A python package to facilitate the iterative process of developing and using schema-like representations of table data to recode and validate instances of these data stored in pandas DataFrames. This is a fairly young attempt to solve a recurrent problem many people have. So far I have looked at multiple solutions, but none really did it for me.
They either deal primarily with JSON encoded data or they only really solve the validation side of the problem and consider recoding to be a separate issue. They seem to assume that recoding and cleaning has already been done and all we care about is making sure the final product is sane.
To me, this seems backwards.
I need to load, recode, and validate tables all day, everyday. Sometimes
its simple; I can pandas.read_table()
and all is good. But sometimes I
have a 700 column long RedCap data dump that is complicated af, and it
really helps me to develop my recoding logic through an iterative
process. For me it makes sense to couple the recoding process directly
with the validation process: to write the "tests" for each column first,
then add recoding logic in steps until the tests pass.
So Table Enforcer is my attempt to apply a sort of "test driven development" workflow to data cleaning and validation.
Column
object that represents the ideal state of your
data by passing a list of small, independent, reusable validator
functions and some descriptive information.Enforcer
object by passing it a list of your column
representation objects.Please take a look and offer thoughts/advice.
Enforcer
and Column
classes to define what columns should look
like in a table.CompundColumn
class that supports complex operations including
"one-to-many" and "many-to-one" recoding logic as sometimes a column
tries to do too much and should really be multiple columns as well
as the reverse.between_4_and_60()
.%load_ext autoreload
%autoreload 2
from pathlib import Path
import re
from box import Box
import pandas as pd
import numpy as np
from table_enforcer import Enforcer, Column, CompoundColumn
import table_enforcer.errors as e
from table_enforcer import validate as v
from table_enforcer import recode as r
table_path = "../tests/files/demo_table.csv"
df = pd.read_csv(table_path)
df
Some validator functions come built-in (See the table_enforcer/validate/funcs.py
module code for current list)
# ! cat $v.funcs.__file__
In general, validators take a single pandas.Series
object as input and return a pandas.Series
of the same shape and indexes containing True
or False
relative to which items passed the validation logic.
This changes a little if you are using a decorator like @maxmin
(See section on Decorating Validators)
def gte2(series):
return series >= 2
def lte10(series):
return series <= 10
def length_is_one(series):
return series.str.len() == 1
def valid_sex(series):
sex = set(['M', 'F'])
return series.isin(sex)
You can define decorators that extend the call signature of validators to accomadate more complex things like ranges.
Here we can define a single validator that accomplishes the same thing as the combination of the first two validators that we defined above:
@v.decorators.minmax(low=2, high=10)
def bt_2_and_10(series):
"""Test that the data items fall within range: 2 <= x <= 10."""
return series
Note: pay attention to the fact that here we return the original series object. We didn't do ANYTHING to it. The testing gets done in the @v.decorators.minmax
decorator function. This allows our validators to always expect a single argument at run-time and always return a single argument. This is very important.
This is a bit beyond this demo's scope but here is the table_enforcer/validate/decorators.py
code so that you can get an idea oh how to do it if you are brave.
# !cat $v.decorators.__file__
Now that we have a few validator functions, its time to put together a representation of our table and use that object to help us get us a final data table that is in a form we like. We will start with col1
.
df
Lets say for this example we only care about col1
,col3
, and col4
.
We create a column object for each column we want that contains the name of the column we want to represent. We set the options to describe our ideal format. We set the values for each option for each column including a a list of validator functions that should all pass for all items in each column if we got our way.
For now, we will ignore the recoders
argument. What we end up with is the three column objects below.
col1 = Column(name='col1',
dtype=np.int,
unique=False,
validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],
recoders=None)
col3 = Column(name='col3',
dtype=np.int,
unique=True,
validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],
recoders=None)
col4 = Column(name='col4',
dtype=str,
unique=False,
validators=[v.funcs.upper, length_is_one, valid_sex],
recoders=None)
For each column object we store the validator functions in a dict
.
Here is the value for col4
col4.validators
We validate a column by passing our entire original dataframe to the column object. The tests get run independently and after the data has passed through each test, we get a dataframe of results. We can see right away which data items fail which tests.
This helps us plan our recoding efforts later. We can immediately see what sorts of operations need to be run to convert those False
tests to True
.
col4.validate(df)
In reality we can have hundreds or thousands of rows and it would be nice to focus on only those that fail validation. Here is how you do that.
col4.validate(df, failed_only=True)
Wait a minute... Where did that validation test called dtype
come from?
Turns out we get that one for free since we defined that this column needs to be of dtype
np.int
. We get a similar unique
validation check inserted if we set that option to True
.
Recoder functions have a very similar structure to validator functions. But their purpose is to coerce the raw data into a more useful form. For example Col4
is obviously meant to represent 'male'/'female' logic. But we don't want to have to interpret all possible ways to represent that logic every time we want to use that column. So we write a recoder function to do convert all items to either "M" or "F" so we only need to understand two choices from now on.
df
Some recoder functions come built-in (See the table_enforcer/recode/funcs.py
module code for current list)
# !cat $r.funcs.__file__
Like validators, recoders take a single pandas.Series
object as input and return a pandas.Series
of the same shape and indexes as the original series object. However, instead of returning a series of True
/False
values, it performs some operation on the data that gets the column data closer to being how you want it to look during analysis operations.
You are able to build a single, monolithic recoder that transforms a column all the way into what you want in a single step. But it may be better to write recoders that perform only a single step on the way to getting a particular column into shape.
This results in reusable functions that are "composable". Meaning that for the NEXT column, you may not even need to write a new function. All you may need to do is list a sequence of recoders that already exist.
This is because, recoders are applied in a pipeline model. The output from the first is supplied to the second etc.
Now, lets write our recoders to transform col4
.
def standardize_sex(series):
"""Return a series where common representations of 'male'/'female' are standardized.
Things like ['M', 'MALE', 'M', 'BOY', ...] are converted to `M`.
Representations of female are treated similarly.
"""
mapper = {"M": "M",
"MALE": "M",
"BOY": "M",
"F": "F",
"FEMALE": "F",
"GIRL": "F",
}
if series.str.islower().any():
raise ValueError("standardize_sex expects input series to contain only UPPERCASE letters.")
else:
return series.apply(lambda x: mapper[x])
Since we plan to call the built-in recoder upper
first, we only need to support uppercase text here. This is kind of a silly requirement in practice but it demonstrates how you can write composable recoders to fail if expectations are not met.
Lets revisit our original definition of col4
and add the recoders we have now.
col4 = Column(name='col4',
dtype=str,
unique=False,
validators=[v.funcs.upper, length_is_one, valid_sex],
recoders=[r.funcs.upper, standardize_sex])
Now what do we have after we recode this column?
col4.recode(df)
That looks pretty good. But let's not rely on our eyes and instead rely on the validation tests we defined.
Now we try col4.recode
again but this time tell it to perform validation after recoding and see what happens.
If all is well, we will simply get the recoded column back. If validation fails, we will get a ValidationError
raised.
col4.recode(df, validate=True)
We are good! We can now move on to our other columns and repeat the process until all the validation check pass on all of our columns. Then we can populate our Enforcer
object and we will be nearing the end of our initial sanity check / recoding phase for this table.
Just pass a list of the column objects we created to the Enforcer
. The validation/recoding api for the Enforcer
mirrors the one for Columns
so you can reuse what we learned above.
demo = Enforcer(columns=[col1,col3,col4])
demo.columns
demo.recode(df, validate=True)
NOTE:
demo.recode(df, validate=True)
encounters problems in recoding or validating when we pass some bad data to our load_csv()
function later.Use these enforcers to recode or validate recoded tables of the same kind as your source table wherever your applications use that type of data like in your loading functions.
You can pass them into a function as an argument or reference them from the script's local scope as you would an imported function.
I pass it in here so that I can easily switch between different enforcers for the examples.
def load_csv(path, enforcer):
df = pd.read_csv(path)
return enforcer.recode(df, validate=True)
df
df2 = load_csv(path=table_path, enforcer=demo)
df2
In this case we load a CSV that is slightly different that the one we built our enforcer
on: perhaps this is represents a new data dump or data from another source.
col1
includes a value less than 2 and we have a validator that requires that all values be between 2 and 10.
ValidationError
with helpful information to be raised.Feemale
is not found in the standardize_sex()
recoder's mapper
dictionary.
RecodingError
to be raised.table_path_bad = "../tests/files/demo_table_bad.csv"
df_bad = pd.read_csv(table_path_bad)
df_bad
A ValidationError
is raised when we try to get a recoded table or column while setting the recode
method's option validate=True
.
Lets see what happens when we try to load this table with our original demo
enforcer.
load_csv(path=table_path_bad, enforcer=demo)
A column object raised a ValidationError
because it saw some data that did not pass validation while inside the recode
method because we told it to validate. Also, it gives us some helpful information:
Now we add a recoder that deals with that value and move on.
# write another recoder
def handle_values_below_detection_limit(series):
series[series < 2] = 2
return series
# add the recoder to a Column representing `col1`
col1_new = Column(name='col1',
dtype=np.int,
unique=False,
validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],
recoders=[handle_values_below_detection_limit])
# build enforcer with the corrected Column object
demo2 = Enforcer(columns=[col1_new,
col3,
col4])
Lets try this again with demo2
.
df3 = load_csv(path=table_path_bad, enforcer=demo2)
df3
Problem 1 seems to be fixed, but now we get this new RecodingError
.
A RecodingError
is raised when a recoder function fails and raises an exception: any exception.
Like the ValidationError
it tries to give you as much helpful information about where and how it failed.
In this case we need to update our recoder to deal with this new typo of the word "female"
# update the recoder
def standardize_sex2(series):
"""Return a series where common representations of 'male'/'female' are standardized.
Things like ['M', 'MALE', 'M', 'BOY', ...] are converted to `M`.
Representations of female are treated similarly.
"""
mapper = {"M": "M",
"MALE": "M",
"BOY": "M",
"F": "F",
"FEMALE": "F",
"GIRL": "F",
"FEEMALE": "F",
}
if series.str.islower().any():
raise ValueError("standardize_sex expects input series to contain only UPPERCASE letters.")
else:
return series.apply(lambda x: mapper[x])
# add the recoder to a Column representing `col4`
col4_new = Column(name='col4',
dtype=str,
unique=False,
validators=[v.funcs.upper, length_is_one, valid_sex],
recoders=[r.funcs.upper, standardize_sex2])
# build enforcer with the corrected Column object
demo3 = Enforcer(columns=[col1_new,
col3,
col4_new])
OK one more time with demo3
this time.
df4 = load_csv(path=table_path_bad, enforcer=demo3)
df4
Success! Problem 2 is no also fixed and we are ready to get back to work.
...until the next head ache.
Overview of CompoundColumn
class:
CompoundColumn
class is based on the same class that the Column
inherits from (BaseColumn
).Column
class that you are used to.validate
, recode
, and update_dataframe
update_dataframe
is used by the Enforcer
class and determines how the Column object adds columns to the final table.Column
.Column
objects.Column
s and will actually contain three plain Column
objects.Column
.input_columns
: A list of fully defined Column
objects, each representing a single column in the original Dataframe
. output_columns
: A list of fully defined Column
objects, each representing a single column in the final DataFrame
. column_transform
: A function accepting the input DataFrame
, performing transformations to it and returning a new DataFrame
containing the TRANSFORMED columns only.Interaction with Enforcer
objects:
BaseColumn
and implement the three essential methods compatible with the interface that Enforcer
objects expect each column to use. validate
recode
update_dataframe
Enforcer
objects works exactly like the Column
objects we already know.print(CompoundColumn.__init__.__doc__)
col5
into two new columns so we must keep tract of how to deal with three columns:df
## Validators and Recoders
bad_chars = re.compile(pattern="""[*(]""")
def no_bad_characters(series):
"""Validator"""
def test(x):
if bad_chars.search(x) is None:
return True
else:
return False
return series.astype(str).apply(test)
def fix_bad_characters(series):
"""Recoder"""
def fix(x):
return bad_chars.sub(repl='', string=x)
return series.astype(str).apply(fix)
def recode_upper(series):
return series.astype(str).str.upper()
def to_int(series):
return series.astype(np.int)
def to_str(series):
return series.astype(str)
# Transformation function
def split_on_colon(df):
cols = Box()
cols.col5_number = df.col5.apply(lambda x: x.split(":")[0])
cols.col5_word = df.col5.apply(lambda x: x.split(":")[1])
new_columns = pd.DataFrame(cols)[["col5_number", "col5_word"]]
return new_columns
## Defining the Input Column
col5 = Column(
name='col5',
dtype=str,
unique=False,
validators=[v.funcs.not_null, no_bad_characters],
recoders=[fix_bad_characters])
## Defining the Output Columns (col5_a/col5_b)
col5_a = Column(
name='col5_number',
dtype=np.int,
unique=False,
validators=[v.funcs.not_null,],
recoders=[to_int],)
col5_b = Column(
name='col5_word',
dtype=str,
unique=False,
validators=[v.funcs.not_null, v.funcs.upper, no_bad_characters],
recoders=[to_str, recode_upper],)
## Defining the Compound Column
col5_split = CompoundColumn(input_columns=[col5],
output_columns=[col5_a, col5_b],
column_transform=split_on_colon)
col5_split.validate
we get a more complex dataframe than before.col5_split.validate(df)
col5_split.recode
returns a pd.DataFrame
representing the output_columns
in the provided order.col5_split.recode(df, validate=True)
list
representing the names of which if any of the tests were done for each subject.CompoundColumn
class.df
# helper functions
def is_subset(x, ref_set):
if not isinstance(ref_set, set):
valid = set(ref_set)
if isinstance(x, (list, tuple, set)):
set_x = set(x)
else:
set_x = set([x])
return set_x.issubset(ref_set)
# Transformation function
def join_as_tuple(df):
cols = Box()
cols.col6_7_8 = df[["col6", "col7", "col8"]].apply(lambda row: (row.col6, row.col7, row.col8,), axis=1)
new_columns = pd.DataFrame(cols)
return new_columns
# Validators
def col6_valid_values(series):
"""Validator"""
valid = [None, "DNASeq"]
return series.apply(is_subset, ref_set=valid)
def col7_valid_values(series):
"""Validator"""
valid = [None, "Protein Function"]
return series.apply(is_subset, ref_set=valid)
def col8_valid_values(series):
"""Validator"""
valid = [None, "RNASeq"]
return series.apply(is_subset, ref_set=valid)
def col6_7_8_valid_values(series):
"""Validator"""
valid = set(["DNASeq", "Protein Function", "RNASeq"])
return series.apply(is_subset, ref_set=valid)
# Recoders
def translate_col6(series):
"""Recode 0-> None; 1-> 'DNASeq' """
def rcode(x):
mapping = {0: None, 1: "DNASeq"}
return mapping[x]
return series.apply(rcode)
def translate_col7(series):
"""Recode 0-> None; 1-> 'Protein Function' """
def rcode(x):
mapping = {0: None, 1: "Protein Function"}
return mapping[x]
return series.apply(rcode)
def translate_col8(series):
"""Recode 0-> None; 1-> 'RNASeq' """
def rcode(x):
mapping = {0: None, 1: "RNASeq"}
return mapping[x]
return series.apply(rcode)
def setify_drop_nones(series):
"""Convert to sets and drop ``None`` values."""
def drop_nones(x):
x.discard(None)
return x
return series.apply(lambda x: set(x)).apply(drop_nones).apply(list)
# Defining the Input Columns
col6 = Column(
name='col6',
dtype=(str, type(None)),
unique=False,
validators=[col6_valid_values],
recoders=[translate_col6],)
col7 = Column(
name='col7',
dtype=(str, type(None)),
unique=False,
validators=[col7_valid_values],
recoders=[translate_col7],)
col8 = Column(
name='col8',
dtype=(str, type(None)),
unique=False,
validators=[col8_valid_values],
recoders=[translate_col8],)
# Defining the Output Column
col6_7_8 = Column(
name='col6_7_8',
dtype=list,
unique=False,
validators=[v.funcs.not_null, col6_7_8_valid_values],
recoders=[setify_drop_nones],)
# Defining the Compound Column
col6_7_8_join = CompoundColumn(input_columns=[col6, col7, col8],
output_columns=[col6_7_8],
column_transform=join_as_tuple)
df
col6_7_8_join.validate
we, again, get a more complex dataframe than with a Column
object.col6_7_8_join.validate(df)
col6_7_8_join.recode(df, validate=True)
Enforcer
object.CompoundColumns
.now_w_moar_complexity = Enforcer(columns=[col1,
col3,
col4,
col5_split,
col6_7_8_join])
Enforcer
.BaseColumn
includes a special set of methods that allow it to interact with Enforcer
objects in their specific ways.# this should fail
now_w_moar_complexity.validate(df)
# this should not
now_w_moar_complexity.recode(df, validate=True)
df
def sort_columns(df):
return df.T.sort_index().T.sort_index()
def check_and_print(df, column):
if sort_columns(column.validate(df).reset_index()).equals(sort_columns(pd.read_json(validate_all_json))):
print(f"validate_all_json = '''{validate_all_json}'''")
else:
raise ValueError()
validate_input = col5_split._validate_input(df).reset_index()
validate_input
validate_input_json = validate_input.to_json()
print(f"validate_input_json = '''{validate_input_json}'''")
pd.read_json(validate_input_json)
validate_output = col5_split._validate_output(df).reset_index()
validate_output
validate_output_json = validate_output.to_json()
print(f"validate_output_json = '''{validate_output_json}'''")
pd.read_json(validate_output_json)
validate_all = col5_split.validate(df).reset_index()
validate_all
validate_all_json = validate_all.to_json()
print(f"validate_all_json = '''{validate_all_json}'''")
pd.read_json(validate_all_json)
validate_input = col6_7_8_join._validate_input(df).reset_index()
validate_input
validate_input_json = validate_input.to_json()
print(f"validate_input_json = '''{validate_input_json}'''")
pd.read_json(validate_input_json)
validate_output = col6_7_8_join._validate_output(df).reset_index()
validate_output
validate_output_json = validate_output.to_json()
print(f"validate_output_json = '''{validate_output_json}'''")
pd.read_json(validate_output_json)
validate_all = col6_7_8_join.validate(df).reset_index()
validate_all
validate_all_json = validate_all.to_json()
print(f"validate_all_json = '''{validate_all_json}'''")
pd.read_json(validate_all_json)
enforcer_recode = now_w_moar_complexity.recode(df, validate=True)
enforcer_recode_json = enforcer_recode.to_json()
print(f"enforcer_recode_json = '''{enforcer_recode_json}'''")
pd.read_json(enforcer_recode_json)