A python package to facilitate the iterative process of developing and using schema-like representations of table data to recode and validate instances of these data stored in pandas DataFrames. This is a fairly young attempt to solve a recurrent problem many people have. So far I have looked at multiple solutions, but none really did it for me.
They either deal primarily with JSON encoded data or they only really solve the validation side of the problem and consider recoding to be a separate issue. They seem to assume that recoding and cleaning has already been done and all we care about is making sure the final product is sane.
To me, this seems backwards.
I need to load, recode, and validate tables all day, everyday. Sometimes
its simple; I can pandas.read_table() and all is good. But sometimes I
have a 700 column long RedCap data dump that is complicated af, and it
really helps me to develop my recoding logic through an iterative
process. For me it makes sense to couple the recoding process directly
with the validation process: to write the "tests" for each column first,
then add recoding logic in steps until the tests pass.
So Table Enforcer is my attempt to apply a sort of "test driven development" workflow to data cleaning and validation.
Column object that represents the ideal state of your
data by passing a list of small, independent, reusable validator
functions and some descriptive information.Enforcer object by passing it a list of your column
representation objects.Please take a look and offer thoughts/advice.
Enforcer and Column classes to define what columns should look
like in a table.CompundColumn class that supports complex operations including
"one-to-many" and "many-to-one" recoding logic as sometimes a column
tries to do too much and should really be multiple columns as well
as the reverse.between_4_and_60().%load_ext autoreload
%autoreload 2
from pathlib import Path
import re
from box import Box
import pandas as pd
import numpy as np
from table_enforcer import Enforcer, Column, CompoundColumn
import table_enforcer.errors as e
from table_enforcer import validate as v
from table_enforcer import recode as r
table_path = "../tests/files/demo_table.csv"
df = pd.read_csv(table_path)
df
Some validator functions come built-in (See the table_enforcer/validate/funcs.py module code for current list)
# ! cat $v.funcs.__file__
In general, validators take a single pandas.Series object as input and return a pandas.Series of the same shape and indexes containing True or False relative to which items passed the validation logic.
This changes a little if you are using a decorator like @maxmin (See section on Decorating Validators)
def gte2(series):
return series >= 2
def lte10(series):
return series <= 10
def length_is_one(series):
return series.str.len() == 1
def valid_sex(series):
sex = set(['M', 'F'])
return series.isin(sex)
You can define decorators that extend the call signature of validators to accomadate more complex things like ranges.
Here we can define a single validator that accomplishes the same thing as the combination of the first two validators that we defined above:
@v.decorators.minmax(low=2, high=10)
def bt_2_and_10(series):
"""Test that the data items fall within range: 2 <= x <= 10."""
return series
Note: pay attention to the fact that here we return the original series object. We didn't do ANYTHING to it. The testing gets done in the @v.decorators.minmax decorator function. This allows our validators to always expect a single argument at run-time and always return a single argument. This is very important.
This is a bit beyond this demo's scope but here is the table_enforcer/validate/decorators.py code so that you can get an idea oh how to do it if you are brave.
# !cat $v.decorators.__file__
Now that we have a few validator functions, its time to put together a representation of our table and use that object to help us get us a final data table that is in a form we like. We will start with col1.
df
Lets say for this example we only care about col1,col3, and col4.
We create a column object for each column we want that contains the name of the column we want to represent. We set the options to describe our ideal format. We set the values for each option for each column including a a list of validator functions that should all pass for all items in each column if we got our way.
For now, we will ignore the recoders argument. What we end up with is the three column objects below.
col1 = Column(name='col1',
dtype=np.int,
unique=False,
validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],
recoders=None)
col3 = Column(name='col3',
dtype=np.int,
unique=True,
validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],
recoders=None)
col4 = Column(name='col4',
dtype=str,
unique=False,
validators=[v.funcs.upper, length_is_one, valid_sex],
recoders=None)
For each column object we store the validator functions in a dict.
Here is the value for col4
col4.validators
We validate a column by passing our entire original dataframe to the column object. The tests get run independently and after the data has passed through each test, we get a dataframe of results. We can see right away which data items fail which tests.
This helps us plan our recoding efforts later. We can immediately see what sorts of operations need to be run to convert those False tests to True.
col4.validate(df)
In reality we can have hundreds or thousands of rows and it would be nice to focus on only those that fail validation. Here is how you do that.
col4.validate(df, failed_only=True)
Wait a minute... Where did that validation test called dtype come from?
Turns out we get that one for free since we defined that this column needs to be of dtype np.int. We get a similar unique validation check inserted if we set that option to True.
Recoder functions have a very similar structure to validator functions. But their purpose is to coerce the raw data into a more useful form. For example Col4 is obviously meant to represent 'male'/'female' logic. But we don't want to have to interpret all possible ways to represent that logic every time we want to use that column. So we write a recoder function to do convert all items to either "M" or "F" so we only need to understand two choices from now on.
df
Some recoder functions come built-in (See the table_enforcer/recode/funcs.py module code for current list)
# !cat $r.funcs.__file__
Like validators, recoders take a single pandas.Series object as input and return a pandas.Series of the same shape and indexes as the original series object. However, instead of returning a series of True/False values, it performs some operation on the data that gets the column data closer to being how you want it to look during analysis operations.
You are able to build a single, monolithic recoder that transforms a column all the way into what you want in a single step. But it may be better to write recoders that perform only a single step on the way to getting a particular column into shape.
This results in reusable functions that are "composable". Meaning that for the NEXT column, you may not even need to write a new function. All you may need to do is list a sequence of recoders that already exist.
This is because, recoders are applied in a pipeline model. The output from the first is supplied to the second etc.
Now, lets write our recoders to transform col4.
def standardize_sex(series):
"""Return a series where common representations of 'male'/'female' are standardized.
Things like ['M', 'MALE', 'M', 'BOY', ...] are converted to `M`.
Representations of female are treated similarly.
"""
mapper = {"M": "M",
"MALE": "M",
"BOY": "M",
"F": "F",
"FEMALE": "F",
"GIRL": "F",
}
if series.str.islower().any():
raise ValueError("standardize_sex expects input series to contain only UPPERCASE letters.")
else:
return series.apply(lambda x: mapper[x])
Since we plan to call the built-in recoder upper first, we only need to support uppercase text here. This is kind of a silly requirement in practice but it demonstrates how you can write composable recoders to fail if expectations are not met.
Lets revisit our original definition of col4 and add the recoders we have now.
col4 = Column(name='col4',
dtype=str,
unique=False,
validators=[v.funcs.upper, length_is_one, valid_sex],
recoders=[r.funcs.upper, standardize_sex])
Now what do we have after we recode this column?
col4.recode(df)
That looks pretty good. But let's not rely on our eyes and instead rely on the validation tests we defined.
Now we try col4.recode again but this time tell it to perform validation after recoding and see what happens.
If all is well, we will simply get the recoded column back. If validation fails, we will get a ValidationError raised.
col4.recode(df, validate=True)
We are good! We can now move on to our other columns and repeat the process until all the validation check pass on all of our columns. Then we can populate our Enforcer object and we will be nearing the end of our initial sanity check / recoding phase for this table.
Just pass a list of the column objects we created to the Enforcer. The validation/recoding api for the Enforcer mirrors the one for Columns so you can reuse what we learned above.
demo = Enforcer(columns=[col1,col3,col4])
demo.columns
demo.recode(df, validate=True)
NOTE:
demo.recode(df, validate=True) encounters problems in recoding or validating when we pass some bad data to our load_csv() function later.Use these enforcers to recode or validate recoded tables of the same kind as your source table wherever your applications use that type of data like in your loading functions.
You can pass them into a function as an argument or reference them from the script's local scope as you would an imported function.
I pass it in here so that I can easily switch between different enforcers for the examples.
def load_csv(path, enforcer):
df = pd.read_csv(path)
return enforcer.recode(df, validate=True)
df
df2 = load_csv(path=table_path, enforcer=demo)
df2
In this case we load a CSV that is slightly different that the one we built our enforcer on: perhaps this is represents a new data dump or data from another source.
col1 includes a value less than 2 and we have a validator that requires that all values be between 2 and 10.
ValidationError with helpful information to be raised.Feemale is not found in the standardize_sex() recoder's mapper dictionary.
RecodingError to be raised.table_path_bad = "../tests/files/demo_table_bad.csv"
df_bad = pd.read_csv(table_path_bad)
df_bad
A ValidationError is raised when we try to get a recoded table or column while setting the recode method's option validate=True.
Lets see what happens when we try to load this table with our original demo enforcer.
load_csv(path=table_path_bad, enforcer=demo)
A column object raised a ValidationError because it saw some data that did not pass validation while inside the recode method because we told it to validate. Also, it gives us some helpful information:
Now we add a recoder that deals with that value and move on.
# write another recoder
def handle_values_below_detection_limit(series):
series[series < 2] = 2
return series
# add the recoder to a Column representing `col1`
col1_new = Column(name='col1',
dtype=np.int,
unique=False,
validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],
recoders=[handle_values_below_detection_limit])
# build enforcer with the corrected Column object
demo2 = Enforcer(columns=[col1_new,
col3,
col4])
Lets try this again with demo2.
df3 = load_csv(path=table_path_bad, enforcer=demo2)
df3
Problem 1 seems to be fixed, but now we get this new RecodingError.
A RecodingError is raised when a recoder function fails and raises an exception: any exception.
Like the ValidationError it tries to give you as much helpful information about where and how it failed.
In this case we need to update our recoder to deal with this new typo of the word "female"
# update the recoder
def standardize_sex2(series):
"""Return a series where common representations of 'male'/'female' are standardized.
Things like ['M', 'MALE', 'M', 'BOY', ...] are converted to `M`.
Representations of female are treated similarly.
"""
mapper = {"M": "M",
"MALE": "M",
"BOY": "M",
"F": "F",
"FEMALE": "F",
"GIRL": "F",
"FEEMALE": "F",
}
if series.str.islower().any():
raise ValueError("standardize_sex expects input series to contain only UPPERCASE letters.")
else:
return series.apply(lambda x: mapper[x])
# add the recoder to a Column representing `col4`
col4_new = Column(name='col4',
dtype=str,
unique=False,
validators=[v.funcs.upper, length_is_one, valid_sex],
recoders=[r.funcs.upper, standardize_sex2])
# build enforcer with the corrected Column object
demo3 = Enforcer(columns=[col1_new,
col3,
col4_new])
OK one more time with demo3 this time.
df4 = load_csv(path=table_path_bad, enforcer=demo3)
df4
Success! Problem 2 is no also fixed and we are ready to get back to work.
...until the next head ache.
Overview of CompoundColumn class:
CompoundColumn class is based on the same class that the Column inherits from (BaseColumn).Column class that you are used to.validate, recode, and update_dataframeupdate_dataframe is used by the Enforcer class and determines how the Column object adds columns to the final table.Column.Column objects.Columns and will actually contain three plain Column objects.Column.input_columns: A list of fully defined Column objects, each representing a single column in the original Dataframe. output_columns: A list of fully defined Column objects, each representing a single column in the final DataFrame. column_transform: A function accepting the input DataFrame, performing transformations to it and returning a new DataFrame containing the TRANSFORMED columns only.Interaction with Enforcer objects:
BaseColumn and implement the three essential methods compatible with the interface that Enforcer objects expect each column to use. validaterecodeupdate_dataframeEnforcer objects works exactly like the Column objects we already know.print(CompoundColumn.__init__.__doc__)
col5 into two new columns so we must keep tract of how to deal with three columns:df
## Validators and Recoders
bad_chars = re.compile(pattern="""[*(]""")
def no_bad_characters(series):
"""Validator"""
def test(x):
if bad_chars.search(x) is None:
return True
else:
return False
return series.astype(str).apply(test)
def fix_bad_characters(series):
"""Recoder"""
def fix(x):
return bad_chars.sub(repl='', string=x)
return series.astype(str).apply(fix)
def recode_upper(series):
return series.astype(str).str.upper()
def to_int(series):
return series.astype(np.int)
def to_str(series):
return series.astype(str)
# Transformation function
def split_on_colon(df):
cols = Box()
cols.col5_number = df.col5.apply(lambda x: x.split(":")[0])
cols.col5_word = df.col5.apply(lambda x: x.split(":")[1])
new_columns = pd.DataFrame(cols)[["col5_number", "col5_word"]]
return new_columns
## Defining the Input Column
col5 = Column(
name='col5',
dtype=str,
unique=False,
validators=[v.funcs.not_null, no_bad_characters],
recoders=[fix_bad_characters])
## Defining the Output Columns (col5_a/col5_b)
col5_a = Column(
name='col5_number',
dtype=np.int,
unique=False,
validators=[v.funcs.not_null,],
recoders=[to_int],)
col5_b = Column(
name='col5_word',
dtype=str,
unique=False,
validators=[v.funcs.not_null, v.funcs.upper, no_bad_characters],
recoders=[to_str, recode_upper],)
## Defining the Compound Column
col5_split = CompoundColumn(input_columns=[col5],
output_columns=[col5_a, col5_b],
column_transform=split_on_colon)
col5_split.validate we get a more complex dataframe than before.col5_split.validate(df)
col5_split.recode returns a pd.DataFrame representing the output_columns in the provided order.col5_split.recode(df, validate=True)
list representing the names of which if any of the tests were done for each subject.CompoundColumn class.df
# helper functions
def is_subset(x, ref_set):
if not isinstance(ref_set, set):
valid = set(ref_set)
if isinstance(x, (list, tuple, set)):
set_x = set(x)
else:
set_x = set([x])
return set_x.issubset(ref_set)
# Transformation function
def join_as_tuple(df):
cols = Box()
cols.col6_7_8 = df[["col6", "col7", "col8"]].apply(lambda row: (row.col6, row.col7, row.col8,), axis=1)
new_columns = pd.DataFrame(cols)
return new_columns
# Validators
def col6_valid_values(series):
"""Validator"""
valid = [None, "DNASeq"]
return series.apply(is_subset, ref_set=valid)
def col7_valid_values(series):
"""Validator"""
valid = [None, "Protein Function"]
return series.apply(is_subset, ref_set=valid)
def col8_valid_values(series):
"""Validator"""
valid = [None, "RNASeq"]
return series.apply(is_subset, ref_set=valid)
def col6_7_8_valid_values(series):
"""Validator"""
valid = set(["DNASeq", "Protein Function", "RNASeq"])
return series.apply(is_subset, ref_set=valid)
# Recoders
def translate_col6(series):
"""Recode 0-> None; 1-> 'DNASeq' """
def rcode(x):
mapping = {0: None, 1: "DNASeq"}
return mapping[x]
return series.apply(rcode)
def translate_col7(series):
"""Recode 0-> None; 1-> 'Protein Function' """
def rcode(x):
mapping = {0: None, 1: "Protein Function"}
return mapping[x]
return series.apply(rcode)
def translate_col8(series):
"""Recode 0-> None; 1-> 'RNASeq' """
def rcode(x):
mapping = {0: None, 1: "RNASeq"}
return mapping[x]
return series.apply(rcode)
def setify_drop_nones(series):
"""Convert to sets and drop ``None`` values."""
def drop_nones(x):
x.discard(None)
return x
return series.apply(lambda x: set(x)).apply(drop_nones).apply(list)
# Defining the Input Columns
col6 = Column(
name='col6',
dtype=(str, type(None)),
unique=False,
validators=[col6_valid_values],
recoders=[translate_col6],)
col7 = Column(
name='col7',
dtype=(str, type(None)),
unique=False,
validators=[col7_valid_values],
recoders=[translate_col7],)
col8 = Column(
name='col8',
dtype=(str, type(None)),
unique=False,
validators=[col8_valid_values],
recoders=[translate_col8],)
# Defining the Output Column
col6_7_8 = Column(
name='col6_7_8',
dtype=list,
unique=False,
validators=[v.funcs.not_null, col6_7_8_valid_values],
recoders=[setify_drop_nones],)
# Defining the Compound Column
col6_7_8_join = CompoundColumn(input_columns=[col6, col7, col8],
output_columns=[col6_7_8],
column_transform=join_as_tuple)
df
col6_7_8_join.validate we, again, get a more complex dataframe than with a Column object.col6_7_8_join.validate(df)
col6_7_8_join.recode(df, validate=True)
Enforcer object.CompoundColumns.now_w_moar_complexity = Enforcer(columns=[col1,
col3,
col4,
col5_split,
col6_7_8_join])
Enforcer.BaseColumn includes a special set of methods that allow it to interact with Enforcer objects in their specific ways.# this should fail
now_w_moar_complexity.validate(df)
# this should not
now_w_moar_complexity.recode(df, validate=True)
df
def sort_columns(df):
return df.T.sort_index().T.sort_index()
def check_and_print(df, column):
if sort_columns(column.validate(df).reset_index()).equals(sort_columns(pd.read_json(validate_all_json))):
print(f"validate_all_json = '''{validate_all_json}'''")
else:
raise ValueError()
validate_input = col5_split._validate_input(df).reset_index()
validate_input
validate_input_json = validate_input.to_json()
print(f"validate_input_json = '''{validate_input_json}'''")
pd.read_json(validate_input_json)
validate_output = col5_split._validate_output(df).reset_index()
validate_output
validate_output_json = validate_output.to_json()
print(f"validate_output_json = '''{validate_output_json}'''")
pd.read_json(validate_output_json)
validate_all = col5_split.validate(df).reset_index()
validate_all
validate_all_json = validate_all.to_json()
print(f"validate_all_json = '''{validate_all_json}'''")
pd.read_json(validate_all_json)
validate_input = col6_7_8_join._validate_input(df).reset_index()
validate_input
validate_input_json = validate_input.to_json()
print(f"validate_input_json = '''{validate_input_json}'''")
pd.read_json(validate_input_json)
validate_output = col6_7_8_join._validate_output(df).reset_index()
validate_output
validate_output_json = validate_output.to_json()
print(f"validate_output_json = '''{validate_output_json}'''")
pd.read_json(validate_output_json)
validate_all = col6_7_8_join.validate(df).reset_index()
validate_all
validate_all_json = validate_all.to_json()
print(f"validate_all_json = '''{validate_all_json}'''")
pd.read_json(validate_all_json)
enforcer_recode = now_w_moar_complexity.recode(df, validate=True)
enforcer_recode_json = enforcer_recode.to_json()
print(f"enforcer_recode_json = '''{enforcer_recode_json}'''")
pd.read_json(enforcer_recode_json)