Table Enforcer Demo

Description

A python package to facilitate the iterative process of developing and using schema-like representations of table data to recode and validate instances of these data stored in pandas DataFrames. This is a fairly young attempt to solve a recurrent problem many people have. So far I have looked at multiple solutions, but none really did it for me.

They either deal primarily with JSON encoded data or they only really solve the validation side of the problem and consider recoding to be a separate issue. They seem to assume that recoding and cleaning has already been done and all we care about is making sure the final product is sane.

To me, this seems backwards.

I need to load, recode, and validate tables all day, everyday. Sometimes its simple; I can pandas.read_table() and all is good. But sometimes I have a 700 column long RedCap data dump that is complicated af, and it really helps me to develop my recoding logic through an iterative process. For me it makes sense to couple the recoding process directly with the validation process: to write the "tests" for each column first, then add recoding logic in steps until the tests pass.

So Table Enforcer is my attempt to apply a sort of "test driven development" workflow to data cleaning and validation.

Basic Workflow

  1. For each column that you care about in your source table:
    1. Define a Column object that represents the ideal state of your data by passing a list of small, independent, reusable validator functions and some descriptive information.
    2. Use this object to validate the column data from your source table.
      • It will fail.
    3. Add small, composable, reusable recoding functions to the column object and iterate until your validations pass.
  2. Define an Enforcer object by passing it a list of your column representation objects.
  3. This enforcer can be used to recode or validate recoded tables of the same kind as your source table wherever your applications use that type of data.

Please take a look and offer thoughts/advice.

Features

  • Enforcer and Column classes to define what columns should look like in a table.
  • CompundColumn class that supports complex operations including "one-to-many" and "many-to-one" recoding logic as sometimes a column tries to do too much and should really be multiple columns as well as the reverse.
  • Growing cadre of built-in validator functions and decorators.
  • Decorators for use in defining parameterized validators like between_4_and_60().

Imports

In [1]:
%load_ext autoreload
%autoreload 2
In [2]:
from pathlib import Path
import re

from box import Box
import pandas as pd
import numpy as np

Table Enforcer

In [3]:
from table_enforcer import Enforcer, Column, CompoundColumn
import table_enforcer.errors as e

from table_enforcer import validate as v
from table_enforcer import recode as r

Constants

Load or create your Table

In [4]:
table_path = "../tests/files/demo_table.csv"
df = pd.read_csv(table_path)
df
Out[4]:
col1 col2 col3 col4 col5 col6 col7 col8
0 7 3 10 male 01:one* 0 1 1
1 2 4 6 m 3:Three 1 0 0
2 6 6 2 F 10:ten 0 1 0
3 5 5 5 Female 6:Six 1 1 1

Validator Functions

Built-in Validators

Some validator functions come built-in (See the table_enforcer/validate/funcs.py module code for current list)

In [5]:
# ! cat $v.funcs.__file__

Validator Call Signatures

In general, validators take a single pandas.Series object as input and return a pandas.Series of the same shape and indexes containing True or False relative to which items passed the validation logic.

This changes a little if you are using a decorator like @maxmin (See section on Decorating Validators)

Writing Custom Validators

In [6]:
def gte2(series):
    return series >= 2

def lte10(series):
    return series <= 10

def length_is_one(series):
    return series.str.len() == 1

def valid_sex(series):
    sex = set(['M', 'F'])
    return series.isin(sex)

Decorating Validators

You can define decorators that extend the call signature of validators to accomadate more complex things like ranges.

Here we can define a single validator that accomplishes the same thing as the combination of the first two validators that we defined above:

In [7]:
@v.decorators.minmax(low=2, high=10)
def bt_2_and_10(series):
    """Test that the data items fall within range: 2 <= x <= 10."""
    return series

Note: pay attention to the fact that here we return the original series object. We didn't do ANYTHING to it. The testing gets done in the @v.decorators.minmax decorator function. This allows our validators to always expect a single argument at run-time and always return a single argument. This is very important.

Writing Custom Validator Decorators

This is a bit beyond this demo's scope but here is the table_enforcer/validate/decorators.py code so that you can get an idea oh how to do it if you are brave.

In [8]:
# !cat $v.decorators.__file__

Defining Columns

Now that we have a few validator functions, its time to put together a representation of our table and use that object to help us get us a final data table that is in a form we like. We will start with col1.

In [9]:
df
Out[9]:
col1 col2 col3 col4 col5 col6 col7 col8
0 7 3 10 male 01:one* 0 1 1
1 2 4 6 m 3:Three 1 0 0
2 6 6 2 F 10:ten 0 1 0
3 5 5 5 Female 6:Six 1 1 1

Lets say for this example we only care about col1,col3, and col4.

We create a column object for each column we want that contains the name of the column we want to represent. We set the options to describe our ideal format. We set the values for each option for each column including a a list of validator functions that should all pass for all items in each column if we got our way.

For now, we will ignore the recoders argument. What we end up with is the three column objects below.

In [10]:
col1 = Column(name='col1',
             dtype=np.int,
             unique=False,
             validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],
             recoders=None)

col3 = Column(name='col3',
             dtype=np.int,
             unique=True,
             validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],
             recoders=None)

col4 = Column(name='col4',
             dtype=str,
             unique=False,
             validators=[v.funcs.upper, length_is_one, valid_sex],
             recoders=None)

Validating a Column object

For each column object we store the validator functions in a dict.

Here is the value for col4

In [11]:
col4.validators
Out[11]:
{'length_is_one': <function __main__.length_is_one>,
 'upper': <function table_enforcer.validate.funcs.upper>,
 'valid_sex': <function __main__.valid_sex>}

We validate a column by passing our entire original dataframe to the column object. The tests get run independently and after the data has passed through each test, we get a dataframe of results. We can see right away which data items fail which tests.

This helps us plan our recoding efforts later. We can immediately see what sorts of operations need to be run to convert those False tests to True.

In [12]:
col4.validate(df)
Out[12]:
length_is_one upper valid_sex dtype
0 False False False True
1 True False False True
2 True True True True
3 False False False True

In reality we can have hundreds or thousands of rows and it would be nice to focus on only those that fail validation. Here is how you do that.

In [13]:
col4.validate(df, failed_only=True)
Out[13]:
length_is_one upper valid_sex dtype
0 False False False True
1 True False False True
3 False False False True

Wait a minute... Where did that validation test called dtype come from?

Turns out we get that one for free since we defined that this column needs to be of dtype np.int. We get a similar unique validation check inserted if we set that option to True.

Recoder Functions

Recoder functions have a very similar structure to validator functions. But their purpose is to coerce the raw data into a more useful form. For example Col4 is obviously meant to represent 'male'/'female' logic. But we don't want to have to interpret all possible ways to represent that logic every time we want to use that column. So we write a recoder function to do convert all items to either "M" or "F" so we only need to understand two choices from now on.

In [14]:
df
Out[14]:
col1 col2 col3 col4 col5 col6 col7 col8
0 7 3 10 male 01:one* 0 1 1
1 2 4 6 m 3:Three 1 0 0
2 6 6 2 F 10:ten 0 1 0
3 5 5 5 Female 6:Six 1 1 1

Built-in Recoders

Some recoder functions come built-in (See the table_enforcer/recode/funcs.py module code for current list)

In [15]:
# !cat $r.funcs.__file__

Recoder Call Signatures

Like validators, recoders take a single pandas.Series object as input and return a pandas.Series of the same shape and indexes as the original series object. However, instead of returning a series of True/False values, it performs some operation on the data that gets the column data closer to being how you want it to look during analysis operations.

Writing Custom Recoders

Recoders are Meant to be Composable

You are able to build a single, monolithic recoder that transforms a column all the way into what you want in a single step. But it may be better to write recoders that perform only a single step on the way to getting a particular column into shape.

This results in reusable functions that are "composable". Meaning that for the NEXT column, you may not even need to write a new function. All you may need to do is list a sequence of recoders that already exist.

This is because, recoders are applied in a pipeline model. The output from the first is supplied to the second etc.

Now, lets write our recoders to transform col4.

In [16]:
def standardize_sex(series):
    """Return a series where common representations of 'male'/'female' are standardized.
    
    Things like ['M', 'MALE', 'M', 'BOY', ...] are converted to `M`.
    Representations of female are treated similarly. 
    """
    mapper = {"M": "M",
              "MALE": "M",
              "BOY": "M",
              "F": "F",
              "FEMALE": "F",
              "GIRL": "F",
             }
    if series.str.islower().any():
        raise ValueError("standardize_sex expects input series to contain only UPPERCASE letters.")
    else:
        return series.apply(lambda x: mapper[x])

Since we plan to call the built-in recoder upper first, we only need to support uppercase text here. This is kind of a silly requirement in practice but it demonstrates how you can write composable recoders to fail if expectations are not met.

Next Iteration: Validate Recoded Table

Lets revisit our original definition of col4 and add the recoders we have now.

In [17]:
col4 = Column(name='col4',
             dtype=str,
             unique=False,
             validators=[v.funcs.upper, length_is_one, valid_sex],
             recoders=[r.funcs.upper, standardize_sex])

Now what do we have after we recode this column?

In [18]:
col4.recode(df)
Out[18]:
col4
0 M
1 M
2 F
3 F

That looks pretty good. But let's not rely on our eyes and instead rely on the validation tests we defined.

Now we try col4.recode again but this time tell it to perform validation after recoding and see what happens.

If all is well, we will simply get the recoded column back. If validation fails, we will get a ValidationError raised.

In [19]:
col4.recode(df, validate=True)
Out[19]:
col4
0 M
1 M
2 F
3 F

We are good! We can now move on to our other columns and repeat the process until all the validation check pass on all of our columns. Then we can populate our Enforcer object and we will be nearing the end of our initial sanity check / recoding phase for this table.

Defining Table Enforcers

Just pass a list of the column objects we created to the Enforcer. The validation/recoding api for the Enforcer mirrors the one for Columns so you can reuse what we learned above.

In [20]:
demo = Enforcer(columns=[col1,col3,col4])
In [21]:
demo.columns
Out[21]:
[<table_enforcer.main_classes.Column at 0x7f6bfc99b9e8>,
 <table_enforcer.main_classes.Column at 0x7f6bfc99bf28>,
 <table_enforcer.main_classes.Column at 0x7f6bfc9371d0>]
In [22]:
demo.recode(df, validate=True)
Out[22]:
col1 col3 col4
0 7 10 M
1 2 6 M
2 6 2 F
3 5 5 F

NOTE:

  • we will see what happens when the demo.recode(df, validate=True) encounters problems in recoding or validating when we pass some bad data to our load_csv() function later.

Writing loading functions that use enforcers

Use these enforcers to recode or validate recoded tables of the same kind as your source table wherever your applications use that type of data like in your loading functions.

You can pass them into a function as an argument or reference them from the script's local scope as you would an imported function.

I pass it in here so that I can easily switch between different enforcers for the examples.

In [23]:
def load_csv(path, enforcer):
    df = pd.read_csv(path)
    return enforcer.recode(df, validate=True)

Loading Good Data

In [24]:
df
Out[24]:
col1 col2 col3 col4 col5 col6 col7 col8
0 7 3 10 male 01:one* 0 1 1
1 2 4 6 m 3:Three 1 0 0
2 6 6 2 F 10:ten 0 1 0
3 5 5 5 Female 6:Six 1 1 1
In [25]:
df2 = load_csv(path=table_path, enforcer=demo)
df2
Out[25]:
col1 col3 col4
0 7 10 M
1 2 6 M
2 6 2 F
3 5 5 F

Loading Bad Data

In this case we load a CSV that is slightly different that the one we built our enforcer on: perhaps this is represents a new data dump or data from another source.

The problems

  1. col1 includes a value less than 2 and we have a validator that requires that all values be between 2 and 10.

    • Perhaps this is because this value represents some lab test that has a limit of detection of 2 so we can't confidently distinguish between 0-2.
    • So we want to set all values below the detection limit to the detection limit itself. However, we neglected to include a recoder for this since we didnt see these types of values in our first data-set.
    • We will see how this causes a ValidationError with helpful information to be raised.
  2. Feemale is not found in the standardize_sex() recoder's mapper dictionary.

    • Here we encounter a value that actually has nothing to do with a validation check failing, but breaks our recoder itself.
    • This time, it causes a helpful RecodingError to be raised.
In [26]:
table_path_bad = "../tests/files/demo_table_bad.csv"
df_bad = pd.read_csv(table_path_bad)
df_bad
Out[26]:
col1 col2 col3 col4 col5 col6 col7 col8
0 7 3 10 male 01:one* 0 1 1
1 2 4 6 m 3:Three 1 0 0
2 1 6 2 F 10:ten 0 1 0
3 6 5 5 Feemale 6:Six 1 1 1

ValidationErrors

A ValidationError is raised when we try to get a recoded table or column while setting the recode method's option validate=True.

Lets see what happens when we try to load this table with our original demo enforcer.

In [27]:
load_csv(path=table_path_bad, enforcer=demo)
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-27-f46767c92254> in <module>()
----> 1 load_csv(path=table_path_bad, enforcer=demo)

<ipython-input-23-c3b512fc9549> in load_csv(path, enforcer)
      1 def load_csv(path, enforcer):
      2     df = pd.read_csv(path)
----> 3     return enforcer.recode(df, validate=True)

~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py in recode(self, table, validate)
     67
     68         for column in self.columns:
---> 69             df = column.update_dataframe(df, table=table, validate=validate)
     70
     71         return df

~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py in update_dataframe(self, df, table, validate)
     81         """Perform ``self.recode`` and add resulting column(s) to ``df`` and return ``df``."""
     82         df = df.copy()
---> 83         recoded_columns = self.recode(table=table, validate=validate)
     84         return pd.concat([df, recoded_columns], axis=1)
     85

~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py in recode(self, table, validate)
    202             failed_rows = find_failed_rows(self.validate(data.to_frame()))
    203             if failed_rows.shape[0] > 0:
--> 204                 raise ValidationError(f"Rows that failed to validate for column '{self.name}':\n{failed_rows}")
    205
    206         return data.to_frame()

ValidationError: Rows that failed to validate for column 'col1':
   bt_2_and_10  not_null  positive  dtype
2        False      True      True   True

A column object raised a ValidationError because it saw some data that did not pass validation while inside the recode method because we told it to validate. Also, it gives us some helpful information:

  1. The column name
  2. The rows that did not pass validation
  3. And which validations happened to fail on each row

Now we add a recoder that deals with that value and move on.

In [54]:
# write another recoder
def handle_values_below_detection_limit(series):
    series[series < 2] = 2
    return series

# add the recoder to a Column representing `col1`
col1_new = Column(name='col1',
                  dtype=np.int,
                  unique=False,
                  validators=[v.funcs.not_null, v.funcs.positive, bt_2_and_10],
                  recoders=[handle_values_below_detection_limit])

# build enforcer with the corrected Column object
demo2 = Enforcer(columns=[col1_new,
                          col3,
                          col4])

Lets try this again with demo2.

In [53]:
df3 = load_csv(path=table_path_bad, enforcer=demo2)
df3
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py in recode(self, table, validate)
    196             try:
--> 197                 data = recoder(data)
    198             except (BaseException) as err:

<ipython-input-16-a3488642c87b> in standardize_sex(series)
     16     else:
---> 17         return series.apply(lambda x: mapper[x])

~/.anaconda/envs/table_enforcer/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2550                 values = self.asobject
-> 2551                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2552

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-16-a3488642c87b> in <lambda>(x)
     16     else:
---> 17         return series.apply(lambda x: mapper[x])

KeyError: 'FEEMALE'

During handling of the above exception, another exception occurred:

RecodingError                             Traceback (most recent call last)
<ipython-input-53-80a005ab7cfc> in <module>()
----> 1 df3 = load_csv(path=table_path_bad, enforcer=demo2)
      2 df3

<ipython-input-23-c3b512fc9549> in load_csv(path, enforcer)
      1 def load_csv(path, enforcer):
      2     df = pd.read_csv(path)
----> 3     return enforcer.recode(df, validate=True)

~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py in recode(self, table, validate)
     67
     68         for column in self.columns:
---> 69             df = column.update_dataframe(df, table=table, validate=validate)
     70
     71         return df

~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py in update_dataframe(self, df, table, validate)
     81         """Perform ``self.recode`` and add resulting column(s) to ``df`` and return ``df``."""
     82         df = df.copy()
---> 83         recoded_columns = self.recode(table=table, validate=validate)
     84         return pd.concat([df, recoded_columns], axis=1)
     85

~/src/repos/git/project-repos/table_enforcer/table_enforcer/main_classes.py in recode(self, table, validate)
    197                 data = recoder(data)
    198             except (BaseException) as err:
--> 199                 raise RecodingError(col, recoder, err)
    200
    201         if validate:

RecodingError: Recoder 'standardize_sex' raised the following error on column 'col4': KeyError('FEEMALE',).

Problem 1 seems to be fixed, but now we get this new RecodingError.

RecodingErrors

A RecodingError is raised when a recoder function fails and raises an exception: any exception.

Like the ValidationError it tries to give you as much helpful information about where and how it failed.

  1. The recoder that raised an exception
  2. The column name
  3. And the exception and its value.

In this case we need to update our recoder to deal with this new typo of the word "female"

In [55]:
# update the recoder
def standardize_sex2(series):
    """Return a series where common representations of 'male'/'female' are standardized.
    
    Things like ['M', 'MALE', 'M', 'BOY', ...] are converted to `M`.
    Representations of female are treated similarly. 
    """
    mapper = {"M": "M",
              "MALE": "M",
              "BOY": "M",
              "F": "F",
              "FEMALE": "F",
              "GIRL": "F",
              "FEEMALE": "F",
             }
    if series.str.islower().any():
        raise ValueError("standardize_sex expects input series to contain only UPPERCASE letters.")
    else:
        return series.apply(lambda x: mapper[x])

# add the recoder to a Column representing `col4`
col4_new = Column(name='col4',
                  dtype=str,
                  unique=False,
                  validators=[v.funcs.upper, length_is_one, valid_sex],
                  recoders=[r.funcs.upper, standardize_sex2])

# build enforcer with the corrected Column object
demo3 = Enforcer(columns=[col1_new,
                          col3,
                          col4_new])

OK one more time with demo3 this time.

In [56]:
df4 = load_csv(path=table_path_bad, enforcer=demo3)
df4
Out[56]:
col1 col3 col4
0 7 10 M
1 2 6 M
2 2 2 F
3 6 5 F

Success! Problem 2 is no also fixed and we are ready to get back to work.

...until the next head ache.

The CompoundColumn Class

Overview of CompoundColumn class:

  • The CompoundColumn class is based on the same class that the Column inherits from (BaseColumn).
  • This means it has many methods with the same names and general behavior as the Column class that you are used to.
    • namely: validate, recode, and update_dataframe
      • update_dataframe is used by the Enforcer class and determines how the Column object adds columns to the final table.
  • However they behave slightly differently because this class has a slightly different job to do than Column.
  • This class is actually contains and coordinates multiple Column objects.
  • These objects allow more complex transformation and validation logic that spans multiple columns in either or both the original table and final table.
    • Transformations like Many-To-One and One-To-Many column conversions for example.
  • It is structured sort of like a container of Columns and will actually contain three plain Column objects.
  • Here we introduce the concept of input_columns and output_columns.
  • Its call signature is fairly different than its cousin the Column.
  • It accepts only three arguments:
    • input_columns: A list of fully defined Column objects, each representing a single column in the original Dataframe.
    • output_columns: A list of fully defined Column objects, each representing a single column in the final DataFrame.
    • column_transform: A function accepting the input DataFrame, performing transformations to it and returning a new DataFrame containing the TRANSFORMED columns only.

Interaction with Enforcer objects:

  • These are subclasses of BaseColumn and implement the three essential methods compatible with the interface that Enforcer objects expect each column to use.
    • validate
    • recode
    • update_dataframe
  • So adding them to Enforcer objects works exactly like the Column objects we already know.
In [30]:
print(CompoundColumn.__init__.__doc__)
Construct a new ``CompoundColumn`` object.

        Args:
            input_columns (list, Column): A list of ``Column`` objects representing column(s) from the SOURCE table.
            output_columns (list, Column): A list of ``Column`` objects representing column(s) from the FINAL table.
            column_transform (Callable): Function accepting the table object, performing transformations to it and returning a DataFrame containing the NEW columns only.

One-to-many Column Transformation

  • Now we will be focusing on column 5.
  • Here we want to split it into two columns because it represents complex information encoded into a single column.
  • It would be better for us to treat it as separate columns.
  • In these examples we will split col5 into two new columns so we must keep tract of how to deal with three columns:
    • the original column plus two derivative columns.
In [31]:
df
Out[31]:
col1 col2 col3 col4 col5 col6 col7 col8
0 7 3 10 male 01:one* 0 1 1
1 2 4 6 m 3:Three 1 0 0
2 6 6 2 F 10:ten 0 1 0
3 5 5 5 Female 6:Six 1 1 1

Definition and Usage

In [32]:
## Validators and Recoders
bad_chars = re.compile(pattern="""[*(]""")

def no_bad_characters(series):
    """Validator"""
    def test(x):
        if bad_chars.search(x) is None:
            return True
        else:
            return False
    return series.astype(str).apply(test)


def fix_bad_characters(series):
    """Recoder"""
    def fix(x):
        return bad_chars.sub(repl='', string=x)
    return series.astype(str).apply(fix)

def recode_upper(series):
    return series.astype(str).str.upper()

def to_int(series):
    return series.astype(np.int)

def to_str(series):
    return series.astype(str)


# Transformation function
def split_on_colon(df):
    cols = Box()
    cols.col5_number = df.col5.apply(lambda x: x.split(":")[0])
    cols.col5_word = df.col5.apply(lambda x: x.split(":")[1])

    new_columns = pd.DataFrame(cols)[["col5_number", "col5_word"]]
    return new_columns



## Defining the Input Column
col5 = Column(
    name='col5',
    dtype=str,
    unique=False,
    validators=[v.funcs.not_null, no_bad_characters],
    recoders=[fix_bad_characters])

## Defining the Output Columns (col5_a/col5_b)
col5_a = Column(
    name='col5_number',
    dtype=np.int,
    unique=False,
    validators=[v.funcs.not_null,],
    recoders=[to_int],)

col5_b = Column(
    name='col5_word',
    dtype=str,
    unique=False,
    validators=[v.funcs.not_null, v.funcs.upper, no_bad_characters],
    recoders=[to_str, recode_upper],)

## Defining the Compound Column
col5_split = CompoundColumn(input_columns=[col5],
                            output_columns=[col5_a, col5_b],
                            column_transform=split_on_colon)
  • When we run col5_split.validate we get a more complex dataframe than before.
  • It is the total validations for each column (in this case 3).
In [33]:
col5_split.validate(df)
Out[33]:
dtype no_bad_characters not_null upper
validation_type column_name row
input col5 0 True False True True
1 True True True True
2 True True True True
3 True True True True
output col5_number 0 False True True True
1 False True True True
2 False True True True
3 False True True True
col5_word 0 True False True False
1 True True True False
2 True True True False
3 True True True False
  • col5_split.recode returns a pd.DataFrame representing the output_columns in the provided order.
In [34]:
col5_split.recode(df, validate=True)
Out[34]:
col5_number col5_word
0 1 ONE
1 3 THREE
2 10 TEN
3 6 SIX

Many-To-One Column Transformation

  • Lets look at columns col6, col7, and col8.
    • Imagine that these columns result from a situation were a subject's tissue was submitted for zero or more different tests.
    • These columns track which tests were done for each subject.
  • In a pure SQL-like relational context we would likely want to break these data out into a supporting table with a one-to-many relation (1 subject to N tests).
  • But what if the data is destined for a more flexible format like MongoDB, where you are able to store these data in a list-like format.
    • In that case we may want to collect these columns into a single column where each item is a list representing the names of which if any of the tests were done for each subject.
  • We can do this using the CompoundColumn class.
In [35]:
df
Out[35]:
col1 col2 col3 col4 col5 col6 col7 col8
0 7 3 10 male 01:one* 0 1 1
1 2 4 6 m 3:Three 1 0 0
2 6 6 2 F 10:ten 0 1 0
3 5 5 5 Female 6:Six 1 1 1

Definition and Usage

In [36]:
# helper functions
def is_subset(x, ref_set):
    if not isinstance(ref_set, set):
        valid = set(ref_set)

    if isinstance(x, (list, tuple, set)):
        set_x = set(x)
    else:
        set_x = set([x])

    return set_x.issubset(ref_set)


# Transformation function
def join_as_tuple(df):
    cols = Box()
    cols.col6_7_8 = df[["col6", "col7", "col8"]].apply(lambda row: (row.col6, row.col7, row.col8,), axis=1)

    new_columns = pd.DataFrame(cols)
    return new_columns


# Validators
def col6_valid_values(series):
    """Validator"""
    valid = [None, "DNASeq"]
    return series.apply(is_subset, ref_set=valid)

def col7_valid_values(series):
    """Validator"""
    valid = [None, "Protein Function"]
    return series.apply(is_subset, ref_set=valid)

def col8_valid_values(series):
    """Validator"""
    valid = [None, "RNASeq"]
    return series.apply(is_subset, ref_set=valid)


def col6_7_8_valid_values(series):
    """Validator"""
    valid = set(["DNASeq", "Protein Function", "RNASeq"])
    return series.apply(is_subset, ref_set=valid)

# Recoders
def translate_col6(series):
    """Recode 0-> None; 1-> 'DNASeq' """
    def rcode(x):
        mapping = {0: None, 1: "DNASeq"}
        return mapping[x]
    return series.apply(rcode)

def translate_col7(series):
    """Recode 0-> None; 1-> 'Protein Function' """
    def rcode(x):
        mapping = {0: None, 1: "Protein Function"}
        return mapping[x]
    return series.apply(rcode)

def translate_col8(series):
    """Recode 0-> None; 1-> 'RNASeq' """
    def rcode(x):
        mapping = {0: None, 1: "RNASeq"}
        return mapping[x]
    return series.apply(rcode)

def setify_drop_nones(series):
    """Convert to sets and drop ``None`` values."""
    def drop_nones(x):
        x.discard(None)
        return x
    return series.apply(lambda x: set(x)).apply(drop_nones).apply(list)

# Defining the Input Columns
col6 = Column(
    name='col6',
    dtype=(str, type(None)),
    unique=False,
    validators=[col6_valid_values],
    recoders=[translate_col6],)

col7 = Column(
    name='col7',
    dtype=(str, type(None)),
    unique=False,
    validators=[col7_valid_values],
    recoders=[translate_col7],)

col8 = Column(
    name='col8',
    dtype=(str, type(None)),
    unique=False,
    validators=[col8_valid_values],
    recoders=[translate_col8],)


# Defining the Output Column
col6_7_8 = Column(
    name='col6_7_8',
    dtype=list,
    unique=False,
    validators=[v.funcs.not_null, col6_7_8_valid_values],
    recoders=[setify_drop_nones],)

# Defining the Compound Column
col6_7_8_join = CompoundColumn(input_columns=[col6, col7, col8],
                       output_columns=[col6_7_8],
                       column_transform=join_as_tuple)
In [37]:
df
Out[37]:
col1 col2 col3 col4 col5 col6 col7 col8
0 7 3 10 male 01:one* 0 1 1
1 2 4 6 m 3:Three 1 0 0
2 6 6 2 F 10:ten 0 1 0
3 5 5 5 Female 6:Six 1 1 1
  • When we run col6_7_8_join.validate we, again, get a more complex dataframe than with a Column object.
In [38]:
col6_7_8_join.validate(df)
Out[38]:
col6_7_8_valid_values col6_valid_values col7_valid_values col8_valid_values dtype not_null
validation_type column_name row
input col6 0 True False True True False True
1 True False True True False True
2 True False True True False True
3 True False True True False True
col7 0 True True False True False True
1 True True False True False True
2 True True False True False True
3 True True False True False True
col8 0 True True True False False True
1 True True True False False True
2 True True True False False True
3 True True True False False True
output col6_7_8 0 False True True True False True
1 False True True True False True
2 False True True True False True
3 False True True True False True
In [39]:
col6_7_8_join.recode(df, validate=True)
Out[39]:
col6_7_8
0 [RNASeq, Protein Function]
1 [DNASeq]
2 [Protein Function]
3 [DNASeq, RNASeq, Protein Function]

Building an Enforcer

  • Now lets see how these fit together into an Enforcer object.
  • We are using the same table as before but now we add our new CompoundColumns.
In [40]:
now_w_moar_complexity = Enforcer(columns=[col1,
                                          col3,
                                          col4,
                                          col5_split,
                                          col6_7_8_join])
  • We can do all the same stuff with this Enforcer.
  • This is because each subclass of BaseColumn includes a special set of methods that allow it to interact with Enforcer objects in their specific ways.
In [41]:
# this should fail
now_w_moar_complexity.validate(df)
Out[41]:
False
In [42]:
# this should not
now_w_moar_complexity.recode(df, validate=True)
Out[42]:
col1 col3 col4 col5_number col5_word col6_7_8
0 7 10 M 1 ONE [RNASeq, Protein Function]
1 2 6 M 3 THREE [DNASeq]
2 6 2 F 10 TEN [Protein Function]
3 5 5 F 6 SIX [DNASeq, RNASeq, Protein Function]
  • The original table again:
In [43]:
df
Out[43]:
col1 col2 col3 col4 col5 col6 col7 col8
0 7 3 10 male 01:one* 0 1 1
1 2 4 6 m 3:Three 1 0 0
2 6 6 2 F 10:ten 0 1 0
3 5 5 5 Female 6:Six 1 1 1

Testing Facilitation

In [44]:
def sort_columns(df):
    return df.T.sort_index().T.sort_index()

def check_and_print(df, column):
    if sort_columns(column.validate(df).reset_index()).equals(sort_columns(pd.read_json(validate_all_json))):
        print(f"validate_all_json = '''{validate_all_json}'''")
    else:
        raise ValueError()

OTM

In [45]:
validate_input = col5_split._validate_input(df).reset_index()
validate_input
validate_input_json = validate_input.to_json()

print(f"validate_input_json = '''{validate_input_json}'''")
pd.read_json(validate_input_json)
validate_input_json = '''{"validation_type":{"0":"input","1":"input","2":"input","3":"input"},"column_name":{"0":"col5","1":"col5","2":"col5","3":"col5"},"row":{"0":0,"1":1,"2":2,"3":3},"no_bad_characters":{"0":false,"1":true,"2":true,"3":true},"not_null":{"0":true,"1":true,"2":true,"3":true},"dtype":{"0":true,"1":true,"2":true,"3":true}}'''
Out[45]:
column_name dtype no_bad_characters not_null row validation_type
0 col5 True False True 0 input
1 col5 True True True 1 input
2 col5 True True True 2 input
3 col5 True True True 3 input
In [46]:
validate_output = col5_split._validate_output(df).reset_index()
validate_output
validate_output_json = validate_output.to_json()

print(f"validate_output_json = '''{validate_output_json}'''")
pd.read_json(validate_output_json)
validate_output_json = '''{"validation_type":{"0":"output","1":"output","2":"output","3":"output","4":"output","5":"output","6":"output","7":"output"},"column_name":{"0":"col5_number","1":"col5_number","2":"col5_number","3":"col5_number","4":"col5_word","5":"col5_word","6":"col5_word","7":"col5_word"},"row":{"0":0,"1":1,"2":2,"3":3,"4":0,"5":1,"6":2,"7":3},"dtype":{"0":false,"1":false,"2":false,"3":false,"4":true,"5":true,"6":true,"7":true},"no_bad_characters":{"0":null,"1":null,"2":null,"3":null,"4":false,"5":true,"6":true,"7":true},"not_null":{"0":true,"1":true,"2":true,"3":true,"4":true,"5":true,"6":true,"7":true},"upper":{"0":null,"1":null,"2":null,"3":null,"4":false,"5":false,"6":false,"7":false}}'''
Out[46]:
column_name dtype no_bad_characters not_null row upper validation_type
0 col5_number False NaN True 0 NaN output
1 col5_number False NaN True 1 NaN output
2 col5_number False NaN True 2 NaN output
3 col5_number False NaN True 3 NaN output
4 col5_word True 0.0 True 0 0.0 output
5 col5_word True 1.0 True 1 0.0 output
6 col5_word True 1.0 True 2 0.0 output
7 col5_word True 1.0 True 3 0.0 output
In [47]:
validate_all = col5_split.validate(df).reset_index()
validate_all
validate_all_json = validate_all.to_json()

print(f"validate_all_json = '''{validate_all_json}'''")
pd.read_json(validate_all_json)
validate_all_json = '''{"validation_type":{"0":"input","1":"input","2":"input","3":"input","4":"output","5":"output","6":"output","7":"output","8":"output","9":"output","10":"output","11":"output"},"column_name":{"0":"col5","1":"col5","2":"col5","3":"col5","4":"col5_number","5":"col5_number","6":"col5_number","7":"col5_number","8":"col5_word","9":"col5_word","10":"col5_word","11":"col5_word"},"row":{"0":0,"1":1,"2":2,"3":3,"4":0,"5":1,"6":2,"7":3,"8":0,"9":1,"10":2,"11":3},"dtype":{"0":true,"1":true,"2":true,"3":true,"4":false,"5":false,"6":false,"7":false,"8":true,"9":true,"10":true,"11":true},"no_bad_characters":{"0":false,"1":true,"2":true,"3":true,"4":true,"5":true,"6":true,"7":true,"8":false,"9":true,"10":true,"11":true},"not_null":{"0":true,"1":true,"2":true,"3":true,"4":true,"5":true,"6":true,"7":true,"8":true,"9":true,"10":true,"11":true},"upper":{"0":true,"1":true,"2":true,"3":true,"4":true,"5":true,"6":true,"7":true,"8":false,"9":false,"10":false,"11":false}}'''
Out[47]:
column_name dtype no_bad_characters not_null row upper validation_type
0 col5 True False True 0 True input
1 col5 True True True 1 True input
10 col5_word True True True 2 False output
11 col5_word True True True 3 False output
2 col5 True True True 2 True input
3 col5 True True True 3 True input
4 col5_number False True True 0 True output
5 col5_number False True True 1 True output
6 col5_number False True True 2 True output
7 col5_number False True True 3 True output
8 col5_word True False True 0 False output
9 col5_word True True True 1 False output

MTO

In [48]:
validate_input = col6_7_8_join._validate_input(df).reset_index()
validate_input
validate_input_json = validate_input.to_json()

print(f"validate_input_json = '''{validate_input_json}'''")
pd.read_json(validate_input_json)
validate_input_json = '''{"validation_type":{"0":"input","1":"input","2":"input","3":"input","4":"input","5":"input","6":"input","7":"input","8":"input","9":"input","10":"input","11":"input"},"column_name":{"0":"col6","1":"col6","2":"col6","3":"col6","4":"col7","5":"col7","6":"col7","7":"col7","8":"col8","9":"col8","10":"col8","11":"col8"},"row":{"0":0,"1":1,"2":2,"3":3,"4":0,"5":1,"6":2,"7":3,"8":0,"9":1,"10":2,"11":3},"col6_valid_values":{"0":false,"1":false,"2":false,"3":false,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null},"col7_valid_values":{"0":null,"1":null,"2":null,"3":null,"4":false,"5":false,"6":false,"7":false,"8":null,"9":null,"10":null,"11":null},"col8_valid_values":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":false,"9":false,"10":false,"11":false},"dtype":{"0":false,"1":false,"2":false,"3":false,"4":false,"5":false,"6":false,"7":false,"8":false,"9":false,"10":false,"11":false}}'''
Out[48]:
col6_valid_values col7_valid_values col8_valid_values column_name dtype row validation_type
0 0.0 NaN NaN col6 False 0 input
1 0.0 NaN NaN col6 False 1 input
10 NaN NaN 0.0 col8 False 2 input
11 NaN NaN 0.0 col8 False 3 input
2 0.0 NaN NaN col6 False 2 input
3 0.0 NaN NaN col6 False 3 input
4 NaN 0.0 NaN col7 False 0 input
5 NaN 0.0 NaN col7 False 1 input
6 NaN 0.0 NaN col7 False 2 input
7 NaN 0.0 NaN col7 False 3 input
8 NaN NaN 0.0 col8 False 0 input
9 NaN NaN 0.0 col8 False 1 input
In [49]:
validate_output = col6_7_8_join._validate_output(df).reset_index()
validate_output
validate_output_json = validate_output.to_json()

print(f"validate_output_json = '''{validate_output_json}'''")
pd.read_json(validate_output_json)
validate_output_json = '''{"validation_type":{"0":"output","1":"output","2":"output","3":"output"},"column_name":{"0":"col6_7_8","1":"col6_7_8","2":"col6_7_8","3":"col6_7_8"},"row":{"0":0,"1":1,"2":2,"3":3},"col6_7_8_valid_values":{"0":false,"1":false,"2":false,"3":false},"not_null":{"0":true,"1":true,"2":true,"3":true},"dtype":{"0":false,"1":false,"2":false,"3":false}}'''
Out[49]:
col6_7_8_valid_values column_name dtype not_null row validation_type
0 False col6_7_8 False True 0 output
1 False col6_7_8 False True 1 output
2 False col6_7_8 False True 2 output
3 False col6_7_8 False True 3 output
In [50]:
validate_all = col6_7_8_join.validate(df).reset_index()
validate_all
validate_all_json = validate_all.to_json()

print(f"validate_all_json = '''{validate_all_json}'''")
pd.read_json(validate_all_json)
validate_all_json = '''{"validation_type":{"0":"input","1":"input","2":"input","3":"input","4":"input","5":"input","6":"input","7":"input","8":"input","9":"input","10":"input","11":"input","12":"output","13":"output","14":"output","15":"output"},"column_name":{"0":"col6","1":"col6","2":"col6","3":"col6","4":"col7","5":"col7","6":"col7","7":"col7","8":"col8","9":"col8","10":"col8","11":"col8","12":"col6_7_8","13":"col6_7_8","14":"col6_7_8","15":"col6_7_8"},"row":{"0":0,"1":1,"2":2,"3":3,"4":0,"5":1,"6":2,"7":3,"8":0,"9":1,"10":2,"11":3,"12":0,"13":1,"14":2,"15":3},"col6_7_8_valid_values":{"0":true,"1":true,"2":true,"3":true,"4":true,"5":true,"6":true,"7":true,"8":true,"9":true,"10":true,"11":true,"12":false,"13":false,"14":false,"15":false},"col6_valid_values":{"0":false,"1":false,"2":false,"3":false,"4":true,"5":true,"6":true,"7":true,"8":true,"9":true,"10":true,"11":true,"12":true,"13":true,"14":true,"15":true},"col7_valid_values":{"0":true,"1":true,"2":true,"3":true,"4":false,"5":false,"6":false,"7":false,"8":true,"9":true,"10":true,"11":true,"12":true,"13":true,"14":true,"15":true},"col8_valid_values":{"0":true,"1":true,"2":true,"3":true,"4":true,"5":true,"6":true,"7":true,"8":false,"9":false,"10":false,"11":false,"12":true,"13":true,"14":true,"15":true},"dtype":{"0":false,"1":false,"2":false,"3":false,"4":false,"5":false,"6":false,"7":false,"8":false,"9":false,"10":false,"11":false,"12":false,"13":false,"14":false,"15":false},"not_null":{"0":true,"1":true,"2":true,"3":true,"4":true,"5":true,"6":true,"7":true,"8":true,"9":true,"10":true,"11":true,"12":true,"13":true,"14":true,"15":true}}'''
Out[50]:
col6_7_8_valid_values col6_valid_values col7_valid_values col8_valid_values column_name dtype not_null row validation_type
0 True False True True col6 False True 0 input
1 True False True True col6 False True 1 input
10 True True True False col8 False True 2 input
11 True True True False col8 False True 3 input
12 False True True True col6_7_8 False True 0 output
13 False True True True col6_7_8 False True 1 output
14 False True True True col6_7_8 False True 2 output
15 False True True True col6_7_8 False True 3 output
2 True False True True col6 False True 2 input
3 True False True True col6 False True 3 input
4 True True False True col7 False True 0 input
5 True True False True col7 False True 1 input
6 True True False True col7 False True 2 input
7 True True False True col7 False True 3 input
8 True True True False col8 False True 0 input
9 True True True False col8 False True 1 input

Full Enforcer

In [51]:
enforcer_recode = now_w_moar_complexity.recode(df, validate=True)

enforcer_recode_json = enforcer_recode.to_json()
print(f"enforcer_recode_json = '''{enforcer_recode_json}'''")
pd.read_json(enforcer_recode_json)
enforcer_recode_json = '''{"col1":{"0":7,"1":2,"2":6,"3":5},"col3":{"0":10,"1":6,"2":2,"3":5},"col4":{"0":"M","1":"M","2":"F","3":"F"},"col5_number":{"0":1,"1":3,"2":10,"3":6},"col5_word":{"0":"ONE","1":"THREE","2":"TEN","3":"SIX"},"col6_7_8":{"0":["RNASeq","Protein Function"],"1":["DNASeq"],"2":["Protein Function"],"3":["DNASeq","RNASeq","Protein Function"]}}'''
Out[51]:
col1 col3 col4 col5_number col5_word col6_7_8
0 7 10 M 1 ONE [RNASeq, Protein Function]
1 2 6 M 3 THREE [DNASeq]
2 6 2 F 10 TEN [Protein Function]
3 5 5 F 6 SIX [DNASeq, RNASeq, Protein Function]