simpleml.pipelines

Import modules to register class names in global registry

Define convenience classes composed of different mixins

Package Contents

Classes

AbstractPipeline

Abstract Base class for all Pipelines objects.

ChronologicalSplitMixin

ChronologicalSplitPipeline

Base class for all Pipeline objects.

DatasetSequence

Sequence wrapper for internal datasets. Only used for raw data mapping so

ExplicitSplitMixin

ExplicitSplitPipeline

Base class for all Pipeline objects.

NoSplitMixin

NoSplitPipeline

Base class for all Pipeline objects.

Pipeline

Base class for all Pipeline objects.

RandomSplitMixin

Class to randomly split dataset into different sets

RandomSplitPipeline

Class to randomly split dataset into different sets

Split

Container class for splits

SplitContainer

Explicit instantiation of a defaultdict returning split objects

TransformedSequence

Nested sequence class to apply transforms on batches in real-time and forward

simpleml.pipelines.__author__ = Elisha Yadgaran[source]
class simpleml.pipelines.AbstractPipeline(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]

Bases: future.utils.with_metaclass()

Abstract Base class for all Pipelines objects.

Relies on mixin classes to define the split_dataset method. Will throw an error on use otherwise

params: pipeline parameter metadata for easy insight into hyperparameters across trainings

__abstract__ = True
object_type = PIPELINE
params
X(self, split=None)

Get X for specific dataset split

_create_external_pipeline(self, external_pipeline_class, transformers, **kwargs)

should return the desired pipeline object

Parameters

external_pipeline_class – str of class to use, can be ‘default’ or ‘sklearn’

_generator_transform(self, X, dataset_split=None, **kwargs)

Pass through method to external pipeline

Parameters

X – dataframe/matrix to transform, if None, use internal dataset

NOTE: Downstream objects expect to consume a generator with a tuple of X, y, other… not a Split object, so an ordered tuple will be returned

_hash(self)

Hash is the combination of the: 1) Dataset 2) Transformers 3) Transformer Params 4) Pipeline Config

_iterate_split(self, split, infinite_loop=False, batch_size=32, shuffle=True, **kwargs)

Turn a dataset split into a generator

_iterate_split_using_sequence(self, split, batch_size=32, shuffle=True, **kwargs)

Different version of iterate split that uses a keras.utils.sequence object to play nice with keras and enable thread safe generation.

_sequence_transform(self, X, dataset_split=None, **kwargs)

Pass through method to external pipeline

Parameters

X – dataframe/matrix to transform, if None, use internal dataset

NOTE: Downstream objects expect to consume a sequence with a tuple of X, y, other… not a Split object, so an ordered tuple will be returned

_transform(self, X, dataset_split=None)

Pass through method to external pipeline

Parameters

X – dataframe/matrix to transform, if None, use internal dataset

Return type

Split object if no dataset passed (X is Null) Otherwise matrix return of input X

add_dataset(self, dataset)

Setter method for dataset used

add_transformer(self, name, transformer)

Setter method for new transformer step

assert_dataset(self, msg='')

Helper method to raise an error if dataset isn’t present

assert_fitted(self, msg='')

Helper method to raise an error if pipeline isn’t fit

property external_pipeline(self)

All pipeline objects are going to require some filebase persisted object

Wrapper around whatever underlying class is desired (eg sklearn or native)

fit(self)

Pass through method to external pipeline

fit_transform(self, **kwargs)

Wrapper for fit and transform methods ASSUMES only applies to default (train) split

property fitted(self)
get_dataset_split(self, split=None, return_generator=False, return_sequence=False, **kwargs)

Get specific dataset split Assumes a Split object (simpleml.pipelines.validation_split_mixins.Split) is returned. Inherit or implement similar expected attributes to replace

Uses internal self._dataset_splits as the split container - assumes dictionary like itemgetter

get_feature_names(self)

Pass through method to external pipeline Should return a list of the final features generated by this pipeline

get_params(self, **kwargs)

Pass through method to external pipeline

get_transformers(self)

Pass through method to external pipeline

load(self, **kwargs)

Extend main load routine to load relationship class

remove_transformer(self, name)

Delete method for transformer step

save(self, **kwargs)

Extend parent function with a few additional save routines

  1. save params

  2. save transformer metadata

  3. features

set_params(self, **params)

Pass through method to external pipeline

transform(self, X, return_generator=False, return_sequence=False, **kwargs)

Main transform routine - routes to generator or regular method depending on the flag

Parameters

return_generator – boolean, whether to use the transformation method

that returns a generator object or the regular transformed input :param return_sequence: boolean, whether to use method that returns a keras.utils.sequence object to play nice with keras models

y(self, split=None)

Get labels for specific dataset split

class simpleml.pipelines.ChronologicalSplitMixin(**kwargs)[source]

Bases: simpleml.pipelines.validation_split_mixins.SplitMixin

class simpleml.pipelines.ChronologicalSplitPipeline(**kwargs)[source]

Bases: simpleml.pipelines.validation_split_mixins.ChronologicalSplitMixin, simpleml.pipelines.base_pipeline.Pipeline

Base class for all Pipeline objects.

dataset_id: foreign key relation to the dataset used as input

class simpleml.pipelines.DatasetSequence(split, batch_size, shuffle)[source]

Bases: simpleml.imports.Sequence

Sequence wrapper for internal datasets. Only used for raw data mapping so return type is internal Split object. Transformed sequences are used to conform with external input types (keras tuples)

__getitem__(self, index)

Gets batch at position index. # Arguments

index: position of the batch in the Sequence.

# Returns

A batch

__len__(self)

Number of batch in the Sequence. # Returns

The number of batches in the Sequence.

on_epoch_end(self)

Method called at the end of every epoch.

static validated_split(split)

Confirms data is valid, otherwise returns None (makes downstream checking simpler)

class simpleml.pipelines.ExplicitSplitMixin[source]

Bases: simpleml.pipelines.validation_split_mixins.SplitMixin

split_dataset(self)

Method to split the dataframe into different sets. Assumes dataset explicitly delineates between train, validation, and test

class simpleml.pipelines.ExplicitSplitPipeline(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]

Bases: simpleml.pipelines.base_pipeline.Pipeline, simpleml.pipelines.validation_split_mixins.ExplicitSplitMixin

Base class for all Pipeline objects.

dataset_id: foreign key relation to the dataset used as input

class simpleml.pipelines.NoSplitMixin[source]

Bases: simpleml.pipelines.validation_split_mixins.SplitMixin

split_dataset(self)

Non-split mixin class. Returns full dataset for any split name

class simpleml.pipelines.NoSplitPipeline(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]

Bases: simpleml.pipelines.base_pipeline.Pipeline, simpleml.pipelines.validation_split_mixins.NoSplitMixin

Base class for all Pipeline objects.

dataset_id: foreign key relation to the dataset used as input

class simpleml.pipelines.Pipeline(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]

Bases: simpleml.pipelines.base_pipeline.AbstractPipeline

Base class for all Pipeline objects.

dataset_id: foreign key relation to the dataset used as input

__table_args__
__tablename__ = pipelines
dataset
dataset_id
class simpleml.pipelines.RandomSplitMixin(train_size, test_size=None, validation_size=0.0, random_state=123, shuffle=True, **kwargs)[source]

Bases: simpleml.pipelines.validation_split_mixins.SplitMixin

Class to randomly split dataset into different sets

Set splitting params: By default validation is 0.0 because it is only used for hyperparameter tuning

split_dataset(self)

Overwrite method to split by percentage

class simpleml.pipelines.RandomSplitPipeline(train_size, test_size=None, validation_size=0.0, random_state=123, shuffle=True, **kwargs)[source]

Bases: simpleml.pipelines.validation_split_mixins.RandomSplitMixin, simpleml.pipelines.base_pipeline.Pipeline

Class to randomly split dataset into different sets

Set splitting params: By default validation is 0.0 because it is only used for hyperparameter tuning

class simpleml.pipelines.Split[source]

Bases: dict

Container class for splits

Initialize self. See help(type(self)) for accurate signature.

__getattr__(self, attr)

Default attribute processor (Used in combination with __getitem__ to enable ** syntax)

static is_null_type(obj)

Helper to check for nulls - useful to not pass “empty” attributes so defaults of None will get returned downstream instead ex: **split -> all non null named params

squeeze(self)

Helper method to clear up any null-type keys

class simpleml.pipelines.SplitContainer(default_factory=Split, **kwargs)[source]

Bases: collections.defaultdict

Explicit instantiation of a defaultdict returning split objects

Initialize self. See help(type(self)) for accurate signature.

class simpleml.pipelines.TransformedSequence(pipeline, dataset_sequence)[source]

Bases: simpleml.imports.Sequence

Nested sequence class to apply transforms on batches in real-time and forward through as the next batch

__getitem__(self, *args, **kwargs)

Pass-through to dataset sequence - applies transform on raw data and returns batch

__len__(self)

Pass-through. Returns number of batches in dataset sequence

on_epoch_end(self)