simpleml.pipelines¶
Import modules to register class names in global registry
Define convenience classes composed of different mixins
Submodules¶
Package Contents¶
Classes¶
Abstract Base class for all Pipelines objects. |
|
Base class for all Pipeline objects. |
|
Sequence wrapper for internal datasets. Only used for raw data mapping so |
|
Base class for all Pipeline objects. |
|
Base class for all Pipeline objects. |
|
Base class for all Pipeline objects. |
|
Class to randomly split dataset into different sets |
|
Class to randomly split dataset into different sets |
|
Container class for splits |
|
Explicit instantiation of a defaultdict returning split objects |
|
Nested sequence class to apply transforms on batches in real-time and forward |
-
class
simpleml.pipelines.AbstractPipeline(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]¶ Bases:
future.utils.with_metaclass()Abstract Base class for all Pipelines objects.
Relies on mixin classes to define the split_dataset method. Will throw an error on use otherwise
params: pipeline parameter metadata for easy insight into hyperparameters across trainings
-
__abstract__= True¶
-
object_type= PIPELINE¶
-
params¶
-
X(self, split=None)¶ Get X for specific dataset split
-
_create_external_pipeline(self, external_pipeline_class, transformers, **kwargs)¶ should return the desired pipeline object
- Parameters
external_pipeline_class – str of class to use, can be ‘default’ or ‘sklearn’
-
_generator_transform(self, X, dataset_split=None, **kwargs)¶ Pass through method to external pipeline
- Parameters
X – dataframe/matrix to transform, if None, use internal dataset
NOTE: Downstream objects expect to consume a generator with a tuple of X, y, other… not a Split object, so an ordered tuple will be returned
-
_hash(self)¶ Hash is the combination of the: 1) Dataset 2) Transformers 3) Transformer Params 4) Pipeline Config
-
_iterate_split(self, split, infinite_loop=False, batch_size=32, shuffle=True, **kwargs)¶ Turn a dataset split into a generator
-
_iterate_split_using_sequence(self, split, batch_size=32, shuffle=True, **kwargs)¶ Different version of iterate split that uses a keras.utils.sequence object to play nice with keras and enable thread safe generation.
-
_sequence_transform(self, X, dataset_split=None, **kwargs)¶ Pass through method to external pipeline
- Parameters
X – dataframe/matrix to transform, if None, use internal dataset
NOTE: Downstream objects expect to consume a sequence with a tuple of X, y, other… not a Split object, so an ordered tuple will be returned
-
_transform(self, X, dataset_split=None)¶ Pass through method to external pipeline
- Parameters
X – dataframe/matrix to transform, if None, use internal dataset
- Return type
Split object if no dataset passed (X is Null) Otherwise matrix return of input X
-
add_dataset(self, dataset)¶ Setter method for dataset used
-
add_transformer(self, name, transformer)¶ Setter method for new transformer step
-
assert_dataset(self, msg='')¶ Helper method to raise an error if dataset isn’t present
-
assert_fitted(self, msg='')¶ Helper method to raise an error if pipeline isn’t fit
-
property
external_pipeline(self)¶ All pipeline objects are going to require some filebase persisted object
Wrapper around whatever underlying class is desired (eg sklearn or native)
-
fit(self)¶ Pass through method to external pipeline
-
fit_transform(self, **kwargs)¶ Wrapper for fit and transform methods ASSUMES only applies to default (train) split
-
property
fitted(self)¶
-
get_dataset_split(self, split=None, return_generator=False, return_sequence=False, **kwargs)¶ Get specific dataset split Assumes a Split object (simpleml.pipelines.validation_split_mixins.Split) is returned. Inherit or implement similar expected attributes to replace
Uses internal self._dataset_splits as the split container - assumes dictionary like itemgetter
-
get_feature_names(self)¶ Pass through method to external pipeline Should return a list of the final features generated by this pipeline
-
get_params(self, **kwargs)¶ Pass through method to external pipeline
-
get_transformers(self)¶ Pass through method to external pipeline
-
load(self, **kwargs)¶ Extend main load routine to load relationship class
-
remove_transformer(self, name)¶ Delete method for transformer step
-
save(self, **kwargs)¶ Extend parent function with a few additional save routines
save params
save transformer metadata
features
-
set_params(self, **params)¶ Pass through method to external pipeline
-
transform(self, X, return_generator=False, return_sequence=False, **kwargs)¶ Main transform routine - routes to generator or regular method depending on the flag
- Parameters
return_generator – boolean, whether to use the transformation method
that returns a generator object or the regular transformed input :param return_sequence: boolean, whether to use method that returns a keras.utils.sequence object to play nice with keras models
-
y(self, split=None)¶ Get labels for specific dataset split
-
-
class
simpleml.pipelines.ChronologicalSplitMixin(**kwargs)[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.SplitMixin
-
class
simpleml.pipelines.ChronologicalSplitPipeline(**kwargs)[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.ChronologicalSplitMixin,simpleml.pipelines.base_pipeline.PipelineBase class for all Pipeline objects.
dataset_id: foreign key relation to the dataset used as input
-
class
simpleml.pipelines.DatasetSequence(split, batch_size, shuffle)[source]¶ Bases:
simpleml.imports.SequenceSequence wrapper for internal datasets. Only used for raw data mapping so return type is internal Split object. Transformed sequences are used to conform with external input types (keras tuples)
-
__getitem__(self, index)¶ Gets batch at position index. # Arguments
index: position of the batch in the Sequence.
- # Returns
A batch
-
__len__(self)¶ Number of batch in the Sequence. # Returns
The number of batches in the Sequence.
-
on_epoch_end(self)¶ Method called at the end of every epoch.
-
static
validated_split(split)¶ Confirms data is valid, otherwise returns None (makes downstream checking simpler)
-
-
class
simpleml.pipelines.ExplicitSplitMixin[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.SplitMixin-
split_dataset(self)¶ Method to split the dataframe into different sets. Assumes dataset explicitly delineates between train, validation, and test
-
-
class
simpleml.pipelines.ExplicitSplitPipeline(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]¶ Bases:
simpleml.pipelines.base_pipeline.Pipeline,simpleml.pipelines.validation_split_mixins.ExplicitSplitMixinBase class for all Pipeline objects.
dataset_id: foreign key relation to the dataset used as input
-
class
simpleml.pipelines.NoSplitMixin[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.SplitMixin-
split_dataset(self)¶ Non-split mixin class. Returns full dataset for any split name
-
-
class
simpleml.pipelines.NoSplitPipeline(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]¶ Bases:
simpleml.pipelines.base_pipeline.Pipeline,simpleml.pipelines.validation_split_mixins.NoSplitMixinBase class for all Pipeline objects.
dataset_id: foreign key relation to the dataset used as input
-
class
simpleml.pipelines.Pipeline(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]¶ Bases:
simpleml.pipelines.base_pipeline.AbstractPipelineBase class for all Pipeline objects.
dataset_id: foreign key relation to the dataset used as input
-
__table_args__¶
-
__tablename__= pipelines¶
-
dataset¶
-
dataset_id¶
-
-
class
simpleml.pipelines.RandomSplitMixin(train_size, test_size=None, validation_size=0.0, random_state=123, shuffle=True, **kwargs)[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.SplitMixinClass to randomly split dataset into different sets
Set splitting params: By default validation is 0.0 because it is only used for hyperparameter tuning
-
split_dataset(self)¶ Overwrite method to split by percentage
-
-
class
simpleml.pipelines.RandomSplitPipeline(train_size, test_size=None, validation_size=0.0, random_state=123, shuffle=True, **kwargs)[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.RandomSplitMixin,simpleml.pipelines.base_pipeline.PipelineClass to randomly split dataset into different sets
Set splitting params: By default validation is 0.0 because it is only used for hyperparameter tuning
-
class
simpleml.pipelines.Split[source]¶ Bases:
dictContainer class for splits
Initialize self. See help(type(self)) for accurate signature.
-
__getattr__(self, attr)¶ Default attribute processor (Used in combination with __getitem__ to enable ** syntax)
-
static
is_null_type(obj)¶ Helper to check for nulls - useful to not pass “empty” attributes so defaults of None will get returned downstream instead ex: **split -> all non null named params
-
squeeze(self)¶ Helper method to clear up any null-type keys
-
-
class
simpleml.pipelines.SplitContainer(default_factory=Split, **kwargs)[source]¶ Bases:
collections.defaultdictExplicit instantiation of a defaultdict returning split objects
Initialize self. See help(type(self)) for accurate signature.
-
class
simpleml.pipelines.TransformedSequence(pipeline, dataset_sequence)[source]¶ Bases:
simpleml.imports.SequenceNested sequence class to apply transforms on batches in real-time and forward through as the next batch
-
__getitem__(self, *args, **kwargs)¶ Pass-through to dataset sequence - applies transform on raw data and returns batch
-
__len__(self)¶ Pass-through. Returns number of batches in dataset sequence
-
on_epoch_end(self)¶
-