simpleml.pipelines
¶
Import modules to register class names in global registry
Define convenience classes composed of different mixins
Submodules¶
Package Contents¶
Classes¶
Abstract Base class for all Pipelines objects. |
|
Base class for all Pipeline objects. |
|
Sequence wrapper for internal datasets. Only used for raw data mapping so |
|
Base class for all Pipeline objects. |
|
Base class for all Pipeline objects. |
|
Base class for all Pipeline objects. |
|
Class to randomly split dataset into different sets |
|
Class to randomly split dataset into different sets |
|
Container class for splits |
|
Explicit instantiation of a defaultdict returning split objects |
|
Nested sequence class to apply transforms on batches in real-time and forward |
-
class
simpleml.pipelines.
AbstractPipeline
(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]¶ Bases:
future.utils.with_metaclass()
Abstract Base class for all Pipelines objects.
Relies on mixin classes to define the split_dataset method. Will throw an error on use otherwise
params: pipeline parameter metadata for easy insight into hyperparameters across trainings
-
__abstract__
= True¶
-
object_type
= PIPELINE¶
-
params
¶
-
X
(self, split=None)¶ Get X for specific dataset split
-
_create_external_pipeline
(self, external_pipeline_class, transformers, **kwargs)¶ should return the desired pipeline object
- Parameters
external_pipeline_class – str of class to use, can be ‘default’ or ‘sklearn’
-
_generator_transform
(self, X, dataset_split=None, **kwargs)¶ Pass through method to external pipeline
- Parameters
X – dataframe/matrix to transform, if None, use internal dataset
NOTE: Downstream objects expect to consume a generator with a tuple of X, y, other… not a Split object, so an ordered tuple will be returned
-
_hash
(self)¶ Hash is the combination of the: 1) Dataset 2) Transformers 3) Transformer Params 4) Pipeline Config
-
_iterate_split
(self, split, infinite_loop=False, batch_size=32, shuffle=True, **kwargs)¶ Turn a dataset split into a generator
-
_iterate_split_using_sequence
(self, split, batch_size=32, shuffle=True, **kwargs)¶ Different version of iterate split that uses a keras.utils.sequence object to play nice with keras and enable thread safe generation.
-
_sequence_transform
(self, X, dataset_split=None, **kwargs)¶ Pass through method to external pipeline
- Parameters
X – dataframe/matrix to transform, if None, use internal dataset
NOTE: Downstream objects expect to consume a sequence with a tuple of X, y, other… not a Split object, so an ordered tuple will be returned
-
_transform
(self, X, dataset_split=None)¶ Pass through method to external pipeline
- Parameters
X – dataframe/matrix to transform, if None, use internal dataset
- Return type
Split object if no dataset passed (X is Null) Otherwise matrix return of input X
-
add_dataset
(self, dataset)¶ Setter method for dataset used
-
add_transformer
(self, name, transformer)¶ Setter method for new transformer step
-
assert_dataset
(self, msg='')¶ Helper method to raise an error if dataset isn’t present
-
assert_fitted
(self, msg='')¶ Helper method to raise an error if pipeline isn’t fit
-
property
external_pipeline
(self)¶ All pipeline objects are going to require some filebase persisted object
Wrapper around whatever underlying class is desired (eg sklearn or native)
-
fit
(self)¶ Pass through method to external pipeline
-
fit_transform
(self, **kwargs)¶ Wrapper for fit and transform methods ASSUMES only applies to default (train) split
-
property
fitted
(self)¶
-
get_dataset_split
(self, split=None, return_generator=False, return_sequence=False, **kwargs)¶ Get specific dataset split Assumes a Split object (simpleml.pipelines.validation_split_mixins.Split) is returned. Inherit or implement similar expected attributes to replace
Uses internal self._dataset_splits as the split container - assumes dictionary like itemgetter
-
get_feature_names
(self)¶ Pass through method to external pipeline Should return a list of the final features generated by this pipeline
-
get_params
(self, **kwargs)¶ Pass through method to external pipeline
-
get_transformers
(self)¶ Pass through method to external pipeline
-
load
(self, **kwargs)¶ Extend main load routine to load relationship class
-
remove_transformer
(self, name)¶ Delete method for transformer step
-
save
(self, **kwargs)¶ Extend parent function with a few additional save routines
save params
save transformer metadata
features
-
set_params
(self, **params)¶ Pass through method to external pipeline
-
transform
(self, X, return_generator=False, return_sequence=False, **kwargs)¶ Main transform routine - routes to generator or regular method depending on the flag
- Parameters
return_generator – boolean, whether to use the transformation method
that returns a generator object or the regular transformed input :param return_sequence: boolean, whether to use method that returns a keras.utils.sequence object to play nice with keras models
-
y
(self, split=None)¶ Get labels for specific dataset split
-
-
class
simpleml.pipelines.
ChronologicalSplitMixin
(**kwargs)[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.SplitMixin
-
class
simpleml.pipelines.
ChronologicalSplitPipeline
(**kwargs)[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.ChronologicalSplitMixin
,simpleml.pipelines.base_pipeline.Pipeline
Base class for all Pipeline objects.
dataset_id: foreign key relation to the dataset used as input
-
class
simpleml.pipelines.
DatasetSequence
(split, batch_size, shuffle)[source]¶ Bases:
simpleml.imports.Sequence
Sequence wrapper for internal datasets. Only used for raw data mapping so return type is internal Split object. Transformed sequences are used to conform with external input types (keras tuples)
-
__getitem__
(self, index)¶ Gets batch at position index. # Arguments
index: position of the batch in the Sequence.
- # Returns
A batch
-
__len__
(self)¶ Number of batch in the Sequence. # Returns
The number of batches in the Sequence.
-
on_epoch_end
(self)¶ Method called at the end of every epoch.
-
static
validated_split
(split)¶ Confirms data is valid, otherwise returns None (makes downstream checking simpler)
-
-
class
simpleml.pipelines.
ExplicitSplitMixin
[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.SplitMixin
-
split_dataset
(self)¶ Method to split the dataframe into different sets. Assumes dataset explicitly delineates between train, validation, and test
-
-
class
simpleml.pipelines.
ExplicitSplitPipeline
(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]¶ Bases:
simpleml.pipelines.base_pipeline.Pipeline
,simpleml.pipelines.validation_split_mixins.ExplicitSplitMixin
Base class for all Pipeline objects.
dataset_id: foreign key relation to the dataset used as input
-
class
simpleml.pipelines.
NoSplitMixin
[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.SplitMixin
-
split_dataset
(self)¶ Non-split mixin class. Returns full dataset for any split name
-
-
class
simpleml.pipelines.
NoSplitPipeline
(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]¶ Bases:
simpleml.pipelines.base_pipeline.Pipeline
,simpleml.pipelines.validation_split_mixins.NoSplitMixin
Base class for all Pipeline objects.
dataset_id: foreign key relation to the dataset used as input
-
class
simpleml.pipelines.
Pipeline
(has_external_files=True, transformers=None, external_pipeline_class='default', fitted=False, **kwargs)[source]¶ Bases:
simpleml.pipelines.base_pipeline.AbstractPipeline
Base class for all Pipeline objects.
dataset_id: foreign key relation to the dataset used as input
-
__table_args__
¶
-
__tablename__
= pipelines¶
-
dataset
¶
-
dataset_id
¶
-
-
class
simpleml.pipelines.
RandomSplitMixin
(train_size, test_size=None, validation_size=0.0, random_state=123, shuffle=True, **kwargs)[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.SplitMixin
Class to randomly split dataset into different sets
Set splitting params: By default validation is 0.0 because it is only used for hyperparameter tuning
-
split_dataset
(self)¶ Overwrite method to split by percentage
-
-
class
simpleml.pipelines.
RandomSplitPipeline
(train_size, test_size=None, validation_size=0.0, random_state=123, shuffle=True, **kwargs)[source]¶ Bases:
simpleml.pipelines.validation_split_mixins.RandomSplitMixin
,simpleml.pipelines.base_pipeline.Pipeline
Class to randomly split dataset into different sets
Set splitting params: By default validation is 0.0 because it is only used for hyperparameter tuning
-
class
simpleml.pipelines.
Split
[source]¶ Bases:
dict
Container class for splits
Initialize self. See help(type(self)) for accurate signature.
-
__getattr__
(self, attr)¶ Default attribute processor (Used in combination with __getitem__ to enable ** syntax)
-
static
is_null_type
(obj)¶ Helper to check for nulls - useful to not pass “empty” attributes so defaults of None will get returned downstream instead ex: **split -> all non null named params
-
squeeze
(self)¶ Helper method to clear up any null-type keys
-
-
class
simpleml.pipelines.
SplitContainer
(default_factory=Split, **kwargs)[source]¶ Bases:
collections.defaultdict
Explicit instantiation of a defaultdict returning split objects
Initialize self. See help(type(self)) for accurate signature.
-
class
simpleml.pipelines.
TransformedSequence
(pipeline, dataset_sequence)[source]¶ Bases:
simpleml.imports.Sequence
Nested sequence class to apply transforms on batches in real-time and forward through as the next batch
-
__getitem__
(self, *args, **kwargs)¶ Pass-through to dataset sequence - applies transform on raw data and returns batch
-
__len__
(self)¶ Pass-through. Returns number of batches in dataset sequence
-
on_epoch_end
(self)¶
-