`simpleml.pipelines.validation_split_mixins`

Module for different pipeline split methods for cross validation

No Split – Just use all the data - hardcoded as the default for all pipelines

Explicit Split – dataset class defines the split

Percentage – random split support for train, validation, test

Chronological – time based split support for train, validation, test

KFold

Module Contents

Classes

`ChronologicalSplitMixin`
`ExplicitSplitMixin`
`KFoldSplitMixin`	TBD on how to implement this. KFold requires K models and unique datasets
`RandomSplitMixin`	Class to randomly split dataset into different sets
`SplitMixin`

Attributes

__author__

simpleml.pipelines.validation_split_mixins.__author__ = Elisha Yadgaran[source]

class simpleml.pipelines.validation_split_mixins.ChronologicalSplitMixin(**kwargs)[source]: Bases: SplitMixin

class simpleml.pipelines.validation_split_mixins.ExplicitSplitMixin[source]

Bases: SplitMixin

split_dataset(self)[source]

Method to split the dataframe into different sets. Assumes dataset explicitly delineates between different splits

Passes forward dataset split names so uniquely named splits will propagate and can be referenced the same way

Return type: None

class simpleml.pipelines.validation_split_mixins.KFoldSplitMixin[source]

Bases: SplitMixin

TBD on how to implement this. KFold requires K models and unique datasets so may be easier to wrap a parallelized implementation that internally creates K new Pipeline and Model objects

class simpleml.pipelines.validation_split_mixins.RandomSplitMixin(train_size, test_size=None, validation_size=0.0, random_state=123, shuffle=True, **kwargs)[source]

Bases: SplitMixin

Class to randomly split dataset into different sets

Redefines splits so custom named splits in dataset cannot be referenced by the same names. Only TRAIN/TEST/VALIDATION

Set splitting params: By default validation is 0.0 because it is only used for hyperparameter tuning

Parameters

train_size (Union[float, int]) –
test_size (Optional[Union[float, int]]) –
validation_size (Union[float, int]) –
random_state (int) –
shuffle (bool) –

static get_index(data)[source]

Helper to extract the index from a dataset. Generates a range index if none exists

Return type: List[int]

split_dataset(self)[source]

Overwrite method to split by percentage

Return type: None

class simpleml.pipelines.validation_split_mixins.SplitMixin[source]

Bases: with_metaclass(ABCMeta, object)

containerize_split(self, split_dict)[source]

Parameters: split_dict (Dict[str, simpleml.datasets.dataset_splits.Split]) –
Return type: simpleml.datasets.dataset_splits.SplitContainer

abstract split_dataset(self)[source]

Set the split criteria

Must set self._dataset_splits

simpleml.pipelines.validation_split_mixins

Module Contents

Classes

Attributes

`simpleml.pipelines.validation_split_mixins`