simpleml.pipelines.validation_split_mixins

Module for different pipeline split methods for cross validation

  1. No Split – Just use all the data - hardcoded as the default for all pipelines

  2. Explicit Split – dataset class defines the split

  3. Percentage – random split support for train, validation, test

  4. Chronological – time based split support for train, validation, test

  5. KFold

Module Contents

Classes

ChronologicalSplitMixin

ExplicitSplitMixin

KFoldSplitMixin

TBD on how to implement this. KFold requires K models and unique datasets

RandomSplitMixin

Class to randomly split dataset into different sets

SplitMixin

Attributes

__author__

simpleml.pipelines.validation_split_mixins.__author__ = Elisha Yadgaran[source]
class simpleml.pipelines.validation_split_mixins.ChronologicalSplitMixin(**kwargs)[source]

Bases: SplitMixin

class simpleml.pipelines.validation_split_mixins.ExplicitSplitMixin[source]

Bases: SplitMixin

split_dataset(self)[source]

Method to split the dataframe into different sets. Assumes dataset explicitly delineates between different splits

Passes forward dataset split names so uniquely named splits will propagate and can be referenced the same way

Return type

None

class simpleml.pipelines.validation_split_mixins.KFoldSplitMixin[source]

Bases: SplitMixin

TBD on how to implement this. KFold requires K models and unique datasets so may be easier to wrap a parallelized implementation that internally creates K new Pipeline and Model objects

class simpleml.pipelines.validation_split_mixins.RandomSplitMixin(train_size, test_size=None, validation_size=0.0, random_state=123, shuffle=True, **kwargs)[source]

Bases: SplitMixin

Class to randomly split dataset into different sets

Redefines splits so custom named splits in dataset cannot be referenced by the same names. Only TRAIN/TEST/VALIDATION

Set splitting params: By default validation is 0.0 because it is only used for hyperparameter tuning

Parameters
  • train_size (Union[float, int]) –

  • test_size (Optional[Union[float, int]]) –

  • validation_size (Union[float, int]) –

  • random_state (int) –

  • shuffle (bool) –

static get_index(data)[source]

Helper to extract the index from a dataset. Generates a range index if none exists

Return type

List[int]

split_dataset(self)[source]

Overwrite method to split by percentage

Return type

None

class simpleml.pipelines.validation_split_mixins.SplitMixin[source]

Bases: with_metaclass(ABCMeta, object)

containerize_split(self, split_dict)[source]
Parameters

split_dict (Dict[str, simpleml.datasets.dataset_splits.Split]) –

Return type

simpleml.datasets.dataset_splits.SplitContainer

abstract split_dataset(self)[source]

Set the split criteria

Must set self._dataset_splits