simpleml.datasets

Import modules to register class names in global registry

Define convenience classes composed of different mixins

Package Contents

Classes

Dataset

Base class for all Dataset objects.

NumpyDataset

Composed mixin class with numpy helper methods and a predefined build

NumpyDatasetMixin

Assumes _external_file is a dictionary of numpy ndarrays

PandasDataset

Composed mixin class with pandas helper methods and a predefined build

PandasDatasetMixin

“Pandas”esque mixin class with control mechanism for self.dataframe of

simpleml.datasets.__author__ = Elisha Yadgaran[source]
exception simpleml.datasets.DatasetError(*args, **kwargs)[source]

Bases: simpleml.utils.errors.SimpleMLError

Common base class for all non-exit exceptions.

Initialize self. See help(type(self)) for accurate signature.

class simpleml.datasets.Dataset(has_external_files=True, label_columns=None, **kwargs)[source]

Bases: simpleml.datasets.base_dataset.AbstractDataset

Base class for all Dataset objects.

pipeline_id: foreign key relation to the dataset pipeline used as input

__table_args__
__tablename__ = datasets
pipeline
pipeline_id
class simpleml.datasets.NumpyDataset(has_external_files=True, label_columns=None, **kwargs)[source]

Bases: simpleml.datasets.base_dataset.Dataset, simpleml.datasets.numpy_mixin.NumpyDatasetMixin

Composed mixin class with numpy helper methods and a predefined build routine, assuming dataset pipeline existence.

WARNING: this class will fail if build_dataframe is not overwritten or a pipeline provided!

build_dataframe(self)[source]

Transform raw dataset via dataset pipeline for production ready dataset Overwrite this method to disable raw dataset requirement

class simpleml.datasets.NumpyDatasetMixin[source]

Bases: simpleml.datasets.abstract_mixin.AbstractDatasetMixin

Assumes _external_file is a dictionary of numpy ndarrays

property X(self)

Return the subset that isn’t in the target labels

get(self, column, split)

Explicitly split validation splits Assumes self.dataframe has a get method to return a dictionary of {‘X’: X, ‘y’: y} Uses self.label_columns if y is named something else – only looks at first entry in list

returns None for any combination of column/split that isn’t present

get_feature_names(self)

Should return a list of the features in the dataset

property y(self)

Return the target label columns

class simpleml.datasets.PandasDataset(has_external_files=True, label_columns=None, **kwargs)[source]

Bases: simpleml.datasets.base_dataset.Dataset, simpleml.datasets.pandas_mixin.PandasDatasetMixin

Composed mixin class with pandas helper methods and a predefined build routine, assuming dataset pipeline existence.

WARNING: this class will fail if build_dataframe is not overwritten or a pipeline provided!

build_dataframe(self)[source]

Transform raw dataset via dataset pipeline for production ready dataset Overwrite this method to disable raw dataset requirement

static merge_split(split)[source]

Helper method to merge all dataframes in a split object into a single df does a column-wise join ex: df1 = [A, B, C](4 rows) + df2 = [D, E, F](4 rows) returns: [A, B, C, D, E, F](4 rows)

class simpleml.datasets.PandasDatasetMixin[source]

Bases: simpleml.datasets.abstract_mixin.AbstractDatasetMixin

“Pandas”esque mixin class with control mechanism for self.dataframe of type dataframe. Only assumes pandas syntax, not types, so should be compatible with pandas drop-in replacements.

In particular:
A - type of pd.DataFrame:
  • query()

  • columns

  • drop()

  • __getitem__()

  • squeeze()

B - any other type:
  • get()

  • __getitem__()

  • squeeze(

property X(self)

Return the subset that isn’t in the target labels (across all potential splits)

concatenate_dataframes(self, dataframes, split_names)

Helper method to merge dataframes into a single one with the split specified under DATAFRAME_SPLIT_COLUMN

get(self, column, split)

Explicitly split validation splits Assumes self.dataframe has a get method to return the dataframe associated with the split Uses self.label_columns to separate x and y columns inside the returned dataframe

returns empty dataframe for missing combinations of column & split

get_feature_names(self)

Should return a list of the features in the dataset

static load_csv(filename, **kwargs)

Helper method to read in a csv file

property y(self)

Return the target label columns