`simpleml.datasets`¶

Import modules to register class names in global registry

Define convenience classes composed of different mixins

Submodules¶

Package Contents¶

Classes¶

`Dataset`	Base class for all Dataset objects.
`NumpyDataset`	Composed mixin class with numpy helper methods and a predefined build
`NumpyDatasetMixin`	Assumes _external_file is a dictionary of numpy ndarrays
`PandasDataset`	Composed mixin class with pandas helper methods and a predefined build
`PandasDatasetMixin`	“Pandas”esque mixin class with control mechanism for self.dataframe of

simpleml.datasets.__author__ = Elisha Yadgaran[source]¶

exception simpleml.datasets.DatasetError(*args, **kwargs)[source]¶

Bases: simpleml.utils.errors.SimpleMLError

Common base class for all non-exit exceptions.

Initialize self. See help(type(self)) for accurate signature.

class simpleml.datasets.Dataset(has_external_files=True, label_columns=None, **kwargs)[source]¶

Bases: simpleml.datasets.base_dataset.AbstractDataset

Base class for all Dataset objects.

pipeline_id: foreign key relation to the dataset pipeline used as input

__table_args__¶

__tablename__ = datasets¶

pipeline¶

pipeline_id¶

class simpleml.datasets.NumpyDataset(has_external_files=True, label_columns=None, **kwargs)[source]¶

Bases: simpleml.datasets.base_dataset.Dataset, simpleml.datasets.numpy_mixin.NumpyDatasetMixin

Composed mixin class with numpy helper methods and a predefined build routine, assuming dataset pipeline existence.

WARNING: this class will fail if build_dataframe is not overwritten or a pipeline provided!

build_dataframe(self)[source]¶: Transform raw dataset via dataset pipeline for production ready dataset Overwrite this method to disable raw dataset requirement

class simpleml.datasets.NumpyDatasetMixin[source]¶

Bases: simpleml.datasets.abstract_mixin.AbstractDatasetMixin

Assumes _external_file is a dictionary of numpy ndarrays

property X(self)¶: Return the subset that isn’t in the target labels

get(self, column, split)¶

Explicitly split validation splits Assumes self.dataframe has a get method to return a dictionary of {‘X’: X, ‘y’: y} Uses self.label_columns if y is named something else – only looks at first entry in list

returns None for any combination of column/split that isn’t present

get_feature_names(self)¶: Should return a list of the features in the dataset

property y(self)¶: Return the target label columns

class simpleml.datasets.PandasDataset(has_external_files=True, label_columns=None, **kwargs)[source]¶

Bases: simpleml.datasets.base_dataset.Dataset, simpleml.datasets.pandas_mixin.PandasDatasetMixin

Composed mixin class with pandas helper methods and a predefined build routine, assuming dataset pipeline existence.

WARNING: this class will fail if build_dataframe is not overwritten or a pipeline provided!

build_dataframe(self)[source]¶: Transform raw dataset via dataset pipeline for production ready dataset Overwrite this method to disable raw dataset requirement

static merge_split(split)[source]¶: Helper method to merge all dataframes in a split object into a single df does a column-wise join ex: df1 = [A, B, C](4 rows) + df2 = [D, E, F](4 rows) returns: [A, B, C, D, E, F](4 rows)

class simpleml.datasets.PandasDatasetMixin[source]¶

Bases: simpleml.datasets.abstract_mixin.AbstractDatasetMixin

“Pandas”esque mixin class with control mechanism for self.dataframe of type dataframe. Only assumes pandas syntax, not types, so should be compatible with pandas drop-in replacements.

In particular:

A - type of pd.DataFrame:

query()
columns
drop()
__getitem__()
squeeze()

B - any other type:

get()
__getitem__()
squeeze(

property X(self)¶: Return the subset that isn’t in the target labels (across all potential splits)

concatenate_dataframes(self, dataframes, split_names)¶: Helper method to merge dataframes into a single one with the split specified under DATAFRAME_SPLIT_COLUMN

get(self, column, split)¶

Explicitly split validation splits Assumes self.dataframe has a get method to return the dataframe associated with the split Uses self.label_columns to separate x and y columns inside the returned dataframe

returns empty dataframe for missing combinations of column & split

get_feature_names(self)¶: Should return a list of the features in the dataset

static load_csv(filename, **kwargs)¶: Helper method to read in a csv file

property y(self)¶: Return the target label columns

simpleml.datasets¶

Submodules¶

Package Contents¶

Classes¶

`simpleml.datasets`¶