simpleml.datasets.pandas

Dataset Library support for Pandas

Submodules

Package Contents

Classes

BasePandasDataset

Pandas base class with control mechanism for self.dataframe of

PandasFileBasedDataset

Pandas dataset class that generates the dataframe by reading in a file

PandasPipelineDataset

Pandas dataset class that generates the dataframe as the output of the

Attributes

__author__

simpleml.datasets.pandas.__author__ = Elisha Yadgaran[source]
class simpleml.datasets.pandas.BasePandasDataset(squeeze_return=False, **kwargs)[source]

Bases: simpleml.datasets.base_dataset.Dataset

Pandas base class with control mechanism for self.dataframe of type pd.Dataframe

Parameters

squeeze_return (bool) – boolean flag whether to run dataframe.squeeze() on return from self.get() calls. Particularly necessary to align input types with different libraries (e.g. sklearn y with single label)

property X(self)

Return the subset that isn’t in the target labels (across all potential splits)

Return type

pandas.DataFrame

property _dataframe(self)

Overwrite base behavior to return a copy of the data in case consumers attempt to mutate the data structure

Only copies the pandas container - underlying cell objects can still propagate inplace mutations (eg lists, dicts, objects)

Return type

pandas.DataFrame

static _get(dataframe, columns, split)

Internal method to extract data subsets from a dataframe

Parameters
  • dataframe (pandas.DataFrame) – the dataframe to subset from

  • columns (List[str]) – List of columns to slice from the dataframe

  • split (str) – row identifiers to slice rows (in internal column mapped to DATAFRAME_SPLIT_COLUMN)

Return type

pandas.DataFrame

_validate_dtype(self, df)

Validating setter method for self._external_file Checks input is of type pd.DataFrame

Parameters

df (pandas.DataFrame) –

Return type

None

static concatenate_dataframes(dataframes, split_names)

Helper method to merge dataframes into a single one with the split specified under DATAFRAME_SPLIT_COLUMN

Parameters
  • dataframes (List[pandas.DataFrame]) –

  • split_names (List[str]) –

Return type

pandas.DataFrame

get(self, column, split)

Explicitly split validation splits Uses self.label_columns to separate x and y columns inside the returned dataframe

returns empty dataframe for missing combinations of column & split

Parameters
  • column (Optional[str]) –

  • split (Optional[str]) –

Return type

pandas.DataFrame

get_feature_names(self)

Should return a list of the features in the dataset

Return type

List[str]

get_split(self, split)

Wrapper accessor to return a split object (for internal use)

Parameters

split (Optional[str]) –

Return type

simpleml.pipelines.validation_split_mixins.Split

get_split_names(self)

Helper to expose the splits contained in the dataset

Return type

List[str]

static merge_split(split)

Helper method to merge all dataframes in a split object into a single df does a column-wise join ex: df1 = [A, B, C](4 rows) + df2 = [D, E, F](4 rows) returns: [A, B, C, D, E, F](4 rows)

Parameters

split (simpleml.pipelines.validation_split_mixins.Split) –

Return type

pandas.DataFrame

static squeeze_dataframe(df)

Helper method to run dataframe squeeze and return a series

Parameters

df (pandas.DataFrame) –

Return type

pandas.Series

property y(self)

Return the target label columns

Return type

pandas.DataFrame

class simpleml.datasets.pandas.PandasFileBasedDataset(filepath, format, reader_params=None, **kwargs)[source]

Bases: simpleml.datasets.pandas.base.BasePandasDataset

Pandas dataset class that generates the dataframe by reading in a file

Parameters
  • squeeze_return – boolean flag whether to run dataframe.squeeze() on return from self.get() calls. Particularly necessary to align input types with different libraries (e.g. sklearn y with single label)

  • filepath (str) –

  • format (str) –

  • reader_params (Optional[Dict]) –

build_dataframe(self)

Must set self._external_file Cant set as abstractmethod because of database lookup dependency

Return type

None

class simpleml.datasets.pandas.PandasPipelineDataset(squeeze_return=False, **kwargs)[source]

Bases: simpleml.datasets.pandas.base.BasePandasDataset

Pandas dataset class that generates the dataframe as the output of the linked pipeline

Parameters

squeeze_return (bool) – boolean flag whether to run dataframe.squeeze() on return from self.get() calls. Particularly necessary to align input types with different libraries (e.g. sklearn y with single label)

build_dataframe(self)

Transform raw dataset via dataset pipeline for production ready dataset

Return type

None