simpleml.datasets.pandas.base

Pandas Module for datasets

Inherit and extend for particular patterns

Module Contents

Classes

BasePandasDataset

Pandas base class with control mechanism for self.dataframe of

Attributes

DATAFRAME_SPLIT_COLUMN

__author__

simpleml.datasets.pandas.base.DATAFRAME_SPLIT_COLUMN :str = DATASET_SPLIT[source]
simpleml.datasets.pandas.base.__author__ = Elisha Yadgaran[source]
class simpleml.datasets.pandas.base.BasePandasDataset(squeeze_return=False, **kwargs)[source]

Bases: simpleml.datasets.base_dataset.Dataset

Pandas base class with control mechanism for self.dataframe of type pd.Dataframe

Parameters

squeeze_return (bool) – boolean flag whether to run dataframe.squeeze() on return from self.get() calls. Particularly necessary to align input types with different libraries (e.g. sklearn y with single label)

property X(self)[source]

Return the subset that isn’t in the target labels (across all potential splits)

Return type

pandas.DataFrame

property _dataframe(self)[source]

Overwrite base behavior to return a copy of the data in case consumers attempt to mutate the data structure

Only copies the pandas container - underlying cell objects can still propagate inplace mutations (eg lists, dicts, objects)

Return type

pandas.DataFrame

static _get(dataframe, columns, split)[source]

Internal method to extract data subsets from a dataframe

Parameters
  • dataframe (pandas.DataFrame) – the dataframe to subset from

  • columns (List[str]) – List of columns to slice from the dataframe

  • split (str) – row identifiers to slice rows (in internal column mapped to DATAFRAME_SPLIT_COLUMN)

Return type

pandas.DataFrame

_validate_dtype(self, df)[source]

Validating setter method for self._external_file Checks input is of type pd.DataFrame

Parameters

df (pandas.DataFrame) –

Return type

None

static concatenate_dataframes(dataframes, split_names)[source]

Helper method to merge dataframes into a single one with the split specified under DATAFRAME_SPLIT_COLUMN

Parameters
  • dataframes (List[pandas.DataFrame]) –

  • split_names (List[str]) –

Return type

pandas.DataFrame

get(self, column, split)[source]

Explicitly split validation splits Uses self.label_columns to separate x and y columns inside the returned dataframe

returns empty dataframe for missing combinations of column & split

Parameters
  • column (Optional[str]) –

  • split (Optional[str]) –

Return type

pandas.DataFrame

get_feature_names(self)[source]

Should return a list of the features in the dataset

Return type

List[str]

get_split(self, split)[source]

Wrapper accessor to return a split object (for internal use)

Parameters

split (Optional[str]) –

Return type

simpleml.pipelines.validation_split_mixins.Split

get_split_names(self)[source]

Helper to expose the splits contained in the dataset

Return type

List[str]

static merge_split(split)[source]

Helper method to merge all dataframes in a split object into a single df does a column-wise join ex: df1 = [A, B, C](4 rows) + df2 = [D, E, F](4 rows) returns: [A, B, C, D, E, F](4 rows)

Parameters

split (simpleml.pipelines.validation_split_mixins.Split) –

Return type

pandas.DataFrame

static squeeze_dataframe(df)[source]

Helper method to run dataframe squeeze and return a series

Parameters

df (pandas.DataFrame) –

Return type

pandas.Series

property y(self)[source]

Return the target label columns

Return type

pandas.DataFrame