`simpleml.datasets.pandas`

Dataset Library support for Pandas

Submodules

Package Contents

Classes

`BasePandasDataset`	Pandas base class with control mechanism for self.dataframe of
`PandasFileBasedDataset`	Pandas dataset class that generates the dataframe by reading in a file
`PandasPipelineDataset`	Pandas dataset class that generates the dataframe as the output of the

Attributes

__author__

simpleml.datasets.pandas.__author__ = Elisha Yadgaran[source]

class simpleml.datasets.pandas.BasePandasDataset(squeeze_return=False, **kwargs)[source]

Bases: simpleml.datasets.base_dataset.Dataset

Pandas base class with control mechanism for self.dataframe of type pd.Dataframe

Parameters: squeeze_return (bool) – boolean flag whether to run dataframe.squeeze() on return from self.get() calls. Particularly necessary to align input types with different libraries (e.g. sklearn y with single label)

property X(self)

Return the subset that isn’t in the target labels (across all potential splits)

Return type: pandas.DataFrame

property _dataframe(self)

Overwrite base behavior to return a copy of the data in case consumers attempt to mutate the data structure

Only copies the pandas container - underlying cell objects can still propagate inplace mutations (eg lists, dicts, objects)

Return type: pandas.DataFrame

static _get(dataframe, columns, split)

Internal method to extract data subsets from a dataframe

Parameters

dataframe (pandas.DataFrame) – the dataframe to subset from
columns (List[str]) – List of columns to slice from the dataframe
split (str) – row identifiers to slice rows (in internal column mapped to DATAFRAME_SPLIT_COLUMN)

Return type

pandas.DataFrame

_validate_dtype(self, df)

Validating setter method for self._external_file Checks input is of type pd.DataFrame

Parameters: df (pandas.DataFrame) –
Return type: None

static concatenate_dataframes(dataframes, split_names)

Helper method to merge dataframes into a single one with the split specified under DATAFRAME_SPLIT_COLUMN

Parameters

dataframes (List[pandas.DataFrame]) –
split_names (List[str]) –

Return type

pandas.DataFrame

get(self, column, split)

Explicitly split validation splits Uses self.label_columns to separate x and y columns inside the returned dataframe

returns empty dataframe for missing combinations of column & split

Parameters

column (Optional[str]) –
split (Optional[str]) –

Return type

pandas.DataFrame

get_feature_names(self)

Should return a list of the features in the dataset

Return type: List[str]

get_split(self, split)

Wrapper accessor to return a split object (for internal use)

Parameters: split (Optional[str]) –
Return type: simpleml.pipelines.validation_split_mixins.Split

get_split_names(self)

Helper to expose the splits contained in the dataset

Return type: List[str]

static merge_split(split)

Helper method to merge all dataframes in a split object into a single df does a column-wise join ex: df1 = [A, B, C](4 rows) + df2 = [D, E, F](4 rows) returns: [A, B, C, D, E, F](4 rows)

Parameters: split (simpleml.pipelines.validation_split_mixins.Split) –
Return type: pandas.DataFrame

static squeeze_dataframe(df)

Helper method to run dataframe squeeze and return a series

Parameters: df (pandas.DataFrame) –
Return type: pandas.Series

property y(self)

Return the target label columns

Return type: pandas.DataFrame

class simpleml.datasets.pandas.PandasFileBasedDataset(filepath, format, reader_params=None, **kwargs)[source]

Bases: simpleml.datasets.pandas.base.BasePandasDataset

Pandas dataset class that generates the dataframe by reading in a file

Parameters

squeeze_return – boolean flag whether to run dataframe.squeeze() on return from self.get() calls. Particularly necessary to align input types with different libraries (e.g. sklearn y with single label)
filepath (str) –
format (str) –
reader_params (Optional[Dict]) –

build_dataframe(self)

Must set self._external_file Cant set as abstractmethod because of database lookup dependency

Return type: None

class simpleml.datasets.pandas.PandasPipelineDataset(squeeze_return=False, **kwargs)[source]

Bases: simpleml.datasets.pandas.base.BasePandasDataset

Pandas dataset class that generates the dataframe as the output of the linked pipeline

Parameters: squeeze_return (bool) – boolean flag whether to run dataframe.squeeze() on return from self.get() calls. Particularly necessary to align input types with different libraries (e.g. sklearn y with single label)

build_dataframe(self)

Transform raw dataset via dataset pipeline for production ready dataset

Return type: None

simpleml.datasets.pandas

Submodules

Package Contents

Classes

Attributes

`simpleml.datasets.pandas`