simpleml.datasets.base_dataset

Base Module for Datasets

Two use cases:
  1. Processed, or traditional datasets. In situations of clean,

representative data, this can be used directly for modeling purposes.

  1. Otherwise, a raw dataset needs to be created first with a dataset pipeline

to transform it into the processed form.

Module Contents

Classes

Dataset

Base class for all Dataset objects.

Attributes

LOGGER

__author__

simpleml.datasets.base_dataset.LOGGER[source]
simpleml.datasets.base_dataset.__author__ = Elisha Yadgaran[source]
class simpleml.datasets.base_dataset.Dataset(has_external_files=True, label_columns=None, other_named_split_sections=None, pipeline_id=None, **kwargs)[source]

Bases: simpleml.persistables.base_persistable.Persistable

Base class for all Dataset objects.

Every dataset has a “dataframe” object associated with it that is responsible for housing the data. The term dataframe is a bit of a misnomer since it does not need to be a pandas.DataFrame obejct.

Each dataframe can be subdivided by inheriting classes and mixins to support numerous representations: ex: y column for supervised

train/test/validation splits …

Datasets can be constructed from scratch or as derivatives of existing datasets. In the event of derivation, a pipeline must be specified to transform the original data

No additional columns

param label_columns: Optional list of column names to register as the “y” split section param other_named_split_sections: Optional map of section names to lists of column names for

other arbitrary split columns – must match expected consumer signatures (e.g. sample_weights) because passed through untouched downstream (eg sklearn.fit(**split))

All other columns in the dataframe will automatically be referenced as “X”

Parameters
  • has_external_files (bool) –

  • label_columns (Optional[List[str]]) –

  • other_named_split_sections (Optional[Dict[str, List[str]]]) –

  • pipeline_id (Optional[Union[str, uuid.uuid4]]) –

object_type :str = DATASET[source]
property X(self)[source]

Return the subset that isn’t in the target labels

Return type

Any

property _dataframe(self)[source]

Separate property method wrapper for the external object Allows mixins/subclasses to change behavior of accsessor

Return type

Any

_hash(self)[source]

Datasets rely on external data so instead of hashing only the config, hash the actual resulting dataframe This requires loading the data before determining duplication so overwrite for differing behavior

Technically there is little reason to hash anything besides the dataframe, but a design choice was made to separate the representation of the data from the use cases. For example: two dataset objects with different configured labels will yield different downstream results, even if the data is identical. Similarly with the pipeline, maintaining the back reference to the originating source is preferred, even if the final data can be made through different transformations.

Hash is the combination of the:
  1. Dataframe

  2. Config

  3. Pipeline

Return type

str

_load_pipeline(self)[source]

Helper to fetch the pipeline

_validate_data(self, df)[source]

Hook to validate the contents of the data

Parameters

df (Any) –

Return type

None

_validate_dtype(self, df)[source]

Hook to validate the types of the data

Parameters

df (Any) –

Return type

None

_validate_schema(self, df)[source]

Hook to validate the schema of the data (columns/sections)

Parameters

df (Any) –

Return type

None

_validate_state(self, df)[source]

Hook to validate the persistable state before allowing modification

Parameters

df (Any) –

Return type

None

add_pipeline(self, pipeline)[source]

Setter method for dataset pipeline used

Parameters

pipeline (simpleml.pipelines.base_pipeline.Pipeline) –

Return type

None

abstract build_dataframe(self)[source]

Must set self._external_file Cant set as abstractmethod because of database lookup dependency

property dataframe(self)[source]

Property wrapper to retrieve the external object associated with the dataset. Automatically checks for unloaded artifacts and loads, if necessary. Will attempt to create a new dataframe if external object is not already created via self.build_dataframe()

Return type

Any

abstract get(self, column, split)[source]

Unimplemented method to explicitly split X and y Must be implemented by subclasses

Parameters
  • column (str) –

  • split (str) –

Return type

Any

abstract get_feature_names(self)[source]

Should return a list of the features in the dataset

Return type

List[str]

get_section_columns(self, section)[source]

Helper accessor for column names in the split_section_map

Parameters

section (str) –

Return type

List[str]

abstract get_split(self, split)[source]

Uninplemented method to return a Split object

Differs from the main get method by wrapping with an internal interface class (Split). Agnostic to implementation library and compatible with downstream SimpleML consumers (pipelines, models)

Parameters

split (str) –

Return type

simpleml.datasets.dataset_splits.Split

abstract get_split_names(self)[source]

Uninplemented method to return the split names available for the dataset

Return type

List[str]

property label_columns(self)[source]

Keep column list for labels in metadata to persist through saving

Return type

List[str]

property pipeline(self)[source]

Use a weakref to bind linked pipeline so it doesnt bloat usage returns pipeline if still available or tries to fetch otherwise

save(self, **kwargs)[source]

Extend parent function with a few additional save routines

Return type

None

property y(self)[source]

Return the target label columns

Return type

Any