simpleml.datasets.base_dataset
¶
Base Module for Datasets
- Two use cases:
Processed, or traditional datasets. In situations of clean,
representative data, this can be used directly for modeling purposes.
Otherwise, a raw dataset needs to be created first with a dataset pipeline
to transform it into the processed form.
Module Contents¶
Classes¶
Abstract Base class for all Dataset objects. |
|
Base class for all Dataset objects. |
-
class
simpleml.datasets.base_dataset.
AbstractDataset
(has_external_files=True, label_columns=None, **kwargs)[source]¶ Bases:
future.utils.with_metaclass()
Abstract Base class for all Dataset objects.
Every dataset has a “dataframe” object associated with it that is responsible for housing the data. The term dataframe is a bit of a misnomer since it does not need to be a pandas.DataFrame obejct.
Each dataframe can be subdivided by inheriting classes and mixins to support numerous representations: ex: y column for supervised
train/test/validation splits …
Datasets can be constructed from scratch or as derivatives of existing datasets. In the event of derivation, a pipeline must be specified to transform the original data
No additional columns
-
_hash
(self)[source]¶ Datasets rely on external data so instead of hashing only the config, hash the actual resulting dataframe This requires loading the data before determining duplication so overwrite for differing behavior
Technically there is little reason to hash anything besides the dataframe, but a design choice was made to separate the representation of the data from the use cases. For example: two dataset objects with different configured labels will yield different downstream results, even if the data is identical. Similarly with the pipeline, maintaining the back reference to the originating source is preferred, even if the final data can be made through different transformations.
- Hash is the combination of the:
Dataframe
Config
Pipeline
-
abstract
build_dataframe
(self)[source]¶ Must set self._external_file Cant set as abstractmethod because of database lookup dependency
-
-
class
simpleml.datasets.base_dataset.
Dataset
(has_external_files=True, label_columns=None, **kwargs)[source]¶ Bases:
simpleml.datasets.base_dataset.AbstractDataset
Base class for all Dataset objects.
pipeline_id: foreign key relation to the dataset pipeline used as input