simpleml.datasets.base_dataset module¶
Base Module for Datasets
- Two use cases:
- Processed, or traditional datasets. In situations of clean,
representative data, this can be used directly for modeling purposes.
- Otherwise, a raw dataset needs to be created first with a dataset pipeline
to transform it into the processed form.
-
class
simpleml.datasets.base_dataset.
AbstractDataset
(has_external_files=True, label_columns=[], **kwargs)[source]¶ Bases:
simpleml.persistables.base_persistable.Persistable
,simpleml.persistables.saving.AllSaveMixin
Abstract Base class for all Dataset objects.
Every dataset has a “dataframe” object associated with it that is responsible for housing the data. The term dataframe is a bit of a misnomer since it does not need to be a pandas.DataFrame obejct.
Each dataframe can be subdivided by inheriting classes and mixins to support numerous representations: ex: y column for supervised
train/test/validation splits …Datasets can be constructed from scratch or as derivatives of existing datasets. In the event of derivation, a pipeline must be specified to transform the original data
No additional columns
-
build_dataframe
()[source]¶ Must set self._external_file Cant set as abstractmethod because of database lookup dependency
-
dataframe
¶
-
label_columns
¶ Keep column list for labels in metadata to persist through saving
-
object_type
= 'DATASET'¶
-
-
class
simpleml.datasets.base_dataset.
Dataset
(has_external_files=True, label_columns=[], **kwargs)[source]¶ Bases:
simpleml.datasets.base_dataset.AbstractDataset
Base class for all Dataset objects.
pipeline_id: foreign key relation to the dataset pipeline used as input
-
created_timestamp
¶
-
filepaths
¶
-
has_external_files
¶
-
hash_
¶
-
id
¶
-
metadata_
¶
-
modified_timestamp
¶
-
name
¶
-
pipeline
¶
-
pipeline_id
¶
-
project
¶
-
registered_name
¶
-
version
¶
-
version_description
¶
-