simpleml.datasets.base_dataset module

Base Module for Datasets

Two use cases:
  1. Processed, or traditional datasets. In situations of clean,

representative data, this can be used directly for modeling purposes.

  1. Otherwise, a raw dataset needs to be created first with a dataset pipeline

to transform it into the processed form.

class simpleml.datasets.base_dataset.AbstractDataset(has_external_files=True, label_columns=[], **kwargs)[source]

Bases: simpleml.persistables.base_persistable.Persistable, simpleml.persistables.saving.AllSaveMixin

Abstract Base class for all Dataset objects.

Every dataset has a “dataframe” object associated with it that is responsible for housing the data. The term dataframe is a bit of a misnomer since it does not need to be a pandas.DataFrame obejct.

Each dataframe can be subdivided by inheriting classes and mixins to support numerous representations: ex: y column for supervised

train/test/validation splits …

Datasets can be constructed from scratch or as derivatives of existing datasets. In the event of derivation, a pipeline must be specified to transform the original data

No additional columns

add_pipeline(pipeline)[source]

Setter method for dataset pipeline used

build_dataframe()[source]

Must set self._external_file Cant set as abstractmethod because of database lookup dependency

dataframe
label_columns

Keep column list for labels in metadata to persist through saving

load(**kwargs)[source]

Extend main load routine to load relationship class

object_type = 'DATASET'
save(**kwargs)[source]

Extend parent function with a few additional save routines

class simpleml.datasets.base_dataset.Dataset(has_external_files=True, label_columns=[], **kwargs)[source]

Bases: simpleml.datasets.base_dataset.AbstractDataset

Base class for all Dataset objects.

pipeline_id: foreign key relation to the dataset pipeline used as input

author
created_timestamp
filepaths
has_external_files
hash_
id
metadata_
modified_timestamp
name
pipeline
pipeline_id
project
registered_name
version
version_description