simpleml.datasets
Import modules to register class names in global registry
Subpackages
Submodules
Package Contents
Classes
Base class for all Dataset objects. |
Attributes
- exception simpleml.datasets.DatasetError(*args, **kwargs)[source]
Bases:
SimpleMLError
Common base class for all non-exit exceptions.
Initialize self. See help(type(self)) for accurate signature.
- class simpleml.datasets.Dataset(has_external_files=True, label_columns=None, other_named_split_sections=None, pipeline_id=None, **kwargs)[source]
Bases:
simpleml.persistables.base_persistable.Persistable
Base class for all Dataset objects.
Every dataset has a “dataframe” object associated with it that is responsible for housing the data. The term dataframe is a bit of a misnomer since it does not need to be a pandas.DataFrame obejct.
Each dataframe can be subdivided by inheriting classes and mixins to support numerous representations: ex: y column for supervised
train/test/validation splits …
Datasets can be constructed from scratch or as derivatives of existing datasets. In the event of derivation, a pipeline must be specified to transform the original data
No additional columns
param label_columns: Optional list of column names to register as the “y” split section param other_named_split_sections: Optional map of section names to lists of column names for
other arbitrary split columns – must match expected consumer signatures (e.g. sample_weights) because passed through untouched downstream (eg sklearn.fit(**split))
All other columns in the dataframe will automatically be referenced as “X”
- Parameters
- object_type :str = DATASET
- property X(self)
Return the subset that isn’t in the target labels
- Return type
Any
- property _dataframe(self)
Separate property method wrapper for the external object Allows mixins/subclasses to change behavior of accsessor
- Return type
Any
- _hash(self)
Datasets rely on external data so instead of hashing only the config, hash the actual resulting dataframe This requires loading the data before determining duplication so overwrite for differing behavior
Technically there is little reason to hash anything besides the dataframe, but a design choice was made to separate the representation of the data from the use cases. For example: two dataset objects with different configured labels will yield different downstream results, even if the data is identical. Similarly with the pipeline, maintaining the back reference to the originating source is preferred, even if the final data can be made through different transformations.
- Hash is the combination of the:
Dataframe
Config
Pipeline
- Return type
- _load_pipeline(self)
Helper to fetch the pipeline
- _validate_data(self, df)
Hook to validate the contents of the data
- Parameters
df (Any) –
- Return type
None
- _validate_dtype(self, df)
Hook to validate the types of the data
- Parameters
df (Any) –
- Return type
None
- _validate_schema(self, df)
Hook to validate the schema of the data (columns/sections)
- Parameters
df (Any) –
- Return type
None
- _validate_state(self, df)
Hook to validate the persistable state before allowing modification
- Parameters
df (Any) –
- Return type
None
- add_pipeline(self, pipeline)
Setter method for dataset pipeline used
- Parameters
pipeline (simpleml.pipelines.base_pipeline.Pipeline) –
- Return type
None
- abstract build_dataframe(self)
Must set self._external_file Cant set as abstractmethod because of database lookup dependency
- property dataframe(self)
Property wrapper to retrieve the external object associated with the dataset. Automatically checks for unloaded artifacts and loads, if necessary. Will attempt to create a new dataframe if external object is not already created via self.build_dataframe()
- Return type
Any
- abstract get(self, column, split)
Unimplemented method to explicitly split X and y Must be implemented by subclasses
- abstract get_feature_names(self)
Should return a list of the features in the dataset
- Return type
List[str]
- get_section_columns(self, section)
Helper accessor for column names in the split_section_map
- abstract get_split(self, split)
Uninplemented method to return a Split object
Differs from the main get method by wrapping with an internal interface class (Split). Agnostic to implementation library and compatible with downstream SimpleML consumers (pipelines, models)
- Parameters
split (str) –
- Return type
- abstract get_split_names(self)
Uninplemented method to return the split names available for the dataset
- Return type
List[str]
- property label_columns(self)
Keep column list for labels in metadata to persist through saving
- Return type
List[str]
- property pipeline(self)
Use a weakref to bind linked pipeline so it doesnt bloat usage returns pipeline if still available or tries to fetch otherwise
- save(self, **kwargs)
Extend parent function with a few additional save routines
- Return type
None
- property y(self)
Return the target label columns
- Return type
Any