simpleml.datasets.numpy

Dataset Library support for Numpy

Submodules

Package Contents

Classes

BaseNumpyDataset

Assumes self.dataframe is a dictionary of numpy ndarrays

NumpyPipelineDataset

Dataset class with a predefined build

Attributes

__author__

simpleml.datasets.numpy.__author__ = Elisha Yadgaran[source]
class simpleml.datasets.numpy.BaseNumpyDataset(*args, **kwargs)[source]

Bases: simpleml.datasets.base_dataset.Dataset

Assumes self.dataframe is a dictionary of numpy ndarrays

param label_columns: Optional list of column names to register as the “y” split section param other_named_split_sections: Optional map of section names to lists of column names for

other arbitrary split columns – must match expected consumer signatures (e.g. sample_weights) because passed through untouched downstream (eg sklearn.fit(**split))

All other columns in the dataframe will automatically be referenced as “X”

property X(self)

Return the subset that isn’t in the target labels

Return type

numpy.ndarray

get(self, column, split)

Explicitly split validation splits Assumes self.dataframe has a get method to return a dictionary of {‘X’: X, ‘y’: y} Uses self.label_columns if y is named something else – only looks at first entry in list

returns None for any combination of column/split that isn’t present

Parameters
  • column (str) –

  • split (str) –

Return type

numpy.ndarray

get_feature_names(self)

Should return a list of the features in the dataset

Return type

List[str]

get_split_names(self)

Helper to expose the splits contained in the dataset

Return type

List[str]

property y(self)

Return the target label columns

Return type

numpy.ndarray

class simpleml.datasets.numpy.NumpyPipelineDataset(*args, **kwargs)[source]

Bases: simpleml.datasets.numpy.base.BaseNumpyDataset

Dataset class with a predefined build routine, assuming dataset pipeline existence.

WARNING: this class will fail if build_dataframe is not overwritten or a pipeline provided!

param label_columns: Optional list of column names to register as the “y” split section param other_named_split_sections: Optional map of section names to lists of column names for

other arbitrary split columns – must match expected consumer signatures (e.g. sample_weights) because passed through untouched downstream (eg sklearn.fit(**split))

All other columns in the dataframe will automatically be referenced as “X”

build_dataframe(self)

Transform raw dataset via dataset pipeline for production ready dataset Overwrite this method to disable raw dataset requirement

Return type

None