`simpleml.models.classifiers.sklearn.ensemble`

Wrapper module around sklearn.ensemble

Module Contents

Classes

`SklearnAdaBoostClassifier`	No different than base model. Here just to maintain the pattern
`SklearnBaggingClassifier`	No different than base model. Here just to maintain the pattern
`SklearnExtraTreesClassifier`	No different than base model. Here just to maintain the pattern
`SklearnGradientBoostingClassifier`	No different than base model. Here just to maintain the pattern
`SklearnHistGradientBoostingClassifier`	No different than base model. Here just to maintain the pattern
`SklearnRandomForestClassifier`	No different than base model. Here just to maintain the pattern
`SklearnVotingClassifier`	No different than base model. Here just to maintain the pattern
`WrappedSklearnAdaBoostClassifier`	An AdaBoost classifier.
`WrappedSklearnBaggingClassifier`	A Bagging classifier.
`WrappedSklearnExtraTreesClassifier`	An extra-trees classifier.
`WrappedSklearnGradientBoostingClassifier`	Gradient Boosting for classification.
`WrappedSklearnHistGradientBoostingClassifier`	Histogram-based Gradient Boosting Classification Tree.
`WrappedSklearnRandomForestClassifier`	A random forest classifier.
`WrappedSklearnVotingClassifier`	Soft Voting/Majority Rule classifier for unfitted estimators.

Attributes

`LOGGER`	AdaBoost Classifier
`__author__`

simpleml.models.classifiers.sklearn.ensemble.LOGGER[source]: AdaBoost Classifier

simpleml.models.classifiers.sklearn.ensemble.__author__ = Elisha Yadgaran[source]

class simpleml.models.classifiers.sklearn.ensemble.SklearnAdaBoostClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]

Bases: simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier

No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)

Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors

Two supported patterns - full initialization in constructor or stepwise configured before fit and save

Parameters

has_external_files (bool) –
external_model_kwargs (Optional[Dict[str, Any]]) –
params (Optional[Dict[str, Any]]) –
fitted (bool) –
pipeline_id (Optional[Union[str, uuid.uuid4]]) –

_create_external_model(self, **kwargs)[source]

Abstract method for each subclass to implement

should return the desired model object

class simpleml.models.classifiers.sklearn.ensemble.SklearnBaggingClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]

Bases: simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier

No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)

Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors

Two supported patterns - full initialization in constructor or stepwise configured before fit and save

Parameters

has_external_files (bool) –
external_model_kwargs (Optional[Dict[str, Any]]) –
params (Optional[Dict[str, Any]]) –
fitted (bool) –
pipeline_id (Optional[Union[str, uuid.uuid4]]) –

_create_external_model(self, **kwargs)[source]

Abstract method for each subclass to implement

should return the desired model object

class simpleml.models.classifiers.sklearn.ensemble.SklearnExtraTreesClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]

Bases: simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier

No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)

Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors

Two supported patterns - full initialization in constructor or stepwise configured before fit and save

Parameters

has_external_files (bool) –
external_model_kwargs (Optional[Dict[str, Any]]) –
params (Optional[Dict[str, Any]]) –
fitted (bool) –
pipeline_id (Optional[Union[str, uuid.uuid4]]) –

_create_external_model(self, **kwargs)[source]

Abstract method for each subclass to implement

should return the desired model object

class simpleml.models.classifiers.sklearn.ensemble.SklearnGradientBoostingClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]

Bases: simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier

No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)

Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors

Two supported patterns - full initialization in constructor or stepwise configured before fit and save

Parameters

has_external_files (bool) –
external_model_kwargs (Optional[Dict[str, Any]]) –
params (Optional[Dict[str, Any]]) –
fitted (bool) –
pipeline_id (Optional[Union[str, uuid.uuid4]]) –

_create_external_model(self, **kwargs)[source]

Abstract method for each subclass to implement

should return the desired model object

class simpleml.models.classifiers.sklearn.ensemble.SklearnHistGradientBoostingClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]

Bases: simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier

No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)

Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors

Two supported patterns - full initialization in constructor or stepwise configured before fit and save

Parameters

has_external_files (bool) –
external_model_kwargs (Optional[Dict[str, Any]]) –
params (Optional[Dict[str, Any]]) –
fitted (bool) –
pipeline_id (Optional[Union[str, uuid.uuid4]]) –

_create_external_model(self, **kwargs)[source]

Abstract method for each subclass to implement

should return the desired model object

class simpleml.models.classifiers.sklearn.ensemble.SklearnRandomForestClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]

Bases: simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier

No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)

Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors

Two supported patterns - full initialization in constructor or stepwise configured before fit and save

Parameters

has_external_files (bool) –
external_model_kwargs (Optional[Dict[str, Any]]) –
params (Optional[Dict[str, Any]]) –
fitted (bool) –
pipeline_id (Optional[Union[str, uuid.uuid4]]) –

_create_external_model(self, **kwargs)[source]

Abstract method for each subclass to implement

should return the desired model object

class simpleml.models.classifiers.sklearn.ensemble.SklearnVotingClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]

Bases: simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier

No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)

Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors

Two supported patterns - full initialization in constructor or stepwise configured before fit and save

Parameters

has_external_files (bool) –
external_model_kwargs (Optional[Dict[str, Any]]) –
params (Optional[Dict[str, Any]]) –
fitted (bool) –
pipeline_id (Optional[Union[str, uuid.uuid4]]) –

_create_external_model(self, **kwargs)[source]

Abstract method for each subclass to implement

should return the desired model object

class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnAdaBoostClassifier(base_estimator=None, *, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None)[source]

Bases: sklearn.ensemble.AdaBoostClassifier, simpleml.models.classifiers.external_models.ClassificationExternalModelMixin

An AdaBoost classifier.

An AdaBoost [1] classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

This class implements the algorithm known as AdaBoost-SAMME [2].

Read more in the User Guide.

New in version 0.14.

base_estimatorobject, default=None: The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1.
n_estimatorsint, default=50: The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.
learning_ratefloat, default=1.0: Weight applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier. There is a trade-off between the learning_rate and n_estimators parameters.
algorithm{‘SAMME’, ‘SAMME.R’}, default=’SAMME.R’: If ‘SAMME.R’ then use the SAMME.R real boosting algorithm. base_estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.
random_stateint, RandomState instance or None, default=None: Controls the random seed given at each base_estimator at each boosting iteration. Thus, it is only used when base_estimator exposes a random_state. Pass an int for reproducible output across multiple function calls. See Glossary.

base_estimator_estimator

The base estimator from which the ensemble is grown.

estimators_list of classifiers

The collection of fitted sub-estimators.

classesndarray of shape (n_classes,)

The classes labels.

n_classes_int

The number of classes.

estimator_weights_ndarray of floats

Weights for each estimator in the boosted ensemble.

estimator_errors_ndarray of floats

Classification error for each estimator in the boosted ensemble.

feature_importances_ndarray of shape (n_features,)

The impurity-based feature importances if supported by the base_estimator (when based on decision trees).

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

n_features_in_int

Number of features seen during fit.

New in version 0.24.

feature_names_in_ndarray of shape (n_features_in_,)

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

AdaBoostRegressorAn AdaBoost regressor that begins by fitting a: regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction.
GradientBoostingClassifierGB builds an additive model in a forward: stage-wise fashion. Regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.
sklearn.tree.DecisionTreeClassifierA non-parametric supervised learning: method used for classification. Creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

1

Y. Freund, R. Schapire, “A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting”, 1995.

2

Zhu, H. Zou, S. Rosset, T. Hastie, “Multi-class AdaBoost”, 2009.

>>> from sklearn.ensemble import AdaBoostClassifier
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=4,
...                            n_informative=2, n_redundant=0,
...                            random_state=0, shuffle=False)
>>> clf = AdaBoostClassifier(n_estimators=100, random_state=0)
>>> clf.fit(X, y)
AdaBoostClassifier(n_estimators=100, random_state=0)
>>> clf.predict([[0, 0, 0, 0]])
array([1])
>>> clf.score(X, y)
0.983...

class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnBaggingClassifier(base_estimator=None, n_estimators=10, *, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)[source]

Bases: sklearn.ensemble.BaggingClassifier, simpleml.models.classifiers.external_models.ClassificationExternalModelMixin

A Bagging classifier.

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting [1]_. If samples are drawn with replacement, then the method is known as Bagging [2]_. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces 3. Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches 4.

Read more in the User Guide.

New in version 0.15.

base_estimatorobject, default=None

The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a DecisionTreeClassifier.

n_estimatorsint, default=10

The number of base estimators in the ensemble.

max_samplesint or float, default=1.0

The number of samples to draw from X to train each base estimator (with replacement by default, see bootstrap for more details).

If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples.

max_featuresint or float, default=1.0

The number of features to draw from X to train each base estimator ( without replacement by default, see bootstrap_features for more details).

If int, then draw max_features features.
If float, then draw max_features * X.shape[1] features.

bootstrapbool, default=True

Whether samples are drawn with replacement. If False, sampling without replacement is performed.

bootstrap_featuresbool, default=False

Whether features are drawn with replacement.

oob_scorebool, default=False

Whether to use out-of-bag samples to estimate the generalization error. Only available if bootstrap=True.

warm_startbool, default=False

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble. See the Glossary.

New in version 0.17: warm_start constructor parameter.

n_jobsint, default=None

The number of jobs to run in parallel for both fit() and predict(). None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

random_stateint, RandomState instance or None, default=None

Controls the random resampling of the original dataset (sample wise and feature wise). If the base estimator accepts a random_state attribute, a different seed is generated for each instance in the ensemble. Pass an int for reproducible output across multiple function calls. See Glossary.

verboseint, default=0

Controls the verbosity when fitting and predicting.

base_estimator_estimator: The base estimator from which the ensemble is grown.
n_features_int: The number of features when fit() is performed.

Deprecated since version 1.0: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.
n_features_in_int: Number of features seen during fit.

New in version 0.24.
feature_names_in_ndarray of shape (n_features_in_,): Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.
estimators_list of estimators: The collection of fitted base estimators.
estimators_samples_list of arrays: The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected.
estimators_features_list of arrays: The subset of drawn features for each base estimator.
classesndarray of shape (n_classes,): The classes labels.
n_classes_int or list: The number of classes.
oob_score_float: Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when oob_score is True.
oob_decision_function_ndarray of shape (n_samples, n_classes): Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN. This attribute exists only when oob_score is True.

BaggingRegressor : A Bagging regressor.

1: L. Breiman, “Pasting small votes for classification in large databases and on-line”, Machine Learning, 36(1), 85-103, 1999.
2: L. Breiman, “Bagging predictors”, Machine Learning, 24(2), 123-140, 1996.
3: T. Ho, “The random subspace method for constructing decision forests”, Pattern Analysis and Machine Intelligence, 20(8), 832-844, 1998.
4: G. Louppe and P. Geurts, “Ensembles on Random Patches”, Machine Learning and Knowledge Discovery in Databases, 346-361, 2012.

>>> from sklearn.svm import SVC
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=100, n_features=4,
...                            n_informative=2, n_redundant=0,
...                            random_state=0, shuffle=False)
>>> clf = BaggingClassifier(base_estimator=SVC(),
...                         n_estimators=10, random_state=0).fit(X, y)
>>> clf.predict([[0, 0, 0, 0]])
array([1])

class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnExtraTreesClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]

Bases: sklearn.ensemble.ExtraTreesClassifier, simpleml.models.classifiers.external_models.ClassificationExternalModelMixin

An extra-trees classifier.

This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Read more in the User Guide.

n_estimatorsint, default=100

The number of trees in the forest.

Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.

criterion{“gini”, “entropy”}, default=”gini”

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint or float, default=2

The minimum number of samples required to split an internal node:

If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

Changed in version 0.18: Added float values for fractions.

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

Changed in version 0.18: Added float values for fractions.

min_weight_fraction_leaffloat, default=0.0

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a fraction and round(max_features * n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

max_leaf_nodesint, default=None

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

min_impurity_decreasefloat, default=0.0

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

New in version 0.19.

bootstrapbool, default=False

Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

oob_scorebool, default=False

Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.

n_jobsint, default=None

The number of jobs to run in parallel. fit(), predict(), decision_path() and apply() are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

random_stateint, RandomState instance or None, default=None

Controls 3 sources of randomness:

the bootstrapping of the samples used when building trees (if bootstrap=True)
the sampling of the features to consider when looking for the best split at each node (if max_features < n_features)
the draw of the splits for each of the max_features

See Glossary for details.

verboseint, default=0

Controls the verbosity when fitting and predicting.

warm_startbool, default=False

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.

class_weight{“balanced”, “balanced_subsample”}, dict or list of dicts, default=None

Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.

New in version 0.22.

max_samplesint or float, default=None

If bootstrap is True, the number of samples to draw from X to train each base estimator.

If None (default), then draw X.shape[0] samples.
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0.0, 1.0].

New in version 0.22.

base_estimator_ExtraTreesClassifier

The child estimator template used to create the collection of fitted sub-estimators.

estimators_list of DecisionTreeClassifier

The collection of fitted sub-estimators.

classesndarray of shape (n_classes,) or a list of such arrays

The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).

n_classes_int or list

The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem).

feature_importances_ndarray of shape (n_features,)

The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

n_features_int

The number of features when fit is performed.

Deprecated since version 1.0: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.

n_features_in_int

Number of features seen during fit.

New in version 0.24.

feature_names_in_ndarray of shape (n_features_in_,)

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

n_outputs_int

The number of outputs when fit is performed.

oob_score_float

Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when oob_score is True.

oob_decision_function_ndarray of shape (n_samples, n_classes) or (n_samples, n_classes, n_outputs)

Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN. This attribute exists only when oob_score is True.

ExtraTreesRegressor : An extra-trees regressor with random splits. RandomForestClassifier : A random forest classifier with optimal splits. RandomForestRegressor : Ensemble regressor using trees with optimal splits.

The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.

1: P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.

>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_features=4, random_state=0)
>>> clf = ExtraTreesClassifier(n_estimators=100, random_state=0)
>>> clf.fit(X, y)
ExtraTreesClassifier(random_state=0)
>>> clf.predict([[0, 0, 0, 0]])
array([1])

get_feature_metadata(self, features, **kwargs)[source]: By default nothing is implemented

class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnGradientBoostingClassifier(*, loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)[source]

Bases: sklearn.ensemble.GradientBoostingClassifier, simpleml.models.classifiers.external_models.ClassificationExternalModelMixin

Gradient Boosting for classification.

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.

Read more in the User Guide.

loss{‘deviance’, ‘exponential’}, default=’deviance’

The loss function to be optimized. ‘deviance’ refers to deviance (= logistic regression) for classification with probabilistic outputs. For loss ‘exponential’ gradient boosting recovers the AdaBoost algorithm.

learning_ratefloat, default=0.1

Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.

n_estimatorsint, default=100

The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.

subsamplefloat, default=1.0

The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.

criterion{‘friedman_mse’, ‘squared_error’, ‘mse’, ‘mae’}, default=’friedman_mse’

The function to measure the quality of a split. Supported criteria are ‘friedman_mse’ for the mean squared error with improvement score by Friedman, ‘squared_error’ for mean squared error, and ‘mae’ for the mean absolute error. The default value of ‘friedman_mse’ is generally the best as it can provide a better approximation in some cases.

New in version 0.18.

Deprecated since version 0.24: criterion=’mae’ is deprecated and will be removed in version 1.1 (renaming of 0.26). Use criterion=’friedman_mse’ or ‘squared_error’ instead, as trees should use a squared error criterion in Gradient Boosting.

Deprecated since version 1.0: Criterion ‘mse’ was deprecated in v1.0 and will be removed in version 1.2. Use criterion=’squared_error’ which is equivalent.

min_samples_splitint or float, default=2

The minimum number of samples required to split an internal node:

If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

Changed in version 0.18: Added float values for fractions.

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

Changed in version 0.18: Added float values for fractions.

min_weight_fraction_leaffloat, default=0.0

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

max_depthint, default=3

The maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.

min_impurity_decreasefloat, default=0.0

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

New in version 0.19.

initestimator or ‘zero’, default=None

An estimator object that is used to compute the initial predictions. init has to provide fit() and predict_proba(). If ‘zero’, the initial raw predictions are set to zero. By default, a DummyEstimator predicting the classes priors is used.

random_stateint, RandomState instance or None, default=None

Controls the random seed given to each Tree estimator at each boosting iteration. In addition, it controls the random permutation of the features at each split (see Notes for more details). It also controls the random splitting of the training data to obtain a validation set if n_iter_no_change is not None. Pass an int for reproducible output across multiple function calls. See Glossary.

max_features{‘auto’, ‘sqrt’, ‘log2’}, int or float, default=None

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
If ‘auto’, then max_features=sqrt(n_features).
If ‘sqrt’, then max_features=sqrt(n_features).
If ‘log2’, then max_features=log2(n_features).
If None, then max_features=n_features.

Choosing max_features < n_features leads to a reduction of variance and an increase in bias.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

verboseint, default=0

Enable verbose output. If 1 then it prints progress and performance once in a while (the more trees the lower the frequency). If greater than 1 then it prints progress and performance for every tree.

max_leaf_nodesint, default=None

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

warm_startbool, default=False

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just erase the previous solution. See the Glossary.

validation_fractionfloat, default=0.1

The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if n_iter_no_change is set to an integer.

New in version 0.20.

n_iter_no_changeint, default=None

n_iter_no_change is used to decide if early stopping will be used to terminate training when validation score is not improving. By default it is set to None to disable early stopping. If set to a number, it will set aside validation_fraction size of the training data as validation and terminate training when validation score is not improving in all of the previous n_iter_no_change numbers of iterations. The split is stratified.

New in version 0.20.

tolfloat, default=1e-4

Tolerance for the early stopping. When the loss is not improving by at least tol for n_iter_no_change iterations (if set to a number), the training stops.

New in version 0.20.

ccp_alphanon-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.

New in version 0.22.

n_estimators_int

The number of estimators as selected by early stopping (if n_iter_no_change is specified). Otherwise it is set to n_estimators.

New in version 0.20.

feature_importances_ndarray of shape (n_features,)

The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

oob_improvement_ndarray of shape (n_estimators,)

The improvement in loss (= deviance) on the out-of-bag samples relative to the previous iteration. oob_improvement_[0] is the improvement in loss of the first stage over the init estimator. Only available if subsample < 1.0

train_score_ndarray of shape (n_estimators,)

The i-th score train_score_[i] is the deviance (= loss) of the model at iteration i on the in-bag sample. If subsample == 1 this is the deviance on the training data.

loss_LossFunction

The concrete LossFunction object.

init_estimator

The estimator that provides the initial predictions. Set via the init argument or loss.init_estimator.

estimators_ndarray of DecisionTreeRegressor of shape (n_estimators, loss_.K)

The collection of fitted sub-estimators. loss_.K is 1 for binary classification, otherwise n_classes.

classesndarray of shape (n_classes,)

The classes labels.

n_features_int

The number of data features.

Deprecated since version 1.0: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.

n_features_in_int

Number of features seen during fit.

New in version 0.24.

feature_names_in_ndarray of shape (n_features_in_,)

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

n_classes_int

The number of classes.

max_features_int

The inferred value of max_features.

HistGradientBoostingClassifierHistogram-based Gradient Boosting: Classification Tree.

sklearn.tree.DecisionTreeClassifier : A decision tree classifier. RandomForestClassifier : A meta-estimator that fits a number of decision

tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

AdaBoostClassifierA meta-estimator that begins by fitting a classifier: on the original dataset and then fits additional copies of the classifier on the same dataset where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

J. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, Vol. 29, No. 5, 2001.

Friedman, Stochastic Gradient Boosting, 1999

T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.

The following example shows how to fit a gradient boosting classifier with 100 decision stumps as weak learners.

>>> from sklearn.datasets import make_hastie_10_2
>>> from sklearn.ensemble import GradientBoostingClassifier

>>> X, y = make_hastie_10_2(random_state=0)
>>> X_train, X_test = X[:2000], X[2000:]
>>> y_train, y_test = y[:2000], y[2000:]

>>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
...     max_depth=1, random_state=0).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.913...

get_feature_metadata(self, features, **kwargs)[source]: By default nothing is implemented

class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnHistGradientBoostingClassifier(loss='auto', *, learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=255, categorical_features=None, monotonic_cst=None, warm_start=False, early_stopping='auto', scoring='loss', validation_fraction=0.1, n_iter_no_change=10, tol=1e-07, verbose=0, random_state=None)[source]

Bases: sklearn.ensemble.HistGradientBoostingClassifier, simpleml.models.classifiers.external_models.ClassificationExternalModelMixin

Histogram-based Gradient Boosting Classification Tree.

This estimator is much faster than GradientBoostingClassifier for big datasets (n_samples >= 10 000).

This estimator has native support for missing values (NaNs). During training, the tree grower learns at each split point whether samples with missing values should go to the left or right child, based on the potential gain. When predicting, samples with missing values are assigned to the left or right child consequently. If no missing values were encountered for a given feature during training, then samples with missing values are mapped to whichever child has the most samples.

This implementation is inspired by LightGBM.

Read more in the User Guide.

New in version 0.21.

loss{‘auto’, ‘binary_crossentropy’, ‘categorical_crossentropy’}, default=’auto’

The loss function to use in the boosting process. ‘binary_crossentropy’ (also known as logistic loss) is used for binary classification and generalizes to ‘categorical_crossentropy’ for multiclass classification. ‘auto’ will automatically choose either loss depending on the nature of the problem.

learning_ratefloat, default=0.1

The learning rate, also known as shrinkage. This is used as a multiplicative factor for the leaves values. Use 1 for no shrinkage.

max_iterint, default=100

The maximum number of iterations of the boosting process, i.e. the maximum number of trees for binary classification. For multiclass classification, n_classes trees per iteration are built.

max_leaf_nodesint or None, default=31

The maximum number of leaves for each tree. Must be strictly greater than 1. If None, there is no maximum limit.

max_depthint or None, default=None

The maximum depth of each tree. The depth of a tree is the number of edges to go from the root to the deepest leaf. Depth isn’t constrained by default.

min_samples_leafint, default=20

The minimum number of samples per leaf. For small datasets with less than a few hundred samples, it is recommended to lower this value since only very shallow trees would be built.

l2_regularizationfloat, default=0

The L2 regularization parameter. Use 0 for no regularization.

max_binsint, default=255

The maximum number of bins to use for non-missing values. Before training, each feature of the input array X is binned into integer-valued bins, which allows for a much faster training stage. Features with a small number of unique values may use less than max_bins bins. In addition to the max_bins bins, one more bin is always reserved for missing values. Must be no larger than 255.

categorical_featuresarray-like of {bool, int} of shape (n_features) or shape (n_categorical_features,), default=None

Indicates the categorical features.

None : no feature will be considered categorical.
boolean array-like : boolean mask indicating categorical features.
integer array-like : integer indices indicating categorical features.

For each categorical feature, there must be at most max_bins unique categories, and each categorical value must be in [0, max_bins -1].

Read more in the User Guide.

New in version 0.24.

monotonic_cstarray-like of int of shape (n_features), default=None

Indicates the monotonic constraint to enforce on each feature. -1, 1 and 0 respectively correspond to a negative constraint, positive constraint and no constraint. Read more in the User Guide.

New in version 0.23.

warm_startbool, default=False

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble. For results to be valid, the estimator should be re-trained on the same data only. See the Glossary.

early_stopping‘auto’ or bool, default=’auto’

If ‘auto’, early stopping is enabled if the sample size is larger than 10000. If True, early stopping is enabled, otherwise early stopping is disabled.

New in version 0.23.

scoringstr or callable or None, default=’loss’

Scoring parameter to use for early stopping. It can be a single string (see scoring_parameter) or a callable (see scoring). If None, the estimator’s default scorer is used. If scoring='loss', early stopping is checked w.r.t the loss value. Only used if early stopping is performed.

validation_fractionint or float or None, default=0.1

Proportion (or absolute size) of training data to set aside as validation data for early stopping. If None, early stopping is done on the training data. Only used if early stopping is performed.

n_iter_no_changeint, default=10

Used to determine when to “early stop”. The fitting process is stopped when none of the last n_iter_no_change scores are better than the n_iter_no_change - 1 -th-to-last one, up to some tolerance. Only used if early stopping is performed.

tolfloat, default=1e-7

The absolute tolerance to use when comparing scores. The higher the tolerance, the more likely we are to early stop: higher tolerance means that it will be harder for subsequent iterations to be considered an improvement upon the reference score.

verboseint, default=0

The verbosity level. If not zero, print some information about the fitting process.

random_stateint, RandomState instance or None, default=None

Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. Pass an int for reproducible output across multiple function calls. See Glossary.

classesarray, shape = (n_classes,): Class labels.
do_early_stopping_bool: Indicates whether early stopping is used during training.
n_iter_int: The number of iterations as selected by early stopping, depending on the early_stopping parameter. Otherwise it corresponds to max_iter.
n_trees_per_iteration_int: The number of tree that are built at each iteration. This is equal to 1 for binary classification, and to n_classes for multiclass classification.
train_score_ndarray, shape (n_iter_+1,): The scores at each iteration on the training data. The first entry is the score of the ensemble before the first iteration. Scores are computed according to the scoring parameter. If scoring is not ‘loss’, scores are computed on a subset of at most 10 000 samples. Empty if no early stopping.
validation_score_ndarray, shape (n_iter_+1,): The scores at each iteration on the held-out validation data. The first entry is the score of the ensemble before the first iteration. Scores are computed according to the scoring parameter. Empty if no early stopping or if validation_fraction is None.
is_categorical_ndarray, shape (n_features, ) or None: Boolean mask for the categorical features. None if there are no categorical features.
n_features_in_int: Number of features seen during fit.

New in version 0.24.
feature_names_in_ndarray of shape (n_features_in_,): Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

GradientBoostingClassifierExact gradient boosting method that does not: scale as good on datasets with a large number of samples.

sklearn.tree.DecisionTreeClassifier : A decision tree classifier. RandomForestClassifier : A meta-estimator that fits a number of decision

tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

AdaBoostClassifierA meta-estimator that begins by fitting a classifier: on the original dataset and then fits additional copies of the classifier on the same dataset where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> clf = HistGradientBoostingClassifier().fit(X, y)
>>> clf.score(X, y)
1.0

class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnRandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]

Bases: sklearn.ensemble.RandomForestClassifier, simpleml.models.classifiers.external_models.ClassificationExternalModelMixin

A random forest classifier.

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

Read more in the User Guide.

n_estimatorsint, default=100

The number of trees in the forest.

Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.

criterion{“gini”, “entropy”}, default=”gini”

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.

max_depthint, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint or float, default=2

The minimum number of samples required to split an internal node:

If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

Changed in version 0.18: Added float values for fractions.

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

Changed in version 0.18: Added float values for fractions.

min_weight_fraction_leaffloat, default=0.0

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a fraction and round(max_features * n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

max_leaf_nodesint, default=None

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

min_impurity_decreasefloat, default=0.0

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

New in version 0.19.

bootstrapbool, default=True

Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

oob_scorebool, default=False

Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.

n_jobsint, default=None

The number of jobs to run in parallel. fit(), predict(), decision_path() and apply() are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

random_stateint, RandomState instance or None, default=None

Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details.