simpleml.models.classifiers.sklearn.ensemble
Wrapper module around sklearn.ensemble
Module Contents
Classes
No different than base model. Here just to maintain the pattern |
|
No different than base model. Here just to maintain the pattern |
|
No different than base model. Here just to maintain the pattern |
|
No different than base model. Here just to maintain the pattern |
|
No different than base model. Here just to maintain the pattern |
|
No different than base model. Here just to maintain the pattern |
|
No different than base model. Here just to maintain the pattern |
|
An AdaBoost classifier. |
|
A Bagging classifier. |
|
An extra-trees classifier. |
|
Gradient Boosting for classification. |
|
Histogram-based Gradient Boosting Classification Tree. |
|
A random forest classifier. |
|
Soft Voting/Majority Rule classifier for unfitted estimators. |
Attributes
AdaBoost Classifier |
|
- class simpleml.models.classifiers.sklearn.ensemble.SklearnAdaBoostClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]
Bases:
simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier
No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)
Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors
Two supported patterns - full initialization in constructor or stepwise configured before fit and save
- Parameters
- class simpleml.models.classifiers.sklearn.ensemble.SklearnBaggingClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]
Bases:
simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier
No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)
Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors
Two supported patterns - full initialization in constructor or stepwise configured before fit and save
- Parameters
- class simpleml.models.classifiers.sklearn.ensemble.SklearnExtraTreesClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]
Bases:
simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier
No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)
Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors
Two supported patterns - full initialization in constructor or stepwise configured before fit and save
- Parameters
- class simpleml.models.classifiers.sklearn.ensemble.SklearnGradientBoostingClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]
Bases:
simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier
No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)
Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors
Two supported patterns - full initialization in constructor or stepwise configured before fit and save
- Parameters
- class simpleml.models.classifiers.sklearn.ensemble.SklearnHistGradientBoostingClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]
Bases:
simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier
No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)
Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors
Two supported patterns - full initialization in constructor or stepwise configured before fit and save
- Parameters
- class simpleml.models.classifiers.sklearn.ensemble.SklearnRandomForestClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]
Bases:
simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier
No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)
Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors
Two supported patterns - full initialization in constructor or stepwise configured before fit and save
- Parameters
- class simpleml.models.classifiers.sklearn.ensemble.SklearnVotingClassifier(has_external_files=True, external_model_kwargs=None, params=None, fitted=False, pipeline_id=None, **kwargs)[source]
Bases:
simpleml.models.classifiers.sklearn.base_sklearn_classifier.SklearnClassifier
No different than base model. Here just to maintain the pattern Generic Base -> Library Base -> Domain Base -> Individual Models (ex: [Library]Model -> SklearnModel -> SklearnClassifier -> SklearnLogisticRegression)
Need to explicitly separate passthrough kwargs to external models since most do not support arbitrary **kwargs in the constructors
Two supported patterns - full initialization in constructor or stepwise configured before fit and save
- Parameters
- class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnAdaBoostClassifier(base_estimator=None, *, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None)[source]
Bases:
sklearn.ensemble.AdaBoostClassifier
,simpleml.models.classifiers.external_models.ClassificationExternalModelMixin
An AdaBoost classifier.
An AdaBoost [1] classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
This class implements the algorithm known as AdaBoost-SAMME [2].
Read more in the User Guide.
New in version 0.14.
- base_estimatorobject, default=None
The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper
classes_
andn_classes_
attributes. IfNone
, then the base estimator isDecisionTreeClassifier
initialized with max_depth=1.- n_estimatorsint, default=50
The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.
- learning_ratefloat, default=1.0
Weight applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier. There is a trade-off between the learning_rate and n_estimators parameters.
- algorithm{‘SAMME’, ‘SAMME.R’}, default=’SAMME.R’
If ‘SAMME.R’ then use the SAMME.R real boosting algorithm.
base_estimator
must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.- random_stateint, RandomState instance or None, default=None
Controls the random seed given at each base_estimator at each boosting iteration. Thus, it is only used when base_estimator exposes a random_state. Pass an int for reproducible output across multiple function calls. See Glossary.
- base_estimator_estimator
The base estimator from which the ensemble is grown.
- estimators_list of classifiers
The collection of fitted sub-estimators.
- classesndarray of shape (n_classes,)
The classes labels.
- n_classes_int
The number of classes.
- estimator_weights_ndarray of floats
Weights for each estimator in the boosted ensemble.
- estimator_errors_ndarray of floats
Classification error for each estimator in the boosted ensemble.
- feature_importances_ndarray of shape (n_features,)
The impurity-based feature importances if supported by the
base_estimator
(when based on decision trees).Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See
sklearn.inspection.permutation_importance()
as an alternative.- n_features_in_int
Number of features seen during fit.
New in version 0.24.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- AdaBoostRegressorAn AdaBoost regressor that begins by fitting a
regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of instances are adjusted according to the error of the current prediction.
- GradientBoostingClassifierGB builds an additive model in a forward
stage-wise fashion. Regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.
- sklearn.tree.DecisionTreeClassifierA non-parametric supervised learning
method used for classification. Creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
- 1
Y. Freund, R. Schapire, “A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting”, 1995.
- 2
Zhu, H. Zou, S. Rosset, T. Hastie, “Multi-class AdaBoost”, 2009.
>>> from sklearn.ensemble import AdaBoostClassifier >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=1000, n_features=4, ... n_informative=2, n_redundant=0, ... random_state=0, shuffle=False) >>> clf = AdaBoostClassifier(n_estimators=100, random_state=0) >>> clf.fit(X, y) AdaBoostClassifier(n_estimators=100, random_state=0) >>> clf.predict([[0, 0, 0, 0]]) array([1]) >>> clf.score(X, y) 0.983...
- class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnBaggingClassifier(base_estimator=None, n_estimators=10, *, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)[source]
Bases:
sklearn.ensemble.BaggingClassifier
,simpleml.models.classifiers.external_models.ClassificationExternalModelMixin
A Bagging classifier.
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.
This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting [1]_. If samples are drawn with replacement, then the method is known as Bagging [2]_. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces 3. Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches 4.
Read more in the User Guide.
New in version 0.15.
- base_estimatorobject, default=None
The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a
DecisionTreeClassifier
.- n_estimatorsint, default=10
The number of base estimators in the ensemble.
- max_samplesint or float, default=1.0
The number of samples to draw from X to train each base estimator (with replacement by default, see bootstrap for more details).
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples.
- max_featuresint or float, default=1.0
The number of features to draw from X to train each base estimator ( without replacement by default, see bootstrap_features for more details).
If int, then draw max_features features.
If float, then draw max_features * X.shape[1] features.
- bootstrapbool, default=True
Whether samples are drawn with replacement. If False, sampling without replacement is performed.
- bootstrap_featuresbool, default=False
Whether features are drawn with replacement.
- oob_scorebool, default=False
Whether to use out-of-bag samples to estimate the generalization error. Only available if bootstrap=True.
- warm_startbool, default=False
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble. See the Glossary.
New in version 0.17: warm_start constructor parameter.
- n_jobsint, default=None
The number of jobs to run in parallel for both
fit()
andpredict()
.None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.- random_stateint, RandomState instance or None, default=None
Controls the random resampling of the original dataset (sample wise and feature wise). If the base estimator accepts a random_state attribute, a different seed is generated for each instance in the ensemble. Pass an int for reproducible output across multiple function calls. See Glossary.
- verboseint, default=0
Controls the verbosity when fitting and predicting.
- base_estimator_estimator
The base estimator from which the ensemble is grown.
- n_features_int
The number of features when
fit()
is performed.Deprecated since version 1.0: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.
- n_features_in_int
Number of features seen during fit.
New in version 0.24.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- estimators_list of estimators
The collection of fitted base estimators.
- estimators_samples_list of arrays
The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected.
- estimators_features_list of arrays
The subset of drawn features for each base estimator.
- classesndarray of shape (n_classes,)
The classes labels.
- n_classes_int or list
The number of classes.
- oob_score_float
Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when
oob_score
is True.- oob_decision_function_ndarray of shape (n_samples, n_classes)
Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN. This attribute exists only when
oob_score
is True.
BaggingRegressor : A Bagging regressor.
- 1
L. Breiman, “Pasting small votes for classification in large databases and on-line”, Machine Learning, 36(1), 85-103, 1999.
- 2
L. Breiman, “Bagging predictors”, Machine Learning, 24(2), 123-140, 1996.
- 3
T. Ho, “The random subspace method for constructing decision forests”, Pattern Analysis and Machine Intelligence, 20(8), 832-844, 1998.
- 4
G. Louppe and P. Geurts, “Ensembles on Random Patches”, Machine Learning and Knowledge Discovery in Databases, 346-361, 2012.
>>> from sklearn.svm import SVC >>> from sklearn.ensemble import BaggingClassifier >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=100, n_features=4, ... n_informative=2, n_redundant=0, ... random_state=0, shuffle=False) >>> clf = BaggingClassifier(base_estimator=SVC(), ... n_estimators=10, random_state=0).fit(X, y) >>> clf.predict([[0, 0, 0, 0]]) array([1])
- class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnExtraTreesClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=False, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]
Bases:
sklearn.ensemble.ExtraTreesClassifier
,simpleml.models.classifiers.external_models.ClassificationExternalModelMixin
An extra-trees classifier.
This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Read more in the User Guide.
- n_estimatorsint, default=100
The number of trees in the forest.
Changed in version 0.22: The default value of
n_estimators
changed from 10 to 100 in 0.22.- criterion{“gini”, “entropy”}, default=”gini”
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
- max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_splitint or float, default=2
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for fractions.
- min_samples_leafint or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for fractions.
- min_weight_fraction_leaffloat, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
- max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and round(max_features * n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than
max_features
features.- max_leaf_nodesint, default=None
Grow trees with
max_leaf_nodes
in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.- min_impurity_decreasefloat, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
where
N
is the total number of samples,N_t
is the number of samples at the current node,N_t_L
is the number of samples in the left child, andN_t_R
is the number of samples in the right child.N
,N_t
,N_t_R
andN_t_L
all refer to the weighted sum, ifsample_weight
is passed.New in version 0.19.
- bootstrapbool, default=False
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
- oob_scorebool, default=False
Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.
- n_jobsint, default=None
The number of jobs to run in parallel.
fit()
,predict()
,decision_path()
andapply()
are all parallelized over the trees.None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.- random_stateint, RandomState instance or None, default=None
Controls 3 sources of randomness:
the bootstrapping of the samples used when building trees (if
bootstrap=True
)the sampling of the features to consider when looking for the best split at each node (if
max_features < n_features
)the draw of the splits for each of the max_features
See Glossary for details.
- verboseint, default=0
Controls the verbosity when fitting and predicting.
- warm_startbool, default=False
When set to
True
, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.- class_weight{“balanced”, “balanced_subsample”}, dict or list of dicts, default=None
Weights associated with classes in the form
{class_label: weight}
. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as
n_samples / (n_classes * np.bincount(y))
The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.
For multi-output, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
- ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than
ccp_alpha
will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.New in version 0.22.
- max_samplesint or float, default=None
If bootstrap is True, the number of samples to draw from X to train each base estimator.
If None (default), then draw X.shape[0] samples.
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0.0, 1.0].
New in version 0.22.
- base_estimator_ExtraTreesClassifier
The child estimator template used to create the collection of fitted sub-estimators.
- estimators_list of DecisionTreeClassifier
The collection of fitted sub-estimators.
- classesndarray of shape (n_classes,) or a list of such arrays
The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).
- n_classes_int or list
The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem).
- feature_importances_ndarray of shape (n_features,)
The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.
Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See
sklearn.inspection.permutation_importance()
as an alternative.- n_features_int
The number of features when
fit
is performed.Deprecated since version 1.0: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.
- n_features_in_int
Number of features seen during fit.
New in version 0.24.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- n_outputs_int
The number of outputs when
fit
is performed.- oob_score_float
Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when
oob_score
is True.- oob_decision_function_ndarray of shape (n_samples, n_classes) or (n_samples, n_classes, n_outputs)
Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN. This attribute exists only when
oob_score
is True.
ExtraTreesRegressor : An extra-trees regressor with random splits. RandomForestClassifier : A random forest classifier with optimal splits. RandomForestRegressor : Ensemble regressor using trees with optimal splits.
The default values for the parameters controlling the size of the trees (e.g.
max_depth
,min_samples_leaf
, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.- 1
P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006.
>>> from sklearn.ensemble import ExtraTreesClassifier >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_features=4, random_state=0) >>> clf = ExtraTreesClassifier(n_estimators=100, random_state=0) >>> clf.fit(X, y) ExtraTreesClassifier(random_state=0) >>> clf.predict([[0, 0, 0, 0]]) array([1])
- class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnGradientBoostingClassifier(*, loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)[source]
Bases:
sklearn.ensemble.GradientBoostingClassifier
,simpleml.models.classifiers.external_models.ClassificationExternalModelMixin
Gradient Boosting for classification.
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage
n_classes_
regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.Read more in the User Guide.
- loss{‘deviance’, ‘exponential’}, default=’deviance’
The loss function to be optimized. ‘deviance’ refers to deviance (= logistic regression) for classification with probabilistic outputs. For loss ‘exponential’ gradient boosting recovers the AdaBoost algorithm.
- learning_ratefloat, default=0.1
Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.
- n_estimatorsint, default=100
The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.
- subsamplefloat, default=1.0
The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
- criterion{‘friedman_mse’, ‘squared_error’, ‘mse’, ‘mae’}, default=’friedman_mse’
The function to measure the quality of a split. Supported criteria are ‘friedman_mse’ for the mean squared error with improvement score by Friedman, ‘squared_error’ for mean squared error, and ‘mae’ for the mean absolute error. The default value of ‘friedman_mse’ is generally the best as it can provide a better approximation in some cases.
New in version 0.18.
Deprecated since version 0.24: criterion=’mae’ is deprecated and will be removed in version 1.1 (renaming of 0.26). Use criterion=’friedman_mse’ or ‘squared_error’ instead, as trees should use a squared error criterion in Gradient Boosting.
Deprecated since version 1.0: Criterion ‘mse’ was deprecated in v1.0 and will be removed in version 1.2. Use criterion=’squared_error’ which is equivalent.
- min_samples_splitint or float, default=2
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for fractions.
- min_samples_leafint or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for fractions.
- min_weight_fraction_leaffloat, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
- max_depthint, default=3
The maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.
- min_impurity_decreasefloat, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
where
N
is the total number of samples,N_t
is the number of samples at the current node,N_t_L
is the number of samples in the left child, andN_t_R
is the number of samples in the right child.N
,N_t
,N_t_R
andN_t_L
all refer to the weighted sum, ifsample_weight
is passed.New in version 0.19.
- initestimator or ‘zero’, default=None
An estimator object that is used to compute the initial predictions.
init
has to providefit()
andpredict_proba()
. If ‘zero’, the initial raw predictions are set to zero. By default, aDummyEstimator
predicting the classes priors is used.- random_stateint, RandomState instance or None, default=None
Controls the random seed given to each Tree estimator at each boosting iteration. In addition, it controls the random permutation of the features at each split (see Notes for more details). It also controls the random splitting of the training data to obtain a validation set if n_iter_no_change is not None. Pass an int for reproducible output across multiple function calls. See Glossary.
- max_features{‘auto’, ‘sqrt’, ‘log2’}, int or float, default=None
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
If ‘auto’, then max_features=sqrt(n_features).
If ‘sqrt’, then max_features=sqrt(n_features).
If ‘log2’, then max_features=log2(n_features).
If None, then max_features=n_features.
Choosing max_features < n_features leads to a reduction of variance and an increase in bias.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than
max_features
features.- verboseint, default=0
Enable verbose output. If 1 then it prints progress and performance once in a while (the more trees the lower the frequency). If greater than 1 then it prints progress and performance for every tree.
- max_leaf_nodesint, default=None
Grow trees with
max_leaf_nodes
in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.- warm_startbool, default=False
When set to
True
, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just erase the previous solution. See the Glossary.- validation_fractionfloat, default=0.1
The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if
n_iter_no_change
is set to an integer.New in version 0.20.
- n_iter_no_changeint, default=None
n_iter_no_change
is used to decide if early stopping will be used to terminate training when validation score is not improving. By default it is set to None to disable early stopping. If set to a number, it will set asidevalidation_fraction
size of the training data as validation and terminate training when validation score is not improving in all of the previousn_iter_no_change
numbers of iterations. The split is stratified.New in version 0.20.
- tolfloat, default=1e-4
Tolerance for the early stopping. When the loss is not improving by at least tol for
n_iter_no_change
iterations (if set to a number), the training stops.New in version 0.20.
- ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than
ccp_alpha
will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.New in version 0.22.
- n_estimators_int
The number of estimators as selected by early stopping (if
n_iter_no_change
is specified). Otherwise it is set ton_estimators
.New in version 0.20.
- feature_importances_ndarray of shape (n_features,)
The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.
Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See
sklearn.inspection.permutation_importance()
as an alternative.- oob_improvement_ndarray of shape (n_estimators,)
The improvement in loss (= deviance) on the out-of-bag samples relative to the previous iteration.
oob_improvement_[0]
is the improvement in loss of the first stage over theinit
estimator. Only available ifsubsample < 1.0
- train_score_ndarray of shape (n_estimators,)
The i-th score
train_score_[i]
is the deviance (= loss) of the model at iterationi
on the in-bag sample. Ifsubsample == 1
this is the deviance on the training data.- loss_LossFunction
The concrete
LossFunction
object.- init_estimator
The estimator that provides the initial predictions. Set via the
init
argument orloss.init_estimator
.- estimators_ndarray of DecisionTreeRegressor of shape (n_estimators,
loss_.K
) The collection of fitted sub-estimators.
loss_.K
is 1 for binary classification, otherwise n_classes.- classesndarray of shape (n_classes,)
The classes labels.
- n_features_int
The number of data features.
Deprecated since version 1.0: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.
- n_features_in_int
Number of features seen during fit.
New in version 0.24.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- n_classes_int
The number of classes.
- max_features_int
The inferred value of max_features.
- HistGradientBoostingClassifierHistogram-based Gradient Boosting
Classification Tree.
sklearn.tree.DecisionTreeClassifier : A decision tree classifier. RandomForestClassifier : A meta-estimator that fits a number of decision
tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
- AdaBoostClassifierA meta-estimator that begins by fitting a classifier
on the original dataset and then fits additional copies of the classifier on the same dataset where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and
max_features=n_features
, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting,random_state
has to be fixed.J. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, Vol. 29, No. 5, 2001.
Friedman, Stochastic Gradient Boosting, 1999
T. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical Learning Ed. 2, Springer, 2009.
The following example shows how to fit a gradient boosting classifier with 100 decision stumps as weak learners.
>>> from sklearn.datasets import make_hastie_10_2 >>> from sklearn.ensemble import GradientBoostingClassifier
>>> X, y = make_hastie_10_2(random_state=0) >>> X_train, X_test = X[:2000], X[2000:] >>> y_train, y_test = y[:2000], y[2000:]
>>> clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, ... max_depth=1, random_state=0).fit(X_train, y_train) >>> clf.score(X_test, y_test) 0.913...
- class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnHistGradientBoostingClassifier(loss='auto', *, learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=255, categorical_features=None, monotonic_cst=None, warm_start=False, early_stopping='auto', scoring='loss', validation_fraction=0.1, n_iter_no_change=10, tol=1e-07, verbose=0, random_state=None)[source]
Bases:
sklearn.ensemble.HistGradientBoostingClassifier
,simpleml.models.classifiers.external_models.ClassificationExternalModelMixin
Histogram-based Gradient Boosting Classification Tree.
This estimator is much faster than
GradientBoostingClassifier
for big datasets (n_samples >= 10 000).This estimator has native support for missing values (NaNs). During training, the tree grower learns at each split point whether samples with missing values should go to the left or right child, based on the potential gain. When predicting, samples with missing values are assigned to the left or right child consequently. If no missing values were encountered for a given feature during training, then samples with missing values are mapped to whichever child has the most samples.
This implementation is inspired by LightGBM.
Read more in the User Guide.
New in version 0.21.
- loss{‘auto’, ‘binary_crossentropy’, ‘categorical_crossentropy’}, default=’auto’
The loss function to use in the boosting process. ‘binary_crossentropy’ (also known as logistic loss) is used for binary classification and generalizes to ‘categorical_crossentropy’ for multiclass classification. ‘auto’ will automatically choose either loss depending on the nature of the problem.
- learning_ratefloat, default=0.1
The learning rate, also known as shrinkage. This is used as a multiplicative factor for the leaves values. Use
1
for no shrinkage.- max_iterint, default=100
The maximum number of iterations of the boosting process, i.e. the maximum number of trees for binary classification. For multiclass classification, n_classes trees per iteration are built.
- max_leaf_nodesint or None, default=31
The maximum number of leaves for each tree. Must be strictly greater than 1. If None, there is no maximum limit.
- max_depthint or None, default=None
The maximum depth of each tree. The depth of a tree is the number of edges to go from the root to the deepest leaf. Depth isn’t constrained by default.
- min_samples_leafint, default=20
The minimum number of samples per leaf. For small datasets with less than a few hundred samples, it is recommended to lower this value since only very shallow trees would be built.
- l2_regularizationfloat, default=0
The L2 regularization parameter. Use 0 for no regularization.
- max_binsint, default=255
The maximum number of bins to use for non-missing values. Before training, each feature of the input array X is binned into integer-valued bins, which allows for a much faster training stage. Features with a small number of unique values may use less than
max_bins
bins. In addition to themax_bins
bins, one more bin is always reserved for missing values. Must be no larger than 255.- categorical_featuresarray-like of {bool, int} of shape (n_features) or shape (n_categorical_features,), default=None
Indicates the categorical features.
None : no feature will be considered categorical.
boolean array-like : boolean mask indicating categorical features.
integer array-like : integer indices indicating categorical features.
For each categorical feature, there must be at most max_bins unique categories, and each categorical value must be in [0, max_bins -1].
Read more in the User Guide.
New in version 0.24.
- monotonic_cstarray-like of int of shape (n_features), default=None
Indicates the monotonic constraint to enforce on each feature. -1, 1 and 0 respectively correspond to a negative constraint, positive constraint and no constraint. Read more in the User Guide.
New in version 0.23.
- warm_startbool, default=False
When set to
True
, reuse the solution of the previous call to fit and add more estimators to the ensemble. For results to be valid, the estimator should be re-trained on the same data only. See the Glossary.- early_stopping‘auto’ or bool, default=’auto’
If ‘auto’, early stopping is enabled if the sample size is larger than 10000. If True, early stopping is enabled, otherwise early stopping is disabled.
New in version 0.23.
- scoringstr or callable or None, default=’loss’
Scoring parameter to use for early stopping. It can be a single string (see scoring_parameter) or a callable (see scoring). If None, the estimator’s default scorer is used. If
scoring='loss'
, early stopping is checked w.r.t the loss value. Only used if early stopping is performed.- validation_fractionint or float or None, default=0.1
Proportion (or absolute size) of training data to set aside as validation data for early stopping. If None, early stopping is done on the training data. Only used if early stopping is performed.
- n_iter_no_changeint, default=10
Used to determine when to “early stop”. The fitting process is stopped when none of the last
n_iter_no_change
scores are better than then_iter_no_change - 1
-th-to-last one, up to some tolerance. Only used if early stopping is performed.- tolfloat, default=1e-7
The absolute tolerance to use when comparing scores. The higher the tolerance, the more likely we are to early stop: higher tolerance means that it will be harder for subsequent iterations to be considered an improvement upon the reference score.
- verboseint, default=0
The verbosity level. If not zero, print some information about the fitting process.
- random_stateint, RandomState instance or None, default=None
Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. Pass an int for reproducible output across multiple function calls. See Glossary.
- classesarray, shape = (n_classes,)
Class labels.
- do_early_stopping_bool
Indicates whether early stopping is used during training.
- n_iter_int
The number of iterations as selected by early stopping, depending on the early_stopping parameter. Otherwise it corresponds to max_iter.
- n_trees_per_iteration_int
The number of tree that are built at each iteration. This is equal to 1 for binary classification, and to
n_classes
for multiclass classification.- train_score_ndarray, shape (n_iter_+1,)
The scores at each iteration on the training data. The first entry is the score of the ensemble before the first iteration. Scores are computed according to the
scoring
parameter. Ifscoring
is not ‘loss’, scores are computed on a subset of at most 10 000 samples. Empty if no early stopping.- validation_score_ndarray, shape (n_iter_+1,)
The scores at each iteration on the held-out validation data. The first entry is the score of the ensemble before the first iteration. Scores are computed according to the
scoring
parameter. Empty if no early stopping or ifvalidation_fraction
is None.- is_categorical_ndarray, shape (n_features, ) or None
Boolean mask for the categorical features.
None
if there are no categorical features.- n_features_in_int
Number of features seen during fit.
New in version 0.24.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- GradientBoostingClassifierExact gradient boosting method that does not
scale as good on datasets with a large number of samples.
sklearn.tree.DecisionTreeClassifier : A decision tree classifier. RandomForestClassifier : A meta-estimator that fits a number of decision
tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
- AdaBoostClassifierA meta-estimator that begins by fitting a classifier
on the original dataset and then fits additional copies of the classifier on the same dataset where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
>>> from sklearn.ensemble import HistGradientBoostingClassifier >>> from sklearn.datasets import load_iris >>> X, y = load_iris(return_X_y=True) >>> clf = HistGradientBoostingClassifier().fit(X, y) >>> clf.score(X, y) 1.0
- class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnRandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]
Bases:
sklearn.ensemble.RandomForestClassifier
,simpleml.models.classifiers.external_models.ClassificationExternalModelMixin
A random forest classifier.
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
Read more in the User Guide.
- n_estimatorsint, default=100
The number of trees in the forest.
Changed in version 0.22: The default value of
n_estimators
changed from 10 to 100 in 0.22.- criterion{“gini”, “entropy”}, default=”gini”
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.
- max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- min_samples_splitint or float, default=2
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
Changed in version 0.18: Added float values for fractions.
- min_samples_leafint or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
Changed in version 0.18: Added float values for fractions.
- min_weight_fraction_leaffloat, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
- max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and round(max_features * n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than
max_features
features.- max_leaf_nodesint, default=None
Grow trees with
max_leaf_nodes
in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.- min_impurity_decreasefloat, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
where
N
is the total number of samples,N_t
is the number of samples at the current node,N_t_L
is the number of samples in the left child, andN_t_R
is the number of samples in the right child.N
,N_t
,N_t_R
andN_t_L
all refer to the weighted sum, ifsample_weight
is passed.New in version 0.19.
- bootstrapbool, default=True
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
- oob_scorebool, default=False
Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.
- n_jobsint, default=None
The number of jobs to run in parallel.
fit()
,predict()
,decision_path()
andapply()
are all parallelized over the trees.None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.- random_stateint, RandomState instance or None, default=None
Controls both the randomness of the bootstrapping of the samples used when building trees (if
bootstrap=True
) and the sampling of the features to consider when looking for the best split at each node (ifmax_features < n_features
). See Glossary for details.- verboseint, default=0
Controls the verbosity when fitting and predicting.
- warm_startbool, default=False
When set to
True
, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.- class_weight{“balanced”, “balanced_subsample”}, dict or list of dicts, default=None
Weights associated with classes in the form
{class_label: weight}
. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as
n_samples / (n_classes * np.bincount(y))
The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.
For multi-output, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
- ccp_alphanon-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than
ccp_alpha
will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details.New in version 0.22.
- max_samplesint or float, default=None
If bootstrap is True, the number of samples to draw from X to train each base estimator.
If None (default), then draw X.shape[0] samples.
If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0.0, 1.0].
New in version 0.22.
- base_estimator_DecisionTreeClassifier
The child estimator template used to create the collection of fitted sub-estimators.
- estimators_list of DecisionTreeClassifier
The collection of fitted sub-estimators.
- classesndarray of shape (n_classes,) or a list of such arrays
The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).
- n_classes_int or list
The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem).
- n_features_int
The number of features when
fit
is performed.Deprecated since version 1.0: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use n_features_in_ instead.
- n_features_in_int
Number of features seen during fit.
New in version 0.24.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- n_outputs_int
The number of outputs when
fit
is performed.- feature_importances_ndarray of shape (n_features,)
The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.
Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See
sklearn.inspection.permutation_importance()
as an alternative.- oob_score_float
Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when
oob_score
is True.- oob_decision_function_ndarray of shape (n_samples, n_classes) or (n_samples, n_classes, n_outputs)
Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN. This attribute exists only when
oob_score
is True.
sklearn.tree.DecisionTreeClassifier : A decision tree classifier. sklearn.ensemble.ExtraTreesClassifier : Ensemble of extremely randomized
tree classifiers.
The default values for the parameters controlling the size of the trees (e.g.
max_depth
,min_samples_leaf
, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data,
max_features=n_features
andbootstrap=False
, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting,random_state
has to be fixed.- 1
Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.
>>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=1000, n_features=4, ... n_informative=2, n_redundant=0, ... random_state=0, shuffle=False) >>> clf = RandomForestClassifier(max_depth=2, random_state=0) >>> clf.fit(X, y) RandomForestClassifier(...) >>> print(clf.predict([[0, 0, 0, 0]])) [1]
- class simpleml.models.classifiers.sklearn.ensemble.WrappedSklearnVotingClassifier(estimators, *, voting='hard', weights=None, n_jobs=None, flatten_transform=True, verbose=False)[source]
Bases:
sklearn.ensemble.VotingClassifier
,simpleml.models.classifiers.external_models.ClassificationExternalModelMixin
Soft Voting/Majority Rule classifier for unfitted estimators.
Read more in the User Guide.
New in version 0.17.
- estimatorslist of (str, estimator) tuples
Invoking the
fit
method on theVotingClassifier
will fit clones of those original estimators that will be stored in the class attributeself.estimators_
. An estimator can be set to'drop'
usingset_params
.Changed in version 0.21:
'drop'
is accepted. Using None was deprecated in 0.22 and support was removed in 0.24.- voting{‘hard’, ‘soft’}, default=’hard’
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
- weightsarray-like of shape (n_classifiers,), default=None
Sequence of weights (float or int) to weight the occurrences of predicted class labels (hard voting) or class probabilities before averaging (soft voting). Uses uniform weights if None.
- n_jobsint, default=None
The number of jobs to run in parallel for
fit
.None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.New in version 0.18.
- flatten_transformbool, default=True
Affects shape of transform output only when voting=’soft’ If voting=’soft’ and flatten_transform=True, transform method returns matrix with shape (n_samples, n_classifiers * n_classes). If flatten_transform=False, it returns (n_classifiers, n_samples, n_classes).
- verbosebool, default=False
If True, the time elapsed while fitting will be printed as it is completed.
New in version 0.23.
- estimators_list of classifiers
The collection of fitted sub-estimators as defined in
estimators
that are not ‘drop’.- named_estimators_
Bunch
Attribute to access any fitted sub-estimators by name.
New in version 0.20.
- le_
LabelEncoder
Transformer used to encode the labels during fit and decode during prediction.
- classesndarray of shape (n_classes,)
The classes labels.
- n_features_in_int
Number of features seen during fit. Only defined if the underlying classifier exposes such an attribute when fit.
New in version 0.24.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Only defined if the underlying estimators expose such an attribute when fit. .. versionadded:: 1.0
VotingRegressor : Prediction voting regressor.
>>> import numpy as np >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.naive_bayes import GaussianNB >>> from sklearn.ensemble import RandomForestClassifier, VotingClassifier >>> clf1 = LogisticRegression(multi_class='multinomial', random_state=1) >>> clf2 = RandomForestClassifier(n_estimators=50, random_state=1) >>> clf3 = GaussianNB() >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) >>> y = np.array([1, 1, 1, 2, 2, 2]) >>> eclf1 = VotingClassifier(estimators=[ ... ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard') >>> eclf1 = eclf1.fit(X, y) >>> print(eclf1.predict(X)) [1 1 1 2 2 2] >>> np.array_equal(eclf1.named_estimators_.lr.predict(X), ... eclf1.named_estimators_['lr'].predict(X)) True >>> eclf2 = VotingClassifier(estimators=[ ... ('lr', clf1), ('rf', clf2), ('gnb', clf3)], ... voting='soft') >>> eclf2 = eclf2.fit(X, y) >>> print(eclf2.predict(X)) [1 1 1 2 2 2] >>> eclf3 = VotingClassifier(estimators=[ ... ('lr', clf1), ('rf', clf2), ('gnb', clf3)], ... voting='soft', weights=[2,1,1], ... flatten_transform=True) >>> eclf3 = eclf3.fit(X, y) >>> print(eclf3.predict(X)) [1 1 1 2 2 2] >>> print(eclf3.transform(X).shape) (6, 6)