yobx.sklearn.tests_helper#
- yobx.sklearn.tests_helper.dump_data_and_model(data: ndarray, model: BaseEstimator, model_onnx: ModelProto, basename: str)[source]#
Validate an ONNX model against the original scikit-learn model.
Runs data through both
onnxruntimeandExtendedReferenceEvaluatorand compares both outputs to the predictions produced by model:If the model has
predict_proba(), both labels and probabilities are compared (outputs0and1respectively).Else if the model has
transform(), the transformed output is compared (output0).Otherwise the
predict()output is compared (output0).
All comparisons use
_assert_close()with default tolerance.- Parameters:
data – Input feature matrix (
np.ndarray) passed to all evaluators.model – The fitted scikit-learn estimator used as the reference.
model_onnx – The ONNX representation of model to validate.
basename – Short identifier used as a prefix in assertion messages to make failures easier to locate.
- Raises:
AssertionError – If any output differs beyond the tolerance defined in
_assert_close().
- yobx.sklearn.tests_helper.fit_classification_model(model, n_classes, is_int=False, pos_features=False, label_string=False, random_state=42, is_bool=False, n_features=20, n_redundant=None, n_repeated=None, cls_dtype=None, is_double=False, n_samples=250)[source]#
Fit a classification model on a synthetic dataset and return it with test data.
Generates a classification dataset using
sklearn.datasets.make_classification(), fits model on the training split, and returns the fitted model together with the held-out test set.- Parameters:
model – An unfitted scikit-learn classifier.
n_classes – Number of target classes.
is_int – When
True, cast features tonp.int64.pos_features – When
True, take the absolute value of all features so that every entry is non-negative.label_string – When
True, convert integer labels to string labels of the form"cl<i>".random_state – Random seed forwarded to
make_classification()andtrain_test_split().is_bool – When
True, cast features tobool(impliesis_intfor the intermediate cast).n_features – Total number of features in the generated dataset.
n_redundant – Number of redundant features. Defaults to
min(2, n_features - min(7, n_features))— at most 2, reduced when n_features is small.n_repeated – Number of repeated features. Defaults to
0.cls_dtype – If provided, cast the label array to this dtype before fitting.
is_double – When
True, cast features tonp.float64after the integer/bool cast step.n_samples – Number of samples in the generated dataset.
- Returns:
A tuple
(fitted_model, X_test)where X_test is the held-out feature matrix.
- yobx.sklearn.tests_helper.fit_clustering_model(model, n_classes, is_int=False, pos_features=False, label_string=False, random_state=42, is_bool=False, n_features=20, n_redundant=None, n_repeated=None)[source]#
Fit a clustering model on a synthetic dataset and return it with test data.
Generates a classification dataset (used purely for the feature matrix) using
sklearn.datasets.make_classification(), fits model on the training split (labels are not used), and returns the fitted model together with the held-out test set.- Parameters:
model – An unfitted scikit-learn clustering estimator (e.g.
sklearn.cluster.KMeans).n_classes – Number of classes used when generating the dataset (acts as a proxy for the number of natural clusters in the feature space).
is_int – When
True, cast features tonp.int64.pos_features – When
True, take the absolute value of all features so that every entry is non-negative.label_string – When
True, convert integer labels to string labels of the form"cl<i>"(not used for fitting but kept for API symmetry).random_state – Random seed forwarded to
make_classification()andtrain_test_split().is_bool – When
True, cast features tobool(impliesis_intfor the intermediate cast).n_features – Total number of features in the generated dataset.
n_redundant – Number of redundant features. Defaults to
min(2, n_features - min(7, n_features))— at most 2, reduced when n_features is small.n_repeated – Number of repeated features. Defaults to
0.
- Returns:
A tuple
(fitted_model, X_test)where X_test is the held-out feature matrix.
- yobx.sklearn.tests_helper.fit_multi_output_classification_model(model, n_classes=3, n_samples=100, n_features=4, n_informative=5, n_outputs=2)[source]#
Fit a multi-output classification model on a synthetic integer dataset.
Generates a random integer feature matrix and a multi-column integer label matrix, fits a
RandomForestClassifier(the model parameter is accepted for API consistency but is replaced internally), and returns the fitted model together with a small test set.- Parameters:
model – Accepted for API consistency; internally replaced by a
RandomForestClassifier.n_classes – Number of distinct integer classes per output column.
n_samples – Number of training samples.
n_features – Number of feature columns.
n_informative – Upper bound for the random integer values in the feature matrix.
n_outputs – Number of output columns (targets).
- Returns:
A tuple
(fitted_model, X_test)where X_test contains 10 samples drawn from the same integer distribution as the training set.
- yobx.sklearn.tests_helper.fit_multilabel_classification_model(model, n_classes=5, n_labels=2, n_samples=200, n_features=20, is_int=False)[source]#
Fit a multilabel classification model on a synthetic dataset.
Generates a multilabel dataset using
sklearn.datasets.make_multilabel_classification(), fits model on the training split, and returns the fitted model together with the held-out test set.- Parameters:
model – An unfitted scikit-learn multilabel classifier.
n_classes – Number of classes (output labels).
n_labels – Average number of labels per sample.
n_samples – Total number of samples in the generated dataset.
n_features – Number of features in the generated dataset.
is_int – When
True, cast features tonp.int64; otherwise cast tonp.float32.
- Returns:
A tuple
(fitted_model, X_test)where X_test is the held-out feature matrix.
- yobx.sklearn.tests_helper.fit_regression_model(model, is_int=False, n_targets=1, is_bool=False, factor=1.0, n_features=10, n_samples=250, n_informative=10)[source]#
Fit a regression model on a synthetic dataset and return it with test data.
Generates a regression dataset using
sklearn.datasets.make_regression(), fits model on the training split, and returns the fitted model together with the held-out test set.- Parameters:
model – An unfitted scikit-learn regressor.
is_int – When
True, cast features tonp.int64.n_targets – Number of regression targets.
is_bool – When
True, cast features tobool(impliesis_intfor the intermediate cast).factor – Multiplicative scaling factor applied to the target values after generation.
n_features – Total number of features in the generated dataset.
n_samples – Number of samples in the generated dataset.
n_informative – Number of informative features used by
make_regression().
- Returns:
A tuple
(fitted_model, X_test)where X_test is the held-out feature matrix.