yobx.sklearn.tests_helper#

yobx.sklearn.tests_helper.dump_data_and_model(data: ndarray, model: BaseEstimator, model_onnx: ModelProto, basename: str)[source]#

Validate an ONNX model against the original scikit-learn model.

Runs data through both onnxruntime and ExtendedReferenceEvaluator and compares both outputs to the predictions produced by model:

If the model has predict_proba(), both labels and probabilities are compared (outputs 0 and 1 respectively).
Else if the model has transform(), the transformed output is compared (output 0).
Otherwise the predict() output is compared (output 0).

All comparisons use _assert_close() with default tolerance.

Parameters:

data – Input feature matrix (np.ndarray) passed to all evaluators.
model – The fitted scikit-learn estimator used as the reference.
model_onnx – The ONNX representation of model to validate.
basename – Short identifier used as a prefix in assertion messages to make failures easier to locate.

Raises:

AssertionError – If any output differs beyond the tolerance defined in _assert_close().

yobx.sklearn.tests_helper.fit_classification_model(model, n_classes, is_int=False, pos_features=False, label_string=False, random_state=42, is_bool=False, n_features=20, n_redundant=None, n_repeated=None, cls_dtype=None, is_double=False, n_samples=250)[source]#

Fit a classification model on a synthetic dataset and return it with test data.

Generates a classification dataset using sklearn.datasets.make_classification(), fits model on the training split, and returns the fitted model together with the held-out test set.

Parameters:

model – An unfitted scikit-learn classifier.
n_classes – Number of target classes.
is_int – When True, cast features to np.int64.
pos_features – When True, take the absolute value of all features so that every entry is non-negative.
label_string – When True, convert integer labels to string labels of the form "cl<i>".
random_state – Random seed forwarded to make_classification() and train_test_split().
is_bool – When True, cast features to bool (implies is_int for the intermediate cast).
n_features – Total number of features in the generated dataset.
n_redundant – Number of redundant features. Defaults to min(2, n_features - min(7, n_features)) — at most 2, reduced when n_features is small.
n_repeated – Number of repeated features. Defaults to 0.
cls_dtype – If provided, cast the label array to this dtype before fitting.
is_double – When True, cast features to np.float64 after the integer/bool cast step.
n_samples – Number of samples in the generated dataset.

Returns:

A tuple (fitted_model, X_test) where X_test is the held-out feature matrix.

yobx.sklearn.tests_helper.fit_clustering_model(model, n_classes, is_int=False, pos_features=False, label_string=False, random_state=42, is_bool=False, n_features=20, n_redundant=None, n_repeated=None)[source]#

Fit a clustering model on a synthetic dataset and return it with test data.

Generates a classification dataset (used purely for the feature matrix) using sklearn.datasets.make_classification(), fits model on the training split (labels are not used), and returns the fitted model together with the held-out test set.

Parameters:

model – An unfitted scikit-learn clustering estimator (e.g. sklearn.cluster.KMeans).
n_classes – Number of classes used when generating the dataset (acts as a proxy for the number of natural clusters in the feature space).
is_int – When True, cast features to np.int64.
pos_features – When True, take the absolute value of all features so that every entry is non-negative.
label_string – When True, convert integer labels to string labels of the form "cl<i>" (not used for fitting but kept for API symmetry).
random_state – Random seed forwarded to make_classification() and train_test_split().
is_bool – When True, cast features to bool (implies is_int for the intermediate cast).
n_features – Total number of features in the generated dataset.
n_redundant – Number of redundant features. Defaults to min(2, n_features - min(7, n_features)) — at most 2, reduced when n_features is small.
n_repeated – Number of repeated features. Defaults to 0.

Returns:

A tuple (fitted_model, X_test) where X_test is the held-out feature matrix.

yobx.sklearn.tests_helper.fit_multi_output_classification_model(model, n_classes=3, n_samples=100, n_features=4, n_informative=5, n_outputs=2)[source]#

Fit a multi-output classification model on a synthetic integer dataset.

Generates a random integer feature matrix and a multi-column integer label matrix, fits a RandomForestClassifier (the model parameter is accepted for API consistency but is replaced internally), and returns the fitted model together with a small test set.

Parameters:

model – Accepted for API consistency; internally replaced by a RandomForestClassifier.
n_classes – Number of distinct integer classes per output column.
n_samples – Number of training samples.
n_features – Number of feature columns.
n_informative – Upper bound for the random integer values in the feature matrix.
n_outputs – Number of output columns (targets).

Returns:

A tuple (fitted_model, X_test) where X_test contains 10 samples drawn from the same integer distribution as the training set.

yobx.sklearn.tests_helper.fit_multilabel_classification_model(model, n_classes=5, n_labels=2, n_samples=200, n_features=20, is_int=False)[source]#

Fit a multilabel classification model on a synthetic dataset.

Generates a multilabel dataset using sklearn.datasets.make_multilabel_classification(), fits model on the training split, and returns the fitted model together with the held-out test set.

Parameters:

model – An unfitted scikit-learn multilabel classifier.
n_classes – Number of classes (output labels).
n_labels – Average number of labels per sample.
n_samples – Total number of samples in the generated dataset.
n_features – Number of features in the generated dataset.
is_int – When True, cast features to np.int64; otherwise cast to np.float32.

Returns:

A tuple (fitted_model, X_test) where X_test is the held-out feature matrix.

yobx.sklearn.tests_helper.fit_regression_model(model, is_int=False, n_targets=1, is_bool=False, factor=1.0, n_features=10, n_samples=250, n_informative=10)[source]#

Fit a regression model on a synthetic dataset and return it with test data.

Generates a regression dataset using sklearn.datasets.make_regression(), fits model on the training split, and returns the fitted model together with the held-out test set.

Parameters:

model – An unfitted scikit-learn regressor.
is_int – When True, cast features to np.int64.
n_targets – Number of regression targets.
is_bool – When True, cast features to bool (implies is_int for the intermediate cast).
factor – Multiplicative scaling factor applied to the target values after generation.
n_features – Total number of features in the generated dataset.
n_samples – Number of samples in the generated dataset.
n_informative – Number of informative features used by make_regression().

Returns:

A tuple (fitted_model, X_test) where X_test is the held-out feature matrix.