yobx.sklearn.tests_helper#

yobx.sklearn.tests_helper.dump_data_and_model(data: ndarray, model: BaseEstimator, model_onnx: ModelProto, basename: str)[source]#

Validate an ONNX model against the original scikit-learn model.

Runs data through both onnxruntime and ExtendedReferenceEvaluator and compares both outputs to the predictions produced by model:

  • If the model has predict_proba(), both labels and probabilities are compared (outputs 0 and 1 respectively).

  • Else if the model has transform(), the transformed output is compared (output 0).

  • Otherwise the predict() output is compared (output 0).

All comparisons use _assert_close() with default tolerance.

Parameters:
  • data – Input feature matrix (np.ndarray) passed to all evaluators.

  • model – The fitted scikit-learn estimator used as the reference.

  • model_onnx – The ONNX representation of model to validate.

  • basename – Short identifier used as a prefix in assertion messages to make failures easier to locate.

Raises:

AssertionError – If any output differs beyond the tolerance defined in _assert_close().

yobx.sklearn.tests_helper.fit_classification_model(model, n_classes, is_int=False, pos_features=False, label_string=False, random_state=42, is_bool=False, n_features=20, n_redundant=None, n_repeated=None, cls_dtype=None, is_double=False, n_samples=250)[source]#

Fit a classification model on a synthetic dataset and return it with test data.

Generates a classification dataset using sklearn.datasets.make_classification(), fits model on the training split, and returns the fitted model together with the held-out test set.

Parameters:
  • model – An unfitted scikit-learn classifier.

  • n_classes – Number of target classes.

  • is_int – When True, cast features to np.int64.

  • pos_features – When True, take the absolute value of all features so that every entry is non-negative.

  • label_string – When True, convert integer labels to string labels of the form "cl<i>".

  • random_state – Random seed forwarded to make_classification() and train_test_split().

  • is_bool – When True, cast features to bool (implies is_int for the intermediate cast).

  • n_features – Total number of features in the generated dataset.

  • n_redundant – Number of redundant features. Defaults to min(2, n_features - min(7, n_features)) — at most 2, reduced when n_features is small.

  • n_repeated – Number of repeated features. Defaults to 0.

  • cls_dtype – If provided, cast the label array to this dtype before fitting.

  • is_double – When True, cast features to np.float64 after the integer/bool cast step.

  • n_samples – Number of samples in the generated dataset.

Returns:

A tuple (fitted_model, X_test) where X_test is the held-out feature matrix.

yobx.sklearn.tests_helper.fit_clustering_model(model, n_classes, is_int=False, pos_features=False, label_string=False, random_state=42, is_bool=False, n_features=20, n_redundant=None, n_repeated=None)[source]#

Fit a clustering model on a synthetic dataset and return it with test data.

Generates a classification dataset (used purely for the feature matrix) using sklearn.datasets.make_classification(), fits model on the training split (labels are not used), and returns the fitted model together with the held-out test set.

Parameters:
  • model – An unfitted scikit-learn clustering estimator (e.g. sklearn.cluster.KMeans).

  • n_classes – Number of classes used when generating the dataset (acts as a proxy for the number of natural clusters in the feature space).

  • is_int – When True, cast features to np.int64.

  • pos_features – When True, take the absolute value of all features so that every entry is non-negative.

  • label_string – When True, convert integer labels to string labels of the form "cl<i>" (not used for fitting but kept for API symmetry).

  • random_state – Random seed forwarded to make_classification() and train_test_split().

  • is_bool – When True, cast features to bool (implies is_int for the intermediate cast).

  • n_features – Total number of features in the generated dataset.

  • n_redundant – Number of redundant features. Defaults to min(2, n_features - min(7, n_features)) — at most 2, reduced when n_features is small.

  • n_repeated – Number of repeated features. Defaults to 0.

Returns:

A tuple (fitted_model, X_test) where X_test is the held-out feature matrix.

yobx.sklearn.tests_helper.fit_multi_output_classification_model(model, n_classes=3, n_samples=100, n_features=4, n_informative=5, n_outputs=2)[source]#

Fit a multi-output classification model on a synthetic integer dataset.

Generates a random integer feature matrix and a multi-column integer label matrix, fits a RandomForestClassifier (the model parameter is accepted for API consistency but is replaced internally), and returns the fitted model together with a small test set.

Parameters:
  • model – Accepted for API consistency; internally replaced by a RandomForestClassifier.

  • n_classes – Number of distinct integer classes per output column.

  • n_samples – Number of training samples.

  • n_features – Number of feature columns.

  • n_informative – Upper bound for the random integer values in the feature matrix.

  • n_outputs – Number of output columns (targets).

Returns:

A tuple (fitted_model, X_test) where X_test contains 10 samples drawn from the same integer distribution as the training set.

yobx.sklearn.tests_helper.fit_multilabel_classification_model(model, n_classes=5, n_labels=2, n_samples=200, n_features=20, is_int=False)[source]#

Fit a multilabel classification model on a synthetic dataset.

Generates a multilabel dataset using sklearn.datasets.make_multilabel_classification(), fits model on the training split, and returns the fitted model together with the held-out test set.

Parameters:
  • model – An unfitted scikit-learn multilabel classifier.

  • n_classes – Number of classes (output labels).

  • n_labels – Average number of labels per sample.

  • n_samples – Total number of samples in the generated dataset.

  • n_features – Number of features in the generated dataset.

  • is_int – When True, cast features to np.int64; otherwise cast to np.float32.

Returns:

A tuple (fitted_model, X_test) where X_test is the held-out feature matrix.

yobx.sklearn.tests_helper.fit_regression_model(model, is_int=False, n_targets=1, is_bool=False, factor=1.0, n_features=10, n_samples=250, n_informative=10)[source]#

Fit a regression model on a synthetic dataset and return it with test data.

Generates a regression dataset using sklearn.datasets.make_regression(), fits model on the training split, and returns the fitted model together with the held-out test set.

Parameters:
  • model – An unfitted scikit-learn regressor.

  • is_int – When True, cast features to np.int64.

  • n_targets – Number of regression targets.

  • is_bool – When True, cast features to bool (implies is_int for the intermediate cast).

  • factor – Multiplicative scaling factor applied to the target values after generation.

  • n_features – Total number of features in the generated dataset.

  • n_samples – Number of samples in the generated dataset.

  • n_informative – Number of informative features used by make_regression().

Returns:

A tuple (fitted_model, X_test) where X_test is the held-out feature matrix.