yobx.sklearn.impute.knn_imputer#

yobx.sklearn.impute.knn_imputer.sklearn_knn_imputer(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: KNNImputer, X: str, name: str = 'knn_imputer') str[source]#

Converts a sklearn.impute.KNNImputer into ONNX.

Each missing value in feature f of a test sample i is replaced by the (weighted) mean of the n_neighbors nearest training samples that also have a valid (non-NaN) value in feature f. Distances are computed with the nan-euclidean metric (features that are NaN in either sample are excluded from the distance calculation).

Both weights='uniform' and weights='distance' are supported.

Distance computation

When the com.microsoft ONNX domain is registered, com.microsoft.CDist (with metric="sqeuclidean") is used as a hardware-accelerated starting point. CDist computes ||x_filled[i] - train_filled[j]||² over all features including NaN-zeroed positions; two MatMul corrections remove the extra terms:

nan_sq[i,j] = CDist(x_filled, train_filled)[i,j]      (sqeuclidean)
            - MatMul(nan_x_float, train_sq.T)[i,j]     (y_f² where test NaN)
            - MatMul(ax_sq, nan_train.T)[i,j]          (x_f² where train NaN)

When CDist is not available the same result is obtained via four standard MatMul operations:

nan_sq[i,j] = MatMul(x_sq, train_valid.T)[i,j]
            - 2 × MatMul(x_filled, train_filled.T)[i,j]
            + MatMul(vx_float, train_sq.T)[i,j]

In both cases n_valid (count of features valid in both samples) is computed separately and used to scale the result:

n_valid[i,j] = MatMul(vx_float, train_valid.T)[i,j]
dist_sq[i,j] = n_features / max(n_valid[i,j], 1) × nan_sq[i,j]
dist_sq[i,j] = inf  if n_valid[i,j] == 0
dists[i,j]   = sqrt(dist_sq[i,j])

Imputation (per feature)

For each feature f, distances to training samples with NaN in feature f are set to infinity before TopK so that only valid donors are selected:

D_f = Where(train_valid[:, f], dists, inf)          [N, M]
top_k_dists_f, top_k_idx_f = TopK(D_f, k, axis=1, largest=False)
                                                    [N, k]
neighbor_vals_f = Gather(train_filled[:, f], top_k_idx_f.flat())
                .reshape(N, k)
is_inf_f = IsInf(top_k_dists_f)               [N, k] — fewer than k donors?
valid_float_f = Cast(~is_inf_f, float)         [N, k]
imputed_f = sum(neighbor_vals_f * valid_float_f) / max(sum(valid_float_f), 1)

The imputed column vectors are concatenated and Where-selected against the original X (NaN preserved where no valid donors exist):

imputed = Concat(imputed_0, ..., imputed_{F-1}, axis=1)  [N, F]
result  = Where(IsNaN(X), imputed, X)                    [N, F]

add_indicator=True is not supported and raises NotImplementedError. Custom callable metric values are not supported and raise NotImplementedError.

Parameters:
  • g – the graph builder to add nodes to

  • sts – shapes defined by scikit-learn

  • estimator – a fitted KNNImputer

  • outputs – desired output names

  • X – input name

  • name – prefix name for the added nodes

Returns:

output name