yobx.sklearn.impute.knn_imputer#
- yobx.sklearn.impute.knn_imputer.sklearn_knn_imputer(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: KNNImputer, X: str, name: str = 'knn_imputer') str[source]#
Converts a
sklearn.impute.KNNImputerinto ONNX.Each missing value in feature f of a test sample i is replaced by the (weighted) mean of the
n_neighborsnearest training samples that also have a valid (non-NaN) value in feature f. Distances are computed with the nan-euclidean metric (features that are NaN in either sample are excluded from the distance calculation).Both
weights='uniform'andweights='distance'are supported.Distance computation
When the
com.microsoftONNX domain is registered,com.microsoft.CDist(withmetric="sqeuclidean") is used as a hardware-accelerated starting point. CDist computes||x_filled[i] - train_filled[j]||²over all features including NaN-zeroed positions; two MatMul corrections remove the extra terms:nan_sq[i,j] = CDist(x_filled, train_filled)[i,j] (sqeuclidean) - MatMul(nan_x_float, train_sq.T)[i,j] (y_f² where test NaN) - MatMul(ax_sq, nan_train.T)[i,j] (x_f² where train NaN)When CDist is not available the same result is obtained via four standard MatMul operations:
nan_sq[i,j] = MatMul(x_sq, train_valid.T)[i,j] - 2 × MatMul(x_filled, train_filled.T)[i,j] + MatMul(vx_float, train_sq.T)[i,j]In both cases n_valid (count of features valid in both samples) is computed separately and used to scale the result:
n_valid[i,j] = MatMul(vx_float, train_valid.T)[i,j] dist_sq[i,j] = n_features / max(n_valid[i,j], 1) × nan_sq[i,j] dist_sq[i,j] = inf if n_valid[i,j] == 0 dists[i,j] = sqrt(dist_sq[i,j])
Imputation (per feature)
For each feature f, distances to training samples with NaN in feature f are set to infinity before
TopKso that only valid donors are selected:D_f = Where(train_valid[:, f], dists, inf) [N, M] top_k_dists_f, top_k_idx_f = TopK(D_f, k, axis=1, largest=False) [N, k] neighbor_vals_f = Gather(train_filled[:, f], top_k_idx_f.flat()) .reshape(N, k) is_inf_f = IsInf(top_k_dists_f) [N, k] — fewer than k donors? valid_float_f = Cast(~is_inf_f, float) [N, k] imputed_f = sum(neighbor_vals_f * valid_float_f) / max(sum(valid_float_f), 1)The imputed column vectors are concatenated and
Where-selected against the original X (NaN preserved where no valid donors exist):imputed = Concat(imputed_0, ..., imputed_{F-1}, axis=1) [N, F] result = Where(IsNaN(X), imputed, X) [N, F]add_indicator=Trueis not supported and raisesNotImplementedError. Custom callablemetricvalues are not supported and raiseNotImplementedError.- Parameters:
g – the graph builder to add nodes to
sts – shapes defined by scikit-learn
estimator – a fitted
KNNImputeroutputs – desired output names
X – input name
name – prefix name for the added nodes
- Returns:
output name