yobx.sklearn.feature_extraction.feature_hasher#

yobx.sklearn.feature_extraction.feature_hasher.sklearn_feature_hasher(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: FeatureHasher, X: str, X_values: str | None = None, name: str = 'feature_hasher') str[source]#

Converts a sklearn.feature_extraction.FeatureHasher into ONNX.

FeatureHasher maps a sequence of feature dictionaries (or pairs, or strings) to a fixed-size dense matrix via the hashing trick — specifically murmurhash3_32 with seed=0 and signed output (positive=False).

This converter requires the com.microsoft opset (ONNX Runtime contrib ops) because the hashing is performed inline using com.microsoft.MurmurHash3, which exactly matches sklearn’s murmurhash3_32.

The primary input X must be a 2-D string tensor of shape (N, K) where K is the maximum number of features per sample (shorter samples padded with ""). A second companion float input X_values of the same shape carries the feature values (1.0 per non-padding entry for input_type='string', 0.0 for padding slots).

X (N, K) STRING, X_values (N, K) float
  │
  ├── MurmurHash3(X, seed=0, positive=0) ──► hashes (N, K) INT32
  ├── abs(hashes) % n_features ─────────────► indices (N, K) INT64
  ├── where(hashes >= 0, +1, −1) ───────────► signs (N, K) float
  ├── signs * X_values ─────────────────────► weighted (N, K) float
  └── ScatterElements(zeros, indices, weighted,
                      axis=1, reduction='add') → output (N, n_features)
Parameters:
  • g – the graph builder to add nodes to

  • sts – shapes defined by scikit-learn (unused; present for interface consistency)

  • estimator – a FeatureHasher instance

  • outputs – desired output names

  • X – primary input name — a STRING tensor of shape (N, K) containing feature names (shorter samples padded with "").

  • X_values – companion float input of shape (N, K) with feature values (1.0 per non-padding entry, 0.0 for padding slots).

  • name – prefix name for the added nodes

Returns:

output name

Raises:

NotImplementedError – if the com.microsoft opset is not registered in the graph builder, or if X is not a STRING tensor.