yobx.sklearn.feature_extraction.feature_hasher#
- yobx.sklearn.feature_extraction.feature_hasher.sklearn_feature_hasher(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: FeatureHasher, X: str, X_values: str | None = None, name: str = 'feature_hasher') str[source]#
Converts a
sklearn.feature_extraction.FeatureHasherinto ONNX.FeatureHashermaps a sequence of feature dictionaries (or pairs, or strings) to a fixed-size dense matrix via the hashing trick — specificallymurmurhash3_32withseed=0and signed output (positive=False).This converter requires the
com.microsoftopset (ONNX Runtime contrib ops) because the hashing is performed inline usingcom.microsoft.MurmurHash3, which exactly matches sklearn’smurmurhash3_32.The primary input X must be a 2-D string tensor of shape
(N, K)where K is the maximum number of features per sample (shorter samples padded with""). A second companion float input X_values of the same shape carries the feature values (1.0per non-padding entry forinput_type='string',0.0for padding slots).X (N, K) STRING, X_values (N, K) float │ ├── MurmurHash3(X, seed=0, positive=0) ──► hashes (N, K) INT32 ├── abs(hashes) % n_features ─────────────► indices (N, K) INT64 ├── where(hashes >= 0, +1, −1) ───────────► signs (N, K) float ├── signs * X_values ─────────────────────► weighted (N, K) float └── ScatterElements(zeros, indices, weighted, axis=1, reduction='add') → output (N, n_features)- Parameters:
g – the graph builder to add nodes to
sts – shapes defined by scikit-learn (unused; present for interface consistency)
estimator – a
FeatureHasherinstanceoutputs – desired output names
X – primary input name — a
STRINGtensor of shape(N, K)containing feature names (shorter samples padded with"").X_values – companion float input of shape
(N, K)with feature values (1.0per non-padding entry,0.0for padding slots).name – prefix name for the added nodes
- Returns:
output name
- Raises:
NotImplementedError – if the
com.microsoftopset is not registered in the graph builder, or if X is not aSTRINGtensor.