yobx.sklearn.feature_extraction.feature_hasher#
- yobx.sklearn.feature_extraction.feature_hasher.sklearn_feature_hasher(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: FeatureHasher, X: str, name: str = 'feature_hasher') str[source]#
Converts a
sklearn.feature_extraction.FeatureHasherinto ONNX using thecom.microsoft.MurmurHash3operator (ONNX Runtime ≥ 1.10).Input format
The input tensor X must be a 2-D string tensor of shape
(N, max_features_per_sample)where each row contains the feature names for one sample and shorter rows are padded with empty strings""(which are silently ignored). This matches the padded string array produced by converting a list of feature-name lists into a rectangular numpy array.Note
Only
input_type='string'is supported. The'dict'and'pair'input types require variable-length inputs that cannot be represented as fixed-size ONNX tensors. Empty strings""are treated as padding and are not counted as features; users should therefore avoid using""as an actual feature name.Requirements
The graph builder must have the
com.microsoftONNX domain registered (passtarget_opset={'': 18, 'com.microsoft': 1}toyobx.sklearn.to_onnx()).Supported options
input_type='string'— the only supported value.n_features— any positive integer.alternate_sign=True(default) orFalse.dtype=numpy.float32ornumpy.float64.
Graph layout
X (N, max_tokens) STRING │ ├── MurmurHash3(seed=0, positive=0) → (N, max_tokens) INT32 │ [com.microsoft domain] ├── Cast → INT64 ├── GreaterOrEqual(0) → bool mask (alternate_sign only) ├── Where(mask, +1, −1) → sign values ├── Equal(0) → is_empty mask ├── Where(is_empty, 0, sign) → values ├── Abs → abs_hash ├── Mod(n_features) → indices (N, max_tokens) INT64 ├── ConstantOfShape([N, n_features]) → zeros ├── ScatterElements(reduction='add', axis=1) → accumulated └── Cast(dtype) → output (N, n_features) FLOAT/DOUBLE
- Parameters:
g – graph builder to add nodes to
sts – shapes defined by scikit-learn (unused; present for interface consistency)
estimator – a fitted
FeatureHasherinstanceoutputs – desired output names
X – input tensor name — a
STRINGtensor of shape(N, max_features_per_sample)padded with""name – prefix for added node names
- Returns:
output tensor name
- Raises:
NotImplementedError – if
input_typeis not'string'or if the input tensor is not of typeSTRINGRuntimeError – if the
com.microsoftONNX domain is not registered in the graph builder