yobx.sklearn.feature_extraction.feature_hasher#

yobx.sklearn.feature_extraction.feature_hasher.sklearn_feature_hasher(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: FeatureHasher, X: str, name: str = 'feature_hasher') str[source]#

Converts a sklearn.feature_extraction.FeatureHasher into ONNX using the com.microsoft.MurmurHash3 operator (ONNX Runtime ≥ 1.10).

Input format

The input tensor X must be a 2-D string tensor of shape (N, max_features_per_sample) where each row contains the feature names for one sample and shorter rows are padded with empty strings "" (which are silently ignored). This matches the padded string array produced by converting a list of feature-name lists into a rectangular numpy array.

Note

Only input_type='string' is supported. The 'dict' and 'pair' input types require variable-length inputs that cannot be represented as fixed-size ONNX tensors. Empty strings "" are treated as padding and are not counted as features; users should therefore avoid using "" as an actual feature name.

Requirements

The graph builder must have the com.microsoft ONNX domain registered (pass target_opset={'': 18, 'com.microsoft': 1} to yobx.sklearn.to_onnx()).

Supported options

  • input_type='string' — the only supported value.

  • n_features — any positive integer.

  • alternate_sign=True (default) or False.

  • dtype=numpy.float32 or numpy.float64.

Graph layout

X  (N, max_tokens) STRING
│
├── MurmurHash3(seed=0, positive=0)   →  (N, max_tokens) INT32
│       [com.microsoft domain]
├── Cast → INT64
├── GreaterOrEqual(0) → bool mask      (alternate_sign only)
├── Where(mask, +1, −1) → sign values
├── Equal(0) → is_empty mask
├── Where(is_empty, 0, sign) → values
├── Abs → abs_hash
├── Mod(n_features) → indices  (N, max_tokens) INT64
├── ConstantOfShape([N, n_features]) → zeros
├── ScatterElements(reduction='add', axis=1) → accumulated
└── Cast(dtype) → output  (N, n_features) FLOAT/DOUBLE
Parameters:
  • g – graph builder to add nodes to

  • sts – shapes defined by scikit-learn (unused; present for interface consistency)

  • estimator – a fitted FeatureHasher instance

  • outputs – desired output names

  • X – input tensor name — a STRING tensor of shape (N, max_features_per_sample) padded with ""

  • name – prefix for added node names

Returns:

output tensor name

Raises:
  • NotImplementedError – if input_type is not 'string' or if the input tensor is not of type STRING

  • RuntimeError – if the com.microsoft ONNX domain is not registered in the graph builder