yobx.sklearn.feature_extraction.count_vectorizer#

yobx.sklearn.feature_extraction.count_vectorizer.sklearn_count_vectorizer(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: CountVectorizer, X: str, name: str = 'count_vectorizer') str[source]#

Converts a sklearn.feature_extraction.text.CountVectorizer into ONNX using the TfIdfVectorizer operator (opset 9+).

The input tensor X must already be tokenized — it should be a 2-D string tensor of shape (N, max_tokens_per_doc) where each row contains the pre-split tokens of one document and shorter rows are padded with empty strings "" (which the operator ignores). This matches the behaviour expected by the ONNX TfIdfVectorizer operator; raw text documents cannot be accepted because ONNX lacks a standard tokeniser.

Supported options

  • analyzer='word' — the only supported value; 'char' and 'char_wb' require character-level input tokenisation that has no ONNX equivalent.

  • ngram_range=(min_n, max_n) — any combination of positive integers.

  • binary=False (default, TF counts) or binary=True (binary presence, 0 or 1 per feature).

  • vocabulary_ — any fitted vocabulary; no restriction on size.

Graph layout

X  (N, seq_len) STRING
│
└── TfIdfVectorizer(mode=TF/IDF, pool_strings=vocab, …)
       └── output  (N, n_features) FLOAT
Parameters:
  • g – the graph builder to add nodes to

  • sts – shapes defined by scikit-learn

  • estimator – a fitted CountVectorizer

  • outputs – desired output names

  • X – input tensor name — a STRING tensor of shape (N, max_tokens_per_doc) (rows padded with "" as needed)

  • name – prefix for added node names

Returns:

output tensor name

Raises:

NotImplementedError – if analyzer is not 'word'