yobx.sklearn.feature_extraction.count_vectorizer#

yobx.sklearn.feature_extraction.count_vectorizer.sklearn_count_vectorizer(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: CountVectorizer, X: str, name: str = 'count_vectorizer') → str[source]#

Converts a sklearn.feature_extraction.text.CountVectorizer into ONNX using the TfIdfVectorizer operator (opset 9+).

The input tensor X must already be tokenized — it should be a 2-D string tensor of shape (N, max_tokens_per_doc) where each row contains the pre-split tokens of one document and shorter rows are padded with empty strings "" (which the operator ignores). This matches the behaviour expected by the ONNX TfIdfVectorizer operator; raw text documents cannot be accepted because ONNX lacks a standard tokeniser.

Supported options

analyzer='word' — the only supported value; 'char' and 'char_wb' require character-level input tokenisation that has no ONNX equivalent.
ngram_range=(min_n, max_n) — any combination of positive integers.
binary=False (default, TF counts) or binary=True (binary presence, 0 or 1 per feature).
vocabulary_ — any fitted vocabulary; no restriction on size.

Graph layout

X  (N, seq_len) STRING
│
└── TfIdfVectorizer(mode=TF/IDF, pool_strings=vocab, …)
       └── output  (N, n_features) FLOAT

Parameters:

g – the graph builder to add nodes to
sts – shapes defined by scikit-learn
estimator – a fitted CountVectorizer
outputs – desired output names
X – input tensor name — a STRING tensor of shape (N, max_tokens_per_doc) (rows padded with "" as needed)
name – prefix for added node names

Returns:

output tensor name

Raises:

NotImplementedError – if analyzer is not 'word'