yobx.sklearn.feature_extraction.count_vectorizer#
- yobx.sklearn.feature_extraction.count_vectorizer.sklearn_count_vectorizer(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: CountVectorizer, X: str, name: str = 'count_vectorizer') str[source]#
Converts a
sklearn.feature_extraction.text.CountVectorizerinto ONNX using theTfIdfVectorizeroperator (opset 9+).The input tensor X must already be tokenized — it should be a 2-D string tensor of shape
(N, max_tokens_per_doc)where each row contains the pre-split tokens of one document and shorter rows are padded with empty strings""(which the operator ignores). This matches the behaviour expected by the ONNXTfIdfVectorizeroperator; raw text documents cannot be accepted because ONNX lacks a standard tokeniser.Supported options
analyzer='word'— the only supported value;'char'and'char_wb'require character-level input tokenisation that has no ONNX equivalent.ngram_range=(min_n, max_n)— any combination of positive integers.binary=False(default, TF counts) orbinary=True(binary presence, 0 or 1 per feature).vocabulary_— any fitted vocabulary; no restriction on size.
Graph layout
X (N, seq_len) STRING │ └── TfIdfVectorizer(mode=TF/IDF, pool_strings=vocab, …) └── output (N, n_features) FLOAT- Parameters:
g – the graph builder to add nodes to
sts – shapes defined by scikit-learn
estimator – a fitted
CountVectorizeroutputs – desired output names
X – input tensor name — a
STRINGtensor of shape(N, max_tokens_per_doc)(rows padded with""as needed)name – prefix for added node names
- Returns:
output tensor name
- Raises:
NotImplementedError – if
analyzeris not'word'