yobx.sklearn.feature_extraction.tfidf_vectorizer#
- yobx.sklearn.feature_extraction.tfidf_vectorizer.sklearn_tfidf_vectorizer(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: TfidfVectorizer, X: str, name: str = 'tfidf_vectorizer') str[source]#
Converts a
sklearn.feature_extraction.text.TfidfVectorizerinto ONNX.TfidfVectorizercombines aCountVectorizerwith aTfidfTransformer. This converter reproduces that two-step pipeline:Count step — the ONNX
TfIdfVectorizeroperator (opset 9+) maps a pre-tokenised string tensor to raw term-frequency counts, exactly asCountVectorizer.transformdoes.TF-IDF step — the same sublinear-TF scaling, IDF weighting, and L1/L2 row normalisation implemented by the
TfidfTransformerconverter.
The input tensor X must already be tokenised — a 2-D string tensor of shape
(N, max_tokens_per_doc)where shorter rows are padded with empty strings"". Raw text documents are not accepted because ONNX lacks a standard tokeniser.Supported options
analyzer='word'only (character-level tokenisation has no ONNX equivalent).ngram_range,sublinear_tf,use_idf,norm(‘l1’, ‘l2’, orNone): all supported.smooth_idf,binary— reflected via the fittedidf_values and the count mode respectively.
Graph layout (all options active)
X (N, seq_len) STRING │ └── TfIdfVectorizer(mode=TF, …) # raw counts │ ├── Greater(0) / Log / Add(1) / Where # sublinear_tf (optional) ├── Mul(idf_) # use_idf (optional) └── ReduceL2 / Div # norm (optional) └── output (N, n_features) float- Parameters:
g – the graph builder to add nodes to
sts – shapes defined by scikit-learn
estimator – a fitted
TfidfVectorizeroutputs – desired output names
X – input tensor name — a
STRINGtensor of shape(N, max_tokens_per_doc)(rows padded with""as needed)name – prefix for added node names
- Returns:
output tensor name
- Raises:
NotImplementedError – if
analyzeris not'word', or ifnormis'l1'or'l2'and the graph opset is < 18