yobx.sklearn.feature_extraction.tfidf_vectorizer#

yobx.sklearn.feature_extraction.tfidf_vectorizer.sklearn_tfidf_vectorizer(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: TfidfVectorizer, X: str, name: str = 'tfidf_vectorizer') → str[source]#

Converts a sklearn.feature_extraction.text.TfidfVectorizer into ONNX.

TfidfVectorizer combines a CountVectorizer with a TfidfTransformer. This converter reproduces that two-step pipeline:

Count step — the ONNX TfIdfVectorizer operator (opset 9+) maps a pre-tokenised string tensor to raw term-frequency counts, exactly as CountVectorizer.transform does.
TF-IDF step — the same sublinear-TF scaling, IDF weighting, and L1/L2 row normalisation implemented by the TfidfTransformer converter.

The input tensor X must already be tokenised — a 2-D string tensor of shape (N, max_tokens_per_doc) where shorter rows are padded with empty strings "". Raw text documents are not accepted because ONNX lacks a standard tokeniser.

Supported options

analyzer='word' only (character-level tokenisation has no ONNX equivalent).
ngram_range, sublinear_tf, use_idf, norm (‘l1’, ‘l2’, or None): all supported.
smooth_idf, binary — reflected via the fitted idf_ values and the count mode respectively.

Graph layout (all options active)

X  (N, seq_len) STRING
│
└── TfIdfVectorizer(mode=TF, …)       # raw counts
       │
       ├── Greater(0) / Log / Add(1) / Where  # sublinear_tf (optional)
       ├── Mul(idf_)                           # use_idf (optional)
       └── ReduceL2 / Div                      # norm (optional)
              └── output  (N, n_features) float

Parameters:

g – the graph builder to add nodes to
sts – shapes defined by scikit-learn
estimator – a fitted TfidfVectorizer
outputs – desired output names
X – input tensor name — a STRING tensor of shape (N, max_tokens_per_doc) (rows padded with "" as needed)
name – prefix for added node names

Returns:

output tensor name

Raises:

NotImplementedError – if analyzer is not 'word', or if norm is 'l1' or 'l2' and the graph opset is < 18