yobx.sklearn.feature_extraction.tfidf_vectorizer#

yobx.sklearn.feature_extraction.tfidf_vectorizer.sklearn_tfidf_vectorizer(g: GraphBuilderExtendedProtocol, sts: Dict, outputs: List[str], estimator: TfidfVectorizer, X: str, name: str = 'tfidf_vectorizer') str[source]#

Converts a sklearn.feature_extraction.text.TfidfVectorizer into ONNX.

TfidfVectorizer combines a CountVectorizer with a TfidfTransformer. This converter reproduces that two-step pipeline:

  1. Count step — the ONNX TfIdfVectorizer operator (opset 9+) maps a pre-tokenised string tensor to raw term-frequency counts, exactly as CountVectorizer.transform does.

  2. TF-IDF step — the same sublinear-TF scaling, IDF weighting, and L1/L2 row normalisation implemented by the TfidfTransformer converter.

The input tensor X must already be tokenised — a 2-D string tensor of shape (N, max_tokens_per_doc) where shorter rows are padded with empty strings "". Raw text documents are not accepted because ONNX lacks a standard tokeniser.

Supported options

  • analyzer='word' only (character-level tokenisation has no ONNX equivalent).

  • ngram_range, sublinear_tf, use_idf, norm (‘l1’, ‘l2’, or None): all supported.

  • smooth_idf, binary — reflected via the fitted idf_ values and the count mode respectively.

Graph layout (all options active)

X  (N, seq_len) STRING
│
└── TfIdfVectorizer(mode=TF, …)       # raw counts
       │
       ├── Greater(0) / Log / Add(1) / Where  # sublinear_tf (optional)
       ├── Mul(idf_)                           # use_idf (optional)
       └── ReduceL2 / Div                      # norm (optional)
              └── output  (N, n_features) float
Parameters:
  • g – the graph builder to add nodes to

  • sts – shapes defined by scikit-learn

  • estimator – a fitted TfidfVectorizer

  • outputs – desired output names

  • X – input tensor name — a STRING tensor of shape (N, max_tokens_per_doc) (rows padded with "" as needed)

  • name – prefix for added node names

Returns:

output tensor name

Raises:

NotImplementedError – if analyzer is not 'word', or if norm is 'l1' or 'l2' and the graph opset is < 18