.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_traceable_ngrams_tfidf.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_traceable_ngrams_tfidf.py: Traceable n-grams with tf-idf ============================= The notebook looks into the way n-grams are stored in `CountVectorizer `_ and `TfidfVectorizer `_ and how the current storage (<= 0.21) is ambiguous in some cases. Example with CountVectorizer ---------------------------- scikit-learn version ~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 17-37 .. code-block:: Python import numpy from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from mlinsights.mlmodel.sklearn_text import ( TraceableCountVectorizer, TraceableTfidfVectorizer, ) corpus = numpy.array( [ "This is the first document.", "This document is the second document.", "Is this the first document?", "", ] ).reshape((4,)) mod1 = CountVectorizer(ngram_range=(1, 2)) mod1.fit(corpus) .. raw:: html

CountVectorizer(ngram_range=(1, 2))

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

.. GENERATED FROM PYTHON SOURCE LINES 39-42 .. code-block:: Python mod1.transform(corpus).todense() .. rst-class:: sphx-glr-script-out .. code-block:: none matrix([[1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0], [2, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0], [1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) .. GENERATED FROM PYTHON SOURCE LINES 44-48 .. code-block:: Python mod1.vocabulary_ .. rst-class:: sphx-glr-script-out .. code-block:: none {'this': 12, 'is': 4, 'the': 9, 'first': 2, 'document': 0, 'this is': 14, 'is the': 5, 'the first': 10, 'first document': 3, 'second': 7, 'this document': 13, 'document is': 1, 'the second': 11, 'second document': 8, 'is this': 6, 'this the': 15} .. GENERATED FROM PYTHON SOURCE LINES 50-61 .. code-block:: Python corpus = numpy.array( [ "This is the first document.", "This document is the second document.", "Is this the first document?", "", ] ).reshape((4,)) .. GENERATED FROM PYTHON SOURCE LINES 63-67 .. code-block:: Python mod2 = TraceableCountVectorizer(ngram_range=(1, 2)) mod2.fit(corpus) .. raw:: html

TraceableCountVectorizer(ngram_range=(1, 2))

.. GENERATED FROM PYTHON SOURCE LINES 69-72 .. code-block:: Python mod2.transform(corpus).todense() .. rst-class:: sphx-glr-script-out .. code-block:: none matrix([[1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0], [2, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0], [1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) .. GENERATED FROM PYTHON SOURCE LINES 74-78 .. code-block:: Python mod2.vocabulary_ .. rst-class:: sphx-glr-script-out .. code-block:: none {('this',): 12, ('is',): 4, ('the',): 9, ('first',): 2, ('document',): 0, ('this', 'is'): 14, ('is', 'the'): 5, ('the', 'first'): 10, ('first', 'document'): 3, ('second',): 7, ('this', 'document'): 13, ('document', 'is'): 1, ('the', 'second'): 11, ('second', 'document'): 8, ('is', 'this'): 6, ('this', 'the'): 15} .. GENERATED FROM PYTHON SOURCE LINES 79-88 The new class does the exact same thing but keeps n-grams in a more explicit form. The original form as a string is sometimes ambiguous as next example shows. Funny example with TfidfVectorizer ---------------------------------- scikit-learn version ~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 88-99 .. code-block:: Python corpus = numpy.array( [ "This is the first document.", "This document is the second document.", "Is this the first document?", "", ] ).reshape((4,)) .. GENERATED FROM PYTHON SOURCE LINES 101-104 .. code-block:: Python mod1 = TfidfVectorizer(ngram_range=(1, 2), token_pattern="[a-zA-Z ]{1,4}") mod1.fit(corpus) .. raw:: html

TfidfVectorizer(ngram_range=(1, 2), token_pattern='[a-zA-Z ]{1,4}')

.. GENERATED FROM PYTHON SOURCE LINES 106-109 .. code-block:: Python mod1.transform(corpus).todense() .. rst-class:: sphx-glr-script-out .. code-block:: none matrix([[0. , 0. , 0.32940523, 0.32940523, 0. , 0. , 0. , 0. , 0.25970687, 0.25970687, 0. , 0. , 0.25970687, 0.25970687, 0. , 0. , 0. , 0. , 0. , 0.25970687, 0. , 0. , 0.25970687, 0.25970687, 0. , 0. , 0.25970687, 0.25970687, 0.25970687, 0. , 0.32940523, 0. , 0. ], [0.24528087, 0.24528087, 0. , 0. , 0.24528087, 0.24528087, 0.24528087, 0.24528087, 0. , 0. , 0.24528087, 0.24528087, 0. , 0. , 0. , 0. , 0. , 0. , 0.24528087, 0. , 0.24528087, 0.24528087, 0. , 0. , 0.24528087, 0.24528087, 0. , 0. , 0.19338226, 0.24528087, 0. , 0.24528087, 0.24528087], [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.25453384, 0.25453384, 0. , 0. , 0.25453384, 0.25453384, 0.3228439 , 0.3228439 , 0.3228439 , 0.3228439 , 0. , 0.25453384, 0. , 0. , 0.25453384, 0.25453384, 0. , 0. , 0.25453384, 0.25453384, 0. , 0. , 0. , 0. , 0. ], [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]]) .. GENERATED FROM PYTHON SOURCE LINES 111-115 .. code-block:: Python mod1.vocabulary_ .. rst-class:: sphx-glr-script-out .. code-block:: none {'this': 28, ' is ': 2, 'the ': 26, 'firs': 12, 't do': 22, 'cume': 8, 'nt': 19, 'this is ': 30, ' is the ': 3, 'the firs': 27, 'firs t do': 13, 't do cume': 23, 'cume nt': 9, ' doc': 0, 'umen': 31, 't is': 24, ' the': 6, ' sec': 4, 'ond ': 20, 'docu': 10, 'ment': 18, 'this doc': 29, ' doc umen': 1, 'umen t is': 32, 't is the': 25, ' the sec': 7, ' sec ond ': 5, 'ond docu': 21, 'docu ment': 11, 'is t': 16, 'his ': 14, 'is t his ': 17, 'his the ': 15} .. GENERATED FROM PYTHON SOURCE LINES 116-118 mlinsights version ~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 118-122 .. code-block:: Python mod2 = TraceableTfidfVectorizer(ngram_range=(1, 2), token_pattern="[a-zA-Z ]{1,4}") mod2.fit(corpus) .. raw:: html

TraceableTfidfVectorizer(ngram_range=(1, 2), token_pattern='[a-zA-Z ]{1,4}')

.. GENERATED FROM PYTHON SOURCE LINES 124-127 .. code-block:: Python mod2.transform(corpus).todense() .. rst-class:: sphx-glr-script-out .. code-block:: none matrix([[0. , 0. , 0.32940523, 0.32940523, 0. , 0. , 0. , 0. , 0.25970687, 0.25970687, 0. , 0. , 0.25970687, 0.25970687, 0. , 0. , 0. , 0. , 0. , 0.25970687, 0. , 0. , 0.25970687, 0.25970687, 0. , 0. , 0.25970687, 0.25970687, 0.25970687, 0. , 0.32940523, 0. , 0. ], [0.24528087, 0.24528087, 0. , 0. , 0.24528087, 0.24528087, 0.24528087, 0.24528087, 0. , 0. , 0.24528087, 0.24528087, 0. , 0. , 0. , 0. , 0. , 0. , 0.24528087, 0. , 0.24528087, 0.24528087, 0. , 0. , 0.24528087, 0.24528087, 0. , 0. , 0.19338226, 0.24528087, 0. , 0.24528087, 0.24528087], [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.25453384, 0.25453384, 0. , 0. , 0.25453384, 0.25453384, 0.3228439 , 0.3228439 , 0.3228439 , 0.3228439 , 0. , 0.25453384, 0. , 0. , 0.25453384, 0.25453384, 0. , 0. , 0.25453384, 0.25453384, 0. , 0. , 0. , 0. , 0. ], [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]]) .. GENERATED FROM PYTHON SOURCE LINES 129-133 .. code-block:: Python mod2.vocabulary_ .. rst-class:: sphx-glr-script-out .. code-block:: none {('this',): 28, (' is ',): 2, ('the ',): 26, ('firs',): 12, ('t do',): 22, ('cume',): 8, ('nt',): 19, ('this', ' is '): 30, (' is ', 'the '): 3, ('the ', 'firs'): 27, ('firs', 't do'): 13, ('t do', 'cume'): 23, ('cume', 'nt'): 9, (' doc',): 0, ('umen',): 31, ('t is',): 24, (' the',): 6, (' sec',): 4, ('ond ',): 20, ('docu',): 10, ('ment',): 18, ('this', ' doc'): 29, (' doc', 'umen'): 1, ('umen', 't is'): 32, ('t is', ' the'): 25, (' the', ' sec'): 7, (' sec', 'ond '): 5, ('ond ', 'docu'): 21, ('docu', 'ment'): 11, ('is t',): 16, ('his ',): 14, ('is t', 'his '): 17, ('his ', 'the '): 15} .. GENERATED FROM PYTHON SOURCE LINES 134-141 As you can see, the original 30th n-grams ``'t is the'`` is a little but ambiguous. It is in fact ``('t is', ' the')`` as the *TraceableTfidfVectorizer* lets you know. The original form could have been ``('t', 'is the')``, ``('t is', ' the')``, ``('t is ', ' the')``, ``('t is ', 'the')``, ``('t', 'is ', 'the')``\ … The regular expression gives some insights but not some information which can be easily used to guess the right one. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.015 seconds) .. _sphx_glr_download_auto_examples_plot_traceable_ngrams_tfidf.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_traceable_ngrams_tfidf.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_traceable_ngrams_tfidf.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_traceable_ngrams_tfidf.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_