module `mlmodel.sklearn_text`#

Short summary#

module mlinsights.mlmodel.sklearn_text

Overloads TfidfVectorizer and CountVectorizer.

Classes#

class	truncated documentation
`NGramsMixin`	Overloads method _word_ngrams …
`TraceableCountVectorizer`	Inherits from `NGramsMixin` which overloads method _word_ngrams …
`TraceableTfidfVectorizer`	Inherits from `NGramsMixin` which overloads method _word_ngrams …

Properties#

property	truncated documentation
`_repr_html_`	HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should …
`_repr_html_`	HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should …
`idf_`	Inverse document frequency vector, only defined if use_idf=True. Returns ——- ndarray of shape …

Methods#

method	truncated documentation
`_word_ngrams`	Turn tokens into a sequence of n-grams after stop words filtering
`_word_ngrams`
`_word_ngrams`

Documentation#

Overloads TfidfVectorizer and CountVectorizer.

source on GitHub

class mlinsights.mlmodel.sklearn_text.NGramsMixin#

Bases: _VectorizerMixin

Overloads method _word_ngrams to get tuples instead of string in member vocabulary_. of TfidfVectorizer or CountVectorizer. It contains the list of n-grams used to process documents. See TraceableCountVectorizer and TraceableTfidfVectorizer for example.

source on GitHub

_word_ngrams(tokens, stop_words=None)#

Turn tokens into a sequence of n-grams after stop words filtering

source on GitHub

class mlinsights.mlmodel.sklearn_text.TraceableCountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)#

Bases: CountVectorizer, NGramsMixin

Inherits from NGramsMixin which overloads method _word_ngrams to keep more information about n-grams but still produces the same outputs than CountVectorizer.

<<<

import numpy
from sklearn.feature_extraction.text import CountVectorizer
from mlinsights.mlmodel.sklearn_text import TraceableCountVectorizer
from pprint import pformat

corpus = numpy.array([
    "This is the first document.",
    "This document is the second document.",
    "Is this the first document?",
    "",
]).reshape((4, ))

print('CountVectorizer from scikit-learn')
mod1 = CountVectorizer(ngram_range=(1, 2))
mod1.fit(corpus)
print(mod1.transform(corpus).todense()[:2])
print(pformat(mod1.vocabulary_)[:100])

print('TraceableCountVectorizer from scikit-learn')
mod2 = TraceableCountVectorizer(ngram_range=(1, 2))
mod2.fit(corpus)
print(mod2.transform(corpus).todense()[:2])
print(pformat(mod2.vocabulary_)[:100])

>>>

    CountVectorizer from scikit-learn
    [[1 0 1 1 1 1 0 0 0 1 1 0 1 0 1 0]
     [2 1 0 0 1 1 0 1 1 1 0 1 1 1 0 0]]
    {'document': 0,
     'document is': 1,
     'first': 2,
     'first document': 3,
     'is': 4,
     'is the': 5,
     'is t
    TraceableCountVectorizer from scikit-learn
    [[1 0 1 1 1 1 0 0 0 1 1 0 1 0 1 0]
     [2 1 0 0 1 1 0 1 1 1 0 1 1 1 0 0]]
    {('document',): 0,
     ('document', 'is'): 1,
     ('first',): 2,
     ('first', 'document'): 3,
     ('is',): 4,

A weirder example with TraceableTfidfVectorizer shows more differences.

source on GitHub

_word_ngrams(tokens, stop_words=None)#: Turn tokens into a sequence of n-grams after stop words filtering

class mlinsights.mlmodel.sklearn_text.TraceableTfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)#

Bases: TfidfVectorizer, NGramsMixin

Inherits from NGramsMixin which overloads method _word_ngrams to keep more information about n-grams but still produces the same outputs than TfidfVectorizer.

<<<

import numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from mlinsights.mlmodel.sklearn_text import TraceableTfidfVectorizer
from pprint import pformat

corpus = numpy.array([
    "This is the first document.",
    "This document is the second document.",
    "Is this the first document?",
    "",
]).reshape((4, ))

print('TfidfVectorizer from scikit-learn')
mod1 = TfidfVectorizer(ngram_range=(1, 2),
                       token_pattern="[a-zA-Z ]{1,4}")
mod1.fit(corpus)
print(mod1.transform(corpus).todense()[:2])
print(pformat(mod1.vocabulary_)[:100])

print('TraceableTfidfVectorizer from scikit-learn')
mod2 = TraceableTfidfVectorizer(ngram_range=(1, 2),
                                token_pattern="[a-zA-Z ]{1,4}")
mod2.fit(corpus)
print(mod2.transform(corpus).todense()[:2])
print(pformat(mod2.vocabulary_)[:100])

>>>

    TfidfVectorizer from scikit-learn
    [[0.    0.    0.329 0.329 0.    0.    0.    0.    0.26  0.26  0.    0.
      0.26  0.26  0.    0.    0.    0.    0.    0.26  0.    0.    0.26  0.26
      0.    0.    0.26  0.26  0.26  0.    0.329 0.    0.   ]
     [0.245 0.245 0.    0.    0.245 0.245 0.245 0.245 0.    0.    0.245 0.245
      0.    0.    0.    0.    0.    0.    0.245 0.    0.245 0.245 0.    0.
      0.245 0.245 0.    0.    0.193 0.245 0.    0.245 0.245]]
    {' doc': 0,
     ' doc umen': 1,
     ' is ': 2,
     ' is  the ': 3,
     ' sec': 4,
     ' sec ond ': 5,
     ' the': 6,
     
    TraceableTfidfVectorizer from scikit-learn
    [[0.    0.    0.329 0.329 0.    0.    0.    0.    0.26  0.26  0.    0.
      0.26  0.26  0.    0.    0.    0.    0.    0.26  0.    0.    0.26  0.26
      0.    0.    0.26  0.26  0.26  0.    0.329 0.    0.   ]
     [0.245 0.245 0.    0.    0.245 0.245 0.245 0.245 0.    0.    0.245 0.245
      0.    0.    0.    0.    0.    0.    0.245 0.    0.245 0.245 0.    0.
      0.245 0.245 0.    0.    0.193 0.245 0.    0.245 0.245]]
    {(' doc',): 0,
     (' doc', 'umen'): 1,
     (' is ',): 2,
     (' is ', 'the '): 3,
     (' sec',): 4,
     (' sec', '

source on GitHub

_word_ngrams(tokens, stop_words=None)#: Turn tokens into a sequence of n-grams after stop words filtering

module mlmodel.sklearn_text#

Short summary#

Classes#

Properties#

Methods#

Documentation#

module `mlmodel.sklearn_text`#