Module tomotopy.label

Submodule tomotopy.label provides automatic topic labeling techniques. You can label topics automatically with simple code like below. The results are attached to the bottom of the code.

::

import tomotopy as tp

corpus = tp.utils.Corpus(tokenizer=tp.utils.SimpleTokenizer(), stopwords=['.'])
# data_feeder yields a tuple of (raw string, user data) or a str (raw string)
corpus.process(open(input_file, encoding='utf-8'))

# make LDA model and train
mdl = tp.LDAModel(k=20, min_cf=10, min_df=5, corpus=corpus)
mdl.train(0)
print('Num docs:', len(mdl.docs), ', Vocab size:', len(mdl.used_vocabs), ', Num words:', mdl.num_words)
print('Removed top words:', mdl.removed_top_words)
for i in range(0, 1000, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# extract candidates for auto topic labeling
extractor = tp.label.PMIExtractor(min_cf=10, min_df=5, max_len=5, max_cand=10000)
cands = extractor.extract(mdl)

# ranking the candidates of labels for a specific topic
labeler = tp.label.FoRelevance(mdl, cands, min_df=5, smoothing=1e-2, mu=0.25)
for k in range(mdl.k):
    print("== Topic #{} ==".format(k))
    print("Labels:", ', '.join(label for label, score in labeler.get_topic_labels(k, top_n=5)))
    for word, prob in mdl.get_topic_words(k, top_n=10):
        print(word, prob, sep='\t')
    print()

# Example of Results
# -----------------
# == Topic #13 ==
# Labels: american basebal, american actress, lawyer politician, race car driver, brown american
# american        0.061747949570417404
# english 0.02476435713469982
# player  0.02357063814997673
# politician      0.020087148994207382
# footbal 0.016364915296435356
# author  0.014303036034107208
# actor   0.01202411763370037
# french  0.009745198301970959
# academ  0.009701790288090706
# produc  0.008822779171168804
# 
# == Topic #16 ==
# Labels: lunar, saturn, orbit moon, nasa report, orbit around
# apollo  0.03052366152405739
# star    0.017564402893185616
# mission 0.015656694769859314
# earth   0.01532777864485979
# lunar   0.015130429528653622
# moon    0.013683202676475048
# orbit   0.011315013282001019
# crew    0.01092031504958868
# space   0.010821640491485596
# nasa    0.009999352507293224

Classes

class Candidate (*args, **kwargs)

Instance variables

var cf
var df
var name
var score
var words
class FoRelevance (topic_model, cands, min_df=5, smoothing=0.01, mu=0.25, window_size=-1, workers=0)
Expand source code
class FoRelevance(_LabelFoRelevance):
    def __init__(self, topic_model, cands, min_df=5, smoothing=0.01, mu=0.25, window_size=-1, workers=0):
        '''.. versionadded:: 0.6.0

This type provides an implementation of First-order Relevance for topic labeling based on the following papers:

> * Mei, Q., Shen, X., & Zhai, C. (2007, August). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 490-499).

Parameters
----------
topic_model
    an instance of topic model to label topics
cands : Iterable[tomotopy.label.Candidate]
    a list of candidates to be used as topic labels
min_df : int
    minimum document frequency of collocations. Collocations with a smaller document frequency than `min_df` are excluded from the candidates.
    Set this value large if the corpus is big
smoothing : float
    a small value greater than 0 for Laplace smoothing
mu : float
    a discriminative coefficient. Candidates with high score on a specific topic and with low score on other topics get a higher final score when this value is larger.
window_size : int
    .. versionadded:: 0.10.0
    
    size of the sliding window for calculating co-occurrence. If `window_size=-1`, it uses the whole document, instead of the sliding windows.
    If your documents are long, it is recommended to set this value to 50 ~ 100, not -1.
workers : int
    an integer indicating the number of workers to perform samplings. 
    If `workers` is 0, the number of cores in the system will be used.
'''
        super().__init__(topic_model, cands, min_df, smoothing, mu, window_size, workers)
    
    def get_topic_labels(self, k, top_n=10) -> List[Tuple[str, float]]:
        '''Return the top-n label candidates for the topic `k`

Parameters
----------
k : int
    an integer indicating a topic
top_n : int
    the number of labels
'''
        return super().get_topic_labels(k, top_n)

Added in version: 0.6.0

This type provides an implementation of First-order Relevance for topic labeling based on the following papers:

  • Mei, Q., Shen, X., & Zhai, C. (2007, August). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 490-499).

Parameters

topic_model
an instance of topic model to label topics
cands : Iterable[Candidate]
a list of candidates to be used as topic labels
min_df : int
minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big
smoothing : float
a small value greater than 0 for Laplace smoothing
mu : float
a discriminative coefficient. Candidates with high score on a specific topic and with low score on other topics get a higher final score when this value is larger.
window_size : int

Added in version: 0.10.0

size of the sliding window for calculating co-occurrence. If window_size=-1, it uses the whole document, instead of the sliding windows. If your documents are long, it is recommended to set this value to 50 ~ 100, not -1.

workers : int
an integer indicating the number of workers to perform samplings. If workers is 0, the number of cores in the system will be used.

Methods

def get_topic_labels(self, k, top_n=10) ‑> List[Tuple[str, float]]
Expand source code
    def get_topic_labels(self, k, top_n=10) -> List[Tuple[str, float]]:
        '''Return the top-n label candidates for the topic `k`

Parameters
----------
k : int
    an integer indicating a topic
top_n : int
    the number of labels
'''
        return super().get_topic_labels(k, top_n)

Return the top-n label candidates for the topic k

Parameters

k : int
an integer indicating a topic
top_n : int
the number of labels
class PMIExtractor (min_cf=10, min_df=5, min_len=1, max_len=5, max_cand=5000, normalized=False)
Expand source code
class PMIExtractor(_LabelPMIExtractor):
    def __init__(self, min_cf=10, min_df=5, min_len=1, max_len=5, max_cand=5000, normalized=False):
        '''.. versionadded:: 0.6.0

`PMIExtractor` exploits multivariate pointwise mutual information to extract collocations. 
It finds a string of words that often co-occur statistically.

Parameters
----------
min_cf : int
    minimum collection frequency of collocations. Collocations with a smaller collection frequency than `min_cf` are excluded from the candidates.
    Set this value large if the corpus is big
min_df : int
    minimum document frequency of collocations. Collocations with a smaller document frequency than `min_df` are excluded from the candidates.
    Set this value large if the corpus is big
min_len : int
    .. versionadded:: 0.10.0
    
    minimum length of collocations. `min_len=1` means that it extracts not only collocations but also all single words.
    The number of single words is excluded in counting `max_cand`.
max_len : int
    maximum length of collocations
max_cand : int
    maximum number of candidates to extract
'''
        super().__init__(min_cf, min_df, min_len, max_len, max_cand, normalized)

    def extract(self, topic_model) -> List:
        '''Return the list of `tomotopy.label.Candidate`s extracted from `topic_model`

Parameters
----------
topic_model
    an instance of topic model with documents to extract candidates
'''
        return super().extract(topic_model)

Added in version: 0.6.0

PMIExtractor exploits multivariate pointwise mutual information to extract collocations. It finds a string of words that often co-occur statistically.

Parameters

min_cf : int
minimum collection frequency of collocations. Collocations with a smaller collection frequency than min_cf are excluded from the candidates. Set this value large if the corpus is big
min_df : int
minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big
min_len : int

Added in version: 0.10.0

minimum length of collocations. min_len=1 means that it extracts not only collocations but also all single words. The number of single words is excluded in counting max_cand.

max_len : int
maximum length of collocations
max_cand : int
maximum number of candidates to extract

Methods

def extract(self, topic_model) ‑> List
Expand source code
    def extract(self, topic_model) -> List:
        '''Return the list of `tomotopy.label.Candidate`s extracted from `topic_model`

Parameters
----------
topic_model
    an instance of topic model with documents to extract candidates
'''
        return super().extract(topic_model)

Return the list of Candidates extracted from topic_model

Parameters

topic_model
an instance of topic model with documents to extract candidates