Module `tomotopy.label`

Submodule tomotopy.label provides automatic topic labeling techniques. You can label topics automatically with simple code like below. The results are attached to the bottom of the code.

import tomotopy as tp

corpus = tp.utils.Corpus(tokenizer=tp.utils.SimpleTokenizer(), stopwords=['.'])
# data_feeder yields a tuple of (raw string, user data) or a str (raw string)
corpus.process(open(input_file, encoding='utf-8'))

# make LDA model and train
mdl = tp.LDAModel(k=20, min_cf=10, min_df=5, corpus=corpus)
mdl.train(0)
print('Num docs:', len(mdl.docs), ', Vocab size:', len(mdl.used_vocabs), ', Num words:', mdl.num_words)
print('Removed top words:', mdl.removed_top_words)
for i in range(0, 1000, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# extract candidates for auto topic labeling
extractor = tp.label.PMIExtractor(min_cf=10, min_df=5, max_len=5, max_cand=10000)
cands = extractor.extract(mdl)

# ranking the candidates of labels for a specific topic
labeler = tp.label.FoRelevance(mdl, cands, min_df=5, smoothing=1e-2, mu=0.25)
for k in range(mdl.k):
    print("== Topic #{} ==".format(k))
    print("Labels:", ', '.join(label for label, score in labeler.get_topic_labels(k, top_n=5)))
    for word, prob in mdl.get_topic_words(k, top_n=10):
        print(word, prob, sep='\t')
    print()

# Example of Results
# -----------------
# == Topic #13 ==
# Labels: american basebal, american actress, lawyer politician, race car driver, brown american
# american        0.061747949570417404
# english 0.02476435713469982
# player  0.02357063814997673
# politician      0.020087148994207382
# footbal 0.016364915296435356
# author  0.014303036034107208
# actor   0.01202411763370037
# french  0.009745198301970959
# academ  0.009701790288090706
# produc  0.008822779171168804
# 
# == Topic #16 ==
# Labels: lunar, saturn, orbit moon, nasa report, orbit around
# apollo  0.03052366152405739
# star    0.017564402893185616
# mission 0.015656694769859314
# earth   0.01532777864485979
# lunar   0.015130429528653622
# moon    0.013683202676475048
# orbit   0.011315013282001019
# crew    0.01092031504958868
# space   0.010821640491485596
# nasa    0.009999352507293224

Classes

class Candidate (*args, **kwargs)

Instance variables

var cf
var df
var name
var score
var words

class FoRelevance (topic_model, cands, min_df=5, smoothing=0.01, mu=0.25, window_size=-1, workers=0)

Expand source code

class FoRelevance(_LabelFoRelevance):
    def __init__(self, topic_model, cands, min_df=5, smoothing=0.01, mu=0.25, window_size=-1, workers=0):
        '''.. versionadded:: 0.6.0

This type provides an implementation of First-order Relevance for topic labeling based on the following papers:

> * Mei, Q., Shen, X., & Zhai, C. (2007, August). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 490-499).

Parameters
----------
topic_model
    an instance of topic model to label topics
cands : Iterable[tomotopy.label.Candidate]
    a list of candidates to be used as topic labels
min_df : int
    minimum document frequency of collocations. Collocations with a smaller document frequency than `min_df` are excluded from the candidates.
    Set this value large if the corpus is big
smoothing : float
    a small value greater than 0 for Laplace smoothing
mu : float
    a discriminative coefficient. Candidates with high score on a specific topic and with low score on other topics get a higher final score when this value is larger.
window_size : int
    .. versionadded:: 0.10.0
    
    size of the sliding window for calculating co-occurrence. If `window_size=-1`, it uses the whole document, instead of the sliding windows.
    If your documents are long, it is recommended to set this value to 50 ~ 100, not -1.
workers : int
    an integer indicating the number of workers to perform samplings. 
    If `workers` is 0, the number of cores in the system will be used.
'''
        super().__init__(topic_model, cands, min_df, smoothing, mu, window_size, workers)
    
    def get_topic_labels(self, k, top_n=10) -> List[Tuple[str, float]]:
        '''Return the top-n label candidates for the topic `k`

Parameters
----------
k : int
    an integer indicating a topic
top_n : int
    the number of labels
'''
        return super().get_topic_labels(k, top_n)

Added in version: 0.6.0

This type provides an implementation of First-order Relevance for topic labeling based on the following papers:

Mei, Q., Shen, X., & Zhai, C. (2007, August). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 490-499).

Parameters

topic_model: an instance of topic model to label topics
cands : Iterable[Candidate]: a list of candidates to be used as topic labels
min_df : int: minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big
smoothing : float: a small value greater than 0 for Laplace smoothing
mu : float: a discriminative coefficient. Candidates with high score on a specific topic and with low score on other topics get a higher final score when this value is larger.
window_size : int: Added in version: 0.10.0

size of the sliding window for calculating co-occurrence. If window_size=-1, it uses the whole document, instead of the sliding windows. If your documents are long, it is recommended to set this value to 50 ~ 100, not -1.
workers : int: an integer indicating the number of workers to perform samplings. If workers is 0, the number of cores in the system will be used.

Methods

def get_topic_labels(self, k, top_n=10) ‑> List[Tuple[str, float]]

Expand source code

    def get_topic_labels(self, k, top_n=10) -> List[Tuple[str, float]]:
        '''Return the top-n label candidates for the topic `k`

Parameters
----------
k : int
    an integer indicating a topic
top_n : int
    the number of labels
'''
        return super().get_topic_labels(k, top_n)

Return the top-n label candidates for the topic k

Parameters

k : int: an integer indicating a topic
top_n : int: the number of labels

class PMIExtractor (min_cf=10, min_df=5, min_len=1, max_len=5, max_cand=5000, normalized=False)

Expand source code

class PMIExtractor(_LabelPMIExtractor):
    def __init__(self, min_cf=10, min_df=5, min_len=1, max_len=5, max_cand=5000, normalized=False):
        '''.. versionadded:: 0.6.0

`PMIExtractor` exploits multivariate pointwise mutual information to extract collocations. 
It finds a string of words that often co-occur statistically.

Parameters
----------
min_cf : int
    minimum collection frequency of collocations. Collocations with a smaller collection frequency than `min_cf` are excluded from the candidates.
    Set this value large if the corpus is big
min_df : int
    minimum document frequency of collocations. Collocations with a smaller document frequency than `min_df` are excluded from the candidates.
    Set this value large if the corpus is big
min_len : int
    .. versionadded:: 0.10.0
    
    minimum length of collocations. `min_len=1` means that it extracts not only collocations but also all single words.
    The number of single words is excluded in counting `max_cand`.
max_len : int
    maximum length of collocations
max_cand : int
    maximum number of candidates to extract
'''
        super().__init__(min_cf, min_df, min_len, max_len, max_cand, normalized)

    def extract(self, topic_model) -> List:
        '''Return the list of `tomotopy.label.Candidate`s extracted from `topic_model`

Parameters
----------
topic_model
    an instance of topic model with documents to extract candidates
'''
        return super().extract(topic_model)

Added in version: 0.6.0

PMIExtractor exploits multivariate pointwise mutual information to extract collocations. It finds a string of words that often co-occur statistically.

Parameters

min_cf : int: minimum collection frequency of collocations. Collocations with a smaller collection frequency than min_cf are excluded from the candidates. Set this value large if the corpus is big
min_df : int: minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big
min_len : int: Added in version: 0.10.0

minimum length of collocations. min_len=1 means that it extracts not only collocations but also all single words. The number of single words is excluded in counting max_cand.
max_len : int: maximum length of collocations
max_cand : int: maximum number of candidates to extract

Methods

def extract(self, topic_model) ‑> List

Expand source code

    def extract(self, topic_model) -> List:
        '''Return the list of `tomotopy.label.Candidate`s extracted from `topic_model`

Parameters
----------
topic_model
    an instance of topic model with documents to extract candidates
'''
        return super().extract(topic_model)

Return the list of Candidates extracted from topic_model

Parameters

topic_model: an instance of topic model with documents to extract candidates