Module `tomotopy.label`

Submodule tomotopy.label provides automatic topic labeling techniques. You can label topics automatically with simple code like below. The results are attached to the bottom of the code.

import tomotopy as tp

corpus = tp.utils.Corpus(tokenizer=tp.utils.SimpleTokenizer(), stopwords=['.'])
# data_feeder yields a tuple of (raw string, user data) or a str (raw string)
corpus.process(open(input_file, encoding='utf-8'))

# make LDA model and train
mdl = tp.LDAModel(k=20, min_cf=10, min_df=5, corpus=corpus)
mdl.train(0)
print('Num docs:', len(mdl.docs), ', Vocab size:', mdl.num_vocabs, ', Num words:', mdl.num_words)
print('Removed top words:', mdl.removed_top_words)
for i in range(0, 1000, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# extract candidates for auto topic labeling
extractor = tp.label.PMIExtractor(min_cf=10, min_df=5, max_len=5, max_cand=10000)
cands = extractor.extract(mdl)

# ranking the candidates of labels for a specific topic
labeler = tp.label.FoRelevance(mdl, cands, min_df=5, smoothing=1e-2, mu=0.25)
for k in range(mdl.k):
    print("== Topic #{} ==".format(k))
    print("Labels:", ', '.join(label for label, score in labeler.get_topic_labels(k, top_n=5)))
    for word, prob in mdl.get_topic_words(k, top_n=10):
        print(word, prob, sep='\t')
    print()

# Example of Results
# -----------------
# == Topic #13 ==
# Labels: american basebal, american actress, lawyer politician, race car driver, brown american
# american        0.061747949570417404
# english 0.02476435713469982
# player  0.02357063814997673
# politician      0.020087148994207382
# footbal 0.016364915296435356
# author  0.014303036034107208
# actor   0.01202411763370037
# french  0.009745198301970959
# academ  0.009701790288090706
# produc  0.008822779171168804
# 
# == Topic #16 ==
# Labels: lunar, saturn, orbit moon, nasa report, orbit around
# apollo  0.03052366152405739
# star    0.017564402893185616
# mission 0.015656694769859314
# earth   0.01532777864485979
# lunar   0.015130429528653622
# moon    0.013683202676475048
# orbit   0.011315013282001019
# crew    0.01092031504958868
# space   0.010821640491485596
# nasa    0.009999352507293224

Expand source code

"""
Submodule `tomotopy.label` provides automatic topic labeling techniques.
You can label topics automatically with simple code like below. The results are attached to the bottom of the code.

.. include:: ./auto_labeling_code.rst
"""

def _load():
    import tomotopy
    for k in dir(tomotopy.label):
        if not k.startswith('_'): globals()[k] = getattr(tomotopy.label, k)
_load()
del _load

import os
if os.environ.get('TOMOTOPY_LANG') == 'kr':
    __doc__ = """
`tomotopy.label` 서브모듈은 자동 토픽 라벨링 기법을 제공합니다.
아래에 나온 코드처럼 간단한 작업을 통해 토픽 모델의 결과에 이름을 붙일 수 있습니다. 그 결과는 코드 하단에 첨부되어 있습니다.

.. include:: ./auto_labeling_code.rst
"""
del os

Classes

class Candidate (...)

Candidate

Instance variables

var name: an actual name of the candidate for topic label
var score: score of the candidate (read-only)
var words: words of the candidate for topic label (read-only)

class FoRelevance (topic_model, cands, min_df=5, smoothing=0.01, mu=0.25, workers=0)

Added in version: 0.6.0

This type provides an implementation of First-order Relevance for topic labeling based on following papers:

Mei, Q., Shen, X., & Zhai, C. (2007, August). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 490-499).

Parameters

topic_model: an instance of topic model to label topics
cands : Iterable[Candidate]: a list of candidates to be used as topic labels
min_df : int: minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big
smoothing : float: a small value greater than 0 for Laplace smoothing
mu : float: a discriminative coefficient. Candidates with high score on a specific topic and with low score on other topics get the higher final score when this value is the larger.
workers : int: an integer indicating the number of workers to perform samplings. If workers is 0, the number of cores in the system will be used.

Methods

def get_topic_labels(self, k, top_n=10): Return the top-n label candidates for the topic k

Parameter

k : int an integer indicating a topic top_n : int the number of labels

class PMIExtractor (min_cf=10, min_df=5, max_len=5, max_cand=5000)

Added in version: 0.6.0

PMIExtractor exploits multivariate pointwise mutual information to extract collocations. It finds a string of words that often co-occur statistically.

Parameter

min_cf : int minimum collection frequency of collocations. Collocations with a smaller collection frequency than min_cf are excluded from the candidates. Set this value large if the corpus is big min_df : int minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big max_len : int maximum length of collocations max_cand : int maximum number of candidates to extract

Methods

def extract(self, topic_model)

Return the list of Candidates extracted from topic_model

Parameters

topic_model: an instance of topic model with documents to extract candidates