Module `tomotopy.label`

Submodule tomotopy.label provides automatic topic labeling techniques. You can label topics automatically with simple code like below. The results are attached to the bottom of the code.

import tomotopy as tp

corpus = tp.utils.Corpus(tokenizer=tp.utils.SimpleTokenizer(), stopwords=['.'])
# data_feeder yields a tuple of (raw string, user data) or a str (raw string)
corpus.process(open(input_file, encoding='utf-8'))

# make LDA model and train
mdl = tp.LDAModel(k=20, min_cf=10, min_df=5, corpus=corpus)
mdl.train(0)
print('Num docs:', len(mdl.docs), ', Vocab size:', len(mdl.used_vocabs), ', Num words:', mdl.num_words)
print('Removed top words:', mdl.removed_top_words)
for i in range(0, 1000, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# extract candidates for auto topic labeling
extractor = tp.label.PMIExtractor(min_cf=10, min_df=5, max_len=5, max_cand=10000)
cands = extractor.extract(mdl)

# ranking the candidates of labels for a specific topic
labeler = tp.label.FoRelevance(mdl, cands, min_df=5, smoothing=1e-2, mu=0.25)
for k in range(mdl.k):
    print("== Topic #{} ==".format(k))
    print("Labels:", ', '.join(label for label, score in labeler.get_topic_labels(k, top_n=5)))
    for word, prob in mdl.get_topic_words(k, top_n=10):
        print(word, prob, sep='\t')
    print()

# Example of Results
# -----------------
# == Topic #13 ==
# Labels: american basebal, american actress, lawyer politician, race car driver, brown american
# american        0.061747949570417404
# english 0.02476435713469982
# player  0.02357063814997673
# politician      0.020087148994207382
# footbal 0.016364915296435356
# author  0.014303036034107208
# actor   0.01202411763370037
# french  0.009745198301970959
# academ  0.009701790288090706
# produc  0.008822779171168804
# 
# == Topic #16 ==
# Labels: lunar, saturn, orbit moon, nasa report, orbit around
# apollo  0.03052366152405739
# star    0.017564402893185616
# mission 0.015656694769859314
# earth   0.01532777864485979
# lunar   0.015130429528653622
# moon    0.013683202676475048
# orbit   0.011315013282001019
# crew    0.01092031504958868
# space   0.010821640491485596
# nasa    0.009999352507293224

Expand source code

"""
Submodule `tomotopy.label` provides automatic topic labeling techniques.
You can label topics automatically with simple code like below. The results are attached to the bottom of the code.

.. include:: ./auto_labeling_code.rst
"""

from _tomotopy import (_LabelCandidate, _LabelPMIExtractor, _LabelFoRelevance)

Candidate = _LabelCandidate
PMIExtractor = _LabelPMIExtractor
FoRelevance = _LabelFoRelevance
'''end of copy from pyc'''

import os
if os.environ.get('TOMOTOPY_LANG') == 'kr':
    __doc__ = """
`tomotopy.label` 서브모듈은 자동 토픽 라벨링 기법을 제공합니다.
아래에 나온 코드처럼 간단한 작업을 통해 토픽 모델의 결과에 이름을 붙일 수 있습니다. 그 결과는 코드 하단에 첨부되어 있습니다.

.. include:: ./auto_labeling_code.rst
"""
del os

Classes

class Candidate

Instance variables

var cf: collection frequency of the candidate (read-only)
var df: document frequency of the candidate (read-only)
var name: an actual name of the candidate for topic label
var score: score of the candidate (read-only)
var words: words of the candidate for topic label (read-only)

class FoRelevance (topic_model, cands, min_df=5, smoothing=0.01, mu=0.25, window_size=-1, workers=0)

Added in version: 0.6.0

This type provides an implementation of First-order Relevance for topic labeling based on following papers:

Mei, Q., Shen, X., & Zhai, C. (2007, August). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 490-499).

Parameters

topic_model: an instance of topic model to label topics
cands : Iterable[Candidate]: a list of candidates to be used as topic labels
min_df : int: minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big
smoothing : float: a small value greater than 0 for Laplace smoothing
mu : float: a discriminative coefficient. Candidates with high score on a specific topic and with low score on other topics get the higher final score when this value is the larger.
window_size : int: Added in version: 0.10.0

size of the sliding window for calculating co-occurrence. If window_size=-1, it uses the whole document, instead of the sliding windows. If your documents are long, it is recommended to set this value to 50 ~ 100, not -1.
workers : int: an integer indicating the number of workers to perform samplings. If workers is 0, the number of cores in the system will be used.

Methods

def get_topic_labels(self, k, top_n=10)

Return the top-n label candidates for the topic k

Parameters

k : int: an integer indicating a topic
top_n : int: the number of labels

class PMIExtractor (min_cf=10, min_df=5, min_len=1, max_len=5, max_cand=5000, normalized=False)

Added in version: 0.6.0

PMIExtractor exploits multivariate pointwise mutual information to extract collocations. It finds a string of words that often co-occur statistically.

Parameters

min_cf : int: minimum collection frequency of collocations. Collocations with a smaller collection frequency than min_cf are excluded from the candidates. Set this value large if the corpus is big
min_df : int: minimum document frequency of collocations. Collocations with a smaller document frequency than min_df are excluded from the candidates. Set this value large if the corpus is big
min_len : int: Added in version: 0.10.0

minimum length of collocations. min_len=1 means that it extracts not only collocations but also all single words. The number of single words are excluded in counting max_cand.
max_len : int: maximum length of collocations
max_cand : int: maximum number of candidates to extract

Methods

def extract(self, topic_model)

Return the list of Candidates extracted from topic_model

Parameters

topic_model: an instance of topic model with documents to extract candidates