Package tomotopy

Python package tomotopy provides types and functions for various Topic Model including LDA, DMR, HDP, MG-LDA, PA and HPA. It is written in C++ for speed and provides Python extension.

What is tomotopy?

tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including

Star Issue

Getting Started

You can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/) ::

$ pip install --upgrade pip
$ pip install tomotopy

The supported OS and Python versions are:

  • Linux (x86-64) with Python >= 3.6
  • macOS >= 10.13 with Python >= 3.6
  • Windows 7 or later (x86, x86-64) with Python >= 3.6
  • Other OS with Python >= 3.6: Compilation from source code required (with c++14 compatible compiler)

After installing, you can start tomotopy by just importing. ::

import tomotopy as tp
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'

Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance. When the package is imported, it will check available instruction sets and select the best option. If tp.isa tells none, iterations of training may take a long time. But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.

Here is a sample code for simple LDA training of texts from 'sample.txt' file. ::

import tomotopy as tp
mdl = tp.LDAModel(k=20)
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

mdl.summary()

Performance Of Tomotopy

tomotopy uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words. Generally CGS converges more slowly than Variational Bayes(VB) that gensim's LdaModel uses, but its iteration can be computed much faster. In addition, tomotopy can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.

Following chart shows the comparison of LDA model's running time between tomotopy and gensim. The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB). tomotopy trains 200 iterations and gensim trains 10 iterations.

↑ Performance in Intel i5-6600, x86-64 (4 cores)

↑ Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)

↑ Performance in AMD Ryzen7 3700X, x86-64 (8 cores, 16 threads)

Although tomotopy iterated 20 times more, the overall running time was 5~10 times faster than gensim. And it yields a stable result.

It is difficult to compare CGS and VB directly because they are totaly different techniques. But from a practical point of view, we can compare the speed and the result between them. The following chart shows the log-likelihood per word of two models' result.

Top words of topics generated by tomotopy
#1use, acid, cell, form, also, effect
#2use, number, one, set, comput, function
#3state, use, may, court, law, person
#4state, american, nation, parti, new, elect
#5film, music, play, song, anim, album
#6art, work, design, de, build, artist
#7american, player, english, politician, footbal, author
#8appl, use, comput, system, softwar, compani
#9day, unit, de, state, german, dutch
#10team, game, first, club, leagu, play
#11church, roman, god, greek, centuri, bc
#12atom, use, star, electron, metal, element
#13alexand, king, ii, emperor, son, iii
#14languag, arab, use, word, english, form
#15speci, island, plant, famili, order, use
#16work, univers, world, book, human, theori
#17citi, area, region, popul, south, world
#18forc, war, armi, militari, jew, countri
#19year, first, would, later, time, death
#20apollo, use, aircraft, flight, mission, first
Top words of topics generated by gensim
#1use, acid, may, also, azerbaijan, cell
#2use, system, comput, one, also, time
#3state, citi, day, nation, year, area
#4state, lincoln, american, war, union, bell
#5anim, game, anal, atari, area, sex
#6art, use, work, also, includ, first
#7american, player, english, politician, footbal, author
#8new, american, team, season, leagu, year
#9appl, ii, martin, aston, magnitud, star
#10bc, assyrian, use, speer, also, abort
#11use, arsen, also, audi, one, first
#12algebra, use, set, ture, number, tank
#13appl, state, use, also, includ, product
#14use, languag, word, arab, also, english
#15god, work, one, also, greek, name
#16first, one, also, time, work, film
#17church, alexand, arab, also, anglican, use
#18british, american, new, war, armi, alfr
#19airlin, vote, candid, approv, footbal, air
#20apollo, mission, lunar, first, crew, land

The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.

Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.

Vocabulary Controlling Using Cf And Df

CF(collection frequency) and DF(document frequency) are concepts used in information retreival, and each represents the total number of times the word appears in the corpus and the number of documents in which the word appears within the corpus, respectively. tomotopy provides these two measures under the parameters of min_cf and min_df to trim low frequency words when building the corpus.

For example, let's say we have 5 documents #0 ~ #4 which are composed of the following words: ::

#0 : a, b, c, d, e, c
#1 : a, b, e, f
#2 : c, d, c
#3 : a, e, f, g
#4 : a, b, g

Both CF of a and CF of c are 4 because it appears 4 times in the entire corpus. But DF of a is 4 and DF of c is 2 because a appears in #0, #1, #3 and #4 and c only appears in #0 and #2. So if we trim low frequency words using min_cf=3, the result becomes follows: ::

(d, f and g are removed.)
#0 : a, b, c, e, c
#1 : a, b, e
#2 : c, c
#3 : a, e
#4 : a, b

However when min_df=3 the result is like : ::

(c, d, f and g are removed.)
#0 : a, b, e
#1 : a, b, e
#2 : (empty doc)
#3 : a, e
#4 : a, b

As we can see, min_df is a stronger criterion than min_cf. In performing topic modeling, words that appear repeatedly in only one document do not contribute to estimating the topic-word distribution. So, removing words with low df is a good way to reduce model size while preserving the results of the final model. In short, please prefer using min_df to min_cf.

Model Save And Load

tomotopy provides save and load method for each topic model class, so you can save the model into the file whenever you want, and re-load it from the file. ::

import tomotopy as tp

mdl = tp.HDPModel()
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# save into file
mdl.save('sample_hdp_model.bin')

# load from file
mdl = tp.HDPModel.load('sample_hdp_model.bin')
for k in range(mdl.k):
    if not mdl.is_live_topic(k): continue
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

# the saved model is HDP model, 
# so when you load it by LDA model, it will raise an exception
mdl = tp.LDAModel.load('sample_hdp_model.bin')

When you load the model from a file, a model type in the file should match the class of methods.

See more at LDAModel.save() and LDAModel.load() methods.

Documents In The Model And Out Of The Model

We can use Topic Model for two major purposes. The basic one is to discover topics from a set of documents as a result of trained model, and the more advanced one is to infer topic distributions for unseen documents by using trained model.

We named the document in the former purpose (used for model training) as document in the model, and the document in the later purpose (unseen document during training) as document out of the model.

In tomotopy, these two different kinds of document are generated differently. A document in the model can be created by LDAModel.add_doc() method. add_doc can be called before LDAModel.train() starts. In other words, after train called, add_doc cannot add a document into the model because the set of document used for training has become fixed.

To acquire the instance of the created document, you should use LDAModel.docs like:

::

mdl = tp.LDAModel(k=20)
idx = mdl.add_doc(words)
if idx < 0: raise RuntimeError("Failed to add doc")
doc_inst = mdl.docs[idx]
# doc_inst is an instance of the added document

A document out of the model is generated by LDAModel.make_doc() method. make_doc can be called only after train starts. If you use make_doc before the set of document used for training has become fixed, you may get wrong results. Since make_doc returns the instance directly, you can use its return value for other manipulations.

::

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_doc) # doc_inst is an instance of the unseen document

Inference For Unseen Documents

If a new document is created by LDAModel.make_doc(), its topic distribution can be inferred by the model. Inference for unseen document should be performed using LDAModel.infer() method.

::

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_doc)
topic_dist, ll = mdl.infer(doc_inst)
print("Topic Distribution for Unseen Docs: ", topic_dist)
print("Log-likelihood of inference: ", ll)

The infer method can infer only one instance of Document or a list of instances of Document. See more at LDAModel.infer().

Corpus And Transform

Every topic model in tomotopy has its own internal document type. A document can be created and added into suitable for each model through each model's add_doc method. However, trying to add the same list of documents to different models becomes quite inconvenient, because add_doc should be called for the same list of documents to each different model. Thus, tomotopy provides Corpus class that holds a list of documents. Corpus can be inserted into any model by passing as argument corpus to __init__ or add_corpus method of each model. So, inserting Corpus just has the same effect to inserting documents the corpus holds.

Some topic models requires different data for its documents. For example, DMRModel requires argument metadata in str type, but PLDAModel requires argument labels in List[str] type. Since Corpus holds an independent set of documents rather than being tied to a specific topic model, data types required by a topic model may be inconsistent when a corpus is added into that topic model. In this case, miscellaneous data can be transformed to be fitted target topic model using argument transform. See more details in the following code:

::

from tomotopy import DMRModel
from tomotopy.utils import Corpus

corpus = Corpus()
corpus.add_doc("a b c d e".split(), a_data=1)
corpus.add_doc("e f g h i".split(), a_data=2)
corpus.add_doc("i j k l m".split(), a_data=3)

model = DMRModel(k=10)
model.add_corpus(corpus) 
# You lose <code>a\_data</code> field in <code>corpus</code>, 
# and <code>metadata</code> that <code><a title="tomotopy.DMRModel" href="#tomotopy.DMRModel">DMRModel</a></code> requires is filled with the default value, empty str.

assert model.docs[0].metadata == ''
assert model.docs[1].metadata == ''
assert model.docs[2].metadata == ''

def transform_a_data_to_metadata(misc: dict):
    return {'metadata': str(misc['a_data'])}
# this function transforms <code>a\_data</code> to <code>metadata</code>

model = DMRModel(k=10)
model.add_corpus(corpus, transform=transform_a_data_to_metadata)
# Now docs in <code>model</code> has non-default <code>metadata</code>, that generated from <code>a\_data</code> field.

assert model.docs[0].metadata == '1'
assert model.docs[1].metadata == '2'
assert model.docs[2].metadata == '3'

Parallel Sampling Algorithms

Since version 0.5.0, tomotopy allows you to choose a parallelism algorithm. The algorithm provided in versions prior to 0.4.2 is COPY_MERGE, which is provided for all topic models. The new algorithm PARTITION, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.

The following chart shows the speed difference between the two algorithms based on the number of topics and the number of workers.

Performance By Version

Performance changes by version are shown in the following graph. The time it takes to run the LDA model train with 1000 iteration was measured. (Docs: 11314, Vocab: 60382, Words: 2364724, Intel Xeon Gold 5120 @2.2GHz)

Pining Topics Using Word Priors

Since version 0.6.0, a new method LDAModel.set_word_prior() has been added. It allows you to control word prior for each topic. For example, we can set the weight of the word 'church' to 1.0 in topic 0, and the weight to 0.1 in the rest of the topics by following codes. This means that the probability that the word 'church' is assigned to topic 0 is 10 times higher than the probability of being assigned to another topic. Therefore, most of 'church' is assigned to topic 0, so topic 0 contains many words related to 'church'. This allows to manipulate some topics to be placed at a specific topic number.

::

import tomotopy as tp
mdl = tp.LDAModel(k=20)

# add documents into <code>mdl</code>

# setting word prior
mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])

See word_prior_example in example.py for more details.

Examples

You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/main/examples/ .

You can also get the data file used in the example code at https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view .

License

tomotopy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

History

  • 0.12.3 (2022-07-19)

    • New features
      • Now, inserting an empty document using LDAModel.add_doc() just ignores it instead of raising an exception. If the newly added argument ignore_empty_words is set to False, an exception is raised as before.
      • HDPModel.purge_dead_topics() method is added to remove non-live topics from the model.
    • Bug fixes
      • Fixed an issue that prevents setting user defined values for nuSq in SLDAModel (by @jucendrero).
      • Fixed an issue where tomotopy.utils.Coherence did not work for DTModel.
      • Fixed an issue that often crashed when calling make_dic() before calling train().
      • Resolved the problem that the results of DMRModel and GDMRModel are different even when the seed is fixed.
      • The parameter optimization process of DMRModel and GDMRModel has been improved.
      • Fixed an issue that sometimes crashed when calling LDAModel.copy().
  • 0.12.2 (2021-09-06)

    • An issue where calling convert_to_lda of HDPModel with min_cf > 0, min_df > 0 or rm_top > 0 causes a crash has been fixed.
    • A new argument from_pseudo_doc is added to Document.get_topics() and Document.get_topic_dist(). This argument is only valid for documents of PTModel, it enables to control a source for computing topic distribution.
    • A default value for argument p of PTModel has been changed. The new default value is k * 10.
    • Using documents generated by make_doc without calling infer doesn't cause a crash anymore, but just print warning messages.
    • An issue where the internal C++ code isn't compiled at clang c++17 environment has been fixed.
  • 0.12.1 (2021-06-20)

  • 0.12.0 (2021-04-26)

    • Now DMRModel and GDMRModel support multiple values of metadata (see https://github.com/bab2min/tomotopy/blob/main/examples/dmr_multi_label.py )
    • The performance of GDMRModel was improved.
    • A copy() method has been added for all topic models to do a deep copy.
    • An issue was fixed where words that are excluded from training (by min_cf, min_df) have incorrect topic id. Now all excluded words have -1 as topic id.
    • Now all exceptions and warnings that generated by tomotopy follow standard Python types.
    • Compiler requirements have been raised to C++14.
  • 0.11.1 (2021-03-28)

    • A critical bug of asymmetric alphas was fixed. Due to this bug, version 0.11.0 has been removed from releases.
  • 0.11.0 (2021-03-26) (removed)

  • 0.10.2 (2021-02-16)

    • An issue was fixed where LDAModel.train() fails with large K.
    • An issue was fixed where Corpus loses their uid values.
  • 0.10.1 (2021-02-14)

  • 0.10.0 (2020-12-19)

    • The interface of Corpus and of LDAModel.docs were unified. Now you can access the document in corpus with the same manner.
    • getitem of Corpus was improved. Not only indexing by int, but also by Iterable[int], slicing are supported. Also indexing by uid is supported.
    • New methods Corpus.extract_ngrams() and Corpus.concat_ngrams() were added. They extracts n-gram collocations using PMI and concatenates them into a single words.
    • A new method LDAModel.add_corpus() was added, and LDAModel.infer() can receive corpus as input.
    • A new module tomotopy.coherence was added. It provides the way to calculate coherence of the model.
    • A paramter window_size was added to FoRelevance.
    • An issue was fixed where NaN often occurs when training HDPModel.
    • Now Python3.9 is supported.
    • A dependency to py-cpuinfo was removed and the initializing of the module was improved.
  • 0.9.1 (2020-08-08)

  • 0.9.0 (2020-08-04)

  • 0.8.2 (2020-07-14)

    • New properties DTModel.num_timepoints and DTModel.num_docs_by_timepoint have been added.
    • A bug which causes different results with the different platform even if seeds were the same was partially fixed. As a result of this fix, now tomotopy in 32 bit yields different training results from earlier version.
  • 0.8.1 (2020-06-08)

  • 0.8.0 (2020-06-06)

    • Since NumPy was introduced in tomotopy, many methods and properties of tomotopy return not just list, but numpy.ndarray now.
    • Tomotopy has a new dependency NumPy >= 1.10.0.
    • A wrong estimation of LDAModel.infer() was fixed.
    • A new method about converting HDPModel to LDAModel was added.
    • New properties including LDAModel.used_vocabs, LDAModel.used_vocab_freq and LDAModel.used_vocab_df were added into topic models.
    • A new g-DMR topic model(GDMRModel) was added.
    • An error at initializing FoRelevance in macOS was fixed.
    • An error that occured when using Corpus created without raw parameters was fixed.
  • 0.7.1 (2020-05-08)

  • 0.7.0 (2020-04-18)

  • 0.6.2 (2020-03-28)

    • A critical bug related to save and load was fixed. Version 0.6.0 and 0.6.1 have been removed from releases.
  • 0.6.1 (2020-03-22) (removed)

    • A bug related to module loading was fixed.
  • 0.6.0 (2020-03-22) (removed)

    • Corpus class that manages multiple documents easily was added.
    • LDAModel.set_word_prior() method that controls word-topic priors of topic models was added.
    • A new argument min_df that filters words based on document frequency was added into every topic model's init.
    • tomotopy.label, the submodule about topic labeling was added. Currently, only FoRelevance is provided.
  • 0.5.2 (2020-03-01)

    • A segmentation fault problem was fixed in LLDAModel.add_doc().
    • A bug was fixed that infer of HDPModel sometimes crashes the program.
    • A crash issue was fixed of LDAModel.infer() with ps=tomotopy.ParallelScheme.PARTITION, together=True.
  • 0.5.1 (2020-01-11)

    • A bug was fixed that SLDAModel.make_doc() doesn't support missing values for y.
    • Now SLDAModel fully supports missing values for response variables y. Documents with missing values (NaN) are included in modeling topic, but excluded from regression of response variables.
  • 0.5.0 (2019-12-30)

    • Now PAModel.infer() returns both topic distribution nd sub-topic distribution.
    • New methods get_sub_topics and get_sub_topic_dist were added into Document. (for PAModel)
    • New parameter parallel was added for LDAModel.train() and LDAModel.infer() method. You can select parallelism algorithm by changing this parameter.
    • ParallelScheme.PARTITION, a new algorithm, was added. It works efficiently when the number of workers is large, the number of topics or the size of vocabulary is big.
    • A bug where rm_top didn't work at min_cf < 2 was fixed.
  • 0.4.2 (2019-11-30)

    • Wrong topic assignments of LLDAModel and PLDAModel were fixed.
    • Readable repr of Document and tomotopy.Dictionary was implemented.
  • 0.4.1 (2019-11-27)

    • A bug at init function of PLDAModel was fixed.
  • 0.4.0 (2019-11-18)

  • 0.3.1 (2019-11-05)

    • An issue where get_topic_dist() returns incorrect value when min_cf or rm_top is set was fixed.
    • The return value of get_topic_dist() of MGLDAModel document was fixed to include local topics.
    • The estimation speed with tw=ONE was improved.
  • 0.3.0 (2019-10-06)

    • A new model, LLDAModel was added into the package.
    • A crashing issue of HDPModel was fixed.
    • Since hyperparameter estimation for HDPModel was implemented, the result of HDPModel may differ from previous versions. If you want to turn off hyperparameter estimation of HDPModel, set optim_interval to zero.
  • 0.2.0 (2019-08-18)

    • New models including CTModel and SLDAModel were added into the package.
    • A new parameter option rm_top was added for all topic models.
    • The problems in save and load method for PAModel and HPAModel were fixed.
    • An occassional crash in loading HDPModel was fixed.
    • The problem that ll_per_word was calculated incorrectly when min_cf > 0 was fixed.
  • 0.1.6 (2019-08-09)

    • Compiling errors at clang with macOS environment were fixed.
  • 0.1.4 (2019-08-05)

    • The issue when add_doc receives an empty list as input was fixed.
    • The issue that PAModel.get_topic_words() doesn't extract the word distribution of subtopic was fixed.
  • 0.1.3 (2019-05-19)

    • The parameter min_cf and its stopword-removing function were added for all topic models.
  • 0.1.0 (2019-05-12)

    • First version of tomotopy
Expand source code
"""
Python package `tomotopy` provides types and functions for various Topic Model 
including LDA, DMR, HDP, MG-LDA, PA and HPA. It is written in C++ for speed and provides Python extension.

.. include:: ./documentation.rst
"""
from tomotopy._version import __version__
from enum import IntEnum

class TermWeight(IntEnum):
    """
    This enumeration is for Term Weighting Scheme and it is based on following paper:
    
    > * Wilson, A. T., & Chew, P. A. (2010, June). Term weighting schemes for latent dirichlet allocation. In human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics (pp. 465-473). Association for Computational Linguistics.
    
    There are three options for term weighting and the basic one is ONE. The others also can be applied for all topic models in `tomotopy`. 
    """

    ONE = 0
    """ Consider every term equal (default)"""

    IDF = 1
    """ 
    Use Inverse Document Frequency term weighting.
    
    Thus, a term occurring at almost every document has very low weighting
    and a term occurring at a few document has high weighting. 
    """

    PMI = 2
    """
    Use Pointwise Mutual Information term weighting.
    """

class ParallelScheme(IntEnum):
    """
    This enumeration is for Parallelizing Scheme:
    There are three options for parallelizing and the basic one is DEFAULT. Not all models supports all options. 
    """

    DEFAULT = 0
    """tomotopy chooses the best available parallelism scheme for your model"""

    NONE = 1
    """ 
    Turn off multi-threading for Gibbs sampling at training or inference. Operations other than Gibbs sampling may use multithreading.
    """

    COPY_MERGE = 2
    """
    Use Copy and Merge algorithm from AD-LDA. It consumes RAM in proportion to the number of workers. 
    This has advantages when you have a small number of workers and a small number of topics and vocabulary sizes in the model.
    Prior to version 0.5, all models used this algorithm by default. 
    
    > * Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10(Aug), 1801-1828.
    """

    PARTITION = 3
    """
    Use Partitioning algorithm from PCGS. It consumes only twice as much RAM as a single-threaded algorithm, regardless of the number of workers.
    This has advantages when you have a large number of workers or a large number of topics and vocabulary sizes in the model.
    
    > * Yan, F., Xu, N., & Qi, Y. (2009). Parallel inference for latent dirichlet allocation on graphics processing units. In Advances in neural information processing systems (pp. 2134-2142).
    """

isa = ''
"""
Indicate which SIMD instruction set is used for acceleration.
It can be one of `'avx2'`, `'avx'`, `'sse2'` and `'none'`.
"""

from _tomotopy import *

import tomotopy.utils as utils
import tomotopy.coherence as coherence
import tomotopy.label as label

import os
if os.environ.get('TOMOTOPY_LANG') == 'kr':
    __doc__ = """`tomotopy` 패키지는 Python에서 사용가능한 다양한 토픽 모델링 타입과 함수를 제공합니다.
내부 모듈은 c++로 작성되었기 때문에 빠른 속도를 자랑합니다.

.. include:: ./documentation.kr.rst
"""
    __pdoc__ = {}
    __pdoc__['isa'] = """현재 로드된 모듈이 어떤 SIMD 명령어 세트를 사용하는지 표시합니다. 
이 값은 `'avx2'`, `'avx'`, `'sse2'`, `'none'` 중 하나입니다."""
    __pdoc__['TermWeight'] = """용어 가중치 기법을 선택하는 데에 사용되는 열거형입니다. 여기에 제시된 용어 가중치 기법들은 다음 논문을 바탕으로 하였습니다:
    
> * Wilson, A. T., & Chew, P. A. (2010, June). Term weighting schemes for latent dirichlet allocation. In human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics (pp. 465-473). Association for Computational Linguistics.

총 3가지 가중치 기법을 사용할 수 있으며 기본값은 ONE입니다. 기본값뿐만 아니라 다른 모든 기법들도 `tomotopy`의 모든 토픽 모델에 사용할 수 있습니다. """
    __pdoc__['TermWeight.ONE'] = """모든 용어를 동일하게 간주합니다. (기본값)"""
    __pdoc__['TermWeight.IDF'] = """역문헌빈도(IDF)를 가중치로 사용합니다.

따라서 모든 문헌에 거의 골고루 등장하는 용어의 경우 낮은 가중치를 가지게 되며, 
소수의 특정 문헌에만 집중적으로 등장하는 용어의 경우 높은 가중치를 가지게 됩니다."""
    __pdoc__['TermWeight.PMI'] = """점별 상호정보량(PMI)을 가중치로 사용합니다."""
    __pdoc__['ParallelScheme'] = """병렬화 기법을 선택하는 데에 사용되는 열거형입니다. 총 3가지 기법을 사용할 수 있으나, 모든 모델이 아래의 기법을 전부 지원하지는 않습니다."""
    __pdoc__['ParallelScheme.DEFAULT'] = """tomotopy가 모델에 따라 적합한 병럴화 기법을 선택하도록 합니다. 이 값이 기본값입니다."""
    __pdoc__['ParallelScheme.NONE'] = """깁스 샘플링에 병렬화 기법을 사용하지 않습니다. 깁스 샘플링을 제외한 다른 연산들은 여전히 병렬로 처리될 수 있습니다."""
    __pdoc__['ParallelScheme.COPY_MERGE'] = """
AD-LDA에서 제안된 복사 후 합치기 알고리즘을 사용합니다. 이는 작업자 수에 비례해 메모리를 소모합니다. 
작업자 수가 적거나, 토픽 개수 혹은 어휘 집합의 크기가 작을 때 유리합니다.
0.5버전 이전까지는 모든 모델은 이 알고리즘을 기본으로 사용했습니다.
    
> * Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10(Aug), 1801-1828.
"""
    __pdoc__['ParallelScheme.PARTITION'] =     """
PCGS에서 제안된 분할 샘플링 알고리즘을 사용합니다. 작업자 수에 관계없이 단일 스레드 알고리즘에 비해 2배의 메모리만 소모합니다.
작업자 수가 많거나, 토픽 개수 혹은 어휘 집합의 크기가 클 때 유리합니다.
    
> * Yan, F., Xu, N., & Qi, Y. (2009). Parallel inference for latent dirichlet allocation on graphics processing units. In Advances in neural information processing systems (pp. 2134-2142).
"""
del IntEnum, os

Sub-modules

tomotopy.coherence

Added in version: 0.10.0 …

tomotopy.label

Submodule tomotopy.label provides automatic topic labeling techniques. You can label topics automatically with simple code like below. The results …

tomotopy.utils

Submodule tomotopy.utils provides various utilities for topic modeling. Corpus class helps manage multiple documents easily. The …

Global variables

var isa

Indicate which SIMD instruction set is used for acceleration. It can be one of 'avx2', 'avx', 'sse2' and 'none'.

Classes

class CTModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, k=1, smoothing_alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

Added in version: 0.2.0

This type provides Correlated Topic Model (CTM) and its implementation is based on following papers:

  • Blei, D., & Lafferty, J. (2006). Correlated topic models. Advances in neural information processing systems, 18, 147.
  • Mimno, D., Wallach, H., & McCallum, A. (2008, December). Gibbs sampling for logistic normal topic models with graph-based priors. In NIPS Workshop on Analyzing Graphs (Vol. 61).

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int

Added in version: 0.6.0

minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded

rm_top : int
the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.
k : int
the number of topics between 1 ~ 32767
smoothing_alpha : Union[float, Iterable[float]]
small smoothing value for preventing topic counts to be zero, given as a single float in case of symmetric and as a list with length k of float in case of asymmetric.
eta : float
hyperparameter of Dirichlet distribution for topic-word
seed : int
random seed. The default value is a random number from std::random_device{} in C++
corpus : Corpus

Added in version: 0.6.0

a list of documents to be added into the model

transform : Callable[dict, dict]

Added in version: 0.6.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Instance variables

var alpha

This property is not available in CTModel. Use CTModel.prior_mean and CTModel.prior_cov instead.

Added in version: 0.9.1

var num_beta_sample

the number of times to sample beta parameters, default value is 10.

CTModel samples num_beta_sample beta parameters for each document. The more beta it samples, the more accurate the distribution will be, but the longer time it takes to learn. If you have a small number of documents in your model, keeping this value larger will help you get better result.

var num_tmn_sample

the number of iterations for sampling Truncated Multivariate Normal distribution, default value is 5.

If your model shows biased topic correlations, increasing this value may be helpful.

var prior_cov

the covariance matrix of prior logistic-normal distribution the for topic distribution (read-only)

var prior_mean

the mean of prior logistic-normal distribution for the topic distribution (read-only)

Methods

def get_correlations(self, topic_id=None)

Return correlations between the topic topic_id and other topics. The returned value is a list of floats of size LDAModel.k.

Parameters

topic_id : Union[int, None]

an integer in range [0, k), indicating the topic

If omitted, the whole correlation matrix is returned.

Inherited members

class DMRModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, k=1, alpha=0.1, eta=0.01, sigma=1.0, alpha_epsilon=1e-10, seed=None, corpus=None, transform=None)

This type provides Dirichlet Multinomial Regression(DMR) topic model and its implementation is based on following papers:

  • Mimno, D., & McCallum, A. (2012). Topic models conditioned on arbitrary features with dirichlet-multinomial regression. arXiv preprint arXiv:1206.3278.

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int

Added in version: 0.6.0

minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded

rm_top : int

Added in version: 0.2.0

the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.

k : int
the number of topics between 1 ~ 32767
alpha : Union[float, Iterable[float]]
an initial value of exponential of mean of normal distribution for lambdas, given as a single float in case of symmetric prior and as a list with length k of float in case of asymmetric prior.
eta : float
hyperparameter of Dirichlet distribution for topic - word
sigma : float
standard deviation of normal distribution for lambdas
alpha_epsilon : float
small smoothing value for preventing exp(lambdas) to be near zero
seed : int
random seed. default value is a random number from std::random_device{} in C++
corpus : Corpus

Added in version: 0.6.0

a list of documents to be added into the model

transform : Callable[dict, dict]

Added in version: 0.6.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Subclasses

Instance variables

var alpha

Dirichlet prior on the per-document topic distributions for each metadata in the shape [k, f]. Equivalent to np.exp(DMRModel.lambdas) (read-only)

Added in version: 0.9.0

Warning

Prior to version 0.11.0, there was a bug in the lambda getter, so it yielded the wrong value. It is recommended to upgrade to version 0.11.0 or later.

var alpha_epsilon

the smooting value alpha-epsilon (read-only)

var f

the number of metadata features (read-only)

var lambda_

parameter lambdas in the shape [k, len(metadata_dict), l] where k is the number of topics and l is the size of vector for multi_metadata (read-only)

See DMRModel.get_topic_prior() for the relation between the lambda parameter and the topic prior.

Added in version: 0.12.0

var lambdas

parameter lambdas in the shape [k, f] (read-only)

Warning

Prior to version 0.11.0, there was a bug in the lambda getter, so it yielded the wrong value. It is recommended to upgrade to version 0.11.0 or later.

var metadata_dict

a dictionary of metadata in type tomotopy.Dictionary (read-only)

var multi_metadata_dict

a dictionary of metadata in type tomotopy.Dictionary (read-only)

Added in version: 0.12.0

This dictionary is distinct from metadata_dict.

var sigma

the hyperparameter sigma (read-only)

Methods

def add_doc(self, words, metadata='', multi_metadata=[], ignore_empty_words=True)

Add a new document into the model instance with metadata and return an index of the inserted document.

Changed in version: 0.12.0

A new argument multi_metadata for multiple values of metadata was added.

Parameters

words : Iterable[str]
an iterable of str
metadata : str
metadata of the document (e.g., author, title or year)
multi_metadata : Iterable[str]
metadata of the document (for multiple values)
ignore_empty_words : bool
If True, empty words doesn't raise exception and makes the method return None.
def get_topic_prior(self, metadata='', multi_metadata=[], raw=False)

Added in version: 0.12.0

Calculate the topic prior of any document with the given metadata and multi_metadata. If raw is true, the value without applying exp() is returned, otherwise, the value with applying exp() is returned.

The topic prior is calculated as follows:

np.dot(lambda_[:, id(metadata)], np.concat([[1], multi_hot(multi_metadata)]))

where idx(metadata) and multi_hot(multi_metadata) indicates an integer id of given metadata and multi-hot encoded binary vector for given multi_metadata respectively.

Parameters

metadata : str
metadata of the document (e.g., author, title or year)
multi_metadata : Iterable[str]
metadata of the document (for multiple values)
raw : bool
If raw is true, the raw value of parameters without applying exp() is returned.
def make_doc(self, words, metadata='', multi_metadata=[])

Return a new Document instance for an unseen document with words and metadata that can be used for LDAModel.infer() method.

Changed in version: 0.12.0

A new argument multi_metadata for multiple values of metadata was added.

Parameters

words : Iterable[str]
an iteratable of str
metadata : str
metadata of the document (e.g., author, title or year)
multi_metadata : Iterable[str]
metadata of the document (for multiple values)

Inherited members

class DTModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, k=1, t=1, alpha_var=0.1, eta_var=0.1, phi_var=0.1, lr_a=0.01, lr_b=0.1, lr_c=0.55, seed=None, corpus=None, transform=None)

This type provides Dynamic Topic model and its implementation is based on following papers:

  • Blei, D. M., & Lafferty, J. D. (2006, June). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning (pp. 113-120).
  • Bhadury, A., Chen, J., Zhu, J., & Liu, S. (2016, April). Scaling up dynamic topic models. In Proceedings of the 25th International Conference on World Wide Web (pp. 381-390). https://github.com/Arnie0426/FastDTM

Added in version: 0.7.0

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int
minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded
rm_top : int
the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.
k : int
the number of topics between 1 ~ 32767
t : int
the number of timpoints
alpha_var : float
transition variance of alpha (per-document topic distribution)
eta_var : float
variance of eta (topic distribution of each document) from its alpha
phi_var : float
transition variance of phi (word distribution of each topic)
lr_a : float
shape parameter a greater than zero, for SGLD step size calculated as e_i = a * (b + i) ^ (-c)
lr_b : float
shape parameter b greater than or equal to zero, for SGLD step size calculated as e_i = a * (b + i) ^ (-c)
lr_c : float
shape parameter c with range (0.5, 1], for SGLD step size calculated as e_i = a * (b + i) ^ (-c)
seed : int
random seed. default value is a random number from std::random_device{} in C++
corpus : Corpus
a list of documents to be added into the model
transform : Callable[dict, dict]
a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Instance variables

var alpha

per-document topic distribution in the shape [num_timepoints, k] (read-only)

Added in version: 0.9.0

var eta

This property is not available in DTModel. Use DTModel.docs[x].eta instead.

Added in version: 0.9.0

var lr_a

parameter a greater than zero for SGLD step size (e_i = a * (b + i) ^ -c)

var lr_b

parameter b greater than zero or equal to zero for SGLD step size (e_i = a * (b + i) ^ -c)

var lr_c

parameter c with range (0.5, 1] for SGLD step size (e_i = a * (b + i) ^ -c)

var num_docs_by_timepoint

the number of documents in the model by timepoint (read-only)

var num_timepoints

the number of timepoints of the model (read-only)

Methods

def add_doc(self, words, timepoint=0, ignore_empty_words=True)

Add a new document into the model instance with timepoint and return an index of the inserted document.

Parameters

words : Iterable[str]
an iterable of str
timepoint : int
an integer with range [0, t)
ignore_empty_words : bool
If True, empty words doesn't raise exception and makes the method return None.
def get_alpha(self, timepoint)

Return a list of alpha parameters for timepoint.

Parameters

timepoint : int
an integer with range [0, t)
def get_count_by_topics(self)

Return the number of words allocated to each timepoint and topic in the shape [num_timepoints, k].

Added in version: 0.9.0

def get_phi(self, timepoint, topic_id)

Return a list of phi parameters for timepoint and topic_id.

Parameters

timepoint : int
an integer with range [0, t)
topic_id : int
an integer with range [0, k)
def get_topic_word_dist(self, topic_id, timepoint, normalize=True)

Return the word distribution of the topic topic_id with timepoint. The returned value is a list that has len(vocabs) fraction numbers indicating probabilities for each word in the current topic.

Parameters

topic_id : int
an integer in range [0, k) indicating the topic
timepoint : int
an integer in range [0, t), indicating the timepoint
normalize : bool

Added in version: 0.11.0

If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.

def get_topic_words(self, topic_id, timepoint, top_n=10)

Return the top_n words and its probability in the topic topic_id with timepoint. The return type is a list of (word:str, probability:float).

Parameters

topic_id : int
an integer in range [0, k), indicating the topic
timepoint : int
an integer in range [0, t), indicating the timepoint
def make_doc(self, words, timepoint=0)

Return a new Document instance for an unseen document with words and timepoint that can be used for LDAModel.infer() method.

Parameters

words : Iterable[str]
an iteratable of str
timepoint : int
an integer with range [0, t)

Inherited members

class Document

This type provides abstract model to access documents to be used Topic Model.

An instance of this type can be acquired from LDAModel.make_doc() method or LDAModel.docs member of each Topic Model instance.

Instance variables

var beta

a list of beta parameters for each topic (for only CTModel model, read-only)

Added in version: 0.2.0

var eta

a list of eta parameters(topic distribution) for the current document (for only DTModel model, read-only)

Added in version: 0.7.0

var labels

a list of (label, list of probabilties of each topic belonging to the label) of the document (for only LLDAModel and PLDAModel models, read-only)

Added in version: 0.3.0

var metadata

categorical metadata of the document (for only DMRModel and GDMRModel model, read-only)

var multi_metadata

categorical multiple metadata of the document (for only DMRModel and GDMRModel model, read-only)

Added in version: 0.12.0

var numeric_metadata

continuous numeric metadata of the document (for only GDMRModel model, read-only)

Added in version: 0.11.0

var path

a list of topic ids by depth for a given document (for only HLDAModel model, read-only)

Added in version: 0.7.1

var pseudo_doc_id

id of a pseudo document where the document is allocated to (for only PTModel model, read-only)

Added in version: 0.11.0

var raw

a raw text of the document (read-only)

var span

a span (tuple of a start position and a end position in bytes) for each word token in the document (read-only)

var subtopics

a list of sub topics for each word (for only PAModel and HPAModel model, read-only)

var timepoint

a timepoint of the document (for only DTModel model, read-only)

Added in version: 0.7.0

var topics

a list of topics for each word (read-only)

This represents super topics in PAModel and HPAModel model.

var uid

a unique id of the document (read-only)

var vars

a list of response variables (for only SLDAModel model, read-only)

Added in version: 0.2.0

var weight

a weight of the document (read-only)

var windows

a list of window IDs for each word (for only MGLDAModel model, read-only)

var words

a list of IDs for each word (read-only)

Methods

def get_count_vector(self)

Added in version: 0.7.0

Return a count vector for the current document.

def get_ll(self)

Added in version: 0.10.0

Return total log-likelihood for the current document.

def get_sub_topic_dist(self, normalize=True)

Added in version: 0.5.0

Return a distribution of the sub topics in the document. (for only PAModel)

Parameters

normalize : bool

Added in version: 0.11.0

If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.

def get_sub_topics(self, top_n=10)

Added in version: 0.5.0

Return the top_n sub topics with its probability of the document. (for only PAModel)

def get_topic_dist(self, normalize=True, from_pseudo_doc=False)

Return a distribution of the topics in the document.

Parameters

normalize : bool

Added in version: 0.11.0

If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.

from_pseudo_doc : bool

Added in version: 0.12.2

If True, it returns the topic distribution of its pseudo document. Only valid for PTModel.

def get_topics(self, top_n=10, from_pseudo_doc=False)

Return the top_n topics with its probability of the document.

Parameters

top_n : int
the n in "top-n"
from_pseudo_doc : bool

Added in version: 0.12.2

If True, it returns the topic distribution of its pseudo document. Only valid for PTModel.

def get_words(self, top_n=10)

Added in version: 0.4.2

Return the top_n words with its probability of the document.

class GDMRModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, k=1, degrees=[], alpha=0.1, eta=0.01, sigma=1.0, sigma0=3.0, decay=0, alpha_epsilon=1e-10, metadata_range=None, seed=None, corpus=None, transform=None)

This type provides Generalized DMR(g-DMR) topic model and its implementation is based on following papers:

  • Lee, M., & Song, M. Incorporating citation impact into analysis of research trends. Scientometrics, 1-34.

Added in version: 0.8.0

Warning

Until version 0.10.2, metadata was used to represent numeric data and there was no argument for categorical data. Since version 0.11.0, the name of the previous metadata argument is changed to numeric_metadata, and metadata is added to represent categorical data for unification with the DMRModel. So metadata arguments in the older codes should be replaced with numeric_metadata to work in version 0.11.0.

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int
minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded
rm_top : int
the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.
k : int
the number of topics between 1 ~ 32767
degrees : Iterable[int]

a list of the degrees of Legendre polynomials for TDF(Topic Distribution Function). Its length should be equal to the number of metadata variables.

Its default value is [] in which case the model doesn't use any metadata variable and as a result, it becomes the same as the LDA or DMR model.

alpha : Union[float, Iterable[float]]
exponential of mean of normal distribution for lambdas, given as a single float in case of symmetric prior and as a list with length k of float in case of asymmetric prior.
eta : float
hyperparameter of Dirichlet distribution for topic - word
sigma : float
standard deviation of normal distribution for non-constant terms of lambdas
sigma0 : float
standard deviation of normal distribution for constant terms of lambdas
decay : float

Added in version: 0.11.0

decay's exponent that causes the coefficient of the higher-order term of lambdas to become smaller

alpha_epsilon : float
small smoothing value for preventing exp(lambdas) to be near zero
metadata_range : Iterable[Iterable[float]]

a list of minimum and maximum value of each numeric metadata variable. Its length should be equal to the length of degrees.

For example, metadata_range = [(2000, 2017), (0, 1)] means that the first variable has a range from 2000 and 2017 and the second one has a range from 0 to 1. Its default value is None in which case the ranges of each variable are obtained from input documents.

seed : int
random seed. default value is a random number from std::random_device{} in C++
corpus : Corpus
a list of documents to be added into the model
transform : Callable[dict, dict]
a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Instance variables

var decay

the hyperparameter decay (read-only)

var degrees

the degrees of Legendre polynomials (read-only)

var metadata_range

the ranges of each metadata variable (read-only)

var sigma0

the hyperparameter sigma0 (read-only)

Methods

def add_doc(self, words, numeric_metadata=[], metadata='', multi_metadata=[], ignore_empty_words=True)

Add a new document into the model instance with metadata and return an index of the inserted document.

Changed in version: 0.11.0

Until version 0.10.2, metadata was used to represent numeric data and there was no argument for categorical data. Since version 0.11.0, the name of the previous metadata argument is changed to numeric_metadata, and metadata is added to represent categorical data for unification with the DMRModel.

Changed in version: 0.12.0

A new argument multi_metadata for multiple values of metadata was added.

Parameters

words : Iterable[str]
an iterable of str
numeric_metadata : Iterable[float]
continuous numeric metadata variable of the document. Its length should be equal to the length of degrees.
metadata : str
categorical metadata of the document (e.g., author, title, journal or country)
multi_metadata : Iterable[str]
metadata of the document (for multiple values)
ignore_empty_words : bool
If True, empty words doesn't raise exception and makes the method return None.
def make_doc(self, words, numeric_metadata=[], metadata='', multi_metadata=[])

Return a new Document instance for an unseen document with words and metadata that can be used for LDAModel.infer() method.

Changed in version: 0.11.0

Until version 0.10.2, metadata was used to represent numeric data and there was no argument for categorical data. Since version 0.11.0, the name of the previous metadata argument is changed to numeric_metadata, and metadata is added to represent categorical data for unification with the DMRModel.

Changed in version: 0.12.0

A new argument multi_metadata for multiple values of metadata was added.

Parameters

words : Iterable[str]
an iteratable of str
numeric_metadata : Iterable[float]
continuous numeric metadata variable of the document. Its length should be equal to the length of degrees.
metadata : str
categorical metadata of the document (e.g., author, title, journal or country)
multi_metadata : Iterable[str]
metadata of the document (for multiple values)
def tdf(self, numeric_metadata, metadata='', multi_metadata=[], normalize=True)

Calculate a topic distribution for given numeric_metadata value. It returns a list with length k.

Changed in version: 0.11.0

Changed in version: 0.12.0

A new argument multi_metadata for multiple values of metadata was added.

Parameters

numeric_metadata : Iterable[float]
continuous metadata variable whose length should be equal to the length of degrees.
metadata : str
categorical metadata variable
multi_metadata : Iterable[str]
categorical metadata variables (for multiple values)
normalize : bool
If true, the method returns probabilities for each topic in range [0, 1]. Otherwise, it returns raw values in logit.
def tdf_linspace(self, numeric_metadata_start, numeric_metadata_stop, num, metadata='', multi_metadata=[], endpoint=True, normalize=True)

Calculate a topic distribution for given metadata value. It returns a list with length k.

Changed in version: 0.11.0

Changed in version: 0.12.0

A new argument multi_metadata for multiple values of metadata was added.

Parameters

numeric_metadata_start : Iterable[float]
the starting value of each continuous metadata variable whose length should be equal to the length of degrees.
numeric_metadata_stop : Iterable[float]
the end value of each continuous metadata variable whose length should be equal to the length of degrees.
num : Iterable[int]
the number of samples to generate for each metadata variable. Must be non-negative. Its length should be equal to the length of degrees.
metadata : str
categorical metadata variable
multi_metadata : Iterable[str]
categorical metadata variables (for multiple values)
endpoint : bool
If True, metadata_stop is the last sample. Otherwise, it is not included. Default is True.
normalize : bool
If true, the method returns probabilities for each topic in range [0, 1]. Otherwise, it returns raw values in logit.

Returns

samples : ndarray
with shape [*num, k].

Inherited members

class HDPModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, initial_k=2, alpha=0.1, eta=0.01, gamma=0.1, seed=None, corpus=None, transform=None)

This type provides Hierarchical Dirichlet Process(HDP) topic model and its implementation is based on following papers:

  • Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2005). Sharing clusters among related groups: Hierarchical Dirichlet processes. In Advances in neural information processing systems (pp. 1385-1392).
  • Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10(Aug), 1801-1828.

Changed in version: 0.3.0

Since version 0.3.0, hyperparameter estimation for alpha and gamma has been added. You can turn off this estimation by setting optim_interval to zero.

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int

Added in version: 0.6.0

minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded

rm_top : int

Added in version: 0.2.0

the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.

initial_k : int
the initial number of topics between 2 ~ 32767 The number of topics will be adjusted for data during training.
Since version 0.3.0, the default value has been changed to 2 from 1.
alpha : float
concentration coeficient of Dirichlet Process for document-table
eta : float
hyperparameter of Dirichlet distribution for topic-word
gamma : float
concentration coeficient of Dirichlet Process for table-topic
seed : int
random seed. default value is a random number from std::random_device{} in C++
corpus : Corpus

Added in version: 0.6.0

a list of documents to be added into the model

transform : Callable[dict, dict]

Added in version: 0.6.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Instance variables

var gamma

the hyperparameter gamma (read-only)

var live_k

the number of alive topics (read-only)

var num_tables

the number of total tables (read-only)

Methods

def convert_to_lda(self, topic_threshold=0.0)

Added in version: 0.8.0

Convert the current HDP model to equivalent LDA model and return (new_lda_model, new_topic_id). Topics with proportion less than topic_threshold are removed in new_lda_model.

new_topic_id is an array of length HDPModel.k and new_topic_id[i] indicates a topic id of new LDA model, equivalent to topic i of original HDP model. If topic i of original HDP model is not alive or is removed in LDA model, new_topic_id[i] would be -1.

Parameters

topic_threshold : float
Topics with proportion less than this value is removed in new LDA model. The default value is 0, and it means no topic except not alive is removed.
def is_live_topic(self, topic_id)

Return True if the topic topic_id is valid, otherwise return False.

Parameters

topic_id : int
an integer in range [0, k) indicating the topic
def purge_dead_topics(self)

Added in version: 0.12.3

Purge all non-alive topics from the model and return new_topic_ids. After called, HDPModel.k shrinks to HDPModel.live_k and all topics of the model become live.

new_topic_id is an array of length HDPModel.k and new_topic_id[i] indicates a topic id of the new model, equivalent to topic i of previous HDP model. If topic i of previous HDP model is not alive or is removed in the new model, new_topic_id[i] would be -1.

Inherited members

class HLDAModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, depth=2, alpha=0.1, eta=0.01, gamma=0.1, seed=None, corpus=None, transform=None)

This type provides Hierarchical LDA topic model and its implementation is based on following papers:

  • Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., & Blei, D. M. (2004). Hierarchical topic models and the nested Chinese restaurant process. In Advances in neural information processing systems (pp. 17-24).

Added in version: 0.4.0

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int

Added in version: 0.6.0

minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded

rm_top : int
the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.
depth : int
the maximum depth level of hierarchy between 2 ~ 32767
alpha : Union[float, Iterable[float]]
hyperparameter of Dirichlet distribution for document-depth level, given as a single float in case of symmetric prior and as a list with length depth of float in case of asymmetric prior.
eta : float
hyperparameter of Dirichlet distribution for topic-word
gamma : float
concentration coeficient of Dirichlet Process
seed : int
random seed. default value is a random number from std::random_device{} in C++
corpus : Corpus

Added in version: 0.6.0

a list of documents to be added into the model

transform : Callable[dict, dict]

Added in version: 0.6.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Instance variables

var depth

the number of depth (read-only)

var gamma

the hyperparameter gamma (read-only)

var live_k

the number of alive topics (read-only)

Methods

def children_topics(self, topic_id)

Return a list of topic IDs with children of a topic topic_id.

Parameters

topic_id : int
an integer in range [0, k) indicating the topic
def is_live_topic(self, topic_id)

Return True if the topic topic_id is alive, otherwise return False.

Parameters

topic_id : int
an integer in range [0, k) indicating the topic
def level(self, topic_id)

Return the level of a topic topic_id.

Parameters

topic_id : int
an integer in range [0, k) indicating the topic
def num_docs_of_topic(self, topic_id)

Return the number of documents belonging to a topic topic_id.

Parameters

topic_id : int
an integer in range [0, k) indicating the topic
def parent_topic(self, topic_id)

Return the topic ID of parent of a topic topic_id.

Parameters

topic_id : int
an integer in range [0, k) indicating the topic

Inherited members

class HPAModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, k1=1, k2=1, alpha=0.1, subalpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

This type provides Hierarchical Pachinko Allocation(HPA) topic model and its implementation is based on following papers:

  • Mimno, D., Li, W., & McCallum, A. (2007, June). Mixtures of hierarchical topics with pachinko allocation. In Proceedings of the 24th international conference on Machine learning (pp. 633-640). ACM.

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int

Added in version: 0.6.0

minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded

rm_top : int

Added in version: 0.2.0

the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.

k1 : int
the number of super topics between 1 ~ 32767
k2 : int
the number of sub topics between 1 ~ 32767
alpha : Union[float, Iterable[float]]
initial hyperparameter of Dirichlet distribution for document-topic, given as a single float in case of symmetric prior and as a list with length k1 + 1 of float in case of asymmetric prior.
subalpha : Union[float, Iterable[float]]

Added in version: 0.11.0

initial hyperparameter of Dirichlet distribution for super-sub topic, given as a single float in case of symmetric prior and as a list with length k2 + 1 of float in case of asymmetric prior.

eta : float
hyperparameter of Dirichlet distribution for topic-word
seed : int
random seed. default value is a random number from std::random_device{} in C++
corpus : Corpus

Added in version: 0.6.0

a list of documents to be added into the model

transform : Callable[dict, dict]

Added in version: 0.6.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Instance variables

var alpha

Dirichlet prior on the per-document super topic distributions in shape [k1 + 1]. Its element 0 indicates the prior to the top topic and elements 1 ~ k1 indicates ones to the super topics. (read-only)

Added in version: 0.9.0

var subalpha

Dirichlet prior on the sub topic distributions for each super topic in shape [k1, k2 + 1]. Its [x, 0] element indicates the prior to the super topic x and [x, 1 ~ k2] elements indicate ones to the sub topics in the super topic x. (read-only)

Added in version: 0.9.0

Methods

def get_topic_word_dist(self, topic_id, normalize=True)

Return the word distribution of the topic topic_id. The returned value is a list that has len(vocabs) fraction numbers indicating probabilities for each word in current topic.

Parameters

topic_id : int
0 indicates the top topic, a number in range [1, 1 + k1) indicates a super topic and a number in range [1 + k1, 1 + k1 + k2) indicates a sub topic.
normalize : bool

Added in version: 0.11.0

If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.

def get_topic_words(self, topic_id, top_n=10)

Return the top_n words and its probability in the topic topic_id. The return type is a list of (word:str, probability:float).

Parameters

topic_id : int
0 indicates the top topic, a number in range [1, 1 + k1) indicates a super topic and a number in range [1 + k1, 1 + k1 + k2) indicates a sub topic.
def infer(self, doc, iter=100, tolerance=-1, workers=0, parallel=0, together=False, transform=None)

Return the inferred topic distribution from unseen docs.

Parameters

doc : Union[Document, Iterable[Document], Corpus]

an instance of Document or a list of instances of Document to be inferred by the model. It can be acquired from LDAModel.make_doc() method.

Changed in version: 0.10.0

Since version 0.10.0, infer can receive a raw corpus instance of Corpus. In this case, you don't need to call make_doc. infer would generate documents bound to the model, estimate its topic distributions and return a corpus contains generated documents as the result.

iter : int
an integer indicating the number of iteration to estimate the distribution of topics of doc. The higher value will generate a more accuracy result.
tolerance : float
isn't currently used.
workers : int
an integer indicating the number of workers to perform samplings. If workers is 0, the number of cores in the system will be used.
parallel : Union[int, ParallelScheme]

Added in version: 0.5.0

the parallelism scheme for inference. the default value is ParallelScheme.DEFAULT which means that tomotopy selects the best scheme by model.

together : bool
all docs are infered together in one process if True, otherwise each doc is infered independently. Its default value is False.
transform : Callable[dict, dict]

Added in version: 0.10.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model. Available when doc is given as an instance of Corpus.

Returns

result : Union[List[float], List[List[float]], Corpus]

If doc is given as a single Document, result is a single List[float] indicating its topic distribution.

If doc is given as a list of Documents, result is a list of List[float] indicating topic distributions for each document.

If doc is given as an instance of Corpus, result is another instance of Corpus which contains infered documents. You can get topic distribution for each document using Document.get_topic_dist().

log_ll : List[float]
a list of log-likelihoods for each docs

Inherited members

class LDAModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, k=1, alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

This type provides Latent Dirichlet Allocation(LDA) topic model and its implementation is based on following papers:

  • Blei, D.M., Ng, A.Y., &Jordan, M.I. (2003).Latent dirichlet allocation.Journal of machine Learning research, 3(Jan), 993 - 1022.
  • Newman, D., Asuncion, A., Smyth, P., &Welling, M. (2009).Distributed algorithms for topic models.Journal of Machine Learning Research, 10(Aug), 1801 - 1828.

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int

Added in version: 0.6.0

minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded

rm_top : int

Added in version: 0.2.0

the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.

k : int
the number of topics between 1 ~ 32767
alpha : Union[float, Iterable[float]]
hyperparameter of Dirichlet distribution for document-topic, given as a single float in case of symmetric prior and as a list with length k of float in case of asymmetric prior.
eta : float
hyperparameter of Dirichlet distribution for topic-word
seed : int
random seed. The default value is a random number from std::random_device{} in C++
corpus : Corpus

Added in version: 0.6.0

a list of documents to be added into the model

transform : Callable[dict, dict]

Added in version: 0.6.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model

Subclasses

Static methods

def load(filename)

Return the model instance loaded from file filename.

def loads(data)

Return the model instance loaded from data in bytes-like object.

Instance variables

var alpha

Dirichlet prior on the per-document topic distributions (read-only)

var burn_in

get or set the burn-in iterations for optimizing parameters

Its default value is 0.

var docs

a list-like interface of Document in the model instance (read-only)

var eta

the hyperparameter eta (read-only)

var global_step

the total number of iterations of training (read-only)

Added in version: 0.9.0

var k

K, the number of topics (read-only)

var ll_per_word

a log likelihood per-word of the model (read-only)

var num_vocabs

the number of vocabuluaries after words with a smaller frequency were removed (read-only)

This value is 0 before train called.

Deprecated since version: 0.8.0

Due to the confusion of its name, this property will be removed. Please use len(used_vocabs) instead.

var num_words

the number of total words (read-only)

This value is 0 before train called.

var optim_interval

get or set the interval for optimizing parameters

Its default value is 10. If it is set to 0, the parameter optimization is turned off.

var perplexity

a perplexity of the model (read-only)

var removed_top_words

a list of str which is a word removed from the model if you set rm_top greater than 0 at initializing the model (read-only)

var tw

the term weighting scheme (read-only)

var used_vocab_df

a list of vocabulary document-frequencies which contains only vocabularies actually used in modeling (read-only)

Added in version: 0.8.0

var used_vocab_freq

a list of vocabulary frequencies which contains only vocabularies actually used in modeling (read-only)

Added in version: 0.8.0

var used_vocab_weighted_freq

a list of term-weighted vocabulary frequencies which contains only vocabularies actually used in modeling (read-only)

Added in version: 0.12.1

var used_vocabs

a dictionary, which contains only the vocabularies actually used in modeling, as the type tomotopy.Dictionary (read-only)

Added in version: 0.8.0

var vocab_df

a list of vocabulary document-frequencies which contains both vocabularies filtered by frequency and vocabularies actually used in modeling (read-only)

Added in version: 0.8.0

var vocab_freq

a list of vocabulary frequencies which contains both vocabularies filtered by frequency and vocabularies actually used in modeling (read-only)

var vocabs

a dictionary, which contains both vocabularies filtered by frequency and vocabularies actually used in modeling, as the type tomotopy.Dictionary (read-only)

Methods

def add_corpus(self, corpus, transform=None)

Added in version: 0.10.0

Add new documents into the model instance using Corpus and return an instance of corpus that contains the inserted documents. This method should be called before calling the LDAModel.train().

Parameters

corpus : Corpus
corpus that contains documents to be added
transform : Callable[dict, dict]
a callable object to manipulate arbitrary keyword arguments for a specific topic model
def add_doc(self, words, ignore_empty_words=True)

Add a new document into the model instance and return an index of the inserted document. This method should be called before calling the LDAModel.train().

Changed in version: 0.12.3

A new argument ignore_empty_words was added.

Parameters

words : Iterable[str]
an iterable of str
ignore_empty_words : bool
If True, empty words doesn't raise exception and makes the method return None.
def copy(self)

Added in version: 0.12.0

Return a new deep-copied instance of the current instance

def get_count_by_topics(self)

Return the number of words allocated to each topic.

def get_topic_word_dist(self, topic_id, normalize=True)

Return the word distribution of the topic topic_id. The returned value is a list that has len(vocabs) fraction numbers indicating probabilities for each word in the current topic.

Parameters

topic_id : int
an integer in range [0, k) indicating the topic
normalize : bool

Added in version: 0.11.0

If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.

def get_topic_words(self, topic_id, top_n=10)

Return the top_n words and its probability in the topic topic_id. The return type is a list of (word:str, probability:float).

Parameters

topic_id : int
an integer in range [0, k), indicating the topic
def get_word_prior(self, word)

Added in version: 0.6.0

Return word-topic prior for word. If there is no set prior for word, an empty list is returned.

Parameters

word : str
a word
def infer(self, doc, iter=100, tolerance=-1, workers=0, parallel=0, together=False, transform=None)

Return the inferred topic distribution from unseen docs.

Parameters

doc : Union[Document, Iterable[Document], Corpus]

an instance of Document or a list of instances of Document to be inferred by the model. It can be acquired from LDAModel.make_doc() method.

Changed in version: 0.10.0

Since version 0.10.0, infer can receive a raw corpus instance of Corpus. In this case, you don't need to call make_doc. infer would generate documents bound to the model, estimate its topic distributions and return a corpus contains generated documents as the result.

iter : int
an integer indicating the number of iteration to estimate the distribution of topics of doc. The higher value will generate a more accuracy result.
tolerance : float
isn't currently used.
workers : int
an integer indicating the number of workers to perform samplings. If workers is 0, the number of cores in the system will be used.
parallel : Union[int, ParallelScheme]

Added in version: 0.5.0

the parallelism scheme for inference. the default value is ParallelScheme.DEFAULT which means that tomotopy selects the best scheme by model.

together : bool
all docs are infered together in one process if True, otherwise each doc is infered independently. Its default value is False.
transform : Callable[dict, dict]

Added in version: 0.10.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model. Available when doc is given as an instance of Corpus.

Returns

result : Union[List[float], List[List[float]], Corpus]

If doc is given as a single Document, result is a single List[float] indicating its topic distribution.

If doc is given as a list of Documents, result is a list of List[float] indicating topic distributions for each document.

If doc is given as an instance of Corpus, result is another instance of Corpus which contains infered documents. You can get topic distribution for each document using Document.get_topic_dist().

log_ll : List[float]
a list of log-likelihoods for each docs
def make_doc(self, words)

Return a new Document instance for an unseen document with words that can be used for LDAModel.infer() method.

Parameters

words : Iterable[str]
an iterable of str
def save(self, filename, full=True)

Save the model instance to file filename. Return None.

If full is True, the model with its all documents and state will be saved. If you want to train more after, use full model. If False, only topic parameters of the model will be saved. This model can be only used for inference of an unseen document.

Added in version: 0.6.0

Since version 0.6.0, the model file format has been changed. Thus model files saved in version 0.6.0 or later are not compatible with versions prior to 0.5.2.

def saves(self, full=True)

Added in version: 0.11.0

Serialize the model instance into bytes object and return it. The arguments work the same as LDAModel.save().

def set_word_prior(self, word, prior)

Added in version: 0.6.0

Set word-topic prior. This method should be called before calling the LDAModel.train().

Parameters

word : str
a word to be set
prior : Iterable[float]
topic distribution of word whose length is equal to LDAModel.k
def summary(self, initial_hp=True, params=True, topic_word_top_n=5, file=None, flush=False)

Added in version: 0.9.0

print human-readable description of the current model

Parameters

initial_hp : bool
whether to show the initial parameters at model creation
params : bool
whether to show the current parameters of the model
topic_word_top_n : int
the number of words by topic to display
file
a file-like object (stream), default is sys.stdout
flush : bool
whether to forcibly flush the stream
def train(self, iter=10, workers=0, parallel=0, freeze_topics=False)

Train the model using Gibbs-sampling with iter iterations. Return None. After calling this method, you cannot LDAModel.add_doc() or LDAModel.set_word_prior() more.

Parameters

iter : int
the number of iterations of Gibbs-sampling
workers : int
an integer indicating the number of workers to perform samplings. If workers is 0, the number of cores in the system will be used.
parallel : Union[int, ParallelScheme]

Added in version: 0.5.0

the parallelism scheme for training. the default value is ParallelScheme.DEFAULT which means that tomotopy selects the best scheme by model.

freeze_topics : bool

Added in version: 0.10.1

prevents to create a new topic when training. Only valid for HLDAModel

class LLDAModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, k=1, alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

This type provides Labeled LDA(L-LDA) topic model and its implementation is based on following papers:

  • Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics.

Added in version: 0.3.0

Deprecated since version: 0.11.0

Use PLDAModel instead.

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int

Added in version: 0.6.0

minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded

rm_top : int
the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.
k : int
the number of topics between 1 ~ 32767
alpha : Union[float, Iterable[float]]
hyperparameter of Dirichlet distribution for document-topic, given as a single float in case of symmetric prior and as a list with length k of float in case of asymmetric prior.
eta : float
hyperparameter of Dirichlet distribution for topic-word
seed : int
random seed. The default value is a random number from std::random_device{} in C++
corpus : Corpus

Added in version: 0.6.0

a list of documents to be added into the model

transform : Callable[dict, dict]

Added in version: 0.6.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Instance variables

var topic_label_dict

a dictionary of topic labels in type tomotopy.Dictionary (read-only)

Methods

def add_doc(self, words, labels=[], ignore_empty_words=True)

Add a new document into the model instance with labels and return an index of the inserted document.

Parameters

words : Iterable[str]
an iterable of str
labels : Iterable[str]
labels of the document
ignore_empty_words : bool
If True, empty words doesn't raise exception and makes the method return None.
def get_topic_words(self, topic_id, top_n=10)

Return the top_n words and its probability in the topic topic_id. The return type is a list of (word:str, probability:float).

Parameters

topic_id : int
Integers in the range [0, l), where l is the number of total labels, represent a topic that belongs to the corresponding label. The label name can be found by looking up LLDAModel.topic_label_dict. Integers in the range [l, k) represent a latent topic which doesn't belongs to the any labels.
def make_doc(self, words, labels=[])

Return a new Document instance for an unseen document with words and labels that can be used for LDAModel.infer() method.

Parameters

words : Iterable[str]
an iteratable of str
labels : Iterable[str]
labels of the document

Inherited members

class MGLDAModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, k_g=1, k_l=1, t=3, alpha_g=0.1, alpha_l=0.1, alpha_mg=0.1, alpha_ml=0.1, eta_g=0.01, eta_l=0.01, gamma=0.1, seed=None, corpus=None, transform=None)

This type provides Multi Grain Latent Dirichlet Allocation(MG-LDA) topic model and its implementation is based on following papers:

  • Titov, I., & McDonald, R. (2008, April). Modeling online reviews with multi-grain topic models. In Proceedings of the 17th international conference on World Wide Web (pp. 111-120). ACM.

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int

Added in version: 0.6.0

minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded

rm_top : int

Added in version: 0.2.0

the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.

k_g : int
the number of global topics between 1 ~ 32767
k_l : int
the number of local topics between 1 ~ 32767
t : int
the size of sentence window
alpha_g : float
hyperparameter of Dirichlet distribution for document-global topic
alpha_l : float
hyperparameter of Dirichlet distribution for document-local topic
alpha_mg : float
hyperparameter of Dirichlet distribution for global-local selection (global coef)
alpha_ml : float
hyperparameter of Dirichlet distribution for global-local selection (local coef)
eta_g : float
hyperparameter of Dirichlet distribution for global topic-word
eta_l : float
hyperparameter of Dirichlet distribution for local topic-word
gamma : float
hyperparameter of Dirichlet distribution for sentence-window
seed : int
random seed. default value is a random number from std::random_device{} in C++
corpus : Corpus

Added in version: 0.6.0

a list of documents to be added into the model

transform : Callable[dict, dict]

Added in version: 0.6.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Instance variables

var alpha_g

the hyperparameter alpha_g (read-only)

var alpha_l

the hyperparameter alpha_l (read-only)

var alpha_mg

the hyperparameter alpha_mg (read-only)

var alpha_ml

the hyperparameter alpha_ml (read-only)

var eta_g

the hyperparameter eta_g (read-only)

var eta_l

the hyperparameter eta_l (read-only)

var gamma

the hyperparameter gamma (read-only)

var k_g

the hyperparameter k_g (read-only)

var k_l

the hyperparameter k_l (read-only)

var t

the hyperparameter t (read-only)

Methods

def add_doc(self, words, delimiter='.', ignore_empty_words=True)

Add a new document into the model instance and return an index of the inserted document.

Parameters

words : Iterable[str]
an iterable of str
delimiter : str
a sentence separator. words will be separated by this value into sentences.
ignore_empty_words : bool
If True, empty words doesn't raise exception and makes the method return None.
def get_topic_word_dist(self, topic_id, normalize=True)

Return the word distribution of the topic topic_id. The returned value is a list that has len(vocabs) fraction numbers indicating probabilities for each word in the current topic.

Parameters

topic_id : int
A number in range [0, k_g) indicates a global topic and a number in range [k_g, k_g + k_l) indicates a local topic.
normalize : bool

Added in version: 0.11.0

If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.

def get_topic_words(self, topic_id, top_n=10)

Return the top_n words and its probability in the topic topic_id. The return type is a list of (word:str, probability:float).

Parameters

topic_id : int
A number in range [0, k_g) indicates a global topic and a number in range [k_g, k_g + k_l) indicates a local topic.
def make_doc(self, words, delimiter='.')

Return a new Document instance for an unseen document with words that can be used for LDAModel.infer() method.

Parameters

words : Iterable[str]
an iteratable of str
delimiter : str
a sentence separator. words will be separated by this value into sentences.

Inherited members

class PAModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, k1=1, k2=1, alpha=0.1, subalpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

This type provides Pachinko Allocation(PA) topic model and its implementation is based on following papers:

  • Li, W., & McCallum, A. (2006, June). Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning (pp. 577-584). ACM.

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int

Added in version: 0.6.0

minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded

rm_top : int

Added in version: 0.2.0

the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.

k1 : int
the number of super topics between 1 ~ 32767
k2 : int
the number of sub topics between 1 ~ 32767
alpha : Union[float, Iterable[float]]
initial hyperparameter of Dirichlet distribution for document-super topic, given as a single float in case of symmetric prior and as a list with length k1 of float in case of asymmetric prior.
subalpha : Union[float, Iterable[float]]

Added in version: 0.11.0

initial hyperparameter of Dirichlet distribution for super-sub topic, given as a single float in case of symmetric prior and as a list with length k2 of float in case of asymmetric prior.

eta : float
hyperparameter of Dirichlet distribution for sub topic-word
seed : int
random seed. default value is a random number from std::random_device{} in C++
corpus : Corpus

Added in version: 0.6.0

a list of documents to be added into the model

transform : Callable[dict, dict]

Added in version: 0.6.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Subclasses

Instance variables

var alpha

Dirichlet prior on the per-document super topic distributions in shape [k1] (read-only)

Added in version: 0.9.0

var k1

k1, the number of super topics (read-only)

var k2

k2, the number of sub topics (read-only)

var subalpha

Dirichlet prior on the sub topic distributions for each super topic in shape [k1, k2] (read-only)

Added in version: 0.9.0

Methods

def get_count_by_super_topic(self)

Return the number of words allocated to each super-topic.

Added in version: 0.9.0

def get_sub_topic_dist(self, super_topic_id, normalize=True)

Return a distribution of the sub topics in a super topic super_topic_id. The returned value is a list that has k2 fraction numbers indicating probabilities for each sub topic in the current super topic.

Parameters

super_topic_id : int
indicating the super topic, in range [0, k1)
def get_sub_topics(self, super_topic_id, top_n=10)

Added in version: 0.1.4

Return the top_n sub topics and its probability in a super topic super_topic_id. The return type is a list of (subtopic:int, probability:float).

Parameters

super_topic_id : int
indicating the super topic, in range [0, k1)
def get_topic_word_dist(self, sub_topic_id, normalize=True)

Return the word distribution of the sub topic sub_topic_id. The returned value is a list that has len(vocabs) fraction numbers indicating probabilities for each word in the current sub topic.

Parameters

sub_topic_id : int
indicating the sub topic, in range [0, k2)
normalize : bool

Added in version: 0.11.0

If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.

def get_topic_words(self, sub_topic_id, top_n=10)

Return the top_n words and its probability in the sub topic sub_topic_id. The return type is a list of (word:str, probability:float).

Parameters

sub_topic_id : int
indicating the sub topic, in range [0, k2)
def infer(self, doc, iter=100, tolerance=-1, workers=0, parallel=0, together=False)

Added in version: 0.5.0

Return the inferred topic distribution and sub-topic distribution from unseen docs.

Parameters

doc : Union[Document, Iterable[Document], Corpus]

an instance of Document or a list of instances of Document to be inferred by the model. It can be acquired from LDAModel.make_doc() method.

Changed in version: 0.10.0

Since version 0.10.0, infer can receive a raw corpus instance of Corpus. In this case, you don't need to call make_doc. infer would generate documents bound to the model, estimate its topic distributions and return a corpus contains generated documents as the result.

iter : int
an integer indicating the number of iteration to estimate the distribution of topics of doc. The higher value will generate a more accuracy result.
tolerance : float
isn't currently used.
workers : int
an integer indicating the number of workers to perform samplings. If workers is 0, the number of cores in the system will be used.
parallel : Union[int, ParallelScheme]

Added in version: 0.5.0

the parallelism scheme for inference. the default value is ParallelScheme.DEFAULT which means that tomotopy selects the best scheme by model.

together : bool
all docs are infered together in one process if True, otherwise each doc is infered independently. Its default value is False.
transform : Callable[dict, dict]

Added in version: 0.10.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model. Available when doc is given as an instance of Corpus.

Returns

result : Union[Tuple[List[float], List[float]], List[Tuple[List[float], List[float]]], Corpus]

If doc is given as a single Document, result is a tuple of List[float] indicating its topic distribution and List[float] indicating its sub-topic distribution.

If doc is given as a list of Documents, result is a list of List[float] indicating topic distributions for each document.

If doc is given as an instance of Corpus, result is another instance of Corpus which contains infered documents. You can get topic distribution for each document using Document.get_topic_dist() and sub-topic distribution using Document.get_sub_topic_dist()

log_ll : float
a list of log-likelihoods for each docs

Inherited members

class PLDAModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, latent_topics=0, topics_per_label=1, alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

This type provides Partially Labeled LDA(PLDA) topic model and its implementation is based on following papers:

  • Ramage, D., Manning, C. D., & Dumais, S. (2011, August). Partially labeled topic models for interpretable text mining. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 457-465). ACM.

Added in version: 0.4.0

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int

Added in version: 0.6.0

minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded

rm_top : int
the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.
latent_topics : int
the number of latent topics, which are shared to all documents, between 1 ~ 32767
topics_per_label : int
the number of topics per label between 1 ~ 32767
alpha : Union[float, Iterable[float]]
hyperparameter of Dirichlet distribution for document-topic, given as a single float in case of symmetric prior and as a list with length k of float in case of asymmetric prior.
eta : float
hyperparameter of Dirichlet distribution for topic-word
seed : int
random seed. The default value is a random number from std::random_device{} in C++
corpus : Corpus

Added in version: 0.6.0

a list of documents to be added into the model

transform : Callable[dict, dict]

Added in version: 0.6.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Instance variables

var latent_topics

the number of latent topics (read-only)

var topic_label_dict

a dictionary of topic labels in type tomotopy.Dictionary (read-only)

var topics_per_label

the number of topics per label (read-only)

Methods

def add_doc(self, words, labels=[], ignore_empty_words=True)

Add a new document into the model instance with labels and return an index of the inserted document.

Parameters

words : Iterable[str]
an iterable of str
labels : Iterable[str]
labels of the document
ignore_empty_words : bool
If True, empty words doesn't raise exception and makes the method return None.
def get_topic_words(self, topic_id, top_n=10)

Return the top_n words and its probability in the topic topic_id. The return type is a list of (word:str, probability:float).

Parameters

topic_id : int
Integers in the range [0, l * topics_per_label), where l is the number of total labels, represent a topic that belongs to the corresponding label. The label name can be found by looking up PLDAModel.topic_label_dict. Integers in the range [l * topics_per_label, l * topics_per_label + latent_topics) represent a latent topic which doesn't belongs to the any labels.
def make_doc(self, words, labels=[])

Return a new Document instance for an unseen document with words and labels that can be used for LDAModel.infer() method.

Parameters

words : Iterable[str]
an iteratable of str
labels : Iterable[str]
labels of the document

Inherited members

class PTModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, k=1, p=None, alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

Added in version: 0.11.0

This type provides Pseudo-document based Topic Model (PTM) and its implementation is based on following papers:

  • Zuo, Y., Wu, J., Zhang, H., Lin, H., Wang, F., Xu, K., & Xiong, H. (2016, August). Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2105-2114).

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int
minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded
rm_top : int
the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.
k : int
the number of topics between 1 ~ 32767
p : int
the number of pseudo documents

Changed in version: 0.12.2

The default value is changed to 10 * k.
alpha : Union[float, Iterable[float]]
hyperparameter of Dirichlet distribution for document-topic, given as a single float in case of symmetric prior and as a list with length k of float in case of asymmetric prior.
eta : float
hyperparameter of Dirichlet distribution for topic-word
seed : int
random seed. The default value is a random number from std::random_device{} in C++
corpus : Corpus
a list of documents to be added into the model
transform : Callable[dict, dict]
a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Instance variables

var p

the number of pseudo documents (read-only)

Added in version: 0.11.0

Inherited members

class ParallelScheme (value, names=None, *, module=None, qualname=None, type=None, start=1)

This enumeration is for Parallelizing Scheme: There are three options for parallelizing and the basic one is DEFAULT. Not all models supports all options.

Expand source code
class ParallelScheme(IntEnum):
    """
    This enumeration is for Parallelizing Scheme:
    There are three options for parallelizing and the basic one is DEFAULT. Not all models supports all options. 
    """

    DEFAULT = 0
    """tomotopy chooses the best available parallelism scheme for your model"""

    NONE = 1
    """ 
    Turn off multi-threading for Gibbs sampling at training or inference. Operations other than Gibbs sampling may use multithreading.
    """

    COPY_MERGE = 2
    """
    Use Copy and Merge algorithm from AD-LDA. It consumes RAM in proportion to the number of workers. 
    This has advantages when you have a small number of workers and a small number of topics and vocabulary sizes in the model.
    Prior to version 0.5, all models used this algorithm by default. 
    
    > * Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10(Aug), 1801-1828.
    """

    PARTITION = 3
    """
    Use Partitioning algorithm from PCGS. It consumes only twice as much RAM as a single-threaded algorithm, regardless of the number of workers.
    This has advantages when you have a large number of workers or a large number of topics and vocabulary sizes in the model.
    
    > * Yan, F., Xu, N., & Qi, Y. (2009). Parallel inference for latent dirichlet allocation on graphics processing units. In Advances in neural information processing systems (pp. 2134-2142).
    """

Ancestors

  • enum.IntEnum
  • builtins.int
  • enum.Enum

Class variables

var COPY_MERGE

Use Copy and Merge algorithm from AD-LDA. It consumes RAM in proportion to the number of workers. This has advantages when you have a small number of workers and a small number of topics and vocabulary sizes in the model. Prior to version 0.5, all models used this algorithm by default.

  • Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10(Aug), 1801-1828.
var DEFAULT

tomotopy chooses the best available parallelism scheme for your model

var NONE

Turn off multi-threading for Gibbs sampling at training or inference. Operations other than Gibbs sampling may use multithreading.

var PARTITION

Use Partitioning algorithm from PCGS. It consumes only twice as much RAM as a single-threaded algorithm, regardless of the number of workers. This has advantages when you have a large number of workers or a large number of topics and vocabulary sizes in the model.

  • Yan, F., Xu, N., & Qi, Y. (2009). Parallel inference for latent dirichlet allocation on graphics processing units. In Advances in neural information processing systems (pp. 2134-2142).
class SLDAModel (tw=TermWeight.ONE, min_cf=0, min_df=0, rm_top=0, k=1, vars='', alpha=0.1, eta=0.01, mu=[], nu_sq=[], glm_param=[], seed=None, corpus=None, transform=None)

This type provides supervised Latent Dirichlet Allocation(sLDA) topic model and its implementation is based on following papers:

  • Mcauliffe, J. D., & Blei, D. M. (2008). Supervised topic models. In Advances in neural information processing systems (pp. 121-128).
  • Python version implementation using Gibbs sampling : https://github.com/Savvysherpa/slda

Added in version: 0.2.0

Parameters

tw : Union[int, TermWeight]
term weighting scheme in TermWeight. The default value is TermWeight.ONE
min_cf : int
minimum collection frequency of words. Words with a smaller collection frequency than min_cf are excluded from the model. The default value is 0, which means no words are excluded.
min_df : int

Added in version: 0.6.0

minimum document frequency of words. Words with a smaller document frequency than min_df are excluded from the model. The default value is 0, which means no words are excluded

rm_top : int
the number of top words to be removed. If you want to remove too common words from model, you can set this value to 1 or more. The default value is 0, which means no top words are removed.
k : int
the number of topics between 1 ~ 32767
vars : Iterable[str]

indicating types of response variables. The length of vars determines the number of response variables, and each element of vars determines a type of the variable. The list of available types is like below:

  • 'l': linear variable (any real value)
  • 'b': binary variable (0 or 1)
alpha : Union[float, Iterable[float]]
hyperparameter of Dirichlet distribution for document-topic, given as a single float in case of symmetric prior and as a list with length k of float in case of asymmetric prior.
eta : float
hyperparameter of Dirichlet distribution for topic-word
mu : Union[float, Iterable[float]]
mean of regression coefficients, default value is 0
nu_sq : Union[float, Iterable[float]]
variance of regression coefficients, default value is 1
glm_param : Union[float, Iterable[float]]
the parameter for Generalized Linear Model, default value is 1
seed : int
random seed. The default value is a random number from std::random_device{} in C++
corpus : Corpus

Added in version: 0.6.0

a list of documents to be added into the model

transform : Callable[dict, dict]

Added in version: 0.6.0

a callable object to manipulate arbitrary keyword arguments for a specific topic model

Ancestors

Instance variables

var f

the number of response variables (read-only)

Methods

def add_doc(self, words, y=[], ignore_empty_words=True)

Add a new document into the model instance with response variables y and return an index of the inserted document.

Parameters

words : Iterable[str]
an iterable of str
y : Iterable[float]

response variables of this document. The length of y must be equal to the number of response variables of the model (SLDAModel.f).

Changed in version: 0.5.1

If you have a missing value, you can set the item as NaN. Documents with NaN variables are included in modeling topics, but excluded from regression.

ignore_empty_words : bool
If True, empty words doesn't raise exception and makes the method return None.
def estimate(self, doc)

Return the estimated response variable for doc. If doc is an unseen document instance which is generated by SLDAModel.make_doc() method, it should be inferred by LDAModel.infer() method first.

Parameters

doc : Document
an instance of document or a list of them to be used for estimating response variables
def get_regression_coef(self, var_id=None)

Return the regression coefficient of the response variable var_id.

Parameters

var_id : int

indicating the reponse variable, in range [0, f)

If omitted, the whole regression coefficients with shape [f, k] are returned.

def get_var_type(self, var_id)

Return the type of the response variable var_id. 'l' means linear variable, 'b' means binary variable.

def make_doc(self, words, y=[])

Return a new Document instance for an unseen document with words and response variables y that can be used for LDAModel.infer() method.

Parameters

words : Iterable[str]
an iterable of str
y : Iterable[float]
response variables of this document. The length of y doesn't have to be equal to the number of response variables of the model (SLDAModel.f). If the length of y is shorter than SLDAModel.f, missing values are automatically filled with NaN.

Inherited members

class TermWeight (value, names=None, *, module=None, qualname=None, type=None, start=1)

This enumeration is for Term Weighting Scheme and it is based on following paper:

  • Wilson, A. T., & Chew, P. A. (2010, June). Term weighting schemes for latent dirichlet allocation. In human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics (pp. 465-473). Association for Computational Linguistics.

There are three options for term weighting and the basic one is ONE. The others also can be applied for all topic models in tomotopy.

Expand source code
class TermWeight(IntEnum):
    """
    This enumeration is for Term Weighting Scheme and it is based on following paper:
    
    > * Wilson, A. T., & Chew, P. A. (2010, June). Term weighting schemes for latent dirichlet allocation. In human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics (pp. 465-473). Association for Computational Linguistics.
    
    There are three options for term weighting and the basic one is ONE. The others also can be applied for all topic models in `tomotopy`. 
    """

    ONE = 0
    """ Consider every term equal (default)"""

    IDF = 1
    """ 
    Use Inverse Document Frequency term weighting.
    
    Thus, a term occurring at almost every document has very low weighting
    and a term occurring at a few document has high weighting. 
    """

    PMI = 2
    """
    Use Pointwise Mutual Information term weighting.
    """

Ancestors

  • enum.IntEnum
  • builtins.int
  • enum.Enum

Class variables

var IDF

Use Inverse Document Frequency term weighting.

Thus, a term occurring at almost every document has very low weighting and a term occurring at a few document has high weighting.

var ONE

Consider every term equal (default)

var PMI

Use Pointwise Mutual Information term weighting.