Module `tomotopy.models`

tomotopy.models 서브모듈은 다양한 토픽 모델 클래스를 제공합니다. 모든 모델은 기본적인 Latent Dirichlet Allocation을 구현하는 LDAModel을 기반으로 합니다. 파생 모델로는 DMR, GDMR, HDP, MGLDA, PA, HPA, CT, SLDA, LLDA, PLDA, HLDA, DT, PT가 있습니다.

Classes

class CTModel (tw='one', min_cf=0, min_df=0, rm_top=0, k=1, smoothing_alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

Expand source code

class CTModel(_CTModel, LDAModel):
    '''.. versionadded:: 0.2.0
This type provides Correlated Topic Model (CTM) and its implementation is based on the following papers:
        
> * Blei, D., & Lafferty, J. (2006). Correlated topic models. Advances in neural information processing systems, 18, 147.
> * Mimno, D., Wallach, H., & McCallum, A. (2008, December). Gibbs sampling for logistic normal topic models with graph-based priors. In NIPS Workshop on Analyzing Graphs (Vol. 61).'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, k=1, smoothing_alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    .. versionadded:: 0.6.0

    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
k : int
    the number of topics between 1 ~ 32767
smoothing_alpha : Union[float, Iterable[float]]
    small smoothing value for preventing topic counts to be zero, given as a single `float` in case of symmetric and as a list with length `k` of `float` in case of asymmetric.
eta : float
    hyperparameter of Dirichlet distribution for topic-word
seed : int
    random seed. The default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    .. versionadded:: 0.6.0

    a list of documents to be added into the model
transform : Callable[dict, dict]
    .. versionadded:: 0.6.0

    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            k,
            smoothing_alpha,
            eta,
            seed,
            corpus,
            transform,
        )

    def get_correlations(self, topic_id=None) -> List[float]:
        '''Return correlations between the topic `topic_id` and other topics.
The returned value is a `list` of `float`s of size `tomotopy.models.LDAModel.k`.

Parameters
----------
topic_id : Union[int, None]
    an integer in range [0, `k`), indicating the topic
    
    If omitted, the whole correlation matrix is returned.
'''
        return self._get_correlations(topic_id)
    
    @property
    def num_beta_samples(self) -> int:
        '''the number of times to sample beta parameters, default value is 10.

CTModel samples `num_beta_samples` beta parameters for each document. 
The more beta it samples, the more accurate the distribution will be, but the more time it takes to learn. 
If you have a small number of documents in your model, keeping this value larger will help you get better result.
'''
        return self._num_beta_samples
    
    @num_beta_samples.setter
    def num_beta_samples(self, value: int):
        self._num_beta_samples = value
    
    @property
    def num_tmn_samples(self) -> int:
        '''the number of iterations for sampling Truncated Multivariate Normal distribution, default value is 5.

If your model shows biased topic correlations, increasing this value may be helpful.'''
        return self._num_tmn_samples
    
    @num_tmn_samples.setter
    def num_tmn_samples(self, value: int):
        self._num_tmn_samples = value

    @property
    def prior_mean(self) -> np.ndarray:
        '''the mean of prior logistic-normal distribution for the topic distribution (read-only)'''
        return self._prior_mean
    
    @property
    def prior_cov(self) -> np.ndarray:
        '''the covariance matrix of prior logistic-normal distribution for the topic distribution (read-only)'''
        return self._prior_cov
    
    @property
    def alpha(self) -> float:
        '''This property is not available in `CTModel`. Use `CTModel.prior_mean` and `CTModel.prior_cov` instead.

.. versionadded:: 0.9.1'''
        raise AttributeError("CTModel has no attribute 'alpha'. Use 'prior_mean' and 'prior_cov' instead.")
    
    def _summary_params_info(self, file):
        print('| prior_mean (Prior mean of Logit-normal for the per-document topic distributions)\n'
            '|  {}'.format(_format_numpy(self.prior_mean, '|  ')), file=file)
        print('| prior_cov (Prior covariance of Logit-normal for the per-document topic distributions)\n'
            '|  {}'.format(_format_numpy(self.prior_cov, '|  ')), file=file)
        print('| eta (Dirichlet prior on the per-topic word distribution)\n'
            '|  {:.5}'.format(self.eta), file=file)

추가된 버전: 0.2.0

이 타입은 Correlated Topic Model (CTM)의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Blei, D., & Lafferty, J. (2006). Correlated topic models. Advances in neural information processing systems, 18, 147.

Mimno, D., Wallach, H., & McCallum, A. (2008, December). Gibbs sampling for logistic normal topic models with graph-based priors. In NIPS Workshop on Analyzing Graphs (Vol. 61).

파라미터

tw : Union[int, TermWeight]: 용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.
min_cf : int: 단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
min_df : int: 추가된 버전: 0.6.0

단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
rm_top : int: 제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.
k : int: 토픽의 개수, 1 ~ 32767 사이의 정수
smoothing_alpha : Union[float, Iterable[float]]: 토픽 개수가 0이 되는걸 방지하는 평탄화 계수, 대칭일 경우 float값 하나로, 비대칭일 경우 k 길이의 float 리스트로 입력할 수 있습니다.
eta : float: 토픽-단어 디리클레 분포의 하이퍼 파라미터
seed : int: 난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.
corpus : Corpus: 추가된 버전: 0.6.0

토픽 모델에 추가될 문헌들의 집합을 지정합니다.
transform : Callable[dict, dict]: 추가된 버전: 0.6.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._CTModel
LDAModel
tomotopy._LDAModel

인스턴스 변수

prop alpha : float

Expand source code

    @property
    def alpha(self) -> float:
        '''This property is not available in `CTModel`. Use `CTModel.prior_mean` and `CTModel.prior_cov` instead.

.. versionadded:: 0.9.1'''
        raise AttributeError("CTModel has no attribute 'alpha'. Use 'prior_mean' and 'prior_cov' instead.")

이 프로퍼티는 CTModel에서 사용불가합니다. 대신 CTModel.prior_mean와 CTModel.prior_cov를 사용하십시오.

추가된 버전: 0.9.1

prop num_beta_samples : int

Expand source code

    @property
    def num_beta_samples(self) -> int:
        '''the number of times to sample beta parameters, default value is 10.

CTModel samples `num_beta_samples` beta parameters for each document. 
The more beta it samples, the more accurate the distribution will be, but the more time it takes to learn. 
If you have a small number of documents in your model, keeping this value larger will help you get better result.
'''
        return self._num_beta_samples

beta 파라미터를 표집하는 횟수, 기본값은 10.

CTModel은 각 문헌마다 총 num_beta_samples 개수의 beta 파라미터를 표집합니다. beta 파라미터를 더 많이 표집할 수록, 전체 분포는 정교해지지만 학습 시간이 더 많이 걸립니다. 만약 모형 내에 문헌의 개수가 적은 경우 이 값을 크게하면 더 정확한 결과를 얻을 수 있습니다.

prop num_tmn_samples : int

Expand source code

    @property
    def num_tmn_samples(self) -> int:
        '''the number of iterations for sampling Truncated Multivariate Normal distribution, default value is 5.

If your model shows biased topic correlations, increasing this value may be helpful.'''
        return self._num_tmn_samples

절단된 다변수 정규분포에서 표본을 추출하기 위한 반복 횟수, 기본값은 5.

만약 결과에서 토픽 간 상관관계가 편향되게 나올 경우 이 값을 키우면 편향을 해소하는 데에 도움이 될 수 있습니다.

prop prior_cov : numpy.ndarray

Expand source code

@property
def prior_cov(self) -> np.ndarray:
    '''the covariance matrix of prior logistic-normal distribution for the topic distribution (read-only)'''
    return self._prior_cov

토픽의 사전 분포인 로지스틱 정규 분포의 공분산 행렬 (읽기전용)

prop prior_mean : numpy.ndarray

Expand source code

@property
def prior_mean(self) -> np.ndarray:
    '''the mean of prior logistic-normal distribution for the topic distribution (read-only)'''
    return self._prior_mean

토픽의 사전 분포인 로지스틱 정규 분포의 평균 벡터 (읽기전용)

메소드

def get_correlations(self, topic_id=None) ‑> List[float]

Expand source code

    def get_correlations(self, topic_id=None) -> List[float]:
        '''Return correlations between the topic `topic_id` and other topics.
The returned value is a `list` of `float`s of size `tomotopy.models.LDAModel.k`.

Parameters
----------
topic_id : Union[int, None]
    an integer in range [0, `k`), indicating the topic
    
    If omitted, the whole correlation matrix is returned.
'''
        return self._get_correlations(topic_id)

토픽 topic_id와 나머지 토픽들 간의 상관관계를 반환합니다. 반환값은 LDAModel.k 길이의 float의 list입니다.

파라미터

topic_id : Union[int, None]

토픽을 지정하는 [0, k), 범위의 정수

생략 시 상관계수 행렬 전체가 반환됩니다.

상속받은 메소드 및 변수

LDAModel:
- add_corpus
- add_doc
- burn_in
- copy
- docs
- eta
- get_count_by_topics
- get_topic_word_dist
- get_topic_words
- get_word_prior
- global_step
- infer
- k
- ll_per_word
- load
- loads
- make_doc
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class DMRModel (tw='one', min_cf=0, min_df=0, rm_top=0, k=1, alpha=0.1, eta=0.01, sigma=1.0, alpha_epsilon=1e-10, seed=None, corpus=None, transform=None)

Expand source code

class DMRModel(_DMRModel, LDAModel):
    '''This type provides Dirichlet Multinomial Regression(DMR) topic model and its implementation is based on the following papers:

> * Mimno, D., & McCallum, A. (2012). Topic models conditioned on arbitrary features with dirichlet-multinomial regression. arXiv preprint arXiv:1206.3278.'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, k=1, alpha=0.1, eta=0.01, sigma=1.0, alpha_epsilon=0.0000000001, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    .. versionadded:: 0.6.0

    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    .. versionadded:: 0.2.0
    
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
k : int
    the number of topics between 1 ~ 32767
alpha : Union[float, Iterable[float]]
    an initial value of exponential of mean of normal distribution for `lambdas`, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.
eta : float
    hyperparameter of Dirichlet distribution for topic - word
sigma : float
    standard deviation of normal distribution for `lambdas`
alpha_epsilon : float
    small smoothing value for preventing `exp(lambdas)` to be near zero
seed : int
    random seed. default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    .. versionadded:: 0.6.0

    a list of documents to be added into the model
transform : Callable[dict, dict]
    .. versionadded:: 0.6.0

    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            k,
            alpha,
            eta,
            sigma,
            alpha_epsilon,
            seed,
            corpus,
            transform,
        )

    def add_doc(self, words, metadata='', multi_metadata=[], ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with `metadata` and return an index of the inserted document.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
metadata : str
    metadata of the document (e.g., author, title or year)
multi_metadata : Iterable[str]
    metadata of the document (for multiple values)
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, metadata, multi_metadata, ignore_empty_words)
    
    def make_doc(self, words, metadata='', multi_metadata=[]) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and `metadata` that can be used for `tomotopy.models.LDAModel.infer` method.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
metadata : str
    metadata of the document (e.g., author, title or year)
multi_metadata : Iterable[str]
    metadata of the document (for multiple values)
'''
        return self._make_doc(words, metadata, multi_metadata)
    
    def get_topic_prior(self, metadata='', multi_metadata=[], raw=False) -> List[float]:
        '''.. versionadded:: 0.12.0

Calculate the topic prior of any document with the given `metadata` and `multi_metadata`. 
If `raw` is true, the value without applying `exp()` is returned, otherwise, the value with applying `exp()` is returned.

The topic prior is calculated as follows:

`np.dot(lambda_[:, id(metadata)], np.concat([[1], multi_hot(multi_metadata)]))`

where `idx(metadata)` and `multi_hot(multi_metadata)` indicates 
an integer id of given `metadata` and multi-hot encoded binary vector for given `multi_metadata` respectively.


Parameters
----------
metadata : str
    metadata of the document (e.g., author, title or year)
multi_metadata : Iterable[str]
    metadata of the document (for multiple values)
raw : bool
    If `raw` is true, the raw value of parameters without applying `exp()` is returned.
'''
        return self._get_topic_prior(metadata, multi_metadata, raw)
    
    @property
    def f(self) -> float:
        '''the number of metadata features (read-only)'''
        return self._f
    
    @property
    def sigma(self) -> float:
        '''the hyperparameter sigma (read-only)'''
        return self._sigma
    
    @property
    def alpha_epsilon(self) -> float:
        '''the smoothing value alpha-epsilon (read-only)'''
        return self._alpha_epsilon
    
    @property
    def metadata_dict(self):
        '''a dictionary of metadata in type `tomotopy.Dictionary` (read-only)'''
        return self._metadata_dict
    
    @property
    def multi_metadata_dict(self):
        '''a dictionary of metadata in type `tomotopy.Dictionary` (read-only)

.. versionadded:: 0.12.0

    This dictionary is distinct from `metadata_dict`.'''
        return self._multi_metadata_dict
    
    @property
    def lambdas(self) -> np.ndarray:
        '''parameter lambdas in the shape `[k, f]` (read-only)

.. warning::

    Prior to version 0.11.0, there was a bug in the lambda getter, so it yielded the wrong value. It is recommended to upgrade to version 0.11.0 or later.'''
        return self._lambdas
    
    @property
    def lambda_(self) -> np.ndarray:
        '''parameter lambdas in the shape `[k, len(metadata_dict), l]` where `k` is the number of topics and `l` is the size of vector for multi_metadata (read-only)

See `tomotopy.models.DMRModel.get_topic_prior` for the relation between the lambda parameter and the topic prior.

.. versionadded:: 0.12.0
'''
        return self._lambda_
    
    @property
    def alpha(self) -> np.ndarray:
        '''Dirichlet prior on the per-document topic distributions for each metadata in the shape `[k, f]`. Equivalent to `np.exp(DMRModel.lambdas)` (read-only)

.. versionadded:: 0.9.0

.. warning::

    Prior to version 0.11.0, there was a bug in the lambda getter, so it yielded the wrong value. It is recommended to upgrade to version 0.11.0 or later.'''
        return self._alpha
    
    def _summary_basic_info(self, file):
        LDAModel._summary_basic_info(self, file)
        md_cnt = Counter(doc.metadata for doc in self.docs)
        if len(md_cnt) > 1:
            print('| Metadata of docs and its distribution', file=file)
            for md in self.metadata_dict:
                print('|  {}: {}'.format(md, md_cnt.get(md, 0)), file=file)
        md_cnt = Counter()
        [md_cnt.update(doc.multi_metadata) for doc in self.docs]
        if len(md_cnt) > 0:
            print('| Multi-Metadata of docs and its distribution', file=file)
            for md in self.multi_metadata_dict:
                print('|  {}: {}'.format(md, md_cnt.get(md, 0)), file=file)

    def _summary_params_info(self, file):
        print('| lambda (feature vector per metadata of documents)\n'
            '|  {}'.format(_format_numpy(self.lambda_, '|  ')), file=file)
        print('| alpha (Dirichlet prior on the per-document topic distributions for each metadata)', file=file)
        for i, md in enumerate(self.metadata_dict):
            print('|  {}: {}'.format(md, _format_numpy(self.alpha[:, i], '|    ')), file=file)
        print('| eta (Dirichlet prior on the per-topic word distribution)\n'
            '|  {:.5}'.format(self.eta), file=file)

이 타입은 Dirichlet Multinomial Regression(DMR) 토픽 모델의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Mimno, D., & McCallum, A. (2012). Topic models conditioned on arbitrary features with dirichlet-multinomial regression. arXiv preprint arXiv:1206.3278.

파라미터

tw : Union[int, TermWeight]: 용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.
min_cf : int: 단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
min_df : int: 추가된 버전: 0.6.0

단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
rm_top : int: 추가된 버전: 0.2.0

제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.
k : int: 토픽의 개수, 1 ~ 32767 범위의 정수.
alpha : Union[float, Iterable[float]]: lambdas 파라미터의 평균의 exp의 초기값, 대칭일 경우 float값 하나로, 비대칭일 경우 k 길이의 float 리스트로 입력할 수 있습니다.
eta : float: 토픽-단어 디리클레 분포의 하이퍼 파라미터
sigma : float: lambdas 파라미터의 표준 편차
alpha_epsilon : float: exp(lambdas)가 0이 되는 것을 방지하는 평탄화 계수
seed : int: 난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.
corpus : Corpus: 추가된 버전: 0.6.0

토픽 모델에 추가될 문헌들의 집합을 지정합니다.
transform : Callable[dict, dict]: 추가된 버전: 0.6.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._DMRModel
LDAModel
tomotopy._LDAModel

Subclasses

GDMRModel

인스턴스 변수

prop alpha : numpy.ndarray

Expand source code

    @property
    def alpha(self) -> np.ndarray:
        '''Dirichlet prior on the per-document topic distributions for each metadata in the shape `[k, f]`. Equivalent to `np.exp(DMRModel.lambdas)` (read-only)

.. versionadded:: 0.9.0

.. warning::

    Prior to version 0.11.0, there was a bug in the lambda getter, so it yielded the wrong value. It is recommended to upgrade to version 0.11.0 or later.'''
        return self._alpha

각 메타데이터별 문헌-토픽 분포의 사전 분포, [k, f] 모양. np.exp(DMRModel.lambdas)와 동일 (읽기전용)

추가된 버전: 0.9.0

Warning

0.11.0 버전 전까지는 lambda getter에 있는 버그로 잘못된 값이 출력되었습니다. 0.11.0 이후 버전으로 업그레이드하시길 권장합니다.

prop alpha_epsilon : float

Expand source code

@property
def alpha_epsilon(self) -> float:
    '''the smoothing value alpha-epsilon (read-only)'''
    return self._alpha_epsilon

평탄화 계수 alpha-epsilon (읽기전용)

prop f : float

Expand source code

@property
def f(self) -> float:
    '''the number of metadata features (read-only)'''
    return self._f

메타데이터 자질 종류의 개수 (읽기전용)

prop lambda_ : numpy.ndarray

Expand source code

    @property
    def lambda_(self) -> np.ndarray:
        '''parameter lambdas in the shape `[k, len(metadata_dict), l]` where `k` is the number of topics and `l` is the size of vector for multi_metadata (read-only)

See `tomotopy.models.DMRModel.get_topic_prior` for the relation between the lambda parameter and the topic prior.

.. versionadded:: 0.12.0
'''
        return self._lambda_

현재 모형의 lambda 파라미터을 보여주는 [k, len(metadata_dict), l] 모양의 float array (읽기전용)

lambda 파라미터와 토픽 사전 분포 간의 관계에 대해서는 DMRModel.get_topic_prior()를 참고하십시오.

추가된 버전: 0.12.0

prop lambdas : numpy.ndarray

Expand source code

    @property
    def lambdas(self) -> np.ndarray:
        '''parameter lambdas in the shape `[k, f]` (read-only)

.. warning::

    Prior to version 0.11.0, there was a bug in the lambda getter, so it yielded the wrong value. It is recommended to upgrade to version 0.11.0 or later.'''
        return self._lambdas

현재 모형의 lambda 파라미터을 보여주는 [k, f] 모양의 float array (읽기전용)

Warning

0.11.0 버전 전까지는 lambda getter에 있는 버그로 잘못된 값이 출력되었습니다. 0.11.0 이후 버전으로 업그레이드하시길 권장합니다.

prop metadata_dict

Expand source code

@property
def metadata_dict(self):
    '''a dictionary of metadata in type `tomotopy.Dictionary` (read-only)'''
    return self._metadata_dict

tomotopy.Dictionary 타입의 메타데이터 사전 (읽기전용)

prop multi_metadata_dict

Expand source code

    @property
    def multi_metadata_dict(self):
        '''a dictionary of metadata in type `tomotopy.Dictionary` (read-only)

.. versionadded:: 0.12.0

    This dictionary is distinct from `metadata_dict`.'''
        return self._multi_metadata_dict

tomotopy.Dictionary 타입의 메타데이터 사전 (읽기전용)

추가된 버전: 0.12.0

이 사전은 metadata_dict와는 별개입니다.

prop sigma : float

Expand source code

@property
def sigma(self) -> float:
    '''the hyperparameter sigma (read-only)'''
    return self._sigma

하이퍼 파라미터 sigma (읽기전용)

메소드

def add_doc(self, words, metadata='', multi_metadata=[], ignore_empty_words=True) ‑> int | None

Expand source code

    def add_doc(self, words, metadata='', multi_metadata=[], ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with `metadata` and return an index of the inserted document.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
metadata : str
    metadata of the document (e.g., author, title or year)
multi_metadata : Iterable[str]
    metadata of the document (for multiple values)
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, metadata, multi_metadata, ignore_empty_words)

현재 모델에 metadata를 포함하는 새로운 문헌을 추가하고 추가된 문헌의 인덱스 번호를 반환합니다.

Changed in version: 0.12.0

여러 개의 메타데이터를 입력하는데 쓰이는 multi_metadata가 추가되었습니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
metadata : str: 문헌의 메타데이터 (예로 저자나 제목, 작성연도 등)
multi_metadata : Iterable[str]: 문헌의 메타데이터 (다중 값이 필요한 경우 사용하십시오)

def get_topic_prior(self, metadata='', multi_metadata=[], raw=False) ‑> List[float]

Expand source code

    def get_topic_prior(self, metadata='', multi_metadata=[], raw=False) -> List[float]:
        '''.. versionadded:: 0.12.0

Calculate the topic prior of any document with the given `metadata` and `multi_metadata`. 
If `raw` is true, the value without applying `exp()` is returned, otherwise, the value with applying `exp()` is returned.

The topic prior is calculated as follows:

`np.dot(lambda_[:, id(metadata)], np.concat([[1], multi_hot(multi_metadata)]))`

where `idx(metadata)` and `multi_hot(multi_metadata)` indicates 
an integer id of given `metadata` and multi-hot encoded binary vector for given `multi_metadata` respectively.


Parameters
----------
metadata : str
    metadata of the document (e.g., author, title or year)
multi_metadata : Iterable[str]
    metadata of the document (for multiple values)
raw : bool
    If `raw` is true, the raw value of parameters without applying `exp()` is returned.
'''
        return self._get_topic_prior(metadata, multi_metadata, raw)

추가된 버전: 0.12.0

주어진 metadata와 multi_metadata에 대해 토픽의 사전 분포를 계산합니다. raw가 참인 경우 exp()가 적용되기 전의 값이 반환되며, 그 외에는 exp()가 적용된 값이 반환됩니다.

토픽의 사전분포는 다음과 같이 계산됩니다:

np.dot(lambda_[:, id(metadata)], np.concat([[1], multi_hot(multi_metadata)]))

여기서 idx(metadata)와 multi_hot(multi_metadata)는 각각 주어진 metadata의 정수 인덱스 번호와 multi_metadata를 multi-hot 인코딩한, 0 혹은 1로 구성된 벡터입니다.

파라미터

metadata : str: 문헌의 메타데이터 (예를 들어 저자나 제목, 작성연도 등)
multi_metadata : Iterable[str]: 문헌의 메타데이터 (다중 값이 필요한 경우 사용하십시오)
raw : bool: 참일 경우 파라미터에 exp()가 적용되지 않은 값이 반환됩니다.

def make_doc(self, words, metadata='', multi_metadata=[]) ‑> Document

Expand source code

    def make_doc(self, words, metadata='', multi_metadata=[]) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and `metadata` that can be used for `tomotopy.models.LDAModel.infer` method.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
metadata : str
    metadata of the document (e.g., author, title or year)
multi_metadata : Iterable[str]
    metadata of the document (for multiple values)
'''
        return self._make_doc(words, metadata, multi_metadata)

words 단어를 바탕으로 새로운 문헌인 Document 인스턴스를 반환합니다. 이 인스턴스는 LDAModel.infer() 메소드에 사용될 수 있습니다.

Changed in version: 0.12.0

여러 개의 메타데이터를 입력하는데 쓰이는 multi_metadata가 추가되었습니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
metadata : str: 문헌의 메타데이터 (예를 들어 저자나 제목, 작성연도 등)
multi_metadata : Iterable[str]: 문헌의 메타데이터 (다중 값이 필요한 경우 사용하십시오)

상속받은 메소드 및 변수

LDAModel:
- add_corpus
- burn_in
- copy
- docs
- eta
- get_count_by_topics
- get_topic_word_dist
- get_topic_words
- get_word_prior
- global_step
- infer
- k
- ll_per_word
- load
- loads
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class DTModel (tw='one', min_cf=0, min_df=0, rm_top=0, k=1, t=1, alpha_var=0.1, eta_var=0.1, phi_var=0.1, lr_a=0.01, lr_b=0.1, lr_c=0.55, seed=None, corpus=None, transform=None)

Expand source code

class DTModel(_DTModel, LDAModel):
    '''This type provides Dynamic Topic model and its implementation is based on the following papers:

> * Blei, D. M., & Lafferty, J. D. (2006, June). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning (pp. 113-120).
> * Bhadury, A., Chen, J., Zhu, J., & Liu, S. (2016, April). Scaling up dynamic topic models. In Proceedings of the 25th International Conference on World Wide Web (pp. 381-390).
> https://github.com/Arnie0426/FastDTM

.. versionadded:: 0.7.0'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, k=1, t=1, alpha_var=0.1, eta_var=0.1, phi_var=0.1, lr_a=0.01, lr_b=0.1, lr_c=0.55, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
k : int
    the number of topics between 1 ~ 32767
t : int
    the number of timepoints
alpha_var : float
    transition variance of alpha (per-document topic distribution)
eta_var : float
    variance of eta (topic distribution of each document) from its alpha 
phi_var : float
    transition variance of phi (word distribution of each topic)
lr_a : float
    shape parameter `a` greater than zero, for SGLD step size calculated as `e_i = a * (b + i) ^ (-c)`
lr_b : float
    shape parameter `b` greater than or equal to zero, for SGLD step size calculated as `e_i = a * (b + i) ^ (-c)`
lr_c : float
    shape parameter `c` with range (0.5, 1], for SGLD step size calculated as `e_i = a * (b + i) ^ (-c)`
seed : int
    random seed. default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    a list of documents to be added into the model
transform : Callable[dict, dict]
    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            k,
            t,
            alpha_var,
            eta_var,
            phi_var,
            lr_a,
            lr_b,
            lr_c,
            seed,
            corpus,
            transform,
        )

    def add_doc(self, words, timepoint=0, ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with `timepoint` and return an index of the inserted document.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
timepoint : int
    an integer with range [0, `t`)
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, timepoint, ignore_empty_words)
    
    def make_doc(self, words, timepoint=0) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and `timepoint` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
timepoint : int
    an integer with range [0, `t`)
'''
        return self._make_doc(words, timepoint)
    
    def get_alpha(self, timepoint) -> List[float]:
        '''Return a `list` of alpha parameters for `timepoint`.

Parameters
----------
timepoint : int
    an integer with range [0, `t`)
'''
        return self._get_alpha(timepoint)
    
    def get_phi(self, timepoint, topic_id) -> List[float]:
        '''Return a `list` of phi parameters for `timepoint` and `topic_id`.

Parameters
----------
timepoint : int
    an integer with range [0, `t`)
topic_id : int
    an integer with range [0, `k`)
'''
        return self._get_phi(timepoint, topic_id)
    
    def get_topic_words(self, topic_id, timepoint, top_n=10) -> List[Tuple[str, float]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id` with `timepoint`. 
The return type is a `list` of (word:`str`, probability:`float`).

Parameters
----------
topic_id : int
    an integer in range [0, `k`), indicating the topic
timepoint : int
        an integer in range [0, `t`), indicating the timepoint
'''
        return self._get_topic_words(topic_id, timepoint, top_n)
    
    def get_topic_word_dist(self, topic_id, timepoint, normalize=True) -> List[float]:
        '''Return the word distribution of the topic `topic_id` with `timepoint`.
The returned value is a `list` that has `len(vocabs)` fraction numbers indicating probabilities for each word in the current topic.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
timepoint : int
        an integer in range [0, `t`), indicating the timepoint
normalize : bool
    .. versionadded:: 0.11.0

    If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.
'''
        return self._get_topic_word_dist(topic_id, timepoint, normalize)
    
    def get_count_by_topics(self) -> np.ndarray:
        '''Return the number of words allocated to each timepoint and topic in the shape `[num_timepoints, k]`.

.. versionadded:: 0.9.0'''
        return self._get_count_by_topics()
    
    @property
    def lr_a(self) -> float:
        '''the shape parameter `a` greater than zero for SGLD step size (e_i = a * (b + i) ^ -c)'''
        return self._lr_a
    
    @lr_a.setter
    def lr_a(self, value: float):
        self._lr_a = value

    @property
    def lr_b(self) -> float:
        '''the shape parameter `b` greater than or equal to zero for SGLD step size (e_i = a * (b + i) ^ -c)'''
        return self._lr_b
    
    @lr_b.setter
    def lr_b(self, value: float):
        self._lr_b = value

    @property
    def lr_c(self) -> float:
        '''the shape parameter `c` with range (0.5, 1] for SGLD step size (e_i = a * (b + i) ^ -c)'''
        return self._lr_c
    
    @lr_c.setter
    def lr_c(self, value: float):
        self._lr_c = value

    @property
    def num_timepoints(self) -> int:
        '''the number of timepoints of the model (read-only)'''
        return self._num_timepoints
    
    @property
    def num_docs_by_timepoint(self) -> List[int]:
        '''the number of documents in the model by timepoint (read-only)'''
        return self._num_docs_by_timepoint
    
    @property
    def alpha(self) -> float:
        '''per-document topic distribution in the shape `[num_timepoints, k]` (read-only)

.. versionadded:: 0.9.0'''
        return self._alpha
    
    @property
    def eta(self):
        '''This property is not available in `DTModel`. Use `DTModel.docs[x].eta` instead.

.. versionadded:: 0.9.0'''
        raise AttributeError("DTModel has no attribute 'eta'. Use 'docs[x].eta' instead.")
    
    def _summary_params_info(self, file):
        print('| alpha (Dirichlet prior on the per-document topic distributions for each timepoint)\n'
            '|  {}'.format(_format_numpy(self.alpha, '|  ')), file=file)
        print('| phi (Dirichlet prior on the per-time&topic word distribution)\n'
            '|  ...', file=file)
        
    def _summary_topics_info(self, file, topic_word_top_n):
        topic_cnt = self.get_count_by_topics()
        for k in range(self.k):
            print('| #{} ({})'.format(k, topic_cnt[:, k].sum()), file=file)
            for t in range(self.num_timepoints):
                words = ' '.join(w for w, _ in self.get_topic_words(k, t, top_n=topic_word_top_n))
                print('|  t={} ({}) : {}'.format(t, topic_cnt[t, k], words), file=file)

이 타입은 Dynamic Topic Model의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Blei, D. M., & Lafferty, J. D. (2006, June). Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning (pp. 113-120).

Bhadury, A., Chen, J., Zhu, J., & Liu, S. (2016, April). Scaling up dynamic topic models. In Proceedings of the 25th International Conference on World Wide Web (pp. 381-390). https://github.com/Arnie0426/FastDTM

추가된 버전: 0.7.0

파라미터

tw : Union[int, TermWeight]: 용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.
min_cf : int: 단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
min_df : int: 단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
rm_top : int: 제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.
k : int: 토픽의 개수, 1 ~ 32767 범위의 정수.
t : int: 시점의 개수
alpha_var : float: alpha 파라미터(시점별 토픽 분포)의 전이 분산
eta_var : float: eta 파라미터(문헌별 토픽 분포)의 alpha로부터의 분산
phi_var : float: phi 파라미터(토픽별 단어 분포)의 전이 분산
lr_a : float: SGLD의 스텝 크기 e_i = a * (b + i) ^ (-c) 계산하는데 사용되는 0보다 큰 a값
lr_b : float: SGLD의 스텝 크기 e_i = a * (b + i) ^ (-c) 계산하는데 사용되는 0 이상의 b값
lr_c : float: SGLD의 스텝 크기 e_i = a * (b + i) ^ (-c) 계산하는데 사용되는 (0.5, 1] 범위의 c값
seed : int: 난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.
corpus : Corpus: 토픽 모델에 추가될 문헌들의 집합을 지정합니다.
transform : Callable[dict, dict]: 특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._DTModel
LDAModel
tomotopy._LDAModel

인스턴스 변수

prop alpha : float

Expand source code

    @property
    def alpha(self) -> float:
        '''per-document topic distribution in the shape `[num_timepoints, k]` (read-only)

.. versionadded:: 0.9.0'''
        return self._alpha

문헌별 토픽 분포, [num_timepoints, k] 모양 (읽기전용)

추가된 버전: 0.9.0

prop eta

Expand source code

    @property
    def eta(self):
        '''This property is not available in `DTModel`. Use `DTModel.docs[x].eta` instead.

.. versionadded:: 0.9.0'''
        raise AttributeError("DTModel has no attribute 'eta'. Use 'docs[x].eta' instead.")

이 프로퍼티는 DTModel에서 사용불가합니다. 대신 DTModel.docs[x].eta를 사용하십시오.

추가된 버전: 0.9.0

prop lr_a : float

Expand source code

@property
def lr_a(self) -> float:
    '''the shape parameter `a` greater than zero for SGLD step size (e_i = a * (b + i) ^ -c)'''
    return self._lr_a

SGLD의 스텝 크기를 결정하는 0보다 큰 파라미터 a (e_i = a * (b + i) ^ -c)

prop lr_b : float

Expand source code

@property
def lr_b(self) -> float:
    '''the shape parameter `b` greater than or equal to zero for SGLD step size (e_i = a * (b + i) ^ -c)'''
    return self._lr_b

SGLD의 스텝 크기를 결정하는 0 이상의 파라미터 b (e_i = a * (b + i) ^ -c)

prop lr_c : float

Expand source code

@property
def lr_c(self) -> float:
    '''the shape parameter `c` with range (0.5, 1] for SGLD step size (e_i = a * (b + i) ^ -c)'''
    return self._lr_c

SGLD의 스텝 크기를 결정하는 (0.5, 1] 범위의 파라미터 c (e_i = a * (b + i) ^ -c)

prop num_docs_by_timepoint : List[int]

Expand source code

@property
def num_docs_by_timepoint(self) -> List[int]:
    '''the number of documents in the model by timepoint (read-only)'''
    return self._num_docs_by_timepoint

각 시점별 모델 내 문헌 개수 (읽기전용)

prop num_timepoints : int

Expand source code

@property
def num_timepoints(self) -> int:
    '''the number of timepoints of the model (read-only)'''
    return self._num_timepoints

모델의 시점 개수 (읽기전용)

메소드

def add_doc(self, words, timepoint=0, ignore_empty_words=True) ‑> int | None

Expand source code

    def add_doc(self, words, timepoint=0, ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with `timepoint` and return an index of the inserted document.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
timepoint : int
    an integer with range [0, `t`)
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, timepoint, ignore_empty_words)

현재 모델에 timepoint 시점의 새로운 문헌을 추가하고 추가된 문헌의 인덱스 번호를 반환합니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
timepoint : int: 시점을 나타내는 [0, t) 범위의 정수

def get_alpha(self, timepoint) ‑> List[float]

Expand source code

    def get_alpha(self, timepoint) -> List[float]:
        '''Return a `list` of alpha parameters for `timepoint`.

Parameters
----------
timepoint : int
    an integer with range [0, `t`)
'''
        return self._get_alpha(timepoint)

timepoint 시점에 대한 alpha 파라미터의 리스트를 반환합니다.

파라미터

timepoint : int: 시점을 나타내는 [0, t) 범위의 정수

def get_count_by_topics(self) ‑> numpy.ndarray

Expand source code

    def get_count_by_topics(self) -> np.ndarray:
        '''Return the number of words allocated to each timepoint and topic in the shape `[num_timepoints, k]`.

.. versionadded:: 0.9.0'''
        return self._get_count_by_topics()

각각의 시점과 토픽에 할당된 단어의 개수를 [num_timepoints, k] 모양으로 반환합니다.

추가된 버전: 0.9.0

def get_phi(self, timepoint, topic_id) ‑> List[float]

Expand source code

    def get_phi(self, timepoint, topic_id) -> List[float]:
        '''Return a `list` of phi parameters for `timepoint` and `topic_id`.

Parameters
----------
timepoint : int
    an integer with range [0, `t`)
topic_id : int
    an integer with range [0, `k`)
'''
        return self._get_phi(timepoint, topic_id)

timepoint 시점의 topic_id에 대한 phi 파라미터의 리스트를 반환합니다.

파라미터

timepoint : int: 시점을 나타내는 [0, t) 범위의 정수
topic_id : int: 토픽을 나타내는 [0, k) 범위의 정수

def get_topic_word_dist(self, topic_id, timepoint, normalize=True) ‑> List[float]

Expand source code

    def get_topic_word_dist(self, topic_id, timepoint, normalize=True) -> List[float]:
        '''Return the word distribution of the topic `topic_id` with `timepoint`.
The returned value is a `list` that has `len(vocabs)` fraction numbers indicating probabilities for each word in the current topic.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
timepoint : int
        an integer in range [0, `t`), indicating the timepoint
normalize : bool
    .. versionadded:: 0.11.0

    If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.
'''
        return self._get_topic_word_dist(topic_id, timepoint, normalize)

시점 timepoint의 토픽 topic_id의 단어 분포를 반환합니다. 반환하는 값은 현재 토픽 내 각각의 단어들의 발생확률을 나타내는 len(vocabs)개의 소수로 구성된 list입니다.

파라미터

topic_id : int: 토픽을 가리키는 [0, k) 범위의 정수
timepoint : int: 시점을 가리키는 [0, t) 범위의 정수
normalize : bool: 추가된 버전: 0.11.0

참일 경우 총합이 1이 되는 확률 분포를 반환하고, 거짓일 경우 정규화되지 않는 값을 그대로 반환합니다.

def get_topic_words(self, topic_id, timepoint, top_n=10) ‑> List[Tuple[str, float]]

Expand source code

    def get_topic_words(self, topic_id, timepoint, top_n=10) -> List[Tuple[str, float]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id` with `timepoint`. 
The return type is a `list` of (word:`str`, probability:`float`).

Parameters
----------
topic_id : int
    an integer in range [0, `k`), indicating the topic
timepoint : int
        an integer in range [0, `t`), indicating the timepoint
'''
        return self._get_topic_words(topic_id, timepoint, top_n)

시점 timepoint의 토픽 topic_id에 속하는 상위 top_n개의 단어와 각각의 확률을 반환합니다. 반환 타입은 (단어:str, 확률:float) 튜플의 list형입니다.

파라미터

topic_id : int: 토픽을 가리키는 [0, k) 범위의 정수
timepoint : int: 시점을 가리키는 [0, t) 범위의 정수

def make_doc(self, words, timepoint=0) ‑> Document

Expand source code

    def make_doc(self, words, timepoint=0) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and `timepoint` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
timepoint : int
    an integer with range [0, `t`)
'''
        return self._make_doc(words, timepoint)

words 단어를 바탕으로 새로운 문헌인 Document 인스턴스를 반환합니다. 이 인스턴스는 LDAModel.infer() 메소드에 사용될 수 있습니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
timepoint : int: 시점을 나타내는 [0, t) 범위의 정수

상속받은 메소드 및 변수

LDAModel:
- add_corpus
- burn_in
- copy
- docs
- get_word_prior
- global_step
- infer
- k
- ll_per_word
- load
- loads
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class GDMRModel (tw='one', min_cf=0, min_df=0, rm_top=0, k=1, degrees=[], alpha=0.1, eta=0.01, sigma=1.0, sigma0=3.0, decay=0, alpha_epsilon=1e-10, metadata_range=None, seed=None, corpus=None, transform=None)

Expand source code

class GDMRModel(_GDMRModel, DMRModel):
    '''This type provides Generalized DMR(g-DMR) topic model and its implementation is based on the following papers:

> * Lee, M., & Song, M. Incorporating citation impact into analysis of research trends. Scientometrics, 1-34.

.. versionadded:: 0.8.0

.. warning::

    Until version 0.10.2, `metadata` was used to represent numeric data and there was no argument for categorical data.
    Since version 0.11.0, the name of the previous `metadata` argument is changed to `numeric_metadata`, 
    and `metadata` is added to represent categorical data for unification with the `tomotopy.models.DMRModel`.
    So `metadata` arguments in the older codes should be replaced with `numeric_metadata` to work in version 0.11.0.'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, k=1, degrees=[], alpha=0.1, eta=0.01, sigma=1.0, sigma0=3.0, decay=0, alpha_epsilon=0.0000000001, metadata_range=None, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
k : int
    the number of topics between 1 ~ 32767
degrees : Iterable[int]
    a list of the degrees of Legendre polynomials for TDF(Topic Distribution Function). Its length should be equal to the number of metadata variables.

    Its default value is `[]` in which case the model doesn't use any metadata variable and as a result, it becomes the same as an LDA or DMR model. 
alpha : Union[float, Iterable[float]]
    exponential of mean of normal distribution for `lambdas`, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.
eta : float
    hyperparameter of Dirichlet distribution for topic - word
sigma : float
    standard deviation of normal distribution for non-constant terms of `lambdas`
sigma0 : float
    standard deviation of normal distribution for constant terms of `lambdas`
decay : float
    .. versionadded:: 0.11.0

    decay's exponent that causes the coefficient of the higher-order term of `lambdas` to become smaller
alpha_epsilon : float
    small smoothing value for preventing `exp(lambdas)` to be near zero
metadata_range : Iterable[Iterable[float]]
    a list of minimum and maximum value of each numeric metadata variable. Its length should be equal to the length of `degrees`.
    
    For example, `metadata_range = [(2000, 2017), (0, 1)]` means that the first variable has a range from 2000 and 2017 and the second one has a range from 0 to 1.
        Its default value is `None` in which case the ranges of each variable are obtained from input documents.
seed : int
    random seed. default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    a list of documents to be added into the model
transform : Callable[dict, dict]
    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            k,
            degrees,
            alpha,
            eta,
            sigma,
            sigma0,
            decay,
            alpha_epsilon,
            metadata_range,
            seed,
            corpus,
            transform,
        )

    def add_doc(self, words, numeric_metadata=[], metadata='', multi_metadata=[], ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with `metadata` and return an index of the inserted document.

.. versionchanged:: 0.11.0

    Until version 0.10.2, `metadata` was used to represent numeric data and there was no argument for categorical data.
    Since version 0.11.0, the name of the previous `metadata` argument is changed to `numeric_metadata`, 
    and `metadata` is added to represent categorical data for unification with the `tomotopy.models.DMRModel`.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
numeric_metadata : Iterable[float]
    continuous numeric metadata variable of the document. Its length should be equal to the length of `degrees`.
metadata : str
    categorical metadata of the document (e.g., author, title, journal or country)
multi_metadata : Iterable[str]
    metadata of the document (for multiple values)
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, numeric_metadata, metadata, multi_metadata, ignore_empty_words)
    
    def make_doc(self, words, numeric_metadata=[], metadata='', multi_metadata=[]) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and `metadata` that can be used for `tomotopy.models.LDAModel.infer` method.

.. versionchanged:: 0.11.0

    Until version 0.10.2, `metadata` was used to represent numeric data and there was no argument for categorical data.
    Since version 0.11.0, the name of the previous `metadata` argument is changed to `numeric_metadata`, 
    and `metadata` is added to represent categorical data for unification with the `tomotopy.models.DMRModel`.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
numeric_metadata : Iterable[float]
    continuous numeric metadata variable of the document. Its length should be equal to the length of `degrees`.
metadata : str
    categorical metadata of the document (e.g., author, title, journal or country)
multi_metadata : Iterable[str]
    metadata of the document (for multiple values)
'''
        return self._make_doc(words, numeric_metadata, metadata, multi_metadata)
    
    def tdf(self, numeric_metadata, metadata='', multi_metadata=[], normalize=True) -> List[float]:
        '''Calculate a topic distribution for given `numeric_metadata` value. It returns a list with length `k`.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
numeric_metadata : Iterable[float]
    continuous metadata variable whose length should be equal to the length of `degrees`.
metadata : str    
    categorical metadata variable
multi_metadata : Iterable[str]
    categorical metadata variables (for multiple values)
normalize : bool
    If true, the method returns probabilities for each topic in range [0, 1]. Otherwise, it returns raw values in logit.
'''
        return self._tdf(numeric_metadata, metadata, multi_metadata, normalize)
    
    def tdf_linspace(self, numeric_metadata_start, numeric_metadata_stop, num, metadata='', multi_metadata=[], endpoint=True, normalize=True) -> np.ndarray:
        '''Calculate topic distributions over a linspace of `numeric_metadata` values.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
numeric_metadata_start : Iterable[float]
    the starting value of each continuous metadata variable whose length should be equal to the length of `degrees`.
numeric_metadata_stop : Iterable[float]
    the end value of each continuous metadata variable whose length should be equal to the length of `degrees`.
num : Iterable[int]
    the number of samples to generate for each metadata variable. Must be non-negative. Its length should be equal to the length of `degrees`.
metadata : str
    categorical metadata variable
multi_metadata : Iterable[str]
    categorical metadata variables (for multiple values)
endpoint : bool
    If True, `metadata_stop` is the last sample. Otherwise, it is not included. Default is True.
normalize : bool
    If true, the method returns probabilities for each topic in range [0, 1]. Otherwise, it returns raw values in logit.

Returns
-------
samples : ndarray
    with shape `[*num, k]`. 
'''
        return self._tdf_linspace(numeric_metadata_start, numeric_metadata_stop, num, metadata, multi_metadata, endpoint, normalize)
    
    @property
    def degrees(self) -> List[int]:
        '''the degrees of Legendre polynomials (read-only)'''
        return self._degrees

    @property
    def sigma0(self) -> float:
        '''the hyperparameter sigma0 (read-only)'''
        return self._sigma0
    
    @property
    def decay(self) -> float:
        '''the hyperparameter decay (read-only)'''
        return self._decay
    
    @property
    def metadata_range(self) -> List[Tuple[float, float]]:
        '''the ranges of each metadata variable (read-only)'''
        return self._metadata_range
    
    def _summary_basic_info(self, file):
        LDAModel._summary_basic_info(self, file)

        md_cnt = Counter(doc.metadata for doc in self.docs)
        if len(md_cnt) > 1:
            print('| Categorical metadata of docs and its distribution', file=file)
            for md in self.metadata_dict:
                print('|  {}: {}'.format(md, md_cnt.get(md, 0)), file=file)
        md_cnt = Counter()
        [md_cnt.update(doc.multi_metadata) for doc in self.docs]
        if len(md_cnt) > 0:
            print('| Categorical multi-metadata of docs and its distribution', file=file)
            for md in self.multi_metadata_dict:
                print('|  {}: {}'.format(md, md_cnt.get(md, 0)), file=file)

        md_stack = np.stack([doc.numeric_metadata for doc in self.docs])
        md_min = md_stack.min(axis=0)
        md_max = md_stack.max(axis=0)
        md_avg = np.average(md_stack, axis=0)
        md_std = np.std(md_stack, axis=0)
        print('| Numeric metadata distribution of docs', file=file)
        for i in range(md_stack.shape[1]):
            print('|  #{}: Range={:.5}~{:.5}, Avg={:.5}, Stdev={:.5}'.format(i, md_min[i], md_max[i], md_avg[i], md_std[i]), file=file)

    def _summary_params_info(self, file):
        print('| lambda (feature vector per metadata of documents)\n'
            '|  {}'.format(_format_numpy(self.lambda_, '|  ')), file=file)
        print('| eta (Dirichlet prior on the per-topic word distribution)\n'
            '|  {:.5}'.format(self.eta), file=file)

이 타입은 Generalized DMR(g-DMR) 토픽 모델의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Lee, M., & Song, M. Incorporating citation impact into analysis of research trends. Scientometrics, 1-34.

추가된 버전: 0.8.0

Warning

0.10.2버전까지는 metadata가 숫자형 연속 변수를 표현하는데 사용되었고, 별도로 범주형 변수에 사용되는 인자가 없었습니다. 0.11.0버전부터는 DMRModel과의 통일성을 위해 기존의 metadata 인수가 numeric_metadata라는 이름으로 변경되고, metadata라는 이름으로 범주형 변수를 사용할 수 있게 변경됩니다. 따라서 이전 코드의 metadata 인자를 numeric_metadata로 바꿔주어야 0.11.0 버전에서 작동합니다.

파라미터

tw : Union[int, TermWeight]

용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.

min_cf : int

단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.

min_df : int

단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.

rm_top : int

제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.

k : int

토픽의 개수, 1 ~ 32767 범위의 정수.

degrees : Iterable[int]

TDF(토픽 분포 함수)로 쓰일 르장드르 다항식의 차수를 나타내는 list. 길이는 메타데이터 변수의 개수와 동일해야 합니다.

기본값은 []으로 이 경우 모델은 어떤 메타데이터 변수도 포함하지 않으므로 LDA 또는 DMR 모델과 동일해집니다.

alpha : Union[float, Iterable[float]]

lambdas 파라미터의 평균의 exp의 초기값, 대칭일 경우 float값 하나로, 비대칭일 경우 k 길이의 float 리스트로 입력할 수 있습니다.

eta : float

토픽-단어 디리클레 분포의 하이퍼 파라미터

sigma : float

lambdas 파라미터 중 비상수 항의 표준 편차

sigma0 : float

lambdas 파라미터 중 상수 항의 표준 편차

decay : float

추가된 버전: 0.11.0

lambdas 파라미터 중 고차항의 계수가 더 작아지도록하는 감쇠 지수

alpha_epsilon : float

exp(lambdas)가 0이 되는 것을 방지하는 평탄화 계수

metadata_range : Iterable[Iterable[float]]

각 메타데이터 변수의 최솟값과 최댓값을 지정하는 list. 길이는 degrees의 길이와 동일해야 합니다.

예를 들어 metadata_range = [(2000, 2017), (0, 1)] 는 첫번째 변수의 범위를 2000에서 2017까지로, 두번째 변수의 범위를 0에서 1까지로 설정하겠다는 뜻입니다. 기본값은 None이며, 이 경우 입력 문헌의 메타데이터로부터 최솟값과 최댓값을 찾습니다.

seed : int

난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.

corpus : Corpus

토픽 모델에 추가될 문헌들의 집합을 지정합니다.

transform : Callable[dict, dict]

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._GDMRModel
DMRModel
tomotopy._DMRModel
LDAModel
tomotopy._LDAModel

인스턴스 변수

prop decay : float

Expand source code

@property
def decay(self) -> float:
    '''the hyperparameter decay (read-only)'''
    return self._decay

하이퍼 파라미터 decay (읽기전용)

prop degrees : List[int]

Expand source code

@property
def degrees(self) -> List[int]:
    '''the degrees of Legendre polynomials (read-only)'''
    return self._degrees

르장드르 다항식의 차수 (읽기전용)

prop metadata_range : List[Tuple[float, float]]

Expand source code

@property
def metadata_range(self) -> List[Tuple[float, float]]:
    '''the ranges of each metadata variable (read-only)'''
    return self._metadata_range

각 메타데이터 변수의 범위를 나타내는 list (읽기전용)

prop sigma0 : float

Expand source code

@property
def sigma0(self) -> float:
    '''the hyperparameter sigma0 (read-only)'''
    return self._sigma0

하이퍼 파라미터 sigma0 (읽기전용)

메소드

def add_doc(self, words, numeric_metadata=[], metadata='', multi_metadata=[], ignore_empty_words=True) ‑> int | None

Expand source code

    def add_doc(self, words, numeric_metadata=[], metadata='', multi_metadata=[], ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with `metadata` and return an index of the inserted document.

.. versionchanged:: 0.11.0

    Until version 0.10.2, `metadata` was used to represent numeric data and there was no argument for categorical data.
    Since version 0.11.0, the name of the previous `metadata` argument is changed to `numeric_metadata`, 
    and `metadata` is added to represent categorical data for unification with the `tomotopy.models.DMRModel`.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
numeric_metadata : Iterable[float]
    continuous numeric metadata variable of the document. Its length should be equal to the length of `degrees`.
metadata : str
    categorical metadata of the document (e.g., author, title, journal or country)
multi_metadata : Iterable[str]
    metadata of the document (for multiple values)
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, numeric_metadata, metadata, multi_metadata, ignore_empty_words)

현재 모델에 metadata를 포함하는 새로운 문헌을 추가하고 추가된 문헌의 인덱스 번호를 반환합니다.

Changed in version: 0.11.0

Changed in version: 0.12.0

여러 개의 메타데이터를 입력하는데 쓰이는 multi_metadata가 추가되었습니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
numeric_metadata : Iterable[float]: 문헌의 연속형 숫자 메타데이터 변수. 길이는 degrees의 길이와 동일해야 합니다.
metadata : str: 문헌의 범주형 메타데이터 (예를 들어 저자나 제목, 저널, 국가 등)
multi_metadata : Iterable[str]: 문헌의 메타데이터 (다중 값이 필요한 경우 사용하십시오)

def make_doc(self, words, numeric_metadata=[], metadata='', multi_metadata=[]) ‑> Document

Expand source code

    def make_doc(self, words, numeric_metadata=[], metadata='', multi_metadata=[]) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and `metadata` that can be used for `tomotopy.models.LDAModel.infer` method.

.. versionchanged:: 0.11.0

    Until version 0.10.2, `metadata` was used to represent numeric data and there was no argument for categorical data.
    Since version 0.11.0, the name of the previous `metadata` argument is changed to `numeric_metadata`, 
    and `metadata` is added to represent categorical data for unification with the `tomotopy.models.DMRModel`.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
numeric_metadata : Iterable[float]
    continuous numeric metadata variable of the document. Its length should be equal to the length of `degrees`.
metadata : str
    categorical metadata of the document (e.g., author, title, journal or country)
multi_metadata : Iterable[str]
    metadata of the document (for multiple values)
'''
        return self._make_doc(words, numeric_metadata, metadata, multi_metadata)

words 단어를 바탕으로 새로운 문헌인 Document 인스턴스를 반환합니다. 이 인스턴스는 LDAModel.infer() 메소드에 사용될 수 있습니다.

Changed in version: 0.11.0

Changed in version: 0.12.0

여러 개의 메타데이터를 입력하는데 쓰이는 multi_metadata가 추가되었습니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
numeric_metadata : Iterable[float]: 문헌의 연속형 숫자 메타데이터 변수. 길이는 degrees의 길이와 동일해야 합니다.
metadata : str: 문헌의 범주형 메타데이터 (예를 들어 저자나 제목, 저널, 국가 등)
multi_metadata : Iterable[str]: 문헌의 메타데이터 (다중 값이 필요한 경우 사용하십시오)

def tdf(self, numeric_metadata, metadata='', multi_metadata=[], normalize=True) ‑> List[float]

Expand source code

    def tdf(self, numeric_metadata, metadata='', multi_metadata=[], normalize=True) -> List[float]:
        '''Calculate a topic distribution for given `numeric_metadata` value. It returns a list with length `k`.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
numeric_metadata : Iterable[float]
    continuous metadata variable whose length should be equal to the length of `degrees`.
metadata : str    
    categorical metadata variable
multi_metadata : Iterable[str]
    categorical metadata variables (for multiple values)
normalize : bool
    If true, the method returns probabilities for each topic in range [0, 1]. Otherwise, it returns raw values in logit.
'''
        return self._tdf(numeric_metadata, metadata, multi_metadata, normalize)

주어진 metadata에 대해 토픽 분포를 계산하여, k 길이의 list로 반환합니다.

Changed in version: 0.11.0

Changed in version: 0.12.0

여러 개의 메타데이터를 입력하는데 쓰이는 multi_metadata가 추가되었습니다.

파라미터

numeric_metadata : Iterable[float]: 연속형 메타데이터 변수. 길이는 degrees의 길이와 동일해야 합니다.
metadata : str: 범주형 메타데이터 변수
multi_metadata : Iterable[str]: 범주형 메타데이터 변수 (여러 개를 입력해야 하는 경우 사용하십시오)
normalize : bool: 참인 경우, 각 값이 [0, 1] 범위에 있는 확률 분포를 반환합니다. 거짓인 경우 logit값을 그대로 반환합니다.

def tdf_linspace(self, numeric_metadata_start, numeric_metadata_stop, num, metadata='', multi_metadata=[], endpoint=True, normalize=True) ‑> numpy.ndarray

Expand source code

    def tdf_linspace(self, numeric_metadata_start, numeric_metadata_stop, num, metadata='', multi_metadata=[], endpoint=True, normalize=True) -> np.ndarray:
        '''Calculate topic distributions over a linspace of `numeric_metadata` values.

.. versionchanged:: 0.12.0

    A new argument `multi_metadata` for multiple values of metadata was added.

Parameters
----------
numeric_metadata_start : Iterable[float]
    the starting value of each continuous metadata variable whose length should be equal to the length of `degrees`.
numeric_metadata_stop : Iterable[float]
    the end value of each continuous metadata variable whose length should be equal to the length of `degrees`.
num : Iterable[int]
    the number of samples to generate for each metadata variable. Must be non-negative. Its length should be equal to the length of `degrees`.
metadata : str
    categorical metadata variable
multi_metadata : Iterable[str]
    categorical metadata variables (for multiple values)
endpoint : bool
    If True, `metadata_stop` is the last sample. Otherwise, it is not included. Default is True.
normalize : bool
    If true, the method returns probabilities for each topic in range [0, 1]. Otherwise, it returns raw values in logit.

Returns
-------
samples : ndarray
    with shape `[*num, k]`. 
'''
        return self._tdf_linspace(numeric_metadata_start, numeric_metadata_stop, num, metadata, multi_metadata, endpoint, normalize)

주어진 metadata에 대해 토픽 분포를 계산하여, k 길이의 list로 반환합니다.

Changed in version: 0.11.0

Changed in version: 0.12.0

여러 개의 메타데이터를 입력하는데 쓰이는 multi_metadata가 추가되었습니다.

파라미터

numeric_metadata_start : Iterable[float]: 문헌의 연속 메타데이터 변수의 시작값. 길이는 degrees의 길이와 동일해야 합니다.
numeric_metadata_stop : Iterable[float]: 문헌의 연속 메타데이터 변수의 끝값. 길이는 degrees의 길이와 동일해야 합니다.
num : Iterable[int]: 각 메타데이터 변수별로 생성할 샘플의 개수(0보다 큰 정수). 길이는 degrees의 길이와 동일해야 합니다.
metadata : str: 범주형 메타데이터 변수
multi_metadata : Iterable[str]: 범주형 메타데이터 변수 (여러 개를 입력해야 하는 경우 사용하십시오)
endpoint : bool: 참인 경우 metadata_stop이 마지막 샘플이 됩니다. 거짓인 경우 끝값이 샘플에 포함되지 않습니다. 기본값은 참입니다.
normalize : bool: 참인 경우, 각 값이 [0, 1] 범위에 있는 확률 분포를 반환합니다. 거짓인 경우 logit값을 그대로 반환합니다.

상속받은 메소드 및 변수

DMRModel:
- add_corpus
- alpha
- alpha_epsilon
- burn_in
- copy
- docs
- eta
- f
- get_count_by_topics
- get_topic_prior
- get_topic_word_dist
- get_topic_words
- get_word_prior
- global_step
- infer
- k
- lambda_
- lambdas
- ll_per_word
- load
- loads
- metadata_dict
- multi_metadata_dict
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- sigma
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class HDPModel (tw='one', min_cf=0, min_df=0, rm_top=0, initial_k=2, alpha=0.1, eta=0.01, gamma=0.1, seed=None, corpus=None, transform=None)

Expand source code

class HDPModel(_HDPModel, LDAModel):
    '''This type provides Hierarchical Dirichlet Process(HDP) topic model and its implementation is based on the following papers:

> * Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2005). Sharing clusters among related groups: Hierarchical Dirichlet processes. In Advances in neural information processing systems (pp. 1385-1392).
> * Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10(Aug), 1801-1828.

.. versionchanged:: 0.3.0

    Since version 0.3.0, hyperparameter estimation for `alpha` and `gamma` has been added. You can turn off this estimation by setting `optim_interval` to zero.'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, initial_k=2, alpha=0.1, eta=0.01, gamma=0.1, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    .. versionadded:: 0.6.0

    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    .. versionadded:: 0.2.0
    
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
initial_k : int
    the initial number of topics between 2 ~ 32767
    The number of topics will be adjusted based on the data during training.
        
        Since version 0.3.0, the default value has been changed to 2 from 1.
alpha : float
    concentration coefficient of Dirichlet Process for document-table 
eta : float
    hyperparameter of Dirichlet distribution for topic-word
gamma : float
    concentration coefficient of Dirichlet Process for table-topic
seed : int
    random seed. default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    .. versionadded:: 0.6.0

    a list of documents to be added into the model
transform : Callable[dict, dict]
    .. versionadded:: 0.6.0

    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            initial_k,
            alpha,
            eta,
            gamma,
            seed,
            corpus,
            transform,
        )

    def is_live_topic(self, topic_id) -> bool:
        '''Return `True` if the topic `topic_id` is valid, otherwise return `False`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._is_live_topic(topic_id)
    
    def convert_to_lda(self, topic_threshold=0.0) -> Tuple['LDAModel', List[int]]:
        '''.. versionadded:: 0.8.0

Convert the current HDP model to equivalent LDA model and return `(new_lda_model, new_topic_id)`.
Topics with proportion less than `topic_threshold` are removed in `new_lda_model`.

`new_topic_id` is an array of length `HDPModel.k` and `new_topic_id[i]` indicates a topic id of new LDA model, equivalent to topic `i` of original HDP model.
If topic `i` of original HDP model is not alive or is removed in LDA model, `new_topic_id[i]` would be `-1`.

Parameters
----------
topic_threshold : float
    Topics with proportion less than this value is removed in new LDA model.
    The default value is 0, and it means no topic except not alive is removed.
'''
        return self._convert_to_lda(LDAModel, topic_threshold)
    
    def purge_dead_topics(self) -> List[int]:
        '''.. versionadded:: 0.12.3

Purge all non-alive topics from the model and return `new_topic_ids`. After called, `HDPModel.k` shrinks to `HDPModel.live_k` and all topics of the model become live.

`new_topic_id` is an array of length `HDPModel.k` and `new_topic_id[i]` indicates a topic id of the new model, equivalent to topic `i` of previous HDP model.
If topic `i` of previous HDP model is not alive or is removed in the new model, `new_topic_id[i]` would be `-1`.
'''
        return self._purge_dead_topics()
    
    @property
    def gamma(self) -> float:
        '''the hyperparameter gamma (read-only)'''
        return self._gamma
    
    @property
    def live_k(self) -> int:
        '''the number of alive topics (read-only)'''
        return self._live_k
    
    @property
    def num_tables(self) -> int:
        '''the number of total tables (read-only)'''
        return self._num_tables
    
    def _progress_tqdm(self, current_iteration:int, total_iteration:int):
        self._tqdm.set_postfix_str(f'# Topics: {self.live_k}, LLPW: {self.ll_per_word:.6f}')
        self._tqdm.update(current_iteration - self._tqdm.n)
    
    def _summary_params_info(self, file):
        print('| alpha (concentration coefficient of Dirichlet Process for document-table)\n'
            '|  {:.5}'.format(self.alpha), file=file)
        print('| eta (Dirichlet prior on the per-topic word distribution)\n'
            '|  {:.5}'.format(self.eta), file=file)
        print('| gamma (concentration coefficient of Dirichlet Process for table-topic)\n'
            '|  {:.5}'.format(self.gamma), file=file)
        print('| Number of Topics: {}'.format(self.live_k), file=file)
        print('| Number of Tables: {}'.format(self.num_tables), file=file)

    def _summary_topics_info(self, file, topic_word_top_n):
        topic_cnt = self.get_count_by_topics()
        for k in range(self.k):
            if not self.is_live_topic(k): continue
            words = ' '.join(w for w, _ in self.get_topic_words(k, top_n=topic_word_top_n))
            print('| #{} ({}) : {}'.format(k, topic_cnt[k], words), file=file)

이 타입은 Hierarchical Dirichlet Process(HDP) 토픽 모델의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2005). Sharing clusters among related groups: Hierarchical Dirichlet processes. In Advances in neural information processing systems (pp. 1385-1392).

Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10(Aug), 1801-1828.

Changed in version: 0.3.0

0.3.0버전부터 alpha와 gamma에 대한 하이퍼파라미터 추정 기능이 추가되었습니다. optim_interval을 0으로 설정함으로써 이 기능을 끌 수 있습니다.

파라미터

tw : Union[int, TermWeight]

용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.

min_cf : int

min_df : int

추가된 버전: 0.6.0

rm_top : int

추가된 버전: 0.2.0

initial_k : int

초기 토픽의 개수를 지정하는 2 ~ 32767 범위의 정수.

0.3.0버전부터 기본값이 1에서 2로 변경되었습니다.

alpha : float

document-table에 대한 Dirichlet Process의 집중 계수

eta : float

토픽-단어 디리클레 분포의 하이퍼 파라미터

gamma : float

table-topic에 대한 Dirichlet Process의 집중 계수

seed : int

corpus : Corpus

추가된 버전: 0.6.0

토픽 모델에 추가될 문헌들의 집합을 지정합니다.

transform : Callable[dict, dict]

추가된 버전: 0.6.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._HDPModel
LDAModel
tomotopy._LDAModel

인스턴스 변수

prop gamma : float

Expand source code

@property
def gamma(self) -> float:
    '''the hyperparameter gamma (read-only)'''
    return self._gamma

하이퍼 파라미터 gamma (읽기전용)

prop live_k : int

Expand source code

@property
def live_k(self) -> int:
    '''the number of alive topics (read-only)'''
    return self._live_k

현재 모델 내의 유효한 토픽의 개수 (읽기전용)

prop num_tables : int

Expand source code

@property
def num_tables(self) -> int:
    '''the number of total tables (read-only)'''
    return self._num_tables

현재 모델 내의 총 테이블 개수 (읽기전용)

메소드

def convert_to_lda(self, topic_threshold=0.0) ‑> Tuple[LDAModel, List[int]]

Expand source code

    def convert_to_lda(self, topic_threshold=0.0) -> Tuple['LDAModel', List[int]]:
        '''.. versionadded:: 0.8.0

Convert the current HDP model to equivalent LDA model and return `(new_lda_model, new_topic_id)`.
Topics with proportion less than `topic_threshold` are removed in `new_lda_model`.

`new_topic_id` is an array of length `HDPModel.k` and `new_topic_id[i]` indicates a topic id of new LDA model, equivalent to topic `i` of original HDP model.
If topic `i` of original HDP model is not alive or is removed in LDA model, `new_topic_id[i]` would be `-1`.

Parameters
----------
topic_threshold : float
    Topics with proportion less than this value is removed in new LDA model.
    The default value is 0, and it means no topic except not alive is removed.
'''
        return self._convert_to_lda(LDAModel, topic_threshold)

추가된 버전: 0.8.0

현재의 HDP 모델을 동등한 LDA모델로 변환하고, (new_lda_mode, new_topic_id)를 반환합니다. 이 때 topic_threshold보다 작은 비율의 토픽은 new_lda_model에서 제거됩니다.

new_topic_id는 길이 HDPModel.k의 배열이며, new_topic_id[i]는 새 LDA 모델에서 원 HDP 모델의 토픽 i와 동등한 토픽의 id를 가리킵니다. 만약 원 HDP 모델의 토픽 i가 유효하지 않거나, 새 LDA 모델에서 제거된 것이라면, new_topic_id[i]는 -1이 됩니다.

파라미터

topic_threshold : float: 이 값보다 작은 비율의 토픽은 새 LDA 모델에서 제거됩니다. 기본값은 0이며, 이 경우 유효하지 않는 토픽을 제외한 모든 토픽이 LDA 모델에 포함됩니다.

def is_live_topic(self, topic_id) ‑> bool

Expand source code

    def is_live_topic(self, topic_id) -> bool:
        '''Return `True` if the topic `topic_id` is valid, otherwise return `False`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._is_live_topic(topic_id)

topic_id가 유효한 토픽을 가리키는 경우 True, 아닌 경우 False를 반환합니다.

파라미터

topic_id : int: 토픽을 가리키는 [0, k) 범위의 정수

def purge_dead_topics(self) ‑> List[int]

Expand source code

    def purge_dead_topics(self) -> List[int]:
        '''.. versionadded:: 0.12.3

Purge all non-alive topics from the model and return `new_topic_ids`. After called, `HDPModel.k` shrinks to `HDPModel.live_k` and all topics of the model become live.

`new_topic_id` is an array of length `HDPModel.k` and `new_topic_id[i]` indicates a topic id of the new model, equivalent to topic `i` of previous HDP model.
If topic `i` of previous HDP model is not alive or is removed in the new model, `new_topic_id[i]` would be `-1`.
'''
        return self._purge_dead_topics()

추가된 버전: 0.12.3

현재 모델에서 유효하지 않은 토픽을 모두 제거하고 new_topic_ids를 반환합니다. 호출 후에 HDPModel.k는 HDPModel.live_k값으로 줄어들며 모든 토픽은 유효한 상태가 됩니다.

new_topic_id는 길이 HDPModel.k의 배열이며, new_topic_id[i]는 새 모델에서 기존 HDP 모델의 토픽 i와 동등한 토픽의 id를 가리킵니다. 만약 기존 HDP 모델의 토픽 i가 유효하지 않거나, 새 모델에서 제거된 것이라면, new_topic_id[i]는 -1이 됩니다.

상속받은 메소드 및 변수

LDAModel:
- add_corpus
- add_doc
- alpha
- burn_in
- copy
- docs
- eta
- get_count_by_topics
- get_topic_word_dist
- get_topic_words
- get_word_prior
- global_step
- infer
- k
- ll_per_word
- load
- loads
- make_doc
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class HLDAModel (tw='one', min_cf=0, min_df=0, rm_top=0, depth=2, alpha=0.1, eta=0.01, gamma=0.1, seed=None, corpus=None, transform=None)

Expand source code

class HLDAModel(_HLDAModel, LDAModel):
    '''This type provides Hierarchical LDA topic model and its implementation is based on the following papers:

> * Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., & Blei, D. M. (2004). Hierarchical topic models and the nested Chinese restaurant process. In Advances in neural information processing systems (pp. 17-24).

.. versionadded:: 0.4.0'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, depth=2, alpha=0.1, eta=0.01, gamma=0.1, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    .. versionadded:: 0.6.0

    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int    
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
depth : int
    the maximum depth level of hierarchy between 2 ~ 32767
alpha : Union[float, Iterable[float]]
    hyperparameter of Dirichlet distribution for document-depth level, given as a single `float` in case of symmetric prior and as a list with length `depth` of `float` in case of asymmetric prior.
eta : float
    hyperparameter of Dirichlet distribution for topic-word
gamma : float
    concentration coefficient of Dirichlet Process
seed : int
    random seed. default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    .. versionadded:: 0.6.0

    a list of documents to be added into the model
transform : Callable[dict, dict]
    .. versionadded:: 0.6.0

    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            depth,
            alpha,
            eta,
            gamma,
            seed,
            corpus,
            transform,
        )

    def is_live_topic(self, topic_id) -> bool:
        '''Return `True` if the topic `topic_id` is alive, otherwise return `False`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._is_live_topic(topic_id)
    
    def num_docs_of_topic(self, topic_id) -> int:
        '''Return the number of documents belonging to a topic `topic_id`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._num_docs_of_topic(topic_id)
    
    def level(self, topic_id) -> int:
        '''Return the level of a topic `topic_id`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._level(topic_id)
    
    def parent_topic(self, topic_id) -> int:
        '''Return the topic ID of parent of a topic `topic_id`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._parent_topic(topic_id)
    
    def children_topics(self, topic_id) -> List[int]:
        '''Return a list of topic IDs with children of a topic `topic_id`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._children_topics(topic_id)
    
    @property
    def gamma(self) -> float:
        '''the hyperparameter gamma (read-only)'''
        return self._gamma
    
    @property
    def live_k(self) -> int:
        '''the number of alive topics (read-only)'''
        return self._live_k
    
    @property
    def depth(self) -> int:
        '''the maximum depth level of hierarchy (read-only)'''
        return self._depth
    
    def _progress_tqdm(self, current_iteration:int, total_iteration:int):
        self._tqdm.set_postfix_str(f'# Topics: {self.live_k}, LLPW: {self.ll_per_word:.6f}')
        self._tqdm.update(current_iteration - self._tqdm.n)
    
    def _summary_params_info(self, file):
        print('| alpha (Dirichlet prior on the per-document depth level distributions)\n'
            '|  {}'.format(_format_numpy(self.alpha, '|  ')), file=file)
        print('| eta (Dirichlet prior on the per-topic word distribution)\n'
            '|  {:.5}'.format(self.eta), file=file)
        print('| gamma (concentration coefficient of Dirichlet Process)\n'
            '|  {:.5}'.format(self.gamma), file=file)
        print('| Number of Topics: {}'.format(self.live_k), file=file)

    def _summary_topics_info(self, file, topic_word_top_n):
        topic_cnt = self.get_count_by_topics()

        def print_hierarchical(k=0, level=0):
            words = ' '.join(w for w, _ in self.get_topic_words(k, top_n=topic_word_top_n))
            print('| {}#{} ({}, {}) : {}'.format('  ' * level, k, topic_cnt[k], self.num_docs_of_topic(k), words), file=file)
            for c in np.sort(self.children_topics(k)):
                print_hierarchical(c, level + 1)

        print_hierarchical()

이 타입은 Hierarchical LDA 토픽 모델의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., & Blei, D. M. (2004). Hierarchical topic models and the nested Chinese restaurant process. In Advances in neural information processing systems (pp. 17-24).

추가된 버전: 0.4.0

파라미터

tw : Union[int, TermWeight]: 용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.
min_cf : int: 단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
min_df : int: 추가된 버전: 0.6.0

단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
rm_top : int: 추가된 버전: 0.2.0

제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.
depth : int: 토픽 계층의 깊이를 지정하는 2 ~ 32767 범위의 정수.
alpha : Union[float, Iterable[float]]: 문헌-계층 디리클레 분포의 하이퍼 파라미터, 대칭일 경우 float값 하나로, 비대칭일 경우 depth 길이의 float 리스트로 입력할 수 있습니다.
eta : float: 토픽-단어 디리클레 분포의 하이퍼 파라미터
gamma : float: Dirichlet Process의 집중 계수
seed : int: 난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.
corpus : Corpus: 추가된 버전: 0.6.0

토픽 모델에 추가될 문헌들의 집합을 지정합니다.
transform : Callable[dict, dict]: 추가된 버전: 0.6.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._HLDAModel
LDAModel
tomotopy._LDAModel

인스턴스 변수

prop depth : int

Expand source code

@property
def depth(self) -> int:
    '''the maximum depth level of hierarchy (read-only)'''
    return self._depth

현재 모델의 총 깊이 (읽기전용)

prop gamma : float

Expand source code

@property
def gamma(self) -> float:
    '''the hyperparameter gamma (read-only)'''
    return self._gamma

하이퍼 파라미터 gamma (읽기전용)

prop live_k : int

Expand source code

@property
def live_k(self) -> int:
    '''the number of alive topics (read-only)'''
    return self._live_k

현재 모델 내의 유효한 토픽의 개수 (읽기전용)

메소드

def children_topics(self, topic_id) ‑> List[int]

Expand source code

    def children_topics(self, topic_id) -> List[int]:
        '''Return a list of topic IDs with children of a topic `topic_id`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._children_topics(topic_id)

topic_id 토픽의 자식 토픽들의 ID를 list로 반환합니다.

파라미터

topic_id : int: 토픽을 가리키는 [0, k) 범위의 정수

def is_live_topic(self, topic_id) ‑> bool

Expand source code

    def is_live_topic(self, topic_id) -> bool:
        '''Return `True` if the topic `topic_id` is alive, otherwise return `False`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._is_live_topic(topic_id)

topic_id가 유효한 토픽을 가리키는 경우 True, 아닌 경우 False를 반환합니다.

파라미터

topic_id : int: 토픽을 가리키는 [0, k) 범위의 정수

def level(self, topic_id) ‑> int

Expand source code

    def level(self, topic_id) -> int:
        '''Return the level of a topic `topic_id`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._level(topic_id)

topic_id 토픽의 레벨을 반환합니다.

파라미터

topic_id : int: 토픽을 가리키는 [0, k) 범위의 정수

def num_docs_of_topic(self, topic_id) ‑> int

Expand source code

    def num_docs_of_topic(self, topic_id) -> int:
        '''Return the number of documents belonging to a topic `topic_id`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._num_docs_of_topic(topic_id)

topic_id 토픽에 속하는 문헌의 개수를 반환합니다.

파라미터

topic_id : int: 토픽을 가리키는 [0, k) 범위의 정수

def parent_topic(self, topic_id) ‑> int

Expand source code

    def parent_topic(self, topic_id) -> int:
        '''Return the topic ID of parent of a topic `topic_id`.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
'''
        return self._parent_topic(topic_id)

topic_id 토픽의 부모 토픽의 ID를 반환합니다.

파라미터

topic_id : int: 토픽을 가리키는 [0, k) 범위의 정수

상속받은 메소드 및 변수

LDAModel:
- add_corpus
- add_doc
- alpha
- burn_in
- copy
- docs
- eta
- get_count_by_topics
- get_topic_word_dist
- get_topic_words
- get_word_prior
- global_step
- infer
- k
- ll_per_word
- load
- loads
- make_doc
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class HPAModel (tw='one', min_cf=0, min_df=0, rm_top=0, k1=1, k2=1, alpha=0.1, subalpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

Expand source code

class HPAModel(_HPAModel, PAModel):
    '''This type provides Hierarchical Pachinko Allocation(HPA) topic model and its implementation is based on the following papers:

> * Mimno, D., Li, W., & McCallum, A. (2007, June). Mixtures of hierarchical topics with pachinko allocation. In Proceedings of the 24th international conference on Machine learning (pp. 633-640). ACM.'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, k1=1, k2=1, alpha=0.1, subalpha=0.1, eta=0.01, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    .. versionadded:: 0.6.0

    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    .. versionadded:: 0.2.0
    
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
k1 : int
    the number of super topics between 1 ~ 32767
k2 : int
    the number of sub topics between 1 ~ 32767
alpha : Union[float, Iterable[float]]
    initial hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k1 + 1` of `float` in case of asymmetric prior.
subalpha : Union[float, Iterable[float]]
    .. versionadded:: 0.11.0

    initial hyperparameter of Dirichlet distribution for super-sub topic, given as a single `float` in case of symmetric prior and as a list with length `k2 + 1` of `float` in case of asymmetric prior.
eta : float
    hyperparameter of Dirichlet distribution for topic-word
seed : int
    random seed. default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    .. versionadded:: 0.6.0

    a list of documents to be added into the model
transform : Callable[dict, dict]
    .. versionadded:: 0.6.0

    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            k1,
            k2,
            alpha,
            subalpha,
            eta,
            seed,
            corpus,
            transform,
        )

    def get_topic_words(self, topic_id, top_n=10) -> List[Tuple[str, float]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`).

Parameters
----------
topic_id : int
    0 indicates the top topic, 
    a number in range [1, 1 + `k1`) indicates a super topic and
    a number in range [1 + `k1`, 1 + `k1` + `k2`) indicates a sub topic.
'''
        return self._get_topic_words(topic_id, top_n)
    
    def get_topic_word_dist(self, topic_id, normalize=True) -> List[float]:
        '''Return the word distribution of the topic `topic_id`.
The returned value is a `list` that has `len(vocabs)` fraction numbers indicating probabilities for each word in the current topic.

Parameters
----------
topic_id : int
    0 indicates the top topic, 
    a number in range [1, 1 + `k1`) indicates a super topic and
    a number in range [1 + `k1`, 1 + `k1` + `k2`) indicates a sub topic.
normalize : bool
    .. versionadded:: 0.11.0

    If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.
'''
        return self._get_topic_word_dist(topic_id, normalize)
    
    @property
    def alpha(self) -> float:
        '''Dirichlet prior on the per-document super topic distributions in shape `[k1 + 1]`. 
Its element 0 indicates the prior to the top topic and elements 1 ~ k1 indicates ones to the super topics. (read-only)

.. versionadded:: 0.9.0'''
        return self._alpha
    
    @property
    def subalpha(self) -> float:
        '''Dirichlet prior on the sub topic distributions for each super topic in shape `[k1, k2 + 1]`.
Its `[x, 0]` element indicates the prior to the super topic `x` 
and `[x, 1 ~ k2]` elements indicate ones to the sub topics in the super topic `x`. (read-only)

.. versionadded:: 0.9.0'''
        return self._subalpha
    
    def _summary_params_info(self, file):
        print('| alpha (Dirichlet prior on the per-document super topic distributions)\n'
            '|  {} {}'.format(self.alpha[:1], _format_numpy(self.alpha[1:], '|  ')), file=file)
        print('| subalpha (Dirichlet prior on the sub topic distributions for each super topic)', file=file)
        for k1 in range(self.k1):
            print('|  Super #{}: {} {}'.format(k1, self.subalpha[k1, :1], _format_numpy(self.subalpha[k1, 1:], '|   ')), file=file)
        print('| eta (Dirichlet prior on the per-subtopic word distribution)\n'
            '|  {:.5}'.format(self.eta), file=file)

    def _summary_topics_info(self, file, topic_word_top_n):
        topic_cnt = self.get_count_by_topics()
        words = ' '.join(w for w, _ in self.get_topic_words(0, top_n=topic_word_top_n))
        print('| Top-topic ({}) : {}'.format(topic_cnt[0], words), file=file)
        print('| Super-topics', file=file)
        for k in range(1, 1 + self.k1):
            words = ' '.join(w for w, _ in self.get_topic_words(k, top_n=topic_word_top_n))
            print('|  #Super{} ({}) : {}'.format(k - 1, topic_cnt[k], words), file=file)
            words = ' '.join('#{}'.format(w) for w, _ in self.get_sub_topics(k - 1, top_n=topic_word_top_n))
            print('|    its sub-topics : {}'.format(words), file=file)
        print('| Sub-topics', file=file)
        for k in range(1 + self.k1, 1 + self.k1 + self.k2):
            words = ' '.join(w for w, _ in self.get_topic_words(k, top_n=topic_word_top_n))
            print('|  #{} ({}) : {}'.format(k - 1 - self.k1, topic_cnt[k], words), file=file)

이 타입은 Hierarchical Pachinko Allocation(HPA) 토픽 모델의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Mimno, D., Li, W., & McCallum, A. (2007, June). Mixtures of hierarchical topics with pachinko allocation. In Proceedings of the 24th international conference on Machine learning (pp. 633-640). ACM.

파라미터

tw : Union[int, TermWeight]: 용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.
min_cf : int: 단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
min_df : int: 추가된 버전: 0.6.0

단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
rm_top : int: 추가된 버전: 0.2.0

제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.* k1 : 상위 토픽의 개수, 1 ~ 32767 사이의 정수.
k1 : int: 상위 토픽의 개수, 1 ~ 32767 사이의 정수
k2 : int: 하위 토픽의 개수, 1 ~ 32767 사이의 정수.
alpha : Union[float, Iterable[float]]: 문헌-토픽 디리클레 분포의 하이퍼 파라미터, 대칭일 경우 float값 하나로, 비대칭일 경우 k1 + 1 길이의 float 리스트로 입력할 수 있습니다.
subalpha : Union[float, Iterable[float]]: 추가된 버전: 0.11.0

상위-하위 토픽 디리클레 분포의 하이퍼 파라미터, 대칭일 경우 float값 하나로, 비대칭일 경우 k2 + 1 길이의 float 리스트로 입력할 수 있습니다.
eta : float: 토픽-단어 디리클레 분포의 하이퍼 파라미터
seed : int: 난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.
corpus : Corpus: 추가된 버전: 0.6.0

토픽 모델에 추가될 문헌들의 집합을 지정합니다.
transform : Callable[dict, dict]: 추가된 버전: 0.6.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._HPAModel
PAModel
tomotopy._PAModel
LDAModel
tomotopy._LDAModel

인스턴스 변수

prop alpha : float

Expand source code

    @property
    def alpha(self) -> float:
        '''Dirichlet prior on the per-document super topic distributions in shape `[k1 + 1]`. 
Its element 0 indicates the prior to the top topic and elements 1 ~ k1 indicates ones to the super topics. (read-only)

.. versionadded:: 0.9.0'''
        return self._alpha

문헌의 상위 토픽 분포에 대한 디리클레 분포 파라미터, [k1 + 1] 모양. 0번째 요소는 최상위 토픽을 가리키며, 1 ~ k1번째가 상위 토픽을 가리킨다. (읽기전용)

추가된 버전: 0.9.0

prop subalpha : float

Expand source code

    @property
    def subalpha(self) -> float:
        '''Dirichlet prior on the sub topic distributions for each super topic in shape `[k1, k2 + 1]`.
Its `[x, 0]` element indicates the prior to the super topic `x` 
and `[x, 1 ~ k2]` elements indicate ones to the sub topics in the super topic `x`. (read-only)

.. versionadded:: 0.9.0'''
        return self._subalpha

상위 토픽의 하위 토픽 분포에 대한 디리클레 분포 파라미터, [k1, k2 + 1] 모양. [x, 0] 요소는 상위 토픽 x를 가리키며, [x, 1 ~ k2] 요소는 상위 토픽 x 내의 하위 토픽들을 가리킨다. (읽기전용)

추가된 버전: 0.9.0

메소드

def get_topic_word_dist(self, topic_id, normalize=True) ‑> List[float]

Expand source code

    def get_topic_word_dist(self, topic_id, normalize=True) -> List[float]:
        '''Return the word distribution of the topic `topic_id`.
The returned value is a `list` that has `len(vocabs)` fraction numbers indicating probabilities for each word in the current topic.

Parameters
----------
topic_id : int
    0 indicates the top topic, 
    a number in range [1, 1 + `k1`) indicates a super topic and
    a number in range [1 + `k1`, 1 + `k1` + `k2`) indicates a sub topic.
normalize : bool
    .. versionadded:: 0.11.0

    If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.
'''
        return self._get_topic_word_dist(topic_id, normalize)

토픽 topic_id의 단어 분포를 반환합니다. 반환하는 값은 현재 하위 토픽 내 각각의 단어들의 발생확률을 나타내는 len(vocabs)개의 소수로 구성된 list입니다.

파라미터

topic_id : int: 0일 경우 최상위 토픽을 가리키며, [1, 1 + k1) 범위의 정수는 상위 토픽을, [1 + k1, 1 + k1 + k2) 범위의 정수는 하위 토픽을 가리킵니다.
normalize : bool: 추가된 버전: 0.11.0

참일 경우 총합이 1이 되는 확률 분포를 반환하고, 거짓일 경우 정규화되지 않는 값을 그대로 반환합니다.

def get_topic_words(self, topic_id, top_n=10) ‑> List[Tuple[str, float]]

Expand source code

    def get_topic_words(self, topic_id, top_n=10) -> List[Tuple[str, float]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`).

Parameters
----------
topic_id : int
    0 indicates the top topic, 
    a number in range [1, 1 + `k1`) indicates a super topic and
    a number in range [1 + `k1`, 1 + `k1` + `k2`) indicates a sub topic.
'''
        return self._get_topic_words(topic_id, top_n)

토픽 topic_id에 속하는 상위 top_n개의 단어와 각각의 확률을 반환합니다. 반환 타입은 (단어:str, 확률:float) 튜플의 list형입니다.

파라미터

topic_id : int: 0일 경우 최상위 토픽을 가리키며, [1, 1 + k1) 범위의 정수는 상위 토픽을, [1 + k1, 1 + k1 + k2) 범위의 정수는 하위 토픽을 가리킵니다.

상속받은 메소드 및 변수

PAModel:
- add_corpus
- add_doc
- burn_in
- copy
- docs
- eta
- get_count_by_super_topic
- get_count_by_topics
- get_sub_topic_dist
- get_sub_topics
- get_word_prior
- global_step
- infer
- k
- k1
- k2
- ll_per_word
- load
- loads
- make_doc
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class LDAModel (tw: int | str = 'one', min_cf: int = 0, min_df: int = 0, rm_top: int = 0, k: int = 1, alpha: float | List[float] = 0.1, eta: float = 0.01, seed: int | None = None, corpus=None, transform=None)

Expand source code

class LDAModel(_LDAModel):
    '''This type provides Latent Dirichlet Allocation(LDA) topic model and its implementation is based on the following papers:
        
> * Blei, D.M., Ng, A.Y., &Jordan, M.I. (2003).Latent dirichlet allocation.Journal of machine Learning research, 3(Jan), 993 - 1022.
> * Newman, D., Asuncion, A., Smyth, P., &Welling, M. (2009).Distributed algorithms for topic models.Journal of Machine Learning Research, 10(Aug), 1801 - 1828.'''

    def __init__(self, 
                 tw: Union[int, str] ='one',
                 min_cf: int = 0,
                 min_df: int = 0,
                 rm_top: int = 0,
                 k: int = 1,
                 alpha: Union[float, List[float]] = 0.1,
                 eta: float = 0.01,
                 seed: Optional[int] = None,
                 corpus = None,
                 transform = None,
                 ):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    .. versionadded:: 0.6.0

    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    .. versionadded:: 0.2.0
    
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
k : int
    the number of topics between 1 ~ 32767
alpha : Union[float, Iterable[float]]
    hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.
eta : float
    hyperparameter of Dirichlet distribution for topic-word
seed : int
    random seed. The default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    .. versionadded:: 0.6.0

    a list of documents to be added into the model
transform : Callable[dict, dict]
    .. versionadded:: 0.6.0

    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            k,
            alpha,
            eta,
            seed,
            corpus,
            transform,
        )
    
    @classmethod
    def load(cls, filename: str) -> 'LDAModel':
        '''Return the model instance loaded from file `filename`.'''
        inst, extra_data = cls._load(cls, filename)
        inst.init_params = pickle.loads(extra_data)
        return inst
    
    @classmethod
    def loads(cls, data: bytes) -> 'LDAModel':
        '''Return the model instance loaded from `data` in a bytes-like object.'''
        inst, extra_data = cls._loads(cls, data)
        inst.init_params = pickle.loads(extra_data)
        return inst
    
    @property
    def alpha(self) -> Union[float, List[float]]:
        '''Dirichlet prior on the per-document topic distributions (read-only)'''
        return self._alpha
    
    @property
    def burn_in(self) -> int:
        '''get or set the burn-in iterations for optimizing parameters

Its default value is 0.'''
        return self._burn_in
    
    @burn_in.setter
    def burn_in(self, value: int):
        self._burn_in = value
    
    @property
    def docs(self):
        '''a `list`-like interface of `tomotopy.utils.Document` in the model instance (read-only)'''
        return self._docs
    
    @property
    def eta(self) -> float:
        '''the hyperparameter eta (read-only)'''
        return self._eta
    
    @property
    def global_step(self) -> int:
        '''the total number of iterations of training (read-only)

.. versionadded:: 0.9.0'''
        return self._global_step
    
    @property
    def k(self) -> int:
        '''K, the number of topics (read-only)'''
        return self._k
    
    @property
    def ll_per_word(self) -> float:
        '''a log likelihood per-word of the model (read-only)'''
        return self._ll_per_word
    
    @property
    def num_vocabs(self) -> int:
        '''the number of vocabularies after words with a smaller frequency were removed (read-only)

This value is 0 before `train` is called.

.. deprecated:: 0.8.0

    Due to the confusion of its name, this property will be removed. Please use `len(used_vocabs)` instead.'''
        return self._num_vocabs
    
    @property
    def num_words(self) -> int:
        '''the number of total words (read-only)

This value is 0 before `train` is called.'''
        return self._num_words
    
    @property
    def optim_interval(self) -> int:
        '''get or set the interval for optimizing parameters

Its default value is 10. If it is set to 0, the parameter optimization is turned off.'''
        return self._optim_interval
    
    @optim_interval.setter
    def optim_interval(self, value: int):
        self._optim_interval = value
    
    @property
    def perplexity(self) -> float:
        '''a perplexity of the model (read-only)'''
        return self._perplexity
    
    @property
    def removed_top_words(self) -> List[str]:
        '''a `list` of `str` which is a word removed from the model if you set `rm_top` greater than 0 at initializing the model (read-only)'''
        return self._removed_top_words
    
    @property
    def tw(self) -> int:
        '''the term weighting scheme (read-only)'''
        return self._tw
    
    @property
    def used_vocab_df(self) -> List[int]:
        '''a `list` of vocabulary document-frequencies which contains only vocabularies actually used in modeling (read-only)

.. versionadded:: 0.8.0'''
        return self._used_vocab_df
    
    @property
    def used_vocab_freq(self) -> List[int]:
        '''a `list` of vocabulary frequencies which contains only vocabularies actually used in modeling (read-only)

.. versionadded:: 0.8.0'''
        return self._used_vocab_freq
    
    @property
    def used_vocab_weighted_freq(self) -> List[float]:
        '''a `list` of term-weighted vocabulary frequencies which contains only vocabularies actually used in modeling (read-only)

.. versionadded:: 0.12.1'''
        return self._used_vocab_weighted_freq
    
    @property
    def used_vocabs(self):
        '''a dictionary, which contains only the vocabularies actually used in modeling, as the type `tomotopy.Dictionary` (read-only)

.. versionadded:: 0.8.0'''
        return self._used_vocabs
    
    @property
    def vocab_df(self) -> List[int]:
        '''a `list` of vocabulary document-frequencies which contains both vocabularies filtered by frequency and vocabularies actually used in modeling (read-only)

.. versionadded:: 0.8.0'''
        return self._vocab_df
    
    @property
    def vocab_freq(self) -> List[int]:
        '''a `list` of vocabulary frequencies which contains both vocabularies filtered by frequency and vocabularies actually used in modeling (read-only)'''
        return self._vocab_freq
    
    @property
    def vocabs(self):
        '''a dictionary, which contains both vocabularies filtered by frequency and vocabularies actually used in modeling, as the type `tomotopy.Dictionary` (read-only)'''
        return self._vocabs
    
    def add_corpus(self, corpus, transform=None) -> Corpus:
        '''.. versionadded:: 0.10.0

Add new documents into the model instance using `tomotopy.utils.Corpus` and return an instance of corpus that contains the inserted documents. 
This method should be called before calling the `tomotopy.models.LDAModel.train`.

Parameters
----------
corpus : tomotopy.utils.Corpus
    corpus that contains documents to be added
transform : Callable[dict, dict]
    a callable object to manipulate arbitrary keyword arguments for a specific topic model
'''
        return self._add_corpus(corpus, transform)
    
    def add_doc(self, words, ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance and return an index of the inserted document. This method should be called before calling the `tomotopy.models.LDAModel.train`.

.. versionchanged:: 0.12.3

    A new argument `ignore_empty_words` was added.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, ignore_empty_words)
    
    def copy(self) -> 'LDAModel':
        '''.. versionadded:: 0.12.0

Return a new deep-copied instance of the current instance'''
        return self._copy(type(self))
    
    def get_count_by_topics(self) -> List[int]:
        '''Return the number of words allocated to each topic.'''
        return self._get_count_by_topics()
    
    def get_hash(self) -> int:
        return self._get_hash()
    
    def get_topic_word_dist(self, topic_id, normalize=True) -> List[float]:
        '''Return the word distribution of the topic `topic_id`.
The returned value is a `list` that has `len(vocabs)` fraction numbers indicating probabilities for each word in the current topic.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
normalize : bool
    .. versionadded:: 0.11.0

    If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.
'''
        return self._get_topic_word_dist(topic_id, normalize)
    
    def get_topic_words(self, topic_id, top_n=10, return_id=False) -> Union[List[Tuple[str, float]], List[Tuple[str, int, float]]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`) tuples if return_id is False,
otherwise a `list` of (word:`str`, word_id:`int`, probability:`float`) tuples.

Parameters
----------
topic_id : int
    an integer in range [0, `k`), indicating the topic
top_n : int
        the number of words to be returned
return_id : bool
        If `True`, it returns the word IDs too.
'''
        return self._get_topic_words(topic_id, top_n, return_id)
    
    def get_word_forms(self, idx = -1):
        return self._get_word_forms(idx)
    
    def get_word_prior(self, word) -> List[float]:
        '''.. versionadded:: 0.6.0

Return word-topic prior for `word`. If there is no set prior for `word`, an empty list is returned.

Parameters
----------
word : str
    a word
'''
        return self._get_word_prior(word)
    
    def infer(self, doc, iterations=100, tolerance=-1, workers=0, parallel=0, together=False, transform=None) -> Tuple[Union[List[float], List[List[float]], Corpus], List[float]]:
        '''Return the inferred topic distribution from unseen `doc`s.

Parameters
----------
doc : Union[tomotopy.utils.Document, Iterable[tomotopy.utils.Document], tomotopy.utils.Corpus]
    an instance of `tomotopy.utils.Document` or a `list` of instances of `tomotopy.utils.Document` to be inferred by the model.
    It can be acquired from `tomotopy.models.LDAModel.make_doc` method.

    .. versionchanged:: 0.10.0

        Since version 0.10.0, `infer` can receive a raw corpus instance of `tomotopy.utils.Corpus`. 
        In this case, you don't need to call `make_doc`. `infer` would generate documents bound to the model, estimate its topic distributions and
        return a corpus containing generated documents as the result.
iterations : int
    an integer indicating the number of iteration to estimate the distribution of topics of `doc`.
    The higher value will generate a more accurate result.
tolerance : float
    This parameter is not currently used.
workers : int
    an integer indicating the number of workers to perform samplings. 
    If `workers` is 0, the number of cores in the system will be used.
parallel : Union[int, tomotopy.ParallelScheme]
    .. versionadded:: 0.5.0
    
    the parallelism scheme for inference. the default value is ParallelScheme.DEFAULT which means that tomotopy selects the best scheme by model.
together : bool
    all `doc`s are inferred together in one process if True, otherwise each `doc` is inferred independently. Its default value is `False`.
transform : Callable[dict, dict]
    .. versionadded:: 0.10.0
    
    a callable object to manipulate arbitrary keyword arguments for a specific topic model. 
    Available when `doc` is given as an instance of `tomotopy.utils.Corpus`.

Returns
-------
result : Union[List[float], List[List[float]], tomotopy.utils.Corpus]
    If `doc` is given as a single `tomotopy.utils.Document`, `result` is a single `List[float]` indicating its topic distribution.
    
    If `doc` is given as a list of `tomotopy.utils.Document`s, `result` is a list of `List[float]` indicating topic distributions for each document.
    
    If `doc` is given as an instance of `tomotopy.utils.Corpus`, `result` is another instance of `tomotopy.utils.Corpus` which contains inferred documents.
    You can get topic distribution for each document using `tomotopy.utils.Document.get_topic_dist`.
log_ll : List[float]
    a list of log-likelihoods for each `doc`
'''
        return self._infer(doc, iterations, tolerance, workers, parallel, together, transform)
    
    def make_doc(self, words) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
'''
        return self._make_doc(words)
    
    def save(self, filename: str, full=True) -> None:
        '''Save the model instance to file `filename`. Return `None`.

If `full` is `True`, the model with its all documents and state will be saved. If you want to train more after, use full model.
If `False`, only topic parameters of the model will be saved. This model can be only used for inference of an unseen document.

.. versionadded:: 0.6.0

Since version 0.6.0, the model file format has been changed. 
Thus model files saved in version 0.6.0 or later are not compatible with versions prior to 0.5.2.
'''
        extra_data = pickle.dumps(self.init_params)
        return self._save(filename, extra_data, full)
    
    def saves(self, full=True) -> bytes:
        '''.. versionadded:: 0.11.0

Serialize the model instance into `bytes` object and return it. The arguments work the same as `tomotopy.models.LDAModel.save`.'''
        extra_data = pickle.dumps(self.init_params)
        return self._saves(extra_data, full)
    
    def set_word_prior(self, word, prior) -> None:
        '''.. versionadded:: 0.6.0

Set word-topic prior. This method should be called before calling the `tomotopy.models.LDAModel.train`.

Parameters
----------
word : str
    a word to be set
prior : Union[Iterable[float], Dict[int, float]]
        topic distribution of `word` whose length is equal to `tomotopy.models.LDAModel.k`

Note
----
Since version 0.12.6, this method can accept a dictionary type parameter as well as a list type parameter for `prior`.
The key of the dictionary is the topic id and the value is the prior of the topic. If the prior of a topic is not set, the default value is set to `eta` parameter of the model.
```python
>>> model = tp.LDAModel(k=3, eta=0.01)
>>> model.set_word_prior('apple', [0.01, 0.9, 0.01])
>>> model.set_word_prior('apple', {1: 0.9}) # same effect as above
```
'''
        return self._set_word_prior(word, prior)
    
    @classmethod
    def _summary_extract_param_desc(cls:type):
        doc_string = cls.__init__.__doc__
        if not doc_string: return {}
        ps = doc_string.split('Parameters\n')[1].split('\n')
        param_name = re.compile(r'^([a-zA-Z0-9_]+)\s*:\s*')
        directive = re.compile(r'^\s*\.\.')
        descriptive = re.compile(r'\s+([^\s].*)')
        period = re.compile(r'[.,](\s|$)')
        ret = {}
        name = None
        desc = ''
        for p in ps:
            if directive.search(p): continue
            m = param_name.search(p)
            if m:
                if name: ret[name] = desc.split('. ')[0]
                name = m.group(1)
                desc = ''
                continue
            m = descriptive.search(p)
            if m:
                desc += (' ' if desc else '') + m.group(1)
                continue
        if name: ret[name] = period.split(desc)[0]
        return ret

    def _summary_basic_info(self, file):
        p = self.used_vocab_freq
        p = p / p.sum()
        entropy = -(p * np.log(p + 1e-20)).sum()

        p = self.used_vocab_weighted_freq
        p /= p.sum()
        w_entropy = -(p * np.log(p + 1e-20)).sum()

        print('| {} (current version: {})'.format(type(self).__name__, __version__), file=file)
        print('| {} docs, {} words'.format(len(self.docs), self.num_words), file=file)
        print('| Total Vocabs: {}, Used Vocabs: {}'.format(len(self.vocabs), len(self.used_vocabs)), file=file)
        print('| Entropy of words: {:.5f}'.format(entropy), file=file)
        print('| Entropy of term-weighted words: {:.5f}'.format(w_entropy), file=file)
        print('| Removed Vocabs: {}'.format(' '.join(self.removed_top_words) if self.removed_top_words else '<NA>'), file=file)

    def _summary_training_info(self, file):
        print('| Iterations: {}, Burn-in steps: {}'.format(self.global_step, self.burn_in), file=file)
        print('| Optimization Interval: {}'.format(self.optim_interval), file=file)
        print('| Log-likelihood per word: {:.5f}'.format(self.ll_per_word), file=file)

    def _summary_initial_params_info(self, file):
        try:
            param_desc = self._summary_extract_param_desc()
        except:
            param_desc = {}
        if hasattr(self, 'init_params'):
            for k, v in self.init_params.items():
                if type(v) is float: fmt = ':.5'
                else: fmt = ''

                try:
                    getattr(self, f'_summary_initial_params_info_{k}')(v, file)
                except AttributeError:
                    if k in param_desc:
                        print(('| {}: {' + fmt + '} ({})').format(k, v, param_desc[k]), file=file)
                    else:
                        print(('| {}: {' + fmt + '}').format(k, v), file=file)
        else:
            print('| Not Available (The model seems to have been built in version < 0.9.0.)', file=file)

    def _summary_initial_params_info_tw(self, v, file):
        from tomotopy import TermWeight
        try:
            if isinstance(v, str):
                v = TermWeight[v.upper()].name
            else:
                v = TermWeight(v).name
        except:
            pass
        print('| tw: TermWeight.{}'.format(v), file=file)

    def _summary_initial_params_info_version(self, v, file):
        print('| trained in version {}'.format(v), file=file)

    def _summary_params_info(self, file):
        print('| alpha (Dirichlet prior on the per-document topic distributions)\n'
            '|  {}'.format(_format_numpy(self.alpha, '|  ')), file=file)
        print('| eta (Dirichlet prior on the per-topic word distribution)\n'
            '|  {:.5}'.format(self.eta), file=file)

    def _summary_topics_info(self, file, topic_word_top_n):
        topic_cnt = self.get_count_by_topics()
        for k in range(self.k):
            words = ' '.join(w for w, _ in self.get_topic_words(k, top_n=topic_word_top_n))
            print('| #{} ({}) : {}'.format(k, topic_cnt[k], words), file=file)

    def summary(self, initial_hp=True, params=True, topic_word_top_n=5, file=None, flush=False) -> None:
        '''.. versionadded:: 0.9.0

Print human-readable description of the current model

Parameters
----------
initial_hp : bool
    whether to show the initial parameters at model creation
params : bool
    whether to show the current parameters of the model
topic_word_top_n : int
    the number of words by topic to display
file
    a file-like object (stream), default is `sys.stdout`
flush : bool
    whether to forcibly flush the stream
'''
        flush = flush or False

        print('<Basic Info>', file=file)
        self._summary_basic_info(file=file)
        print('|', file=file)
        print('<Training Info>', file=file)
        self._summary_training_info(file=file)
        print('|', file=file)

        if initial_hp:
            print('<Initial Parameters>', file=file)
            self._summary_initial_params_info(file=file)
            print('|', file=file)
        
        if params:
            print('<Parameters>', file=file)
            self._summary_params_info(file=file)
            print('|', file=file)

        if topic_word_top_n > 0:
            print('<Topics>', file=file)
            self._summary_topics_info(file=file, topic_word_top_n=topic_word_top_n)
            print('|', file=file)

        print(file=file, flush=flush)

    
    def train(self, iterations=10, workers=0, parallel=0, freeze_topics=False, callback_interval=10, callback=None, show_progress=False) -> None:
        '''Train the model using Gibbs-sampling with `iterations` iterations. Return `None`. 
After calling this method, you cannot `tomotopy.models.LDAModel.add_doc` or `tomotopy.models.LDAModel.set_word_prior` more.

Parameters
----------
iterations : int
    the number of iterations of Gibbs-sampling
workers : int
    an integer indicating the number of workers to perform samplings. 
    If `workers` is 0, the number of cores in the system will be used.
parallel : Union[int, tomotopy.ParallelScheme]
    .. versionadded:: 0.5.0
    
    the parallelism scheme for training. the default value is `tomotopy.ParallelScheme.DEFAULT` which means that tomotopy selects the best scheme by model.
freeze_topics : bool
    .. versionadded:: 0.10.1

    prevents creating a new topic when training. Only valid for `tomotopy.models.HLDAModel`
callback_interval : int
    .. versionadded:: 0.12.6

    the interval of calling `callback` function. If `callback_interval` <= 0, `callback` function is called at the beginning and the end of training.
callback : Callable[[tomotopy.models.LDAModel, int, int], None]
    .. versionadded:: 0.12.6

    a callable object which is called every `callback_interval` iterations. 
    It receives three arguments: the current model, the current number of iterations, and the total number of iterations.
show_progress : bool
    .. versionadded:: 0.12.6

    If `True`, it shows progress bar during training using `tqdm` package.
'''
        if show_progress:
            if callback is not None:
                callback = LDAModel._show_progress
            else:
                def _multiple_callbacks(*args):
                    callback(*args)
                    LDAModel._show_progress(*args)
                callback = _multiple_callbacks
        return self._train(iterations, workers, parallel, freeze_topics, callback_interval, callback)
    
    def _init_tqdm(self, current_iteration:int, total_iteration:int):
        from tqdm import tqdm
        self._tqdm = tqdm(total=total_iteration, desc='Iteration')
    
    def _close_tqdm(self, current_iteration:int, total_iteration:int):
        self._tqdm.update(current_iteration - self._tqdm.n)
        self._tqdm.close()
        self._tqdm = None
    
    def _progress_tqdm(self, current_iteration:int, total_iteration:int):
        self._tqdm.set_postfix_str(f'LLPW: {self.ll_per_word:.6f}')
        self._tqdm.update(current_iteration - self._tqdm.n)
    
    def _show_progress(self, current_iteration:int, total_iteration:int):
        if current_iteration == 0:
            self._init_tqdm(current_iteration, total_iteration)
        elif current_iteration == total_iteration:
            self._close_tqdm(current_iteration, total_iteration)
        else:
            self._progress_tqdm(current_iteration, total_iteration)

이 타입은 Latent Dirichlet Allocation(LDA) 토픽 모델의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Blei, D.M., Ng, A.Y., &Jordan, M.I. (2003).Latent dirichlet allocation.Journal of machine Learning research, 3(Jan), 993 - 1022.

Newman, D., Asuncion, A., Smyth, P., &Welling, M. (2009).Distributed algorithms for topic models.Journal of Machine Learning Research, 10(Aug), 1801 - 1828.

파라미터

tw : Union[int, TermWeight]: 용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.
min_cf : int: 단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
min_df : int: 추가된 버전: 0.6.0

단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
rm_top : int: 추가된 버전: 0.2.0

제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.
k : int: 토픽의 개수, 1 ~ 32767 범위의 정수.
alpha : Union[float, Iterable[float]]: 문헌-토픽 디리클레 분포의 하이퍼 파라미터, 대칭일 경우 float값 하나로, 비대칭일 경우 k 길이의 float 리스트로 입력할 수 있습니다.
eta : float: 토픽-단어 디리클레 분포의 하이퍼 파라미터
seed : int: 난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.
corpus : Corpus: 추가된 버전: 0.6.0

토픽 모델에 추가될 문헌들의 집합을 지정합니다.
transform : Callable[dict, dict]: 추가된 버전: 0.6.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._LDAModel

Static methods

def load(filename: str) ‑> LDAModel: filename 경로의 파일로부터 모델 인스턴스를 읽어들여 반환합니다.
def loads(data: bytes) ‑> LDAModel: bytes-like object인 data로로부터 모델 인스턴스를 읽어들여 반환합니다.

인스턴스 변수

prop alpha : float | List[float]

Expand source code

@property
def alpha(self) -> Union[float, List[float]]:
    '''Dirichlet prior on the per-document topic distributions (read-only)'''
    return self._alpha

문헌의 토픽 분포에 대한 디리클레 분포 파라미터 (읽기전용)

prop burn_in : int

Expand source code

    @property
    def burn_in(self) -> int:
        '''get or set the burn-in iterations for optimizing parameters

Its default value is 0.'''
        return self._burn_in

파라미터 학습 초기의 Burn-in 단계의 반복 횟수를 얻거나 설정합니다.

기본값은 0입니다.

prop docs

Expand source code

@property
def docs(self):
    '''a `list`-like interface of `tomotopy.utils.Document` in the model instance (read-only)'''
    return self._docs

현재 모델에 포함된 Document에 접근할 수 있는 list형 인터페이스 (읽기전용)

prop eta : float

Expand source code

@property
def eta(self) -> float:
    '''the hyperparameter eta (read-only)'''
    return self._eta

하이퍼 파라미터 eta (읽기전용)

prop global_step : int

Expand source code

    @property
    def global_step(self) -> int:
        '''the total number of iterations of training (read-only)

.. versionadded:: 0.9.0'''
        return self._global_step

현재까지 수행된 학습의 총 반복 횟수 (읽기전용)

추가된 버전: 0.9.0

prop k : int

Expand source code

@property
def k(self) -> int:
    '''K, the number of topics (read-only)'''
    return self._k

토픽의 개수 (읽기전용)

prop ll_per_word : float

Expand source code

@property
def ll_per_word(self) -> float:
    '''a log likelihood per-word of the model (read-only)'''
    return self._ll_per_word

현재 모델의 단어당 로그 가능도 (읽기전용)

prop num_vocabs : int

Expand source code

    @property
    def num_vocabs(self) -> int:
        '''the number of vocabularies after words with a smaller frequency were removed (read-only)

This value is 0 before `train` is called.

.. deprecated:: 0.8.0

    Due to the confusion of its name, this property will be removed. Please use `len(used_vocabs)` instead.'''
        return self._num_vocabs

작은 빈도의 단어들을 제거한 뒤 남은 어휘의 개수 (읽기전용)

train이 호출되기 전에는 이 값은 0입니다.

Deprecated since version: 0.8.0

이 프로퍼티의 이름은 혼동을 일으킬 여지가 있어 제거될 예정입니다. 대신 len(used_vocabs)을 사용하십시오.

prop num_words : int

Expand source code

    @property
    def num_words(self) -> int:
        '''the number of total words (read-only)

This value is 0 before `train` is called.'''
        return self._num_words

현재 모델에 포함된 문헌들 전체의 단어 개수 (읽기전용)

train이 호출되기 전에는 이 값은 0입니다.

prop optim_interval : int

Expand source code

    @property
    def optim_interval(self) -> int:
        '''get or set the interval for optimizing parameters

Its default value is 10. If it is set to 0, the parameter optimization is turned off.'''
        return self._optim_interval

파라미터 최적화의 주기를 얻거나 설정합니다.

기본값은 10이며, 0으로 설정할 경우 학습 과정에서 파라미터 최적화를 수행하지 않습니다.

prop perplexity : float

Expand source code

@property
def perplexity(self) -> float:
    '''a perplexity of the model (read-only)'''
    return self._perplexity

현재 모델의 Perplexity (읽기전용)

prop removed_top_words : List[str]

Expand source code

@property
def removed_top_words(self) -> List[str]:
    '''a `list` of `str` which is a word removed from the model if you set `rm_top` greater than 0 at initializing the model (read-only)'''
    return self._removed_top_words

모델 생성시 rm_top 파라미터를 1 이상으로 설정한 경우, 빈도수가 높아서 모델에서 제외된 단어의 목록을 보여줍니다. (읽기전용)

prop tw : int

Expand source code

@property
def tw(self) -> int:
    '''the term weighting scheme (read-only)'''
    return self._tw

현재 모델의 용어 가중치 계획 (읽기전용)

prop used_vocab_df : List[int]

Expand source code

    @property
    def used_vocab_df(self) -> List[int]:
        '''a `list` of vocabulary document-frequencies which contains only vocabularies actually used in modeling (read-only)

.. versionadded:: 0.8.0'''
        return self._used_vocab_df

모델에 실제로 사용된 어휘들의 문헌빈도를 보여주는 list (읽기전용)

추가된 버전: 0.8.0

prop used_vocab_freq : List[int]

Expand source code

    @property
    def used_vocab_freq(self) -> List[int]:
        '''a `list` of vocabulary frequencies which contains only vocabularies actually used in modeling (read-only)

.. versionadded:: 0.8.0'''
        return self._used_vocab_freq

모델에 실제로 사용된 어휘들의 빈도를 보여주는 list (읽기전용)

prop used_vocab_weighted_freq : List[float]

Expand source code

    @property
    def used_vocab_weighted_freq(self) -> List[float]:
        '''a `list` of term-weighted vocabulary frequencies which contains only vocabularies actually used in modeling (read-only)

.. versionadded:: 0.12.1'''
        return self._used_vocab_weighted_freq

모델에 실제로 사용된 어휘들의 빈도(용어 가중치 적용됨)를 보여주는 list (읽기전용)

prop used_vocabs

Expand source code

    @property
    def used_vocabs(self):
        '''a dictionary, which contains only the vocabularies actually used in modeling, as the type `tomotopy.Dictionary` (read-only)

.. versionadded:: 0.8.0'''
        return self._used_vocabs

모델에 실제로 사용된 어휘만을 포함하는 tomotopy.Dictionary 타입의 어휘 사전 (읽기전용)

추가된 버전: 0.8.0

prop vocab_df : List[int]

Expand source code

    @property
    def vocab_df(self) -> List[int]:
        '''a `list` of vocabulary document-frequencies which contains both vocabularies filtered by frequency and vocabularies actually used in modeling (read-only)

.. versionadded:: 0.8.0'''
        return self._vocab_df

빈도수로 필터링된 어휘와 현재 모델에 포함된 어휘 전체의 문헌빈도를 보여주는 list (읽기전용)

추가된 버전: 0.8.0

prop vocab_freq : List[int]

Expand source code

@property
def vocab_freq(self) -> List[int]:
    '''a `list` of vocabulary frequencies which contains both vocabularies filtered by frequency and vocabularies actually used in modeling (read-only)'''
    return self._vocab_freq

빈도수로 필터링된 어휘와 현재 모델에 포함된 어휘 전체의 빈도를 보여주는 list (읽기전용)

prop vocabs

Expand source code

@property
def vocabs(self):
    '''a dictionary, which contains both vocabularies filtered by frequency and vocabularies actually used in modeling, as the type `tomotopy.Dictionary` (read-only)'''
    return self._vocabs

빈도수로 필터링된 어휘와 모델에 포함된 어휘 전체를 포함하는 tomotopy.Dictionary 타입의 어휘 사전 (읽기전용)

메소드

def add_corpus(self, corpus, transform=None) ‑> Corpus

Expand source code

    def add_corpus(self, corpus, transform=None) -> Corpus:
        '''.. versionadded:: 0.10.0

Add new documents into the model instance using `tomotopy.utils.Corpus` and return an instance of corpus that contains the inserted documents. 
This method should be called before calling the `tomotopy.models.LDAModel.train`.

Parameters
----------
corpus : tomotopy.utils.Corpus
    corpus that contains documents to be added
transform : Callable[dict, dict]
    a callable object to manipulate arbitrary keyword arguments for a specific topic model
'''
        return self._add_corpus(corpus, transform)

추가된 버전: 0.10.0

코퍼스를 이용해 현재 모델에 새로운 문헌들을 추가하고 추가된 문헌로 구성된 새 코퍼스를 반환합니다. 이 메소드는 LDAModel.train()를 호출하기 전에만 사용될 수 있습니다. Parameters

corpus : Corpus: 토픽 모델에 추가될 문헌들로 구성된 코퍼스
transform : Callable[dict, dict]: 특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

def add_doc(self, words, ignore_empty_words=True) ‑> int | None

Expand source code

    def add_doc(self, words, ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance and return an index of the inserted document. This method should be called before calling the `tomotopy.models.LDAModel.train`.

.. versionchanged:: 0.12.3

    A new argument `ignore_empty_words` was added.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, ignore_empty_words)

현재 모델에 새로운 문헌을 추가하고 추가된 문헌의 인덱스 번호를 반환합니다. 이 메소드는 LDAModel.train()를 호출하기 전에만 사용될 수 있습니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable

def copy(self) ‑> LDAModel

Expand source code

    def copy(self) -> 'LDAModel':
        '''.. versionadded:: 0.12.0

Return a new deep-copied instance of the current instance'''
        return self._copy(type(self))

추가된 버전: 0.12.0

깊게 복사된 새 인스턴스를 반환합니다.

def get_count_by_topics(self) ‑> List[int]

Expand source code

def get_count_by_topics(self) -> List[int]:
    '''Return the number of words allocated to each topic.'''
    return self._get_count_by_topics()

각각의 토픽에 할당된 단어의 개수를 list형태로 반환합니다.

def get_hash(self) ‑> int

Expand source code

def get_hash(self) -> int:
    return self._get_hash()

def get_topic_word_dist(self, topic_id, normalize=True) ‑> List[float]

Expand source code

    def get_topic_word_dist(self, topic_id, normalize=True) -> List[float]:
        '''Return the word distribution of the topic `topic_id`.
The returned value is a `list` that has `len(vocabs)` fraction numbers indicating probabilities for each word in the current topic.

Parameters
----------
topic_id : int
    an integer in range [0, `k`) indicating the topic
normalize : bool
    .. versionadded:: 0.11.0

    If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.
'''
        return self._get_topic_word_dist(topic_id, normalize)

토픽 topic_id의 단어 분포를 반환합니다. 반환하는 값은 현재 토픽 내 각각의 단어들의 발생확률을 나타내는 len(vocabs)개의 소수로 구성된 list입니다.

파라미터

topic_id : int: 토픽을 가리키는 [0, k) 범위의 정수
normalize : bool: 추가된 버전: 0.11.0

참일 경우 총합이 1이 되는 확률 분포를 반환하고, 거짓일 경우 정규화되지 않는 값을 그대로 반환합니다.

def get_topic_words(self, topic_id, top_n=10, return_id=False) ‑> List[Tuple[str, float]] | List[Tuple[str, int, float]]

Expand source code

    def get_topic_words(self, topic_id, top_n=10, return_id=False) -> Union[List[Tuple[str, float]], List[Tuple[str, int, float]]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`) tuples if return_id is False,
otherwise a `list` of (word:`str`, word_id:`int`, probability:`float`) tuples.

Parameters
----------
topic_id : int
    an integer in range [0, `k`), indicating the topic
top_n : int
        the number of words to be returned
return_id : bool
        If `True`, it returns the word IDs too.
'''
        return self._get_topic_words(topic_id, top_n, return_id)

토픽 topic_id에 속하는 상위 top_n개의 단어와 각각의 확률을 반환합니다. 반환 타입은 (단어:str, 확률:float) 튜플의 list형입니다.

파라미터

topic_id : int: 토픽을 가리키는 [0, k) 범위의 정수
top_n : int: 반환할 단어의 개수
return_id : bool: 참일 경우 단어 ID도 함께 반환합니다.

def get_word_forms(self, idx=-1)

Expand source code

def get_word_forms(self, idx = -1):
    return self._get_word_forms(idx)

def get_word_prior(self, word) ‑> List[float]

Expand source code

    def get_word_prior(self, word) -> List[float]:
        '''.. versionadded:: 0.6.0

Return word-topic prior for `word`. If there is no set prior for `word`, an empty list is returned.

Parameters
----------
word : str
    a word
'''
        return self._get_word_prior(word)

추가된 버전: 0.6.0

word에 대한 사전 주제 분포를 반환합니다. 별도로 설정된 값이 없을 경우 빈 리스트가 반환됩니다.

파라미터

word : str: 어휘

def infer(self, doc, iterations=100, tolerance=-1, workers=0, parallel=0, together=False, transform=None) ‑> Tuple[List[float] | List[List[float]] | Corpus, List[float]]

Expand source code

    def infer(self, doc, iterations=100, tolerance=-1, workers=0, parallel=0, together=False, transform=None) -> Tuple[Union[List[float], List[List[float]], Corpus], List[float]]:
        '''Return the inferred topic distribution from unseen `doc`s.

Parameters
----------
doc : Union[tomotopy.utils.Document, Iterable[tomotopy.utils.Document], tomotopy.utils.Corpus]
    an instance of `tomotopy.utils.Document` or a `list` of instances of `tomotopy.utils.Document` to be inferred by the model.
    It can be acquired from `tomotopy.models.LDAModel.make_doc` method.

    .. versionchanged:: 0.10.0

        Since version 0.10.0, `infer` can receive a raw corpus instance of `tomotopy.utils.Corpus`. 
        In this case, you don't need to call `make_doc`. `infer` would generate documents bound to the model, estimate its topic distributions and
        return a corpus containing generated documents as the result.
iterations : int
    an integer indicating the number of iteration to estimate the distribution of topics of `doc`.
    The higher value will generate a more accurate result.
tolerance : float
    This parameter is not currently used.
workers : int
    an integer indicating the number of workers to perform samplings. 
    If `workers` is 0, the number of cores in the system will be used.
parallel : Union[int, tomotopy.ParallelScheme]
    .. versionadded:: 0.5.0
    
    the parallelism scheme for inference. the default value is ParallelScheme.DEFAULT which means that tomotopy selects the best scheme by model.
together : bool
    all `doc`s are inferred together in one process if True, otherwise each `doc` is inferred independently. Its default value is `False`.
transform : Callable[dict, dict]
    .. versionadded:: 0.10.0
    
    a callable object to manipulate arbitrary keyword arguments for a specific topic model. 
    Available when `doc` is given as an instance of `tomotopy.utils.Corpus`.

Returns
-------
result : Union[List[float], List[List[float]], tomotopy.utils.Corpus]
    If `doc` is given as a single `tomotopy.utils.Document`, `result` is a single `List[float]` indicating its topic distribution.
    
    If `doc` is given as a list of `tomotopy.utils.Document`s, `result` is a list of `List[float]` indicating topic distributions for each document.
    
    If `doc` is given as an instance of `tomotopy.utils.Corpus`, `result` is another instance of `tomotopy.utils.Corpus` which contains inferred documents.
    You can get topic distribution for each document using `tomotopy.utils.Document.get_topic_dist`.
log_ll : List[float]
    a list of log-likelihoods for each `doc`
'''
        return self._infer(doc, iterations, tolerance, workers, parallel, together, transform)

새로운 문헌인 doc에 대해 각각의 주제 분포를 추론하여 반환합니다. 반환 타입은 (doc의 주제 분포, 로그가능도) 또는 (doc의 주제 분포로 구성된 list, 로그가능도)입니다.

파라미터

doc : Union[Document, Iterable[Document], Corpus]: 추론에 사용할 Document의 인스턴스이거나 이 인스턴스들의 list. 이 인스턴스들은 LDAModel.make_doc() 메소드를 통해 얻을 수 있습니다.

Changed in version: 0.10.0

0.10.0버전부터 infer는 Corpus의 인스턴스를 직접 입력 받을 수 있습니다. 이 경우 make_doc를 사용할 필요 없이 infer가 직접 모델에 맞춰진 문헌을 생성하고 이를 이용해 토픽 분포를 추정하며, 결과로 생성된 문헌들이 포함된 Corpus를 반환합니다.
iter : int: doc의 주제 분포를 추론하기 위해 학습을 반복할 횟수입니다. 이 값이 클 수록 더 정확한 결과를 낼 수 있습니다.
tolerance : float: 현재는 사용되지 않음
workers : int: 깁스 샘플링을 수행하는 데에 사용할 스레드의 개수입니다. 만약 이 값을 0으로 설정할 경우 시스템 내의 가용한 모든 코어가 사용됩니다.
parallel : Union[int, ParallelScheme]: 추가된 버전: 0.5.0

추론에 사용할 병렬화 방법. 기본값은 ParallelScheme.DEFAULT로 이는 모델에 따라 최적의 방법을 tomotopy가 알아서 선택하도록 합니다.
together : bool: 이 값이 True인 경우 입력한 doc 문헌들을 한 번에 모델에 넣고 추론을 진행합니다. False인 경우 각각의 문헌들을 별도로 모델에 넣어 추론합니다. 기본값은 False입니다.
transform : Callable[dict, dict]: 추가된 버전: 0.10.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체. doc이 Corpus의 인스턴스로 주어진 경우에만 사용 가능합니다.

Returns

result : Union[List[float], List[List[float]], Corpus]

doc이 Document로 주어진 경우, result는 문헌의 토픽 분포를 나타내는 List[float]입니다.

doc이 Document의 list로 주어진 경우, result는 문헌의 토픽 분포를 나타내는 List[float]의 list입니다.

doc이 Corpus의 인스턴스로 주어진 경우, result는 추론된 결과 문서들을 담고 있는, Corpus의 새로운 인스턴스입니다. 각 문헌별 토픽 분포를 얻기 위해서는 Document.get_topic_dist()를 사용하면 됩니다.

log_ll : float

각 문헌별 로그 가능도의 리스트

def make_doc(self, words) ‑> Document

Expand source code

    def make_doc(self, words) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
'''
        return self._make_doc(words)

words 단어를 바탕으로 새로운 문헌인 Document 인스턴스를 반환합니다. 이 인스턴스는 LDAModel.infer() 메소드에 사용될 수 있습니다..

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable

def save(self, filename: str, full=True) ‑> None

Expand source code

    def save(self, filename: str, full=True) -> None:
        '''Save the model instance to file `filename`. Return `None`.

If `full` is `True`, the model with its all documents and state will be saved. If you want to train more after, use full model.
If `False`, only topic parameters of the model will be saved. This model can be only used for inference of an unseen document.

.. versionadded:: 0.6.0

Since version 0.6.0, the model file format has been changed. 
Thus model files saved in version 0.6.0 or later are not compatible with versions prior to 0.5.2.
'''
        extra_data = pickle.dumps(self.init_params)
        return self._save(filename, extra_data, full)

현재 모델을 filename 경로의 파일에 저장합니다. None을 반환합니다.

full이 True일 경우, 모델의 전체 상태가 파일에 모두 저장됩니다. 저장된 모델을 다시 읽어들여 학습(train)을 더 진행하고자 한다면 full = True로 하여 저장하십시오. 반면 False일 경우, 토픽 추론에 관련된 파라미터만 파일에 저장됩니다. 이 경우 파일의 용량은 작아지지만, 추가 학습은 불가하고 새로운 문헌에 대해 추론(infer)하는 것만 가능합니다.

추가된 버전: 0.6.0

0.6.0 버전부터 모델 파일 포맷이 변경되었습니다. 따라서 0.6.0 이후 버전에서 저장된 모델 파일 포맷은 0.5.2 버전 이전과는 호환되지 않습니다.

def saves(self, full=True) ‑> bytes

Expand source code

    def saves(self, full=True) -> bytes:
        '''.. versionadded:: 0.11.0

Serialize the model instance into `bytes` object and return it. The arguments work the same as `tomotopy.models.LDAModel.save`.'''
        extra_data = pickle.dumps(self.init_params)
        return self._saves(extra_data, full)

추가된 버전: 0.11.0

현재 모델을 직렬화하여 bytes로 만든 뒤 이를 반환합니다. 인자는 LDAModel.save()와 동일하게 작동합니다.

def set_word_prior(self, word, prior) ‑> None

Expand source code

    def set_word_prior(self, word, prior) -> None:
        '''.. versionadded:: 0.6.0

Set word-topic prior. This method should be called before calling the `tomotopy.models.LDAModel.train`.

Parameters
----------
word : str
    a word to be set
prior : Union[Iterable[float], Dict[int, float]]
        topic distribution of `word` whose length is equal to `tomotopy.models.LDAModel.k`

Note
----
Since version 0.12.6, this method can accept a dictionary type parameter as well as a list type parameter for `prior`.
The key of the dictionary is the topic id and the value is the prior of the topic. If the prior of a topic is not set, the default value is set to `eta` parameter of the model.
```python
>>> model = tp.LDAModel(k=3, eta=0.01)
>>> model.set_word_prior('apple', [0.01, 0.9, 0.01])
>>> model.set_word_prior('apple', {1: 0.9}) # same effect as above
```
'''
        return self._set_word_prior(word, prior)

추가된 버전: 0.6.0

어휘-주제 사전 분포를 설정합니다. 이 메소드는 LDAModel.train()를 호출하기 전에만 사용될 수 있습니다.

파라미터

word : str: 설정할 어휘
prior : Union[Iterable[float], Dict[int, float]]: 어휘 word의 주제 분포. prior의 길이는 LDAModel.k와 동일해야 합니다.

Note

0.12.6 버전부터 이 메소드는 prior에 리스트 타입 파라미터 외에도 딕셔너리 타입 파라미터를 받을 수 있습니다. 딕셔너리의 키는 주제의 id이며 값은 사전 주제 분포입니다. 만약 주제의 사전 분포가 설정되지 않았을 경우, 기본값으로 모델의 eta 파라미터로 설정됩니다.

>>> model = tp.LDAModel(k=3, eta=0.01)
>>> model.set_word_prior('apple', [0.01, 0.9, 0.01])
>>> model.set_word_prior('apple', {1: 0.9}) # 위와 동일한 효과

def summary(self, initial_hp=True, params=True, topic_word_top_n=5, file=None, flush=False) ‑> None

Expand source code

    def summary(self, initial_hp=True, params=True, topic_word_top_n=5, file=None, flush=False) -> None:
        '''.. versionadded:: 0.9.0

Print human-readable description of the current model

Parameters
----------
initial_hp : bool
    whether to show the initial parameters at model creation
params : bool
    whether to show the current parameters of the model
topic_word_top_n : int
    the number of words by topic to display
file
    a file-like object (stream), default is `sys.stdout`
flush : bool
    whether to forcibly flush the stream
'''
        flush = flush or False

        print('<Basic Info>', file=file)
        self._summary_basic_info(file=file)
        print('|', file=file)
        print('<Training Info>', file=file)
        self._summary_training_info(file=file)
        print('|', file=file)

        if initial_hp:
            print('<Initial Parameters>', file=file)
            self._summary_initial_params_info(file=file)
            print('|', file=file)
        
        if params:
            print('<Parameters>', file=file)
            self._summary_params_info(file=file)
            print('|', file=file)

        if topic_word_top_n > 0:
            print('<Topics>', file=file)
            self._summary_topics_info(file=file, topic_word_top_n=topic_word_top_n)
            print('|', file=file)

        print(file=file, flush=flush)

추가된 버전: 0.9.0

현재 모델의 요약 정보를 읽기 편한 형태로 출력합니다.

파라미터

initial_hp : bool: 모델 생성 시 초기 파라미터의 표시 여부
params : bool: 현재 모델 파라미터의 표시 여부
topic_word_top_n : int: 토픽별 출력할 단어의 개수
file: 요약 정보를 출력할 대상, 기본값은 sys.stdout
flush : bool: 출력 스트림의 강제 flush 여부

def train(self, iterations=10, workers=0, parallel=0, freeze_topics=False, callback_interval=10, callback=None, show_progress=False) ‑> None

Expand source code

    def train(self, iterations=10, workers=0, parallel=0, freeze_topics=False, callback_interval=10, callback=None, show_progress=False) -> None:
        '''Train the model using Gibbs-sampling with `iterations` iterations. Return `None`. 
After calling this method, you cannot `tomotopy.models.LDAModel.add_doc` or `tomotopy.models.LDAModel.set_word_prior` more.

Parameters
----------
iterations : int
    the number of iterations of Gibbs-sampling
workers : int
    an integer indicating the number of workers to perform samplings. 
    If `workers` is 0, the number of cores in the system will be used.
parallel : Union[int, tomotopy.ParallelScheme]
    .. versionadded:: 0.5.0
    
    the parallelism scheme for training. the default value is `tomotopy.ParallelScheme.DEFAULT` which means that tomotopy selects the best scheme by model.
freeze_topics : bool
    .. versionadded:: 0.10.1

    prevents creating a new topic when training. Only valid for `tomotopy.models.HLDAModel`
callback_interval : int
    .. versionadded:: 0.12.6

    the interval of calling `callback` function. If `callback_interval` <= 0, `callback` function is called at the beginning and the end of training.
callback : Callable[[tomotopy.models.LDAModel, int, int], None]
    .. versionadded:: 0.12.6

    a callable object which is called every `callback_interval` iterations. 
    It receives three arguments: the current model, the current number of iterations, and the total number of iterations.
show_progress : bool
    .. versionadded:: 0.12.6

    If `True`, it shows progress bar during training using `tqdm` package.
'''
        if show_progress:
            if callback is not None:
                callback = LDAModel._show_progress
            else:
                def _multiple_callbacks(*args):
                    callback(*args)
                    LDAModel._show_progress(*args)
                callback = _multiple_callbacks
        return self._train(iterations, workers, parallel, freeze_topics, callback_interval, callback)

깁스 샘플링을 iter 회 반복하여 현재 모델을 학습시킵니다. 반환값은 None입니다. 이 메소드가 호출된 이후에는 더 이상 LDAModel.add_doc()로 현재 모델에 새로운 학습 문헌을 추가시킬 수 없습니다.

파라미터

iter : int: 깁스 샘플링의 반복 횟수
workers : int: 깁스 샘플링을 수행하는 데에 사용할 스레드의 개수입니다. 만약 이 값을 0으로 설정할 경우 시스템 내의 가용한 모든 코어가 사용됩니다.
parallel : Union[int, ParallelScheme]: 추가된 버전: 0.5.0

학습에 사용할 병렬화 방법. 기본값은 ParallelScheme.DEFAULT로 이는 모델에 따라 최적의 방법을 tomotopy가 알아서 선택하도록 합니다.
freeze_topics : bool: 추가된 버전: 0.10.1

학습 시 새로운 토픽을 생성하지 못하도록 합니다. 이 파라미터는 오직 HLDAModel에만 유효합니다.
callback_interval : int: 추가된 버전: 0.12.6

callback 함수를 호출하는 간격. callback_interval <= 0일 경우 학습 시작과 종료 시에만 callback 함수가 호출됩니다.
callback : Callable[[LDAModel, int, int], None]: 추가된 버전: 0.12.6

학습 과정에서 callback_interval 마다 호출되는 호출가능한 객체. 이 함수는 세 개의 인자를 받습니다: 현재 모델, 현재까지의 반복 횟수, 총 반복 횟수.
show_progress : bool: 추가된 버전: 0.12.6

True일 경우 tqdm 패키지를 이용해 학습 진행 상황을 표시합니다.

class LLDAModel (tw='one', min_cf=0, min_df=0, rm_top=0, k=1, alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

Expand source code

class LLDAModel(_LLDAModel, LDAModel):
    '''This type provides Labeled LDA(L-LDA) topic model and its implementation is based on the following papers:
        
> * Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics.

.. versionadded:: 0.3.0

.. deprecated:: 0.11.0
    Use `tomotopy.models.PLDAModel` instead.'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, k=1, alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    .. versionadded:: 0.6.0

    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
k : int
    the number of topics between 1 ~ 32767
alpha : Union[float, Iterable[float]]
    hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.
eta : float
    hyperparameter of Dirichlet distribution for topic-word
seed : int
    random seed. The default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    .. versionadded:: 0.6.0

    a list of documents to be added into the model
transform : Callable[dict, dict]
    .. versionadded:: 0.6.0

    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            k,
            alpha,
            eta,
            seed,
            corpus,
            transform,
        )
    
    def add_doc(self, words, labels=[], ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with `labels` and return an index of the inserted document.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
labels : Iterable[str]
    labels of the document
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, labels, ignore_empty_words)
    
    def make_doc(self, words, labels=[]) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and `labels` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
labels : Iterable[str]
    labels of the document
'''
        return self._make_doc(words, labels)
    
    def get_topic_words(self, topic_id, top_n=10, return_id=False) -> Union[List[Tuple[str, float]], List[Tuple[int, str, float]]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`) if `return_id` is False, or a `list` of (word_id:`int`, word:`str`, probability:`float`) if `return_id` is True.

Parameters
----------
topic_id : int
    Integers in the range [0, `l`), where `l` is the number of total labels, represent a topic that belongs to the corresponding label.
    The label name can be found by looking up `tomotopy.models.LLDAModel.topic_label_dict`.
    Integers in the range [`l`, `k`) represent a latent topic which does not belong to any label.
top_n : int
    the number of top words to return
return_id : bool
    If `True`, it returns a list of (word_id, word, probability) where `word_id` is an integer indicating the id of the word in the model's vocabulary. Otherwise, it returns a list of (word, probability).
'''
        return self._get_topic_words(topic_id, top_n, return_id)
    
    @property
    def topic_label_dict(self):
        '''a dictionary of topic labels in type `tomotopy.Dictionary` (read-only)'''
        return self._topic_label_dict
    
    def _summary_basic_info(self, file):
        LDAModel._summary_basic_info(self, file)
        label_cnt = Counter(l for doc in self.docs for l, _ in doc.labels)
        print('| Label of docs and its distribution', file=file)
        for lb in self.topic_label_dict:
            print('|  {}: {}'.format(lb, label_cnt.get(lb, 0)), file=file)

    def _summary_topics_info(self, file, topic_word_top_n):
        topic_cnt = self.get_count_by_topics()
        for k in range(self.k):
            label = ('Label {} (#{})'.format(self.topic_label_dict[k], k) 
                if k < len(self.topic_label_dict) else '#{}'.format(k))
            words = ' '.join(w for w, _ in self.get_topic_words(k, top_n=topic_word_top_n))
            print('| {} ({}) : {}'.format(label, topic_cnt[k], words), file=file)

이 타입은 Labeled LDA(L-LDA) 토픽 모델의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics.

추가된 버전: 0.3.0

Deprecated since version: 0.11.0

PLDAModel를 대신 사용하세요.

파라미터

tw : Union[int, TermWeight]: 용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.
min_cf : int: 단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
min_df : int: 추가된 버전: 0.6.0

단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
rm_top : int: 제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.
k : int: 토픽의 개수, 1 ~ 32767 범위의 정수.
alpha : Union[float, Iterable[float]]: 문헌-토픽 디리클레 분포의 하이퍼 파라미터, 대칭일 경우 float값 하나로, 비대칭일 경우 k 길이의 float 리스트로 입력할 수 있습니다.
eta : float: 토픽-단어 디리클레 분포의 하이퍼 파라미터
seed : int: 난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.
corpus : Corpus: 추가된 버전: 0.6.0

토픽 모델에 추가될 문헌들의 집합을 지정합니다.
transform : Callable[dict, dict]: 추가된 버전: 0.6.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._LLDAModel
LDAModel
tomotopy._LDAModel

인스턴스 변수

prop topic_label_dict

Expand source code

@property
def topic_label_dict(self):
    '''a dictionary of topic labels in type `tomotopy.Dictionary` (read-only)'''
    return self._topic_label_dict

tomotopy.Dictionary 타입의 토픽 레이블 사전 (읽기전용)

메소드

def add_doc(self, words, labels=[], ignore_empty_words=True) ‑> int | None

Expand source code

    def add_doc(self, words, labels=[], ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with `labels` and return an index of the inserted document.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
labels : Iterable[str]
    labels of the document
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, labels, ignore_empty_words)

현재 모델에 labels를 포함하는 새로운 문헌을 추가하고 추가된 문헌의 인덱스 번호를 반환합니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
labels : Iterable[str]: 문헌의 레이블 리스트

def get_topic_words(self, topic_id, top_n=10, return_id=False) ‑> List[Tuple[str, float]] | List[Tuple[int, str, float]]

Expand source code

    def get_topic_words(self, topic_id, top_n=10, return_id=False) -> Union[List[Tuple[str, float]], List[Tuple[int, str, float]]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`) if `return_id` is False, or a `list` of (word_id:`int`, word:`str`, probability:`float`) if `return_id` is True.

Parameters
----------
topic_id : int
    Integers in the range [0, `l`), where `l` is the number of total labels, represent a topic that belongs to the corresponding label.
    The label name can be found by looking up `tomotopy.models.LLDAModel.topic_label_dict`.
    Integers in the range [`l`, `k`) represent a latent topic which does not belong to any label.
top_n : int
    the number of top words to return
return_id : bool
    If `True`, it returns a list of (word_id, word, probability) where `word_id` is an integer indicating the id of the word in the model's vocabulary. Otherwise, it returns a list of (word, probability).
'''
        return self._get_topic_words(topic_id, top_n, return_id)

토픽 topic_id에 속하는 상위 top_n개의 단어와 각각의 확률을 반환합니다. 반환 타입은 (단어:str, 확률:float) 튜플의 list형입니다.

파라미터

topic_id : int: 전체 레이블의 개수를 l이라고 할 때, [0, l) 범위의 정수는 각각의 레이블에 해당하는 토픽을 가리킵니다. 해당 토픽의 레이블 이름은 LLDAModel.topic_label_dict을 열람하여 확인할 수 있습니다. [l, k) 범위의 정수는 어느 레이블에도 속하지 않는 잠재 토픽을 가리킵니다.

def make_doc(self, words, labels=[]) ‑> Document

Expand source code

    def make_doc(self, words, labels=[]) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and `labels` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
labels : Iterable[str]
    labels of the document
'''
        return self._make_doc(words, labels)

words 단어를 바탕으로 새로운 문헌인 Document 인스턴스를 반환합니다. 이 인스턴스는 LDAModel.infer() 메소드에 사용될 수 있습니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
labels : Iterable[str]: 문헌의 레이블 리스트

상속받은 메소드 및 변수

LDAModel:
- add_corpus
- alpha
- burn_in
- copy
- docs
- eta
- get_count_by_topics
- get_topic_word_dist
- get_word_prior
- global_step
- infer
- k
- ll_per_word
- load
- loads
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class MGLDAModel (tw='one', min_cf=0, min_df=0, rm_top=0, k_g=1, k_l=1, t=3, alpha_g=0.1, alpha_l=0.1, alpha_mg=0.1, alpha_ml=0.1, eta_g=0.01, eta_l=0.01, gamma=0.1, seed=None, corpus=None, transform=None)

Expand source code

class MGLDAModel(_MGLDAModel, LDAModel):
    '''This type provides Multi Grain Latent Dirichlet Allocation(MG-LDA) topic model and its implementation is based on the following papers:

> * Titov, I., & McDonald, R. (2008, April). Modeling online reviews with multi-grain topic models. In Proceedings of the 17th international conference on World Wide Web (pp. 111-120). ACM.'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, k_g=1, k_l=1, t=3, alpha_g=0.1, alpha_l=0.1, alpha_mg=0.1, alpha_ml=0.1, eta_g=0.01, eta_l=0.01, gamma=0.1, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    .. versionadded:: 0.6.0

    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    .. versionadded:: 0.2.0
    
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
k_g : int
    the number of global topics between 1 ~ 32767
k_l : int
    the number of local topics between 1 ~ 32767
t : int
    the size of sentence window
alpha_g : float
    hyperparameter of Dirichlet distribution for document-global topic
alpha_l : float
    hyperparameter of Dirichlet distribution for document-local topic
alpha_mg : float
    hyperparameter of Dirichlet distribution for global-local selection (global coef)
alpha_ml : float
    hyperparameter of Dirichlet distribution for global-local selection (local coef)
eta_g : float
    hyperparameter of Dirichlet distribution for global topic-word
eta_l : float
    hyperparameter of Dirichlet distribution for local topic-word
gamma : float
    hyperparameter of Dirichlet distribution for sentence-window
seed : int
    random seed. default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    .. versionadded:: 0.6.0

    a list of documents to be added into the model
transform : Callable[dict, dict]
    .. versionadded:: 0.6.0

    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            k_g,
            k_l,
            t,
            alpha_g,
            alpha_l,
            alpha_mg,
            alpha_ml,
            eta_g,
            eta_l,
            gamma,
            seed,
            corpus,
            transform,
        )
    
    def add_doc(self, words, delimiter='.', ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance and return an index of the inserted document.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
delimiter : str
    a sentence separator. `words` will be separated by this value into sentences.
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, delimiter, ignore_empty_words)
    
    def make_doc(self, words, delimiter='.') -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
delimiter : str
    a sentence separator. `words` will be separated by this value into sentences.
'''
        return self._make_doc(words, delimiter)
    
    def get_topic_words(self, topic_id, top_n=10) -> List[Tuple[str, float]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`).

Parameters
----------
topic_id : int 
    A number in range [0, `k_g`) indicates a global topic and 
    a number in range [`k_g`, `k_g` + `k_l`) indicates a local topic.
'''
        return self._get_topic_words(topic_id, top_n)
    
    def get_topic_word_dist(self, topic_id, normalize=True) -> List[float]:
        '''Return the word distribution of the topic `topic_id`.
The returned value is a `list` that has `len(vocabs)` fraction numbers indicating probabilities for each word in the current topic.

Parameters
----------
topic_id : int 
    A number in range [0, `k_g`) indicates a global topic and 
    a number in range [`k_g`, `k_g` + `k_l`) indicates a local topic.
normalize : bool
    .. versionadded:: 0.11.0

    If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.
'''
        return self._get_topic_word_dist(topic_id, normalize)
    
    @property
    def k_g(self) -> int:
        '''the hyperparameter k_g (read-only)'''
        return self._k
    
    @property
    def k_l(self) -> int:
        '''the hyperparameter k_l (read-only)'''
        return self._k_l
    
    @property
    def gamma(self) -> float:
        '''the hyperparameter gamma (read-only)'''
        return self._gamma
    
    @property
    def t(self) -> int:
        '''the hyperparameter t (read-only)'''
        return self._t
    
    @property
    def alpha_g(self) -> float:
        '''the hyperparameter alpha_g (read-only)'''
        return self._alpha
    
    @property
    def alpha_l(self) -> float:
        '''the hyperparameter alpha_l (read-only)'''
        return self._alpha_l
    
    @property
    def alpha_mg(self) -> float:
        '''the hyperparameter alpha_mg (read-only)'''
        return self._alpha_mg
    
    @property
    def alpha_ml(self) -> float:
        '''the hyperparameter alpha_ml (read-only)'''
        return self._alpha_ml
    
    @property
    def eta_g(self) -> float:
        '''the hyperparameter eta_g (read-only)'''
        return self._eta
    
    @property
    def eta_l(self) -> float:
        '''the hyperparameter eta_l (read-only)'''
        return self._eta_l

    def _summary_topics_info(self, file, topic_word_top_n):
        topic_cnt = self.get_count_by_topics()
        print('| Global Topic', file=file)
        for k in range(self.k):
            words = ' '.join(w for w, _ in self.get_topic_words(k, top_n=topic_word_top_n))
            print('|  #{} ({}) : {}'.format(k, topic_cnt[k], words), file=file)
        print('| Local Topic', file=file)
        for k in range(self.k_l):
            words = ' '.join(w for w, _ in self.get_topic_words(k + self.k, top_n=topic_word_top_n))
            print('|  #{} ({}) : {}'.format(k, topic_cnt[k + self.k], words), file=file)

이 타입은 Multi Grain Latent Dirichlet Allocation(MG-LDA) 토픽 모델의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Titov, I., & McDonald, R. (2008, April). Modeling online reviews with multi-grain topic models. In Proceedings of the 17th international conference on World Wide Web (pp. 111-120). ACM.

파라미터

tw : Union[int, TermWeight]: 용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.
min_cf : int: 단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
min_df : int: 추가된 버전: 0.6.0

단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
rm_top : int: 추가된 버전: 0.2.0

제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.
k_g : int: 전역 토픽의 개수를 지정하는 1 ~ 32767 사이의 정수
k_l : int: 지역 토픽의 개수를 지정하는 1 ~ 32767 사이의 정수
t : int: 문장 윈도우의 크기
alpha_g : float: 문헌-전역 토픽 디리클레 분포의 하이퍼 파라미터
alpha_l : float: 문헌-지역 토픽 디리클레 분포의 하이퍼 파라미터
alpha_mg : float: 전역/지역 선택 디리클레 분포의 하이퍼 파라미터 (전역 부분 계수)
alpha_ml : float: 전역/지역 선택 디리클레 분포의 하이퍼 파라미터 (지역 부분 계수)
eta_g : float: 전역 토픽-단어 디리클레 분포의 하이퍼 파라미터
eta_l : float: 지역 토픽-단어 디리클레 분포의 하이퍼 파라미터
gamma : float: 문장-윈도우 디리클레 분포의 하이퍼 파라미터
seed : int: 난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.
corpus : Corpus: 추가된 버전: 0.6.0

토픽 모델에 추가될 문헌들의 집합을 지정합니다.
transform : Callable[dict, dict]: 추가된 버전: 0.6.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._MGLDAModel
LDAModel
tomotopy._LDAModel

인스턴스 변수

prop alpha_g : float

Expand source code

@property
def alpha_g(self) -> float:
    '''the hyperparameter alpha_g (read-only)'''
    return self._alpha

하이퍼 파라미터 alpha_g (읽기전용)

prop alpha_l : float

Expand source code

@property
def alpha_l(self) -> float:
    '''the hyperparameter alpha_l (read-only)'''
    return self._alpha_l

하이퍼 파라미터 alpha_l (읽기전용)

prop alpha_mg : float

Expand source code

@property
def alpha_mg(self) -> float:
    '''the hyperparameter alpha_mg (read-only)'''
    return self._alpha_mg

하이퍼 파라미터 alpha_mg (읽기전용)

prop alpha_ml : float

Expand source code

@property
def alpha_ml(self) -> float:
    '''the hyperparameter alpha_ml (read-only)'''
    return self._alpha_ml

하이퍼 파라미터 alpha_ml (읽기전용)

prop eta_g : float

Expand source code

@property
def eta_g(self) -> float:
    '''the hyperparameter eta_g (read-only)'''
    return self._eta

하이퍼 파라미터 eta_g (읽기전용)

prop eta_l : float

Expand source code

@property
def eta_l(self) -> float:
    '''the hyperparameter eta_l (read-only)'''
    return self._eta_l

하이퍼 파라미터 eta_l (읽기전용)

prop gamma : float

Expand source code

@property
def gamma(self) -> float:
    '''the hyperparameter gamma (read-only)'''
    return self._gamma

하이퍼 파라미터 gamma (읽기전용)

prop k_g : int

Expand source code

@property
def k_g(self) -> int:
    '''the hyperparameter k_g (read-only)'''
    return self._k

하이퍼 파라미터 k_g (읽기전용)

prop k_l : int

Expand source code

@property
def k_l(self) -> int:
    '''the hyperparameter k_l (read-only)'''
    return self._k_l

하이퍼 파라미터 k_l (읽기전용)

prop t : int

Expand source code

@property
def t(self) -> int:
    '''the hyperparameter t (read-only)'''
    return self._t

하이퍼 파라미터 t (읽기전용)

메소드

def add_doc(self, words, delimiter='.', ignore_empty_words=True) ‑> int | None

Expand source code

    def add_doc(self, words, delimiter='.', ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance and return an index of the inserted document.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
delimiter : str
    a sentence separator. `words` will be separated by this value into sentences.
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, delimiter, ignore_empty_words)

현재 모델에 metadata를 포함하는 새로운 문헌을 추가하고 추가된 문헌의 인덱스 번호를 반환합니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
delimiter : str: 문장 구분자, words는 이 값을 기준으로 문장 단위로 반할됩니다.

def get_topic_word_dist(self, topic_id, normalize=True) ‑> List[float]

Expand source code

    def get_topic_word_dist(self, topic_id, normalize=True) -> List[float]:
        '''Return the word distribution of the topic `topic_id`.
The returned value is a `list` that has `len(vocabs)` fraction numbers indicating probabilities for each word in the current topic.

Parameters
----------
topic_id : int 
    A number in range [0, `k_g`) indicates a global topic and 
    a number in range [`k_g`, `k_g` + `k_l`) indicates a local topic.
normalize : bool
    .. versionadded:: 0.11.0

    If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.
'''
        return self._get_topic_word_dist(topic_id, normalize)

파라미터

topic_id : int: [0, k_g) 범위의 정수는 전역 토픽을, [k_g, k_g + k_l) 범위의 정수는 지역 토픽을 가리킵니다.
normalize : bool: 추가된 버전: 0.11.0

참일 경우 총합이 1이 되는 확률 분포를 반환하고, 거짓일 경우 정규화되지 않는 값을 그대로 반환합니다.

def get_topic_words(self, topic_id, top_n=10) ‑> List[Tuple[str, float]]

Expand source code

    def get_topic_words(self, topic_id, top_n=10) -> List[Tuple[str, float]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`).

Parameters
----------
topic_id : int 
    A number in range [0, `k_g`) indicates a global topic and 
    a number in range [`k_g`, `k_g` + `k_l`) indicates a local topic.
'''
        return self._get_topic_words(topic_id, top_n)

토픽 topic_id에 속하는 상위 top_n개의 단어와 각각의 확률을 반환합니다. 반환 타입은 (단어:str, 확률:float) 튜플의 list형입니다.

파라미터

topic_id : int: [0, k_g) 범위의 정수는 전역 토픽을, [k_g, k_g + k_l) 범위의 정수는 지역 토픽을 가리킵니다.

def make_doc(self, words, delimiter='.') ‑> Document

Expand source code

    def make_doc(self, words, delimiter='.') -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
delimiter : str
    a sentence separator. `words` will be separated by this value into sentences.
'''
        return self._make_doc(words, delimiter)

words 단어를 바탕으로 새로운 문헌인 Document 인스턴스를 반환합니다. 이 인스턴스는 LDAModel.infer() 메소드에 사용될 수 있습니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
delimiter : str: 문장 구분자, words는 이 값을 기준으로 문장 단위로 반할됩니다.

상속받은 메소드 및 변수

LDAModel:
- add_corpus
- alpha
- burn_in
- copy
- docs
- eta
- get_count_by_topics
- get_word_prior
- global_step
- infer
- k
- ll_per_word
- load
- loads
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class PAModel (tw='one', min_cf=0, min_df=0, rm_top=0, k1=1, k2=1, alpha=0.1, subalpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

Expand source code

class PAModel(_PAModel, LDAModel):
    '''This type provides Pachinko Allocation(PA) topic model and its implementation is based on the following papers:

> * Li, W., & McCallum, A. (2006, June). Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning (pp. 577-584). ACM.'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, k1=1, k2=1, alpha=0.1, subalpha=0.1, eta=0.01, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    .. versionadded:: 0.6.0

    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    .. versionadded:: 0.2.0
    
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
k1 : int
    the number of super topics between 1 ~ 32767
k2 : int
    the number of sub topics between 1 ~ 32767
alpha : Union[float, Iterable[float]]
    initial hyperparameter of Dirichlet distribution for document-super topic, given as a single `float` in case of symmetric prior and as a list with length `k1` of `float` in case of asymmetric prior.
subalpha : Union[float, Iterable[float]]
    .. versionadded:: 0.11.0

    initial hyperparameter of Dirichlet distribution for super-sub topic, given as a single `float` in case of symmetric prior and as a list with length `k2` of `float` in case of asymmetric prior.
eta : float
    hyperparameter of Dirichlet distribution for sub topic-word
seed : int
    random seed. default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    .. versionadded:: 0.6.0

    a list of documents to be added into the model
transform : Callable[dict, dict]
    .. versionadded:: 0.6.0

    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            k1,
            k2,
            alpha,
            subalpha,
            eta,
            seed,
            corpus,
            transform,
        )

    def get_topic_words(self, sub_topic_id, top_n=10) -> List[Tuple[str, float]]:
        '''Return the `top_n` words and their probabilities in the sub topic `sub_topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`).

Parameters
----------
sub_topic_id : int
    indicating the sub topic, in range [0, `k2`)
'''
        return self._get_topic_words(sub_topic_id, top_n)
    
    def get_topic_word_dist(self, sub_topic_id, normalize=True) -> List[float]:
        '''Return the word distribution of the sub topic `sub_topic_id`.
The returned value is a `list` that has `len(vocabs)` fraction numbers indicating probabilities for each word in the current sub topic.

Parameters
----------
sub_topic_id : int
    indicating the sub topic, in range [0, `k2`)
normalize : bool
    .. versionadded:: 0.11.0

    If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.
'''
        return self._get_topic_word_dist(sub_topic_id, normalize)
    
    def get_sub_topics(self, super_topic_id, top_n=10) -> List[Tuple[int, float]]:
        '''.. versionadded:: 0.1.4

Return the `top_n` sub topics and their probabilities in the super topic `super_topic_id`.
The return type is a `list` of (subtopic:`int`, probability:`float`).

Parameters
----------
super_topic_id : int
    indicating the super topic, in range [0, `k1`)
'''
        return self._get_sub_topics(super_topic_id, top_n)
    
    def get_sub_topic_dist(self, super_topic_id, normalize=True) -> List[float]:
        '''Return a distribution of the sub topics in a super topic `super_topic_id`.
The returned value is a `list` that has `k2` fraction numbers indicating probabilities for each sub topic in the current super topic.

Parameters
----------
super_topic_id : int
    indicating the super topic, in range [0, `k1`)
'''
        return self._get_sub_topic_dist(super_topic_id, normalize)
    
    def infer(self, doc, iterations=100, tolerance=-1, workers=0, parallel=0, together=False, transform=None) -> Tuple[Union[Tuple[List[float], List[float]], List[Tuple[List[float], List[float]]], Corpus], List[float]]:
        '''.. versionadded:: 0.5.0

Return the inferred topic distribution and sub-topic distribution from unseen `doc`s.

Parameters
----------
doc : Union[tomotopy.utils.Document, Iterable[tomotopy.utils.Document], tomotopy.utils.Corpus]
    an instance of `tomotopy.utils.Document` or a `list` of instances of `tomotopy.utils.Document` to be inferred by the model.
    It can be acquired from `tomotopy.models.LDAModel.make_doc` method.

    .. versionchanged:: 0.10.0

        Since version 0.10.0, `infer` can receive a raw corpus instance of `tomotopy.utils.Corpus`. 
        In this case, you don't need to call `make_doc`. `infer` would generate documents bound to the model, estimate its topic distributions and
        return a corpus containing generated documents as the result.
iterations : int
    an integer indicating the number of iteration to estimate the distribution of topics of `doc`.
    The higher value will generate a more accurate result.
tolerance : float
    This parameter is not currently used.
workers : int
    an integer indicating the number of workers to perform samplings. 
    If `workers` is 0, the number of cores in the system will be used.
parallel : Union[int, tomotopy.ParallelScheme]
    .. versionadded:: 0.5.0
    
    the parallelism scheme for inference. the default value is ParallelScheme.DEFAULT which means that tomotopy selects the best scheme by model.
together : bool
    all `doc`s are inferred together in one process if True, otherwise each `doc` is inferred independently. Its default value is `False`.
transform : Callable[dict, dict]
    .. versionadded:: 0.10.0
    
    a callable object to manipulate arbitrary keyword arguments for a specific topic model. 
    Available when `doc` is given as an instance of `tomotopy.utils.Corpus`.

Returns
-------
result : Union[Tuple[List[float], List[float]], List[Tuple[List[float], List[float]]], tomotopy.utils.Corpus]
    If `doc` is given as a single `tomotopy.utils.Document`, `result` is a tuple of `List[float]` indicating its topic distribution and `List[float]` indicating its sub-topic distribution.
    
    If `doc` is given as a list of `tomotopy.utils.Document`s, `result` is a list of `List[float]` indicating topic distributions for each document.
    
    If `doc` is given as an instance of `tomotopy.utils.Corpus`, `result` is another instance of `tomotopy.utils.Corpus` which contains inferred documents.
    You can get topic distribution for each document using `tomotopy.utils.Document.get_topic_dist` and sub-topic distribution using `tomotopy.utils.Document.get_sub_topic_dist`
log_ll : List[float]
    a list of log-likelihoods for each `doc`
'''
        return self._infer(doc, iterations, tolerance, workers, parallel, together, transform)
    
    def get_count_by_super_topic(self) -> List[int]:
        '''Return the number of words allocated to each super-topic.

.. versionadded:: 0.9.0'''
        return self._get_count_by_super_topic()
    
    @property
    def k1(self) -> int:
        '''k1, the number of super topics (read-only)'''
        return self._k
    
    @property
    def k2(self) -> int:
        '''k2, the number of sub topics (read-only)'''
        return self._k2
    
    @property
    def alpha(self) -> float:
        '''Dirichlet prior on the per-document super topic distributions in shape `[k1]` (read-only)

.. versionadded:: 0.9.0'''
        return self._alpha
    
    @property
    def subalpha(self) -> float:
        '''Dirichlet prior on the sub topic distributions for each super topic in shape `[k1, k2]` (read-only)

.. versionadded:: 0.9.0'''
        return self._subalpha
    
    def _summary_params_info(self, file):
        print('| alpha (Dirichlet prior on the per-document super topic distributions)\n'
            '|  {}'.format(_format_numpy(self.alpha, '|  ')), file=file)
        print('| subalpha (Dirichlet prior on the sub topic distributions for each super topic)', file=file)
        for k1 in range(self.k1):
            print('|  Super #{}: {}'.format(k1, _format_numpy(self.subalpha[k1], '|   ')), file=file)
        print('| eta (Dirichlet prior on the per-subtopic word distribution)\n'
            '|  {:.5}'.format(self.eta), file=file)

    def _summary_topics_info(self, file, topic_word_top_n):
        topic_cnt = self.get_count_by_super_topic()
        print('| Sub-topic distribution of Super-topics', file=file)
        for k in range(self.k1):
            words = ' '.join('#{}'.format(w) for w, _ in self.get_sub_topics(k, top_n=topic_word_top_n))
            print('|  #Super{} ({}) : {}'.format(k, topic_cnt[k], words), file=file)
        topic_cnt = self.get_count_by_topics()
        print('| Word distribution of Sub-topics', file=file)
        for k in range(self.k2):
            words = ' '.join(w for w, _ in self.get_topic_words(k, top_n=topic_word_top_n))
            print('|  #{} ({}) : {}'.format(k, topic_cnt[k], words), file=file)

이 타입은 Pachinko Allocation(PA) 토픽 모델의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Li, W., & McCallum, A. (2006, June). Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning (pp. 577-584). ACM.

파라미터

tw : Union[int, TermWeight]: 용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.
min_cf : int: 단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
min_df : int: 추가된 버전: 0.6.0

단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
rm_top : int: 추가된 버전: 0.2.0

제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.* k1 : 상위 토픽의 개수, 1 ~ 32767 사이의 정수.
k1 : int: 상위 토픽의 개수, 1 ~ 32767 사이의 정수
k2 : int: 하위 토픽의 개수, 1 ~ 32767 사이의 정수.
alpha : Union[float, Iterable[float]]: 문헌-상위 토픽 디리클레 분포의 하이퍼 파라미터, 대칭일 경우 float값 하나로, 비대칭일 경우 k1 길이의 float 리스트로 입력할 수 있습니다.
subalpha : Union[float, Iterable[float]]: 추가된 버전: 0.11.0

상위-하위 토픽 디리클레 분포의 하이퍼 파라미터, 대칭일 경우 float값 하나로, 비대칭일 경우 k2 길이의 float 리스트로 입력할 수 있습니다.
eta : float: 하위 토픽-단어 디리클레 분포의 하이퍼 파라미터
seed : int: 난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.
corpus : Corpus: 추가된 버전: 0.6.0

토픽 모델에 추가될 문헌들의 집합을 지정합니다.
transform : Callable[dict, dict]: 추가된 버전: 0.6.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._PAModel
LDAModel
tomotopy._LDAModel

Subclasses

HPAModel

인스턴스 변수

prop alpha : float

Expand source code

    @property
    def alpha(self) -> float:
        '''Dirichlet prior on the per-document super topic distributions in shape `[k1]` (read-only)

.. versionadded:: 0.9.0'''
        return self._alpha

문헌의 상위 토픽 분포에 대한 디리클레 분포 파라미터, [k1] 모양 (읽기전용)

추가된 버전: 0.9.0

prop k1 : int

Expand source code

@property
def k1(self) -> int:
    '''k1, the number of super topics (read-only)'''
    return self._k

k1, 상위 토픽의 개수 (읽기전용)

prop k2 : int

Expand source code

@property
def k2(self) -> int:
    '''k2, the number of sub topics (read-only)'''
    return self._k2

k2, 하위 토픽의 개수 (읽기전용)

prop subalpha : float

Expand source code

    @property
    def subalpha(self) -> float:
        '''Dirichlet prior on the sub topic distributions for each super topic in shape `[k1, k2]` (read-only)

.. versionadded:: 0.9.0'''
        return self._subalpha

상위 토픽의 하위 토픽 분포에 대한 디리클레 분포 파라미터, [k1, k2] 모양 (읽기전용)

추가된 버전: 0.9.0

메소드

def get_count_by_super_topic(self) ‑> List[int]

Expand source code

    def get_count_by_super_topic(self) -> List[int]:
        '''Return the number of words allocated to each super-topic.

.. versionadded:: 0.9.0'''
        return self._get_count_by_super_topic()

각각의 상위 토픽에 할당된 단어의 개수를 list형태로 반환합니다.

추가된 버전: 0.9.0

def get_sub_topic_dist(self, super_topic_id, normalize=True) ‑> List[float]

Expand source code

    def get_sub_topic_dist(self, super_topic_id, normalize=True) -> List[float]:
        '''Return a distribution of the sub topics in a super topic `super_topic_id`.
The returned value is a `list` that has `k2` fraction numbers indicating probabilities for each sub topic in the current super topic.

Parameters
----------
super_topic_id : int
    indicating the super topic, in range [0, `k1`)
'''
        return self._get_sub_topic_dist(super_topic_id, normalize)

상위 토픽 super_topic_id의 하위 토픽 분포를 반환합니다. 반환하는 값은 현재 상위 토픽 내 각각의 하위 토픽들의 발생확률을 나타내는 k2개의 소수로 구성된 list입니다.

파라미터

super_topic_id : int: 상위 토픽을 가리키는 [0, k1) 범위의 정수
normalize : bool: 추가된 버전: 0.11.0

참일 경우 총합이 1이 되는 확률 분포를 반환하고, 거짓일 경우 정규화되지 않는 값을 그대로 반환합니다.

def get_sub_topics(self, super_topic_id, top_n=10) ‑> List[Tuple[int, float]]

Expand source code

    def get_sub_topics(self, super_topic_id, top_n=10) -> List[Tuple[int, float]]:
        '''.. versionadded:: 0.1.4

Return the `top_n` sub topics and their probabilities in the super topic `super_topic_id`.
The return type is a `list` of (subtopic:`int`, probability:`float`).

Parameters
----------
super_topic_id : int
    indicating the super topic, in range [0, `k1`)
'''
        return self._get_sub_topics(super_topic_id, top_n)

추가된 버전: 0.1.4

상위 토픽 super_topic_id에 속하는 상위 top_n개의 하위 토픽과 각각의 확률을 반환합니다. 반환 타입은 (하위토픽:int, 확률:float) 튜플의 list형입니다.

파라미터

super_topic_id : int: 상위 토픽을 가리키는 [0, k1) 범위의 정수

def get_topic_word_dist(self, sub_topic_id, normalize=True) ‑> List[float]

Expand source code

    def get_topic_word_dist(self, sub_topic_id, normalize=True) -> List[float]:
        '''Return the word distribution of the sub topic `sub_topic_id`.
The returned value is a `list` that has `len(vocabs)` fraction numbers indicating probabilities for each word in the current sub topic.

Parameters
----------
sub_topic_id : int
    indicating the sub topic, in range [0, `k2`)
normalize : bool
    .. versionadded:: 0.11.0

    If True, it returns the probability distribution with the sum being 1. Otherwise it returns the distribution of raw values.
'''
        return self._get_topic_word_dist(sub_topic_id, normalize)

하위 토픽 sub_topic_id의 단어 분포를 반환합니다. 반환하는 값은 현재 하위 토픽 내 각각의 단어들의 발생확률을 나타내는 len(vocabs)개의 소수로 구성된 list입니다.

파라미터

sub_topic_id : int: 하위 토픽을 가리키는 [0, k2) 범위의 정수
normalize : bool: 추가된 버전: 0.11.0

참일 경우 총합이 1이 되는 확률 분포를 반환하고, 거짓일 경우 정규화되지 않는 값을 그대로 반환합니다.

def get_topic_words(self, sub_topic_id, top_n=10) ‑> List[Tuple[str, float]]

Expand source code

    def get_topic_words(self, sub_topic_id, top_n=10) -> List[Tuple[str, float]]:
        '''Return the `top_n` words and their probabilities in the sub topic `sub_topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`).

Parameters
----------
sub_topic_id : int
    indicating the sub topic, in range [0, `k2`)
'''
        return self._get_topic_words(sub_topic_id, top_n)

하위 토픽 sub_topic_id에 속하는 상위 top_n개의 단어와 각각의 확률을 반환합니다. 반환 타입은 (단어:str, 확률:float) 튜플의 list형입니다.

파라미터

sub_topic_id : int: 하위 토픽을 가리키는 [0, k2) 범위의 정수

def infer(self, doc, iterations=100, tolerance=-1, workers=0, parallel=0, together=False, transform=None) ‑> Tuple[Tuple[List[float], List[float]] | List[Tuple[List[float], List[float]]] | Corpus, List[float]]

Expand source code

    def infer(self, doc, iterations=100, tolerance=-1, workers=0, parallel=0, together=False, transform=None) -> Tuple[Union[Tuple[List[float], List[float]], List[Tuple[List[float], List[float]]], Corpus], List[float]]:
        '''.. versionadded:: 0.5.0

Return the inferred topic distribution and sub-topic distribution from unseen `doc`s.

Parameters
----------
doc : Union[tomotopy.utils.Document, Iterable[tomotopy.utils.Document], tomotopy.utils.Corpus]
    an instance of `tomotopy.utils.Document` or a `list` of instances of `tomotopy.utils.Document` to be inferred by the model.
    It can be acquired from `tomotopy.models.LDAModel.make_doc` method.

    .. versionchanged:: 0.10.0

        Since version 0.10.0, `infer` can receive a raw corpus instance of `tomotopy.utils.Corpus`. 
        In this case, you don't need to call `make_doc`. `infer` would generate documents bound to the model, estimate its topic distributions and
        return a corpus containing generated documents as the result.
iterations : int
    an integer indicating the number of iteration to estimate the distribution of topics of `doc`.
    The higher value will generate a more accurate result.
tolerance : float
    This parameter is not currently used.
workers : int
    an integer indicating the number of workers to perform samplings. 
    If `workers` is 0, the number of cores in the system will be used.
parallel : Union[int, tomotopy.ParallelScheme]
    .. versionadded:: 0.5.0
    
    the parallelism scheme for inference. the default value is ParallelScheme.DEFAULT which means that tomotopy selects the best scheme by model.
together : bool
    all `doc`s are inferred together in one process if True, otherwise each `doc` is inferred independently. Its default value is `False`.
transform : Callable[dict, dict]
    .. versionadded:: 0.10.0
    
    a callable object to manipulate arbitrary keyword arguments for a specific topic model. 
    Available when `doc` is given as an instance of `tomotopy.utils.Corpus`.

Returns
-------
result : Union[Tuple[List[float], List[float]], List[Tuple[List[float], List[float]]], tomotopy.utils.Corpus]
    If `doc` is given as a single `tomotopy.utils.Document`, `result` is a tuple of `List[float]` indicating its topic distribution and `List[float]` indicating its sub-topic distribution.
    
    If `doc` is given as a list of `tomotopy.utils.Document`s, `result` is a list of `List[float]` indicating topic distributions for each document.
    
    If `doc` is given as an instance of `tomotopy.utils.Corpus`, `result` is another instance of `tomotopy.utils.Corpus` which contains inferred documents.
    You can get topic distribution for each document using `tomotopy.utils.Document.get_topic_dist` and sub-topic distribution using `tomotopy.utils.Document.get_sub_topic_dist`
log_ll : List[float]
    a list of log-likelihoods for each `doc`
'''
        return self._infer(doc, iterations, tolerance, workers, parallel, together, transform)

추가된 버전: 0.5.0

새로운 문헌인 doc에 대해 각각의 주제 분포를 추론하여 반환합니다. 반환 타입은 ((doc의 주제 분포, doc의 하위 주제 분포), 로그가능도) 또는 ((doc의 주제 분포, doc의 하위 주제 분포)로 구성된 list, 로그가능도)입니다.

파라미터

doc : Union[Document, Iterable[Document]]: 추론에 사용할 Document의 인스턴스이거나 이 인스턴스들의 list. 이 인스턴스들은 LDAModel.make_doc() 메소드를 통해 얻을 수 있습니다.

Changed in version: 0.10.0

0.10.0버전부터 infer는 Corpus의 인스턴스를 직접 입력 받을 수 있습니다. 이 경우 make_doc를 사용할 필요 없이 infer가 직접 모델에 맞춰진 문헌을 생성하고 이를 이용해 토픽 분포를 추정하며, 결과로 생성된 문헌들이 포함된 Corpus를 반환합니다.
iter : int: doc의 주제 분포를 추론하기 위해 학습을 반복할 횟수입니다. 이 값이 클 수록 더 정확한 결과를 낼 수 있습니다.
tolerance : float: 현재는 사용되지 않음
workers : int: 깁스 샘플링을 수행하는 데에 사용할 스레드의 개수입니다. 만약 이 값을 0으로 설정할 경우 시스템 내의 가용한 모든 코어가 사용됩니다.
parallel : Union[int, ParallelScheme]: 추가된 버전: 0.5.0

추론에 사용할 병렬화 방법. 기본값은 ParallelScheme.DEFAULT로 이는 모델에 따라 최적의 방법을 tomotopy가 알아서 선택하도록 합니다.
together : bool: 이 값이 True인 경우 입력한 doc 문헌들을 한 번에 모델에 넣고 추론을 진행합니다. False인 경우 각각의 문헌들을 별도로 모델에 넣어 추론합니다. 기본값은 False입니다.
transform : Callable[dict, dict]: 추가된 버전: 0.10.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체. doc이 Corpus의 인스턴스로 주어진 경우에만 사용 가능합니다.

Returns

result : Union[Tuple[List[float], List[float]], List[Tuple[List[float], List[float]]], Corpus]

doc이 Document로 주어진 경우, result는 문헌의 토픽 분포를 나타내는 List[float]와 하위 토픽 분포를 나타내는 List[float]의 tuple입니다.

doc이 Document의 list로 주어진 경우, result는 문헌의 토픽 분포를 나타내는 List[float]와 하위 토픽 분포를 나타내는 List[float]의 tuple의 list입니다.

doc이 Corpus의 인스턴스로 주어진 경우, result는 추론된 결과 문서들을 담고 있는, Corpus의 새로운 인스턴스입니다. 각 문헌별 토픽 분포를 얻기 위해서는 Document.get_topic_dist(), 하위 토픽 분포를 얻기 위해서는 Document.get_sub_topic_dist()를 사용하면 됩니다.

log_ll : List[float]

각 문헌별 로그 가능도의 리스트

상속받은 메소드 및 변수

LDAModel:
- add_corpus
- add_doc
- burn_in
- copy
- docs
- eta
- get_count_by_topics
- get_word_prior
- global_step
- k
- ll_per_word
- load
- loads
- make_doc
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class PLDAModel (tw='one', min_cf=0, min_df=0, rm_top=0, latent_topics=0, topics_per_label=1, alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

Expand source code

class PLDAModel(_PLDAModel, LDAModel):
    '''This type provides Partially Labeled LDA(PLDA) topic model and its implementation is based on the following papers:
        
> * Ramage, D., Manning, C. D., & Dumais, S. (2011, August). Partially labeled topic models for interpretable text mining. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 457-465). ACM.

.. versionadded:: 0.4.0'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, latent_topics=0, topics_per_label=1, alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    .. versionadded:: 0.6.0

    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
latent_topics : int
    the number of latent topics, which are shared to all documents, between 1 ~ 32767
topics_per_label : int
    the number of topics per label between 1 ~ 32767
alpha : Union[float, Iterable[float]]
    hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.
eta : float
    hyperparameter of Dirichlet distribution for topic-word
seed : int
    random seed. The default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    .. versionadded:: 0.6.0

    a list of documents to be added into the model
transform : Callable[dict, dict]
    .. versionadded:: 0.6.0

    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            latent_topics,
            topics_per_label,
            alpha,
            eta,
            seed,
            corpus,
            transform,
        )

    def add_doc(self, words, labels=[], ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with `labels` and return an index of the inserted document.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
labels : Iterable[str]
    labels of the document
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, labels, ignore_empty_words)
    
    def make_doc(self, words, labels=[]) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and `labels` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
labels : Iterable[str]
    labels of the document
'''
        return self._make_doc(words, labels)
    
    def get_topic_words(self, topic_id, top_n=10, return_id=False) -> Union[List[Tuple[str, float]], List[Tuple[int, str, float]]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`).

Parameters
----------
topic_id : int
    Integers in the range [0, `l` * `topics_per_label`), where `l` is the number of total labels, represent a topic that belongs to the corresponding label.
    The label name can be found by looking up `tomotopy.models.PLDAModel.topic_label_dict`.
    Integers in the range [`l` * `topics_per_label`, `l` * `topics_per_label` + `latent_topics`) represent a latent topic which does not belong to any label.
top_n : int
    the number of top words to return
return_id : bool
    If `True`, it returns a list of (word_id:`int`, word:`str`, probability:`float`) instead of (word:`str`, probability:`float`).
    
'''
        return self._get_topic_words(topic_id, top_n, return_id)
    
    @property
    def topic_label_dict(self):
        '''a dictionary of topic labels in type `tomotopy.Dictionary` (read-only)'''
        return self._topic_label_dict
    
    @property
    def latent_topics(self) -> int:
        '''the number of latent topics (read-only)'''
        return self._latent_topics
    
    @property
    def topics_per_label(self) -> int:
        '''the number of topics per label (read-only)'''
        return self._topics_per_label
    
    def _summary_basic_info(self, file):
        LDAModel._summary_basic_info(self, file)
        label_cnt = Counter(l for doc in self.docs for l, _ in doc.labels)
        print('| Label of docs and its distribution', file=file)
        for lb in self.topic_label_dict:
            print('|  {}: {}'.format(lb, label_cnt.get(lb, 0)), file=file)

    def _summary_topics_info(self, file, topic_word_top_n):
        topic_cnt = self.get_count_by_topics()
        for k in range(self.k):
            l = k // self.topics_per_label
            label = ('Label {}-{} (#{})'.format(self.topic_label_dict[l], k % self.topics_per_label, k) 
                if l < len(self.topic_label_dict) else 'Latent {} (#{})'.format(k - self.topics_per_label * len(self.topic_label_dict), k))
            words = ' '.join(w for w, _ in self.get_topic_words(k, top_n=topic_word_top_n))
            print('| {} ({}) : {}'.format(label, topic_cnt[k], words), file=file)

이 타입은 Partially Labeled LDA(PLDA) 토픽 모델의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Ramage, D., Manning, C. D., & Dumais, S. (2011, August). Partially labeled topic models for interpretable text mining. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 457-465). ACM.

추가된 버전: 0.4.0

파라미터

tw : Union[int, TermWeight]: 용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.
min_cf : int: 단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
min_df : int: 추가된 버전: 0.6.0

단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
rm_top : int: 제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.
latent_topics : int: 모든 문헌에 공유되는 잠재 토픽의 개수, 1 ~ 32767 사이의 정수.
topics_per_label : int: 레이블별 토픽의 개수, 1 ~ 32767 사이의 정수.
alpha : Union[float, Iterable[float]]: 문헌-토픽 디리클레 분포의 하이퍼 파라미터, 대칭일 경우 float값 하나로, 비대칭일 경우 k 길이의 float 리스트로 입력할 수 있습니다.
eta : float: 토픽-단어 디리클레 분포의 하이퍼 파라미터
seed : int: 난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.
corpus : Corpus: 추가된 버전: 0.6.0

토픽 모델에 추가될 문헌들의 집합을 지정합니다.
transform : Callable[dict, dict]: 추가된 버전: 0.6.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._PLDAModel
LDAModel
tomotopy._LDAModel

인스턴스 변수

prop latent_topics : int

Expand source code

@property
def latent_topics(self) -> int:
    '''the number of latent topics (read-only)'''
    return self._latent_topics

잠재 토픽의 개수 (읽기전용)

prop topic_label_dict

Expand source code

@property
def topic_label_dict(self):
    '''a dictionary of topic labels in type `tomotopy.Dictionary` (read-only)'''
    return self._topic_label_dict

tomotopy.Dictionary 타입의 토픽 레이블 사전 (읽기전용)

prop topics_per_label : int

Expand source code

@property
def topics_per_label(self) -> int:
    '''the number of topics per label (read-only)'''
    return self._topics_per_label

레이블별 토픽의 개수 (읽기전용)

메소드

def add_doc(self, words, labels=[], ignore_empty_words=True) ‑> int | None

Expand source code

    def add_doc(self, words, labels=[], ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with `labels` and return an index of the inserted document.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
labels : Iterable[str]
    labels of the document
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, labels, ignore_empty_words)

labels를 포함하여 현재 모델에 새 문헌을 추가하고 추가된 문헌의 인덱스를 반환합니다.

파라미터

words : Iterable[str]: str의 iterable
labels : Iterable[str]: 문헌의 레이블
ignore_empty_words : bool: True일 경우, 빈 words에 대해 예외를 발생시키지 않고 None을 반환합니다.

def get_topic_words(self, topic_id, top_n=10, return_id=False) ‑> List[Tuple[str, float]] | List[Tuple[int, str, float]]

Expand source code

    def get_topic_words(self, topic_id, top_n=10, return_id=False) -> Union[List[Tuple[str, float]], List[Tuple[int, str, float]]]:
        '''Return the `top_n` words and their probabilities in the topic `topic_id`. 
The return type is a `list` of (word:`str`, probability:`float`).

Parameters
----------
topic_id : int
    Integers in the range [0, `l` * `topics_per_label`), where `l` is the number of total labels, represent a topic that belongs to the corresponding label.
    The label name can be found by looking up `tomotopy.models.PLDAModel.topic_label_dict`.
    Integers in the range [`l` * `topics_per_label`, `l` * `topics_per_label` + `latent_topics`) represent a latent topic which does not belong to any label.
top_n : int
    the number of top words to return
return_id : bool
    If `True`, it returns a list of (word_id:`int`, word:`str`, probability:`float`) instead of (word:`str`, probability:`float`).
    
'''
        return self._get_topic_words(topic_id, top_n, return_id)

토픽 topic_id에 속하는 상위 top_n개의 단어와 각각의 확률을 반환합니다. 반환 타입은 (단어:str, 확률:float) 튜플의 list형입니다.

파라미터

topic_id : int: 전체 레이블의 개수를 l이라고 할 때, [0, l * topics_per_label) 범위의 정수는 각각의 레이블에 해당하는 토픽을 가리킵니다. 해당 토픽의 레이블 이름은 PLDAModel.topic_label_dict을 열람하여 확인할 수 있습니다. [l * topics_per_label, l * topics_per_label + latent_topics) 범위의 정수는 어느 레이블에도 속하지 않는 잠재 토픽을 가리킵니다.

def make_doc(self, words, labels=[]) ‑> Document

Expand source code

    def make_doc(self, words, labels=[]) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and `labels` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
labels : Iterable[str]
    labels of the document
'''
        return self._make_doc(words, labels)

words와 labels를 가지고 LDAModel.infer() 메소드에 사용할 수 있는 새 Document 인스턴스를 반환합니다.

파라미터

words : Iterable[str]: str의 iterable
labels : Iterable[str]: 문헌의 레이블

상속받은 메소드 및 변수

LDAModel:
- add_corpus
- alpha
- burn_in
- copy
- docs
- eta
- get_count_by_topics
- get_topic_word_dist
- get_word_prior
- global_step
- infer
- k
- ll_per_word
- load
- loads
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class PTModel (tw='one', min_cf=0, min_df=0, rm_top=0, k=1, p=None, alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None)

Expand source code

class PTModel(_PTModel, LDAModel):
    '''.. versionadded:: 0.11.0
This type provides Pseudo-document based Topic Model (PTM) and its implementation is based on the following papers:
        
> * Zuo, Y., Wu, J., Zhang, H., Lin, H., Wang, F., Xu, K., & Xiong, H. (2016, August). Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2105-2114).'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, k=1, p=None, alpha=0.1, eta=0.01, seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
k : int
    the number of topics between 1 ~ 32767
p : int
    the number of pseudo documents
    ..versionchanged:: 0.12.2
        The default value is changed to `10 * k`.
alpha : Union[float, Iterable[float]]
    hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.
eta : float
    hyperparameter of Dirichlet distribution for topic-word
seed : int
    random seed. The default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    a list of documents to be added into the model
transform : Callable[dict, dict]
    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            k,
            p,
            alpha,
            eta,
            seed,
            corpus,
            transform,
        )
    
    @property
    def p(self) -> int:
        '''the number of pseudo documents (read-only)

.. versionadded:: 0.11.0'''
        return self._p

추가된 버전: 0.11.0

이 타입은 Pseudo-document based Topic Model (PTM)의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Zuo, Y., Wu, J., Zhang, H., Lin, H., Wang, F., Xu, K., & Xiong, H. (2016, August). Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2105-2114).

파라미터

tw : Union[int, TermWeight]: 용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.
min_cf : int: 단어의 최소 장서 빈도. 전체 문헌 내의 출현 빈도가 min_cf보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
min_df : int: 단어의 최소 문헌 빈도. 출현한 문헌 숫자가 min_df보다 작은 단어들은 모델에서 제외시킵니다. 기본값은 0으로, 이 경우 어떤 단어도 제외되지 않습니다.
rm_top : int: 제거될 최상위 빈도 단어의 개수. 만약 너무 흔한 단어가 토픽 모델 상위 결과에 등장해 이를 제거하고 싶은 경우, 이 값을 1 이상의 수로 설정하십시오. 기본값은 0으로, 이 경우 최상위 빈도 단어는 전혀 제거되지 않습니다.
k : int: 토픽의 개수, 1 ~ 32767 사이의 정수
p : int: 가상 문헌의 개수

Changed in version: 0.12.2
기본값이 10 * k로 변경되었습니다.
alpha : Union[float, Iterable[float]]: 문헌-토픽 디리클레 분포의 하이퍼 파라미터, 대칭일 경우 float값 하나로, 비대칭일 경우 k 길이의 float 리스트로 입력할 수 있습니다.
eta : float: 토픽-단어 디리클레 분포의 하이퍼 파라미터
seed : int: 난수의 시드값. 기본값은 C++의 std::random_device{}이 생성하는 임의의 정수입니다. 이 값을 고정하더라도 train시 workers를 2 이상으로 두면, 멀티 스레딩 과정에서 발생하는 우연성 때문에 실행시마다 결과가 달라질 수 있습니다.
corpus : Corpus: 토픽 모델에 추가될 문헌들의 집합을 지정합니다.
transform : Callable[dict, dict]: 특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._PTModel
LDAModel
tomotopy._LDAModel

인스턴스 변수

prop p : int

Expand source code

    @property
    def p(self) -> int:
        '''the number of pseudo documents (read-only)

.. versionadded:: 0.11.0'''
        return self._p

가상 문헌의 개수 (읽기전용)

추가된 버전: 0.11.0

상속받은 메소드 및 변수

LDAModel:
- add_corpus
- add_doc
- alpha
- burn_in
- copy
- docs
- eta
- get_count_by_topics
- get_topic_word_dist
- get_topic_words
- get_word_prior
- global_step
- infer
- k
- ll_per_word
- load
- loads
- make_doc
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs

class SLDAModel (tw='one', min_cf=0, min_df=0, rm_top=0, k=1, vars='', alpha=0.1, eta=0.01, mu=[], nu_sq=[], glm_param=[], seed=None, corpus=None, transform=None)

Expand source code

class SLDAModel(_SLDAModel, LDAModel):
    '''This type provides supervised Latent Dirichlet Allocation(sLDA) topic model and its implementation is based on the following papers:
        
> * Mcauliffe, J. D., & Blei, D. M. (2008). Supervised topic models. In Advances in neural information processing systems (pp. 121-128).
> * Python version implementation using Gibbs sampling : https://github.com/Savvysherpa/slda

.. versionadded:: 0.2.0'''

    def __init__(self,
                 tw='one', min_cf=0, min_df=0, rm_top=0, k=1, vars='', alpha=0.1, eta=0.01, mu=[], nu_sq=[], glm_param=[], seed=None, corpus=None, transform=None):
        '''Parameters
----------
tw : Union[int, tomotopy.TermWeight]
    term weighting scheme in `tomotopy.TermWeight`. The default value is TermWeight.ONE
min_cf : int
    minimum collection frequency of words. Words with a smaller collection frequency than `min_cf` are excluded from the model.
    The default value is 0, which means no words are excluded.
min_df : int
    .. versionadded:: 0.6.0

    minimum document frequency of words. Words with a smaller document frequency than `min_df` are excluded from the model.
    The default value is 0, which means no words are excluded.
rm_top : int
    the number of top words to be removed. If you want to remove too common words from the model, you can set this value to 1 or more.
    The default value is 0, which means no top words are removed.
k : int
    the number of topics between 1 ~ 32767
vars : Iterable[str]
    indicating types of response variables.
    The length of `vars` determines the number of response variables, and each element of `vars` determines a type of the variable.
    The list of available types is like below:
    
    > * 'l': linear variable (any real value)
    > * 'b': binary variable (0 or 1)
alpha : Union[float, Iterable[float]]
    hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.
eta : float
    hyperparameter of Dirichlet distribution for topic-word
mu : Union[float, Iterable[float]]
    mean of regression coefficients, default value is 0
nu_sq : Union[float, Iterable[float]]
    variance of regression coefficients, default value is 1
glm_param : Union[float, Iterable[float]]
    the parameter for Generalized Linear Model, default value is 1
seed : int
    random seed. The default value is a random number from `std::random_device{}` in C++
corpus : tomotopy.utils.Corpus
    .. versionadded:: 0.6.0

    a list of documents to be added into the model
transform : Callable[dict, dict]
    .. versionadded:: 0.6.0

    a callable object to manipulate arbitrary keyword arguments for a specific topic model

'''
        # get initial params
        self.init_params = deepcopy({k: v for k, v in locals().items() if k != 'self' and not k.startswith('_')})
        self.init_params['version'] = __version__
        tw = _convert_term_weight(tw)

        super().__init__(
            tw,
            min_cf,
            min_df,
            rm_top,
            k,
            vars,
            alpha,
            eta,
            mu,
            nu_sq,
            glm_param,
            seed,
            corpus,
            transform,
        )

    def add_doc(self, words, y=[], ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with response variables `y` and return an index of the inserted document.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
y : Iterable[float]
    response variables of this document. 
    The length of `y` must be equal to the number of response variables of the model (`tomotopy.models.SLDAModel.f`).
    
    .. versionchanged:: 0.5.1
    
        If you have a missing value, you can set the item as `NaN`. Documents with `NaN` variables are included in modeling topics, but excluded from regression.
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, y, ignore_empty_words)
    
    def make_doc(self, words, y=[]) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and response variables `y` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
y : Iterable[float]
    response variables of this document. 
    The length of `y` doesn't have to be equal to the number of response variables of the model (`tomotopy.models.SLDAModel.f`).
    If the length of `y` is shorter than `tomotopy.models.SLDAModel.f`, missing values are automatically filled with `NaN`.
'''
        return self._make_doc(words, y)
    
    def get_regression_coef(self, var_id=None) -> List[float]:
        '''Return the regression coefficient of the response variable `var_id`.

Parameters
----------
var_id : int
    indicating the response variable, in range [0, `f`)

    If omitted, the whole regression coefficients with shape `[f, k]` are returned.
'''
        return self._get_regression_coef(var_id)
    
    def get_var_type(self, var_id) -> str:
        '''Return the type of the response variable `var_id`. 'l' means linear variable, 'b' means binary variable.'''
        return self._get_var_type(var_id)
    
    def estimate(self, doc) -> List[float]:
        '''Return the estimated response variable for `doc`.
If `doc` is an unseen document instance which is generated by `tomotopy.models.SLDAModel.make_doc` method, it should be inferred by `tomotopy.models.LDAModel.infer` method first.

Parameters
----------
doc : tomotopy.utils.Document
    an instance of document or a list of them to be used for estimating response variables
'''
        return self._estimate(doc)
    
    @property
    def f(self) -> int:
        '''the number of response variables (read-only)'''
        return self._f
    
    def _summary_initial_params_info_vars(self, v, file):
        var_type = {'l':'linear', 'b':'binary'}
        print('| vars: {}'.format(', '.join(map(var_type.__getitem__, v))), file=file)

    def _summary_params_info(self, file):
        LDAModel._summary_params_info(self, file)
        var_type = {'l':'linear', 'b':'binary'}
        print('| regression coefficients of response variables', file=file)
        for f in range(self.f):
            print('|  #{} ({}): {}'.format(f, 
                var_type.get(self.get_var_type(f)),
                _format_numpy(self.get_regression_coef(f), '|    ')
            ), file=file)

이 타입은 supervised Latent Dirichlet Allocation(sLDA) 토픽 모델의 구현체를 제공합니다. 주요 알고리즘은 다음 논문에 기초하고 있습니다:

Mcauliffe, J. D., & Blei, D. M. (2008). Supervised topic models. In Advances in neural information processing systems (pp. 121-128).

Python version implementation using Gibbs sampling : https://github.com/Savvysherpa/slda

추가된 버전: 0.2.0

파라미터

tw : Union[int, TermWeight]

용어 가중치 기법을 나타내는 TermWeight의 열거값. 기본값은 TermWeight.ONE 입니다.

min_cf : int

min_df : int

추가된 버전: 0.6.0

rm_top : int

k : int

토픽의 개수, 1 ~ 32767 사이의 정수

vars : Iterable[str]

응답변수의 종류를 지정합니다. vars의 길이는 모형이 사용하는 응답 변수의 개수를 결정하며, vars의 요소는 각 응답 변수의 종류를 결정합니다. 사용가능한 종류는 다음과 같습니다:

'l': 선형 변수 (아무 실수 값이나 가능)

'b': 이진 변수 (0 혹은 1만 가능)

alpha : Union[float, Iterable[float]]

문헌-토픽 디리클레 분포의 하이퍼 파라미터, 대칭일 경우 float값 하나로, 비대칭일 경우 k 길이의 float 리스트로 입력할 수 있습니다.

eta : float

토픽-단어 디리클레 분포의 하이퍼 파라미터

mu : Union[float, Iterable[float]]

회귀 계수의 평균값, 기본값은 0

nu_sq : Union[float, Iterable[float]]

회귀 계수의 분산값, 기본값은 1

glm_param : Union[float, Iterable[float]]

일반화 선형 모형에서 사용될 파라미터, 기본값은 1

seed : int

corpus : Corpus

추가된 버전: 0.6.0

토픽 모델에 추가될 문헌들의 집합을 지정합니다.

transform : Callable[dict, dict]

추가된 버전: 0.6.0

특정한 토픽 모델에 맞춰 임의 키워드 인자를 조작하기 위한 호출가능한 객체

부모 클래스

tomotopy._SLDAModel
LDAModel
tomotopy._LDAModel

인스턴스 변수

prop f : int

Expand source code

@property
def f(self) -> int:
    '''the number of response variables (read-only)'''
    return self._f

응답 변수의 개수 (읽기전용)

메소드

def add_doc(self, words, y=[], ignore_empty_words=True) ‑> int | None

Expand source code

    def add_doc(self, words, y=[], ignore_empty_words=True) -> Optional[int]:
        '''Add a new document into the model instance with response variables `y` and return an index of the inserted document.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
y : Iterable[float]
    response variables of this document. 
    The length of `y` must be equal to the number of response variables of the model (`tomotopy.models.SLDAModel.f`).
    
    .. versionchanged:: 0.5.1
    
        If you have a missing value, you can set the item as `NaN`. Documents with `NaN` variables are included in modeling topics, but excluded from regression.
ignore_empty_words : bool
    If `True`, empty `words` doesn't raise an exception and makes the method return None.
'''
        return self._add_doc(words, y, ignore_empty_words)

현재 모델에 응답 변수 y를 포함하는 새로운 문헌을 추가하고 추가된 문헌의 인덱스 번호를 반환합니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
y : Iterable[float]: 문헌의 응답 변수로 쓰일 float의 list. y의 길이는 모델의 응답 변수의 개수인 SLDAModel.f와 일치해야 합니다.

Changed in version: 0.5.1

만약 결측값이 있을 경우, 해당 항목을 NaN으로 설정할 수 있습니다. 이 경우 NaN값을 가진 문헌은 토픽을 모델링하는 데에는 포함되지만, 응답 변수 회귀에서는 제외됩니다.

def estimate(self, doc) ‑> List[float]

Expand source code

    def estimate(self, doc) -> List[float]:
        '''Return the estimated response variable for `doc`.
If `doc` is an unseen document instance which is generated by `tomotopy.models.SLDAModel.make_doc` method, it should be inferred by `tomotopy.models.LDAModel.infer` method first.

Parameters
----------
doc : tomotopy.utils.Document
    an instance of document or a list of them to be used for estimating response variables
'''
        return self._estimate(doc)

doc의 추정된 응답 변수를 반환합니다. 만약 doc이 SLDAModel.make_doc()에 의해 생성된 인스턴스라면, 먼저 LDAModel.infer()를 통해 토픽 추론을 실시한 다음 이 메소드를 사용해야 합니다.

파라미터

doc : Document: 응답 변수를 추정하려하는 문헌의 인스턴스 혹은 인스턴스들의 list

def get_regression_coef(self, var_id=None) ‑> List[float]

Expand source code

    def get_regression_coef(self, var_id=None) -> List[float]:
        '''Return the regression coefficient of the response variable `var_id`.

Parameters
----------
var_id : int
    indicating the response variable, in range [0, `f`)

    If omitted, the whole regression coefficients with shape `[f, k]` are returned.
'''
        return self._get_regression_coef(var_id)

응답 변수 var_id의 회귀 계수를 반환합니다.

파라미터

var_id : int

응답 변수를 지정하는 [0, f) 범위의 정수

생략시, [f, k] 모양의 전체 회귀 계수가 반환됩니다.

def get_var_type(self, var_id) ‑> str

Expand source code

def get_var_type(self, var_id) -> str:
    '''Return the type of the response variable `var_id`. 'l' means linear variable, 'b' means binary variable.'''
    return self._get_var_type(var_id)

응답 변수 var_id의 종류를 반환합니다. 'l'은 선형 변수, 'b'는 이진 변수를 뜻합니다.

def make_doc(self, words, y=[]) ‑> Document

Expand source code

    def make_doc(self, words, y=[]) -> Document:
        '''Return a new `tomotopy.utils.Document` instance for an unseen document with `words` and response variables `y` that can be used for `tomotopy.models.LDAModel.infer` method.

Parameters
----------
words : Iterable[str]
    an iterable of `str`
y : Iterable[float]
    response variables of this document. 
    The length of `y` doesn't have to be equal to the number of response variables of the model (`tomotopy.models.SLDAModel.f`).
    If the length of `y` is shorter than `tomotopy.models.SLDAModel.f`, missing values are automatically filled with `NaN`.
'''
        return self._make_doc(words, y)

words 단어를 바탕으로 새로운 문헌인 Document 인스턴스를 반환합니다. 이 인스턴스는 LDAModel.infer() 메소드에 사용될 수 있습니다.

파라미터

words : Iterable[str]: 문헌의 각 단어를 나열하는 str 타입의 iterable
y : Iterable[float]: 문헌의 응답 변수로 쓰일 float의 list. y의 길이는 모델의 응답 변수의 개수인 SLDAModel.f와 꼭 일치할 필요는 없습니다. y의 길이가 SLDAModel.f보다 짧을 경우, 모자란 값들은 자동으로 NaN으로 채워집니다.

상속받은 메소드 및 변수

LDAModel:
- add_corpus
- alpha
- burn_in
- copy
- docs
- eta
- get_count_by_topics
- get_topic_word_dist
- get_topic_words
- get_word_prior
- global_step
- infer
- k
- ll_per_word
- load
- loads
- num_vocabs
- num_words
- optim_interval
- perplexity
- removed_top_words
- save
- saves
- set_word_prior
- summary
- train
- tw
- used_vocab_df
- used_vocab_freq
- used_vocab_weighted_freq
- used_vocabs
- vocab_df
- vocab_freq
- vocabs