Module tomotopy.coherence

Added in version: 0.10.0

This module provides a way to calculate topic coherence introduced in the following paper:

Classes

class Coherence (corpus, coherence='u_mass', window_size=0, targets=None, top_n=10, eps=1e-12, gamma=1.0)
Expand source code
class Coherence(_Coherence):
    '''`Coherence` class provides a way to calculate topic coherence.
    '''
    _COH_MAP = {
        'u_mass':(ProbEstimation.DOCUMENT, 0, Segmentation.ONE_PRE, ConfirmMeasure.LOGCOND, IndirectMeasure.NONE),
        'c_v':(ProbEstimation.SLIDING_WINDOWS, 110, Segmentation.ONE_SET, ConfirmMeasure.NPMI, IndirectMeasure.COSINE),
        'c_uci':(ProbEstimation.SLIDING_WINDOWS, 10, Segmentation.ONE_ONE, ConfirmMeasure.PMI, IndirectMeasure.NONE),
        'c_npmi':(ProbEstimation.SLIDING_WINDOWS, 10, Segmentation.ONE_ONE, ConfirmMeasure.NPMI, IndirectMeasure.NONE)
    }

    def __init__(self, corpus, coherence='u_mass', window_size=0, targets=None, top_n=10, eps=1e-12, gamma=1.0):
        '''Initialize an instance to calculate coherence for given corpus

Parameters
----------
corpus : Union[tomotopy.utils.Corpus, tomotopy.models.LDAModel]
    A reference corpus to be used for estimating probability. 
    Supports not only an instance of `tomotopy.utils.Corpus`, but also any topic model instances including `tomotopy.models.LDAModel` and its descendants.
    If `corpus` is an instance of `tomotopy.utils.Corpus`, `targets` must be given too.
coherence : Union[str, Tuple[int, int, int], Tuple[int, int, int, int]]
    A coherence metric to be used. The metric can be a combination of (`tomotopy.coherence.ProbEstimation`, `tomotopy.coherence.Segmentation`, `tomotopy.coherence.ConfirmMeasure`)
    or a combination of (`tomotopy.coherence.ProbEstimation`, `tomotopy.coherence.Segmentation`, `tomotopy.coherence.ConfirmMeasure`, `tomotopy.coherence.IndirectMeasure`).
    
    Also shorthands in `str` type are supported as follows:
    > * 'u_mass' : (`tomotopy.coherence.ProbEstimation.DOCUMENT`, `tomotopy.coherence.Segmentation.ONE_PRE`, `tomotopy.coherence.ConfirmMeasure.LOGCOND`)
    > * 'c_uci' : (`tomotopy.coherence.ProbEstimation.SLIDING_WINDOWS`, `tomotopy.coherence.Segmentation.ONE_ONE`, `tomotopy.coherence.ConfirmMeasure.PMI`)
    > * 'c_npmi' : (`tomotopy.coherence.ProbEstimation.SLIDING_WINDOWS`, `tomotopy.coherence.Segmentation.ONE_ONE`, `tomotopy.coherence.ConfirmMeasure.NPMI`)
    > * 'c_v' : (`tomotopy.coherence.ProbEstimation.SLIDING_WINDOWS`, `tomotopy.coherence.Segmentation.ONE_SET`, `tomotopy.coherence.ConfirmMeasure.NPMI`, `tomotopy.coherence.IndirectMeasure.COSINE`)
window_size : int
    A window size for `tomotopy.coherence.ProbEstimation.SLIDING_WINDOWS`.
    default value is 10 for 'c_uci' and 'c_npmi' and 110 for 'c_v'.
targets : Iterable[str]
    If `corpus` is an instance of `tomotopy.utils.Corpus`, target words must be given. 
    Only words that are provided as `targets` are included in probability estimation.
top_n : int
    The number of top words to be extracted from each topic.
    If `corpus` is an instance of `tomotopy.models.LDAModel` and its descendants, target words are extracted from each topic of the topic model.
    But if `targets` are given, target words are extracted from `targets`, even if `corpus` is an instance of topic models.
eps : float
    An epsilon value to prevent division by zero
gamma : float
    A gamma value for indirect confirm measures
        '''
        import tomotopy as tp
        self._top_n = top_n
        if isinstance(corpus, tp.DTModel):
            self._topic_model = corpus
            if targets is None:
                targets = itertools.chain(*((w for w, _ in corpus.get_topic_words(k, t, top_n=top_n)) for k in range(corpus.k) for t in range(corpus.num_timepoints)))
            corpus = corpus.docs
        elif isinstance(corpus, tp.LDAModel):
            self._topic_model = corpus
            if targets is None:
                targets = itertools.chain(*((w for w, _ in corpus.get_topic_words(k, top_n=top_n)) for k in range(corpus.k)))
            corpus = corpus.docs
        else:
            self._topic_model = None
        
        try:
            pe, w, seg, cm, im = Coherence._COH_MAP[coherence]
        except KeyError:
            if type(coherence) is str: raise ValueError("Unknown `coherence` value (given {})".format(repr(coherence)))
            if len(coherence) not in (3, 4): raise ValueError("`coherence` must be a tuple with len=3 or len=4 (given {})".format(repr(coherence)))
            if len(coherence) == 3: 
                pe, seg, cm = coherence
                im = IndirectMeasure.NONE
            else:
                pe, seg, cm, im = coherence
            w = 0

            if type(pe) is str: pe = ProbEstimation[pe]
            if type(seg) is str: seg = Segmentation[seg]
            if type(cm) is str: cm = ConfirmMeasure[cm]
            if type(im) is str: im = IndirectMeasure[im]
        
        if not targets: raise ValueError("`targets` must be given as a non-empty iterable of str.")

        super().__init__(corpus, pe, seg, cm, im, window_size or w, eps, gamma, targets)
    
    def get_score(self, words=None, topic_id=None, timepoint=None) -> float:
        '''Calculate the coherence score for given `words` or `topic_id`

Parameters
----------
words : Iterable[str]
    Words whose coherence is calculated. 
    If `tomotopy.coherence.Coherence` was initialized using `corpus` as `tomotopy.models.LDAModel` or its descendants, `words` can be omitted.
    In this case, words are extracted from the topic with `topic_id`.
topic_id : int
    An id of the topic from which words are extracted. 
    This parameter is valid when `tomotopy.coherence.Coherence` was initialized using `corpus` as `tomotopy.models.LDAModel` or its descendants.
    If this is omitted, the average score of all topics is returned.
timepoint : int
    ..versionadded:: 0.12.3

    A timepoint of the topic from which words are extracted. (Only for `tomotopy.models.DTModel`)
        '''
        import tomotopy as tp
        if words is None and self._topic_model is None:
            raise ValueError("`words` must be provided if `Coherence` is not bound to an instance of topic model.")
        if isinstance(self._topic_model, tp.DTModel):
            if int(topic_id is None) + int(timepoint is None) == 1:
                raise ValueError("Both `topic_id` and `timepoint` should be given.")
            if words is None and topic_id is None:
                c = []
                for k in range(self._topic_model.k):
                    for t in range(self._topic_model.num_timepoints):
                        c.append(super().get_score((w for w, _ in self._topic_model.get_topic_words(k, timepoint=t, top_n=self._top_n))))
                return sum(c) / len(c)
            
            if words is None:
                words = (w for w, _ in self._topic_model.get_topic_words(topic_id, timepoint=timepoint, top_n=self._top_n))
            return super().get_score(words)
        else:
            if timepoint is not None:
                raise ValueError("`timepoint` is valid for only `DTModel`.")
            if words is None and topic_id is None:
                c = []
                for k in range(self._topic_model.k):
                    c.append(super().get_score((w for w, _ in self._topic_model.get_topic_words(k, top_n=self._top_n))))
                return sum(c) / len(c)
            
            if words is None:
                words = (w for w, _ in self._topic_model.get_topic_words(topic_id, top_n=self._top_n))
            return super().get_score(words)

Coherence class provides a way to calculate topic coherence.

Initialize an instance to calculate coherence for given corpus

Parameters

corpus : Union[Corpus, LDAModel]
A reference corpus to be used for estimating probability. Supports not only an instance of Corpus, but also any topic model instances including LDAModel and its descendants. If corpus is an instance of Corpus, targets must be given too.
coherence : Union[str, Tuple[int, int, int], Tuple[int, int, int, int]]

A coherence metric to be used. The metric can be a combination of (ProbEstimation, Segmentation, ConfirmMeasure) or a combination of (ProbEstimation, Segmentation, ConfirmMeasure, IndirectMeasure).

Also shorthands in str type are supported as follows:

window_size : int
A window size for ProbEstimation.SLIDING_WINDOWS. default value is 10 for 'c_uci' and 'c_npmi' and 110 for 'c_v'.
targets : Iterable[str]
If corpus is an instance of Corpus, target words must be given. Only words that are provided as targets are included in probability estimation.
top_n : int
The number of top words to be extracted from each topic. If corpus is an instance of LDAModel and its descendants, target words are extracted from each topic of the topic model. But if targets are given, target words are extracted from targets, even if corpus is an instance of topic models.
eps : float
An epsilon value to prevent division by zero
gamma : float
A gamma value for indirect confirm measures

Ancestors

  • tomotopy._Coherence

Methods

def get_score(self, words=None, topic_id=None, timepoint=None) ‑> float
Expand source code
    def get_score(self, words=None, topic_id=None, timepoint=None) -> float:
        '''Calculate the coherence score for given `words` or `topic_id`

Parameters
----------
words : Iterable[str]
    Words whose coherence is calculated. 
    If `tomotopy.coherence.Coherence` was initialized using `corpus` as `tomotopy.models.LDAModel` or its descendants, `words` can be omitted.
    In this case, words are extracted from the topic with `topic_id`.
topic_id : int
    An id of the topic from which words are extracted. 
    This parameter is valid when `tomotopy.coherence.Coherence` was initialized using `corpus` as `tomotopy.models.LDAModel` or its descendants.
    If this is omitted, the average score of all topics is returned.
timepoint : int
    ..versionadded:: 0.12.3

    A timepoint of the topic from which words are extracted. (Only for `tomotopy.models.DTModel`)
        '''
        import tomotopy as tp
        if words is None and self._topic_model is None:
            raise ValueError("`words` must be provided if `Coherence` is not bound to an instance of topic model.")
        if isinstance(self._topic_model, tp.DTModel):
            if int(topic_id is None) + int(timepoint is None) == 1:
                raise ValueError("Both `topic_id` and `timepoint` should be given.")
            if words is None and topic_id is None:
                c = []
                for k in range(self._topic_model.k):
                    for t in range(self._topic_model.num_timepoints):
                        c.append(super().get_score((w for w, _ in self._topic_model.get_topic_words(k, timepoint=t, top_n=self._top_n))))
                return sum(c) / len(c)
            
            if words is None:
                words = (w for w, _ in self._topic_model.get_topic_words(topic_id, timepoint=timepoint, top_n=self._top_n))
            return super().get_score(words)
        else:
            if timepoint is not None:
                raise ValueError("`timepoint` is valid for only `DTModel`.")
            if words is None and topic_id is None:
                c = []
                for k in range(self._topic_model.k):
                    c.append(super().get_score((w for w, _ in self._topic_model.get_topic_words(k, top_n=self._top_n))))
                return sum(c) / len(c)
            
            if words is None:
                words = (w for w, _ in self._topic_model.get_topic_words(topic_id, top_n=self._top_n))
            return super().get_score(words)

Calculate the coherence score for given words or topic_id

Parameters

words : Iterable[str]
Words whose coherence is calculated. If Coherence was initialized using corpus as LDAModel or its descendants, words can be omitted. In this case, words are extracted from the topic with topic_id.
topic_id : int
An id of the topic from which words are extracted. This parameter is valid when Coherence was initialized using corpus as LDAModel or its descendants. If this is omitted, the average score of all topics is returned.
timepoint : int

Added in version: 0.12.3

A timepoint of the topic from which words are extracted. (Only for DTModel)

class ConfirmMeasure (*args, **kwds)
Expand source code
class ConfirmMeasure(IntEnum):
    '''
    This enumeration follows the direct confirm measure in the paper.
    '''

    DIFFERENCE = 1
    RATIO = 2
    LIKELIHOOD = 3
    LOGLIKELIHOOD = 4
    PMI = 5
    NPMI = 6
    LOGCOND = 7

This enumeration follows the direct confirm measure in the paper.

Ancestors

  • enum.IntEnum
  • builtins.int
  • enum.ReprEnum
  • enum.Enum

Class variables

var DIFFERENCE

The type of the None singleton.

var LIKELIHOOD

The type of the None singleton.

var LOGCOND

The type of the None singleton.

var LOGLIKELIHOOD

The type of the None singleton.

var NPMI

The type of the None singleton.

var PMI

The type of the None singleton.

var RATIO

The type of the None singleton.

class IndirectMeasure (*args, **kwds)
Expand source code
class IndirectMeasure(IntEnum):
    '''
    This enumeration follows the indirect confirm measure in the paper.
    '''

    NONE = 0
    COSINE = 1
    DICE = 2
    JACCARD = 3

This enumeration follows the indirect confirm measure in the paper.

Ancestors

  • enum.IntEnum
  • builtins.int
  • enum.ReprEnum
  • enum.Enum

Class variables

var COSINE

The type of the None singleton.

var DICE

The type of the None singleton.

var JACCARD

The type of the None singleton.

var NONE

The type of the None singleton.

class ProbEstimation (*args, **kwds)
Expand source code
class ProbEstimation(IntEnum):
    '''
    This enumeration follows the probability estimator in the paper.
    '''
    DOCUMENT = 1
    SLIDING_WINDOWS = 2

This enumeration follows the probability estimator in the paper.

Ancestors

  • enum.IntEnum
  • builtins.int
  • enum.ReprEnum
  • enum.Enum

Class variables

var DOCUMENT

The type of the None singleton.

var SLIDING_WINDOWS

The type of the None singleton.

class Segmentation (*args, **kwds)
Expand source code
class Segmentation(IntEnum):
    '''
    This enumeration follows the segmentation in the paper.
    '''

    ONE_ONE = 1
    ONE_PRE = 2
    ONE_SUC = 3
    ONE_ALL = 4
    ONE_SET = 5

This enumeration follows the segmentation in the paper.

Ancestors

  • enum.IntEnum
  • builtins.int
  • enum.ReprEnum
  • enum.Enum

Class variables

var ONE_ALL

The type of the None singleton.

var ONE_ONE

The type of the None singleton.

var ONE_PRE

The type of the None singleton.

var ONE_SET

The type of the None singleton.

var ONE_SUC

The type of the None singleton.