minimel.clean module

Filter anchor counts

minimel.clean.get_titles(indexdbfile, ents=None, names=None, language=None)

Get Wikipedia article titles

minimel.clean.steps(count)
minimel.clean.filter_steps(ent_count, cutoff=0.7)
minimel.clean.filter_counts_cutoff(s_e_count, cutoff=0.7)
minimel.clean.entropy(ent_count)
minimel.clean.countratio(ent_count)
minimel.clean.tokens(s, N=3)
minimel.clean.tokenscore(name, count, id_titles)

Calculate the average mean token overlap between the name & the titles of candidate labels (asymmetric jaccard index)

minimel.clean.cluster(name_scores, score_threshold)
minimel.clean.clean(indexdbfile: Path, disambigfile: Path, countfile: Path, namecountfile: Path | None = None, *, outfile: Path | None = None, stem: str | None = None, freqnorm: bool = False, badentfile: Path | None = None, min_count: int = 2, tokenscore_threshold: float = 0.1, entropy_threshold: float = 1.0, countratio_threshold: float = 0.5, quantile_top_shadowed: float | None = None, cluster_threshold: float | None = None)

Filter anchor counts (given their candidate entity counts).

First, only keep ambiguous candidate entities that either have minimal counts or are linked from disambiguation pages. If the tokenscore is low, then names with high entropy or countratio (len / sum) are removed.

Parameters:
Keyword Arguments:
  • outfile – Output file or directory (default: clean.json)

  • stem – Stemming language ISO 639-1 (2-letter) code

  • min_count – Minimal candidate entity count

  • freqnorm – Normalize counts by total entity frequency

  • badentfile – Files of entity IDs to ignore, one per line

  • tokenscore_threshold – Threshold for mean asymmentric Jaccard index between name and candidate entity labels

  • entropy_threshold – Entropy threshold (high entropy = flat dist)

  • countratio_threshold – Count-ratio (len / sum) threshold

  • quantile_top_shadowed – Only train models for a % names with highest counts of candidate entities shadowed by the top candidate