minimel.clean module

Filter anchor counts

minimel.clean.get_titles(indexdbfile, ents=None, names=None, language=None): Get Wikipedia article titles

minimel.clean.steps(count)

minimel.clean.filter_steps(ent_count, cutoff=0.7)

minimel.clean.filter_counts_cutoff(s_e_count, cutoff=0.7)

minimel.clean.entropy(ent_count)

minimel.clean.countratio(ent_count)

minimel.clean.tokens(s, N=3)

minimel.clean.tokenscore(name, count, id_titles): Calculate the average mean token overlap between the name & the titles of candidate labels (asymmetric jaccard index)

minimel.clean.cluster(name_scores, score_threshold)

minimel.clean.clean(indexdbfile: Path, disambigfile: Path, countfile: Path, namecountfile: Path | None = None, *, outfile: Path | None = None, stem: str | None = None, freqnorm: bool = False, badentfile: Path | None = None, min_count: int = 2, tokenscore_threshold: float = 0.1, entropy_threshold: float = 1.0, countratio_threshold: float = 0.5, quantile_top_shadowed: float | None = None, cluster_threshold: float | None = None)

Filter anchor counts (given their candidate entity counts).

First, only keep ambiguous candidate entities that either have minimal counts or are linked from disambiguation pages. If the tokenscore is low, then names with high entropy or countratio (len / sum) are removed.

Parameters:

indexdbfile (Path) – Wikimapper index sqlite3 database
disambigfile (Path) – Disambiguation JSON file
countfile (Path) – Hyperlink anchor count {word: {Q_ent: count}} JSON file
namecountfile (Optional[Path]) – Counts of names (regardless of hyperlinks)
outfile (Optional[Path])
stem (Optional[str])
freqnorm (bool)
badentfile (Optional[Path])
min_count (int)
tokenscore_threshold (float)
entropy_threshold (float)
countratio_threshold (float)
quantile_top_shadowed (Optional[float])
cluster_threshold (Optional[float])

Keyword Arguments:

outfile – Output file or directory (default: clean.json)
stem – Stemming language ISO 639-1 (2-letter) code
min_count – Minimal candidate entity count
freqnorm – Normalize counts by total entity frequency
badentfile – Files of entity IDs to ignore, one per line
tokenscore_threshold – Threshold for mean asymmentric Jaccard index between name and candidate entity labels
entropy_threshold – Entropy threshold (high entropy = flat dist)
countratio_threshold – Count-ratio (len / sum) threshold
quantile_top_shadowed – Only train models for a % names with highest counts of candidate entities shadowed by the top candidate