minimel.experiment module

class minimel.experiment.log_time(fname): Bases: object

minimel.experiment.find(directory, glob)

minimel.experiment.sweep(**kw)

minimel.experiment.make_dir_params(name, **params)

minimel.experiment.get_dir_params(dirname: Path)

Parameters:: dirname (Path)

minimel.experiment.experiment(root: Path = PosixPath('.'), *, outdir: Path | None = None, nparts: int = 100, head: int | None = None, split: List[int] = (None,), fold: List[int] = (None,), stem: List[str] = ('',), min_count: List[int] = (2,), freqnorm: List[bool] = (False,), badentfile: List[Path] = ('',), tokenscore_threshold: List[float] = (0.1,), entropy_threshold: List[float] = (1.0,), countratio_threshold: List[float] = (0.5,), quantile_top_shadowed: List[float] = (0,), cluster_threshold: List[float] = (None,), vectorizer: List[Path] = ('',), ent_feats_csv: List[Path] = ('',), balanced: List[bool] = (False,), usenil: List[bool] = (False,), bits: List[int] = (20,), runfile: List[Path] = ('',), use_fallback: List[bool] = (True,), also_baseline: bool = True, evaluate: bool = False, evaluate_per_name: bool = False)

Run all steps to train and evaluate EL models over a parameter sweep.

The root directory must contain the following files:

index_*.dawg: DAWG trie mapping of article names -> numeric IDs
*-disambig.txt: See disambig_ent_file in get_disambig

Parameters:

root (Path) – Root directory
outdir (Optional[Path])
nparts (int)
head (Optional[int])
split (List[int])
fold (List[int])
stem (List[str])
min_count (List[int])
freqnorm (List[bool])
badentfile (List[Path])
tokenscore_threshold (List[float])
entropy_threshold (List[float])
countratio_threshold (List[float])
quantile_top_shadowed (List[float])
cluster_threshold (List[float])
vectorizer (List[Path])
ent_feats_csv (List[Path])
balanced (List[bool])
usenil (List[bool])
bits (List[int])
runfile (List[Path])
use_fallback (List[bool])
also_baseline (bool)
evaluate (bool)
evaluate_per_name (bool)

Keyword Arguments:

outdir – Write outputs to this directory
nparts – Number of parts to chunk wikidump into
head – Use only N first lines from each partition
split – Split the data into several parts
fold – Ignore this fold of the split data in training, use in evaluation
stem – Stemming language ISO 639-1 (2-letter) code (use X for no stemming)
min_count – Minimal (anchor-text, target) occurrence
freqnorm – Normalize counts by total entity frequency (1/0)
badentfile – File of entity IDs to ignore, one per line (default: *-disambig.txt)
tokenscore_threshold – Threshold for mean asymmentric Jaccard index between name and candidate entity labels
entropy_threshold – Entropy threshold (high entropy = flat dist)
countratio_threshold – Count-ratio (len / sum) threshold
quantile_top_shadowed – Only train models for a % names with highest counts of candidate entities shadowed by the top candidate
cluster_threshold – Cluster names based on their meanings
vectorizer – Scikit-learn vectorizer .pickle or Fasttext .bin word embeddings. If unset, use tokens directly.
ent_feats_csv – CSV of (ent_id,space separated feat list) entity features
balanced – Use balanced training
usenil – Use NIL option for training unlinked mentions
bits – Number of bits of the Vowpal Wabbit feature hash function
runfile – TSV rows of (ID, {name -> ID}, text) or ({name -> ID}, text)
use_fallback – Use raw counts as fallback
also_baseline – Also run a baseline model without model predictions
evaluate – Write evaluation scores to file
evaluate_per_name – Write evaluation scores per name to file