minimel.experiment module
- minimel.experiment.find(directory, glob)
- minimel.experiment.sweep(**kw)
- minimel.experiment.make_dir_params(name, **params)
- minimel.experiment.experiment(root: Path = PosixPath('.'), *, outdir: Path | None = None, nparts: int = 100, head: int | None = None, split: List[int] = (None,), fold: List[int] = (None,), stem: List[str] = ('',), min_count: List[int] = (2,), freqnorm: List[bool] = (False,), badentfile: List[Path] = ('',), tokenscore_threshold: List[float] = (0.1,), entropy_threshold: List[float] = (1.0,), countratio_threshold: List[float] = (0.5,), quantile_top_shadowed: List[float] = (0,), cluster_threshold: List[float] = (None,), vectorizer: List[Path] = ('',), ent_feats_csv: List[Path] = ('',), balanced: List[bool] = (False,), usenil: List[bool] = (False,), bits: List[int] = (20,), runfile: List[Path] = ('',), use_fallback: List[bool] = (True,), also_baseline: bool = True, evaluate: bool = False, evaluate_per_name: bool = False)
Run all steps to train and evaluate EL models over a parameter sweep.
The root directory must contain the following files:
index_*.dawg: DAWG trie mapping of article names -> numeric IDs
*-disambig.txt: See disambig_ent_file in
get_disambig
- Parameters:
- Keyword Arguments:
outdir – Write outputs to this directory
nparts – Number of parts to chunk wikidump into
head – Use only N first lines from each partition
split – Split the data into several parts
fold – Ignore this fold of the split data in training, use in evaluation
stem – Stemming language ISO 639-1 (2-letter) code (use X for no stemming)
min_count – Minimal (anchor-text, target) occurrence
freqnorm – Normalize counts by total entity frequency (1/0)
badentfile – File of entity IDs to ignore, one per line (default: *-disambig.txt)
tokenscore_threshold – Threshold for mean asymmentric Jaccard index between name and candidate entity labels
entropy_threshold – Entropy threshold (high entropy = flat dist)
countratio_threshold – Count-ratio (len / sum) threshold
quantile_top_shadowed – Only train models for a % names with highest counts of candidate entities shadowed by the top candidate
cluster_threshold – Cluster names based on their meanings
vectorizer – Scikit-learn vectorizer .pickle or Fasttext .bin word embeddings. If unset, use tokens directly.
ent_feats_csv – CSV of (ent_id,space separated feat list) entity features
balanced – Use balanced training
usenil – Use NIL option for training unlinked mentions
bits – Number of bits of the Vowpal Wabbit feature hash function
runfile – TSV rows of (ID, {name -> ID}, text) or ({name -> ID}, text)
use_fallback – Use raw counts as fallback
also_baseline – Also run a baseline model without model predictions
evaluate – Write evaluation scores to file
evaluate_per_name – Write evaluation scores per name to file