minimel.experiment module

class minimel.experiment.log_time(fname)

Bases: object

minimel.experiment.find(directory, glob)
minimel.experiment.sweep(**kw)
minimel.experiment.make_dir_params(name, **params)
minimel.experiment.get_dir_params(dirname: Path)
Parameters:

dirname (Path)

minimel.experiment.experiment(root: Path = PosixPath('.'), *, outdir: Path | None = None, nparts: int = 100, head: int | None = None, split: List[int] = (None,), fold: List[int] = (None,), stem: List[str] = ('',), min_count: List[int] = (2,), freqnorm: List[bool] = (False,), badentfile: List[Path] = ('',), tokenscore_threshold: List[float] = (0.1,), entropy_threshold: List[float] = (1.0,), countratio_threshold: List[float] = (0.5,), quantile_top_shadowed: List[float] = (0,), cluster_threshold: List[float] = (None,), vectorizer: List[Path] = ('',), ent_feats_csv: List[Path] = ('',), balanced: List[bool] = (False,), usenil: List[bool] = (False,), bits: List[int] = (20,), runfile: List[Path] = ('',), use_fallback: List[bool] = (True,), also_baseline: bool = True, evaluate: bool = False, evaluate_per_name: bool = False)

Run all steps to train and evaluate EL models over a parameter sweep.

The root directory must contain the following files:

  • index_*.dawg: DAWG trie mapping of article names -> numeric IDs

  • *-disambig.txt: See disambig_ent_file in get_disambig

Parameters:
Keyword Arguments:
  • outdir – Write outputs to this directory

  • nparts – Number of parts to chunk wikidump into

  • head – Use only N first lines from each partition

  • split – Split the data into several parts

  • fold – Ignore this fold of the split data in training, use in evaluation

  • stem – Stemming language ISO 639-1 (2-letter) code (use X for no stemming)

  • min_count – Minimal (anchor-text, target) occurrence

  • freqnorm – Normalize counts by total entity frequency (1/0)

  • badentfile – File of entity IDs to ignore, one per line (default: *-disambig.txt)

  • tokenscore_threshold – Threshold for mean asymmentric Jaccard index between name and candidate entity labels

  • entropy_threshold – Entropy threshold (high entropy = flat dist)

  • countratio_threshold – Count-ratio (len / sum) threshold

  • quantile_top_shadowed – Only train models for a % names with highest counts of candidate entities shadowed by the top candidate

  • cluster_threshold – Cluster names based on their meanings

  • vectorizer – Scikit-learn vectorizer .pickle or Fasttext .bin word embeddings. If unset, use tokens directly.

  • ent_feats_csv – CSV of (ent_id,space separated feat list) entity features

  • balanced – Use balanced training

  • usenil – Use NIL option for training unlinked mentions

  • bits – Number of bits of the Vowpal Wabbit feature hash function

  • runfile – TSV rows of (ID, {name -> ID}, text) or ({name -> ID}, text)

  • use_fallback – Use raw counts as fallback

  • also_baseline – Also run a baseline model without model predictions

  • evaluate – Write evaluation scores to file

  • evaluate_per_name – Write evaluation scores per name to file