Command Line Interface

usage: minimel [-h] [--verbose] [--slurm]
               {prepare,index,xml-db,query-pages,get-disambig,get-paragraphs,count,count-names,clean,vectorize,ent-feats,train,run,evaluate,experiment,audit}
               ...

positional arguments:
  {prepare,index,xml-db,query-pages,get-disambig,get-paragraphs,count,count-names,clean,vectorize,ent-feats,train,run,evaluate,experiment,audit}
    prepare             Download required files and make indices
    index               Make an efficient DAWG trie index from a Wikimapper sqlite file
    xml-db              Make a name database from Wikidump page ids
    query-pages         Query the Wikidata API to get disambiguation (& list pages if indicated)
    get-disambig        Get disambiguation links.
    get-paragraphs      Extract hyperlinks from Wikipedia dumps.
    count               Count targets per anchor text in Wikipedia paragraphs.
    count-names         Count anchor texts in Wikipedia paragraphs.
    clean               Filter anchor counts (given their candidate entity counts).
    vectorize           Vectorize paragraph text dataset into Vowpal Wabbit format
    ent-feats           Extract entity features from parquet triples
    train               Train Logistic Regression models
    run                 Perform entity disambiguation
    evaluate            Evaluate predictions
    experiment          Run all steps to train and evaluate EL models over a parameter sweep.
    audit               Print prediction scores and model coefficients

options:
  -h, --help            show this help message and exit
  --verbose, -v         Verbosity (use -vv for debug messages)
  --slurm, -s           Use Slurm

prepare

usage: minimel prepare [-h] [-r ROOTDIR] [-m MIRROR]
                       [-o | --overwrite | --no-overwrite] [-n NPARTS]
                       [-i | --index-only | --no-index-only]
                       [-c CUSTOM_LANGCODE]
                       wikiname version

Download required files and make indices

positional arguments:
  wikiname              Wikipedia edition name (eg. "simplewiki")
  version               Wikipedia version (eg. "latest")

options:
  -h, --help            show this help message and exit
  -r ROOTDIR, --rootdir ROOTDIR
                        Root directory
                        (default: None)
  -m MIRROR, --mirror MIRROR
                        Wikimedia mirror
                        (default: https://dumps.wikimedia.org)
  -o, --overwrite, --no-overwrite
                        Whether to overwrite existing files
                        (default: False)
  -n NPARTS, --nparts NPARTS
                        Number of chunks to read
                        (default: 100)
  -i, --index-only, --no-index-only
                        Whether to only create the DAWG index
                        (default: False)
  -c CUSTOM_LANGCODE, --custom-langcode CUSTOM_LANGCODE
                        Custom language code (if different from wikiname, e.g. "en-simple")
                        (default: None)

index

usage: minimel index [-h] db_fname

Make an efficient DAWG trie index from a Wikimapper sqlite file

positional arguments:
  db_fname    Wikimapper SQLite3 index file

options:
  -h, --help  show this help message and exit

xml-db

usage: minimel xml-db [-h] [--ns NS] [--nparts NPARTS] wikidump

Make a name database from Wikidump page ids

positional arguments:
  wikidump         Wikipedia XML dump file

options:
  -h, --help       show this help message and exit
  --ns NS          Page Namespace
                   (default: 0)
  --nparts NPARTS  Number of chunks to read
                   (default: 100)

query-pages

usage: minimel query-pages [-h]
                           [-q | --query-listpages | --no-query-listpages]
                           [-o OUTFILE]
                           langcode

Query the Wikidata API to get disambiguation (& list pages if indicated)

Returns Wikidata Qids, one per line

positional arguments:
  langcode              Wikipedia language code

options:
  -h, --help            show this help message and exit
  -q, --query-listpages, --no-query-listpages
                        Whether to also query for list pages
                        (default: False)
  -o OUTFILE, --outfile OUTFILE
                        (default: None)

get-disambig

usage: minimel get-disambig [-h] [-d DISAMBIG_TEMPLATE] [-n NPARTS]
                            wikidump dawgfile [disambig_ent_file]

Get disambiguation links.

Writes disambig.json.

positional arguments:
  wikidump              Wikipedia XML dump file
  dawgfile              DAWG trie file of Wikipedia > Wikidata mapping
  disambig_ent_file     Flat text file of disambiguation pages with one entity ID per line
                        (default: None)

options:
  -h, --help            show this help message and exit
  -d DISAMBIG_TEMPLATE, --disambig-template DISAMBIG_TEMPLATE
                        Use disambiguation pages that contain a template with this name instead of disambig_ent_file (if disambig_ent_file is provided, create it)
                        (default: None)
  -n NPARTS, --nparts NPARTS
                        Number of chunks to read
                        (default: 1000)

get-paragraphs

usage: minimel get-paragraphs [-h] [-n NPARTS] wikidump dawgfile [skip ...]

Extract hyperlinks from Wikipedia dumps.

Writes to outdir.

positional arguments:
  wikidump              Wikipedia pages-articles XML dump file
  dawgfile              DAWG trie file of Wikipedia > Wikidata mapping
  skip                  Skip pages with this prefix

options:
  -h, --help            show this help message and exit
  -n NPARTS, --nparts NPARTS
                        Number of chunks to read
                        (default: 1000)

count

usage: minimel count [-h] [-o OUTFILE] [-m MIN_COUNT] [--stem STEM]
                     [--head HEAD] [--split SPLIT] [-f FOLD]
                     paragraphlinks

Count targets per anchor text in Wikipedia paragraphs.

Writes count.min{min_count}[.stem-{LANG}].json

positional arguments:
  paragraphlinks        Directory of (pagetitle, links-json, paragraph) .tsv files

options:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
                        Output file or directory (default: count.json)
                        (default: None)
  -m MIN_COUNT, --min-count MIN_COUNT
                        Minimal (anchor-text, target) occurrence
                        (default: 2)
  --stem STEM           Stemming language ISO 639-1 (2-letter) code
                        (default: None)
  --head HEAD           Use only N first lines from each partition
                        (default: None)
  --split SPLIT         Split the data into several parts
                        (default: None)
  -f FOLD, --fold FOLD  Ignore this fold of the split data
                        (default: None)

count-names

usage: minimel count-names [-h] [-o OUTFILE] [-s STEM] [--head HEAD]
                           paragraphlinks countfile

Count anchor texts in Wikipedia paragraphs.

positional arguments:
  paragraphlinks        Directory of (pagetitle, links-json, paragraph) .tsv files
  countfile             Hyperlink anchor count JSON file

options:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
                        Output file or directory (default: name{countfile}[.stem-{LANG}].json)
                        (default: None)
  -s STEM, --stem STEM  Stemming language ISO 639-1 (2-letter) code
                        (default: None)
  --head HEAD           Use only N first lines from each partition
                        (default: None)

clean

usage: minimel clean [-h] [-o OUTFILE] [-s STEM]
                     [-f | --freqnorm | --no-freqnorm] [-b BADENTFILE]
                     [-m MIN_COUNT] [-t TOKENSCORE_THRESHOLD]
                     [-e ENTROPY_THRESHOLD]
                     [--countratio-threshold COUNTRATIO_THRESHOLD]
                     [-q QUANTILE_TOP_SHADOWED]
                     [--cluster-threshold CLUSTER_THRESHOLD]
                     indexdbfile disambigfile countfile [namecountfile]

Filter anchor counts (given their candidate entity counts).

First, only keep ambiguous candidate entities that either have minimal counts or are
linked from disambiguation pages.
If the tokenscore is low, then names with high entropy or countratio
(len / sum) are removed.

positional arguments:
  indexdbfile           Wikimapper index sqlite3 database
  disambigfile          Disambiguation JSON file
  countfile             Hyperlink anchor count {word: {Q_ent: count}} JSON file
  namecountfile         Counts of names (regardless of hyperlinks)
                        (default: None)

options:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
                        Output file or directory (default: clean.json)
                        (default: None)
  -s STEM, --stem STEM  Stemming language ISO 639-1 (2-letter) code
                        (default: None)
  -f, --freqnorm, --no-freqnorm
                        Normalize counts by total entity frequency
                        (default: False)
  -b BADENTFILE, --badentfile BADENTFILE
                        Files of entity IDs to ignore, one per line
                        (default: None)
  -m MIN_COUNT, --min-count MIN_COUNT
                        Minimal candidate entity count
                        (default: 2)
  -t TOKENSCORE_THRESHOLD, --tokenscore-threshold TOKENSCORE_THRESHOLD
                        Threshold for mean asymmentric Jaccard index
                        between name and candidate entity labels
                        (default: 0.1)
  -e ENTROPY_THRESHOLD, --entropy-threshold ENTROPY_THRESHOLD
                        Entropy threshold (high entropy = flat dist)
                        (default: 1.0)
  --countratio-threshold COUNTRATIO_THRESHOLD
                        Count-ratio (len / sum) threshold
                        (default: 0.5)
  -q QUANTILE_TOP_SHADOWED, --quantile-top-shadowed QUANTILE_TOP_SHADOWED
                        Only train models for a % names with highest counts
                        of candidate entities shadowed by the top candidate
                        (default: None)
  --cluster-threshold CLUSTER_THRESHOLD
                        (default: None)

vectorize

usage: minimel vectorize [-h] [-o OUTFILE] [--head HEAD] [--stem STEM]
                         [-v VECTORIZER] [-e ENT_FEATS_CSV]
                         [-b | --balanced | --no-balanced]
                         [-u | --usenil | --no-usenil] [--split SPLIT]
                         [-f FOLD]
                         paragraphlinks name_count_json

Vectorize paragraph text dataset into Vowpal Wabbit format

positional arguments:
  paragraphlinks        Paragraph links directory
  name_count_json       Surfaceform count json file

options:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
                        Output file or directory (default: vec*.parts)
                        (default: None)
  --head HEAD           Use only N first lines from each partition
                        (default: None)
  --stem STEM           Stemming language ISO 639-1 (2-letter) code
                        (default: None)
  -v VECTORIZER, --vectorizer VECTORIZER
                        Scikit-learn vectorizer .pickle or Fasttext .bin word
                        embeddings. If unset, use tokens directly.
                        (default: None)
  -e ENT_FEATS_CSV, --ent-feats-csv ENT_FEATS_CSV
                        CSV of (ent_id,space separated feat list) entity features
                        (default: None)
  -b, --balanced, --no-balanced
                        Use balanced training
                        (default: False)
  -u, --usenil, --no-usenil
                        Use NIL option
                        (default: False)
  --split SPLIT         Split the data into several parts
                        (default: None)
  -f FOLD, --fold FOLD  Ignore this fold of the split data
                        (default: None)

ent-feats

usage: minimel ent-feats [-h] [-p PART] spo_parquet anchor_json

Extract entity features from parquet triples

positional arguments:
  spo_parquet           Parquet triple file
  anchor_json           Anchor counts

options:
  -h, --help            show this help message and exit
  -p PART, --part PART  Filter part of features based on count
                        <1: Quantile of feature count
                        >1: Minimum feature count
                        (default: 1)

train

usage: minimel train [-h] [-o OUTFILE] [-b BITS] vec_file

Train Logistic Regression models

Writes

positional arguments:
  vec_file              Training data in Vowpal Wabbit format

options:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
                        Output file or directory (default: model.b{bits}.vw)
                        (default: None)
  -b BITS, --bits BITS  Number of bits of the Vowpal Wabbit feature hash function
                        (default: 20)

run

usage: minimel run [-h] [-o OUTFILE] [-v VECTORIZER]
                   [--ent-feats-csv ENT_FEATS_CSV] [-l LANG]
                   [--fallback FALLBACK] [--evaluate | --no-evaluate]
                   [--evalfile EVALFILE]
                   [--evalfile-per-name EVALFILE_PER_NAME]
                   [-p | --predict-only | --no-predict-only]
                   [-a | --all-scores | --no-all-scores]
                   [-u | --upperbound | --no-upperbound] [-s SPLIT]
                   [--fold FOLD]
                   dawgfile [candidatefile] [modelfile] [runfiles ...]

Perform entity disambiguation

positional arguments:
  dawgfile              DAWG trie file of Wikipedia > Wikidata count
  candidatefile         Candidate {name -> [ID]} json
                        (default: None)
  modelfile             Vowpal Wabbit model
                        (default: None)
  runfiles              Input file (- or absent for standard input). TSV rows of
                        (ID, {name -> ID}, text) or ({name -> ID}, text) or (text)

options:
  -h, --help            show this help message and exit
  -o OUTFILE, --outfile OUTFILE
                        Write outputs to file (default: stdout)
                        (default: None)
  -v VECTORIZER, --vectorizer VECTORIZER
                        Scikit-learn vectorizer .pickle or Fasttext .bin word
                        embeddings. If unset, use HashingVectorizer.
                        (default: None)
  --ent-feats-csv ENT_FEATS_CSV
                        CSV of (ent_id,space separated feat list) entity features
                        (default: None)
  -l LANG, --lang LANG  (default: None)
  --fallback FALLBACK   Additional fallback deterministic name -> ID json
                        (default: None)
  --evaluate, --no-evaluate
                        Report evaluation scores instead of predictions
                        (default: False)
  --evalfile EVALFILE   Write evaluation results to file
                        (default: None)
  --evalfile-per-name EVALFILE_PER_NAME
                        Write evaluation results per name to file
                        (default: None)
  -p, --predict-only, --no-predict-only
                        Only print predictions, not original text
                        (default: True)
  -a, --all-scores, --no-all-scores
                        Output all candidate scores
                        (default: False)
  -u, --upperbound, --no-upperbound
                        Create upper bound on performance
                        (default: False)
  -s SPLIT, --split SPLIT
                        Split the data into several parts
                        (default: None)
  --fold FOLD           Use only this fold of the split data
                        (default: None)

evaluate

usage: minimel evaluate [-h] [-a [AGG ...]] [-e EVALFILE]
                        goldfile [predfiles ...]

Evaluate predictions

positional arguments:
  goldfile
  predfiles

options:
  -h, --help            show this help message and exit
  -a [AGG ...], --agg [AGG ...]
                        Aggregation jsons (TODO: depend on data...?)
                        (default: ())
  -e EVALFILE, --evalfile EVALFILE
                        Write evaluation results to file
                        (default: None)

experiment

usage: minimel experiment [-h] [-o OUTDIR] [-n NPARTS] [--head HEAD]
                          [--split [SPLIT ...]] [--fold [FOLD ...]]
                          [--stem [STEM ...]] [-m [MIN_COUNT ...]]
                          [--freqnorm [FREQNORM ...]]
                          [--badentfile [BADENTFILE ...]]
                          [-t [TOKENSCORE_THRESHOLD ...]]
                          [--entropy-threshold [ENTROPY_THRESHOLD ...]]
                          [--countratio-threshold [COUNTRATIO_THRESHOLD ...]]
                          [-q [QUANTILE_TOP_SHADOWED ...]]
                          [--cluster-threshold [CLUSTER_THRESHOLD ...]]
                          [-v [VECTORIZER ...]]
                          [--ent-feats-csv [ENT_FEATS_CSV ...]]
                          [--balanced [BALANCED ...]] [--usenil [USENIL ...]]
                          [--bits [BITS ...]] [-r [RUNFILE ...]]
                          [--use-fallback [USE_FALLBACK ...]]
                          [-a | --also-baseline | --no-also-baseline]
                          [--evaluate | --no-evaluate]
                          [--evaluate-per-name | --no-evaluate-per-name]
                          [root]

Run all steps to train and evaluate EL models over a parameter sweep.

The root directory must contain the following files:

- index_*.dawg: DAWG trie mapping of article names -> numeric IDs

- *-disambig.txt: See disambig_ent_file in ~get_disambig.get_disambig

positional arguments:
  root                  Root directory
                        (default: .)

options:
  -h, --help            show this help message and exit
  -o OUTDIR, --outdir OUTDIR
                        Write outputs to this directory
                        (default: None)
  -n NPARTS, --nparts NPARTS
                        Number of parts to chunk wikidump into
                        (default: 100)
  --head HEAD           Use only N first lines from each partition
                        (default: None)
  --split [SPLIT ...]   Split the data into several parts
                        (default: (None,))
  --fold [FOLD ...]     Ignore this fold of the split data in training, use in evaluation
                        (default: (None,))
  --stem [STEM ...]     Stemming language ISO 639-1 (2-letter) code (use X for no stemming)
                        (default: ('',))
  -m [MIN_COUNT ...], --min-count [MIN_COUNT ...]
                        Minimal (anchor-text, target) occurrence
                        (default: (2,))
  --freqnorm [FREQNORM ...]
                        Normalize counts by total entity frequency (1/0)
                        (default: (False,))
  --badentfile [BADENTFILE ...]
                        File of entity IDs to ignore, one per line (default: *-disambig.txt)
                        (default: ('',))
  -t [TOKENSCORE_THRESHOLD ...], --tokenscore-threshold [TOKENSCORE_THRESHOLD ...]
                        Threshold for mean asymmentric Jaccard index
                        between name and candidate entity labels
                        (default: (0.1,))
  --entropy-threshold [ENTROPY_THRESHOLD ...]
                        Entropy threshold (high entropy = flat dist)
                        (default: (1.0,))
  --countratio-threshold [COUNTRATIO_THRESHOLD ...]
                        Count-ratio (len / sum) threshold
                        (default: (0.5,))
  -q [QUANTILE_TOP_SHADOWED ...], --quantile-top-shadowed [QUANTILE_TOP_SHADOWED ...]
                        Only train models for a % names with highest counts
                        of candidate entities shadowed by the top candidate
                        (default: (0,))
  --cluster-threshold [CLUSTER_THRESHOLD ...]
                        Cluster names based on their meanings
                        (default: (None,))
  -v [VECTORIZER ...], --vectorizer [VECTORIZER ...]
                        Scikit-learn vectorizer .pickle or Fasttext .bin word
                        embeddings. If unset, use tokens directly.
                        (default: ('',))
  --ent-feats-csv [ENT_FEATS_CSV ...]
                        CSV of (ent_id,space separated feat list) entity features
                        (default: ('',))
  --balanced [BALANCED ...]
                        Use balanced training
                        (default: (False,))
  --usenil [USENIL ...]
                        Use NIL option for training unlinked mentions
                        (default: (False,))
  --bits [BITS ...]     Number of bits of the Vowpal Wabbit feature hash function
                        (default: (20,))
  -r [RUNFILE ...], --runfile [RUNFILE ...]
                        TSV rows of (ID, {name -> ID}, text) or ({name -> ID}, text)
                        (default: ('',))
  --use-fallback [USE_FALLBACK ...]
                        Use raw counts as fallback
                        (default: (True,))
  -a, --also-baseline, --no-also-baseline
                        Also run a baseline model without model predictions
                        (default: True)
  --evaluate, --no-evaluate
                        Write evaluation scores to file
                        (default: False)
  --evaluate-per-name, --no-evaluate-per-name
                        Write evaluation scores per name to file
                        (default: False)

audit

usage: minimel audit [-h] modelfile datafile name [limit]

Print prediction scores and model coefficients

positional arguments:
  modelfile   Model
  datafile    VW format vectorized data
  name
  limit       (default: 1000)

options:
  -h, --help  show this help message and exit