Command Line Interface
usage: minimel [-h] [--verbose] [--slurm]
{prepare,index,xml-db,query-pages,get-disambig,get-paragraphs,count,count-names,clean,vectorize,ent-feats,train,run,evaluate,experiment,audit}
...
positional arguments:
{prepare,index,xml-db,query-pages,get-disambig,get-paragraphs,count,count-names,clean,vectorize,ent-feats,train,run,evaluate,experiment,audit}
prepare Download required files and make indices
index Make an efficient DAWG trie index from a Wikimapper sqlite file
xml-db Make a name database from Wikidump page ids
query-pages Query the Wikidata API to get disambiguation (& list pages if indicated)
get-disambig Get disambiguation links.
get-paragraphs Extract hyperlinks from Wikipedia dumps.
count Count targets per anchor text in Wikipedia paragraphs.
count-names Count anchor texts in Wikipedia paragraphs.
clean Filter anchor counts (given their candidate entity counts).
vectorize Vectorize paragraph text dataset into Vowpal Wabbit format
ent-feats Extract entity features from parquet triples
train Train Logistic Regression models
run Perform entity disambiguation
evaluate Evaluate predictions
experiment Run all steps to train and evaluate EL models over a parameter sweep.
audit Print prediction scores and model coefficients
options:
-h, --help show this help message and exit
--verbose, -v Verbosity (use -vv for debug messages)
--slurm, -s Use Slurm
prepare
usage: minimel prepare [-h] [-r ROOTDIR] [-m MIRROR]
[-o | --overwrite | --no-overwrite] [-n NPARTS]
[-i | --index-only | --no-index-only]
[-c CUSTOM_LANGCODE]
wikiname version
Download required files and make indices
positional arguments:
wikiname Wikipedia edition name (eg. "simplewiki")
version Wikipedia version (eg. "latest")
options:
-h, --help show this help message and exit
-r ROOTDIR, --rootdir ROOTDIR
Root directory
(default: None)
-m MIRROR, --mirror MIRROR
Wikimedia mirror
(default: https://dumps.wikimedia.org)
-o, --overwrite, --no-overwrite
Whether to overwrite existing files
(default: False)
-n NPARTS, --nparts NPARTS
Number of chunks to read
(default: 100)
-i, --index-only, --no-index-only
Whether to only create the DAWG index
(default: False)
-c CUSTOM_LANGCODE, --custom-langcode CUSTOM_LANGCODE
Custom language code (if different from wikiname, e.g. "en-simple")
(default: None)
index
usage: minimel index [-h] db_fname
Make an efficient DAWG trie index from a Wikimapper sqlite file
positional arguments:
db_fname Wikimapper SQLite3 index file
options:
-h, --help show this help message and exit
xml-db
usage: minimel xml-db [-h] [--ns NS] [--nparts NPARTS] wikidump
Make a name database from Wikidump page ids
positional arguments:
wikidump Wikipedia XML dump file
options:
-h, --help show this help message and exit
--ns NS Page Namespace
(default: 0)
--nparts NPARTS Number of chunks to read
(default: 100)
query-pages
usage: minimel query-pages [-h]
[-q | --query-listpages | --no-query-listpages]
[-o OUTFILE]
langcode
Query the Wikidata API to get disambiguation (& list pages if indicated)
Returns Wikidata Qids, one per line
positional arguments:
langcode Wikipedia language code
options:
-h, --help show this help message and exit
-q, --query-listpages, --no-query-listpages
Whether to also query for list pages
(default: False)
-o OUTFILE, --outfile OUTFILE
(default: None)
get-disambig
usage: minimel get-disambig [-h] [-d DISAMBIG_TEMPLATE] [-n NPARTS]
wikidump dawgfile [disambig_ent_file]
Get disambiguation links.
Writes disambig.json.
positional arguments:
wikidump Wikipedia XML dump file
dawgfile DAWG trie file of Wikipedia > Wikidata mapping
disambig_ent_file Flat text file of disambiguation pages with one entity ID per line
(default: None)
options:
-h, --help show this help message and exit
-d DISAMBIG_TEMPLATE, --disambig-template DISAMBIG_TEMPLATE
Use disambiguation pages that contain a template with this name instead of disambig_ent_file (if disambig_ent_file is provided, create it)
(default: None)
-n NPARTS, --nparts NPARTS
Number of chunks to read
(default: 1000)
get-paragraphs
usage: minimel get-paragraphs [-h] [-n NPARTS] wikidump dawgfile [skip ...]
Extract hyperlinks from Wikipedia dumps.
Writes to outdir.
positional arguments:
wikidump Wikipedia pages-articles XML dump file
dawgfile DAWG trie file of Wikipedia > Wikidata mapping
skip Skip pages with this prefix
options:
-h, --help show this help message and exit
-n NPARTS, --nparts NPARTS
Number of chunks to read
(default: 1000)
count
usage: minimel count [-h] [-o OUTFILE] [-m MIN_COUNT] [--stem STEM]
[--head HEAD] [--split SPLIT] [-f FOLD]
paragraphlinks
Count targets per anchor text in Wikipedia paragraphs.
Writes count.min{min_count}[.stem-{LANG}].json
positional arguments:
paragraphlinks Directory of (pagetitle, links-json, paragraph) .tsv files
options:
-h, --help show this help message and exit
-o OUTFILE, --outfile OUTFILE
Output file or directory (default: count.json)
(default: None)
-m MIN_COUNT, --min-count MIN_COUNT
Minimal (anchor-text, target) occurrence
(default: 2)
--stem STEM Stemming language ISO 639-1 (2-letter) code
(default: None)
--head HEAD Use only N first lines from each partition
(default: None)
--split SPLIT Split the data into several parts
(default: None)
-f FOLD, --fold FOLD Ignore this fold of the split data
(default: None)
count-names
usage: minimel count-names [-h] [-o OUTFILE] [-s STEM] [--head HEAD]
paragraphlinks countfile
Count anchor texts in Wikipedia paragraphs.
positional arguments:
paragraphlinks Directory of (pagetitle, links-json, paragraph) .tsv files
countfile Hyperlink anchor count JSON file
options:
-h, --help show this help message and exit
-o OUTFILE, --outfile OUTFILE
Output file or directory (default: name{countfile}[.stem-{LANG}].json)
(default: None)
-s STEM, --stem STEM Stemming language ISO 639-1 (2-letter) code
(default: None)
--head HEAD Use only N first lines from each partition
(default: None)
clean
usage: minimel clean [-h] [-o OUTFILE] [-s STEM]
[-f | --freqnorm | --no-freqnorm] [-b BADENTFILE]
[-m MIN_COUNT] [-t TOKENSCORE_THRESHOLD]
[-e ENTROPY_THRESHOLD]
[--countratio-threshold COUNTRATIO_THRESHOLD]
[-q QUANTILE_TOP_SHADOWED]
[--cluster-threshold CLUSTER_THRESHOLD]
indexdbfile disambigfile countfile [namecountfile]
Filter anchor counts (given their candidate entity counts).
First, only keep ambiguous candidate entities that either have minimal counts or are
linked from disambiguation pages.
If the tokenscore is low, then names with high entropy or countratio
(len / sum) are removed.
positional arguments:
indexdbfile Wikimapper index sqlite3 database
disambigfile Disambiguation JSON file
countfile Hyperlink anchor count {word: {Q_ent: count}} JSON file
namecountfile Counts of names (regardless of hyperlinks)
(default: None)
options:
-h, --help show this help message and exit
-o OUTFILE, --outfile OUTFILE
Output file or directory (default: clean.json)
(default: None)
-s STEM, --stem STEM Stemming language ISO 639-1 (2-letter) code
(default: None)
-f, --freqnorm, --no-freqnorm
Normalize counts by total entity frequency
(default: False)
-b BADENTFILE, --badentfile BADENTFILE
Files of entity IDs to ignore, one per line
(default: None)
-m MIN_COUNT, --min-count MIN_COUNT
Minimal candidate entity count
(default: 2)
-t TOKENSCORE_THRESHOLD, --tokenscore-threshold TOKENSCORE_THRESHOLD
Threshold for mean asymmentric Jaccard index
between name and candidate entity labels
(default: 0.1)
-e ENTROPY_THRESHOLD, --entropy-threshold ENTROPY_THRESHOLD
Entropy threshold (high entropy = flat dist)
(default: 1.0)
--countratio-threshold COUNTRATIO_THRESHOLD
Count-ratio (len / sum) threshold
(default: 0.5)
-q QUANTILE_TOP_SHADOWED, --quantile-top-shadowed QUANTILE_TOP_SHADOWED
Only train models for a % names with highest counts
of candidate entities shadowed by the top candidate
(default: None)
--cluster-threshold CLUSTER_THRESHOLD
(default: None)
vectorize
usage: minimel vectorize [-h] [-o OUTFILE] [--head HEAD] [--stem STEM]
[-v VECTORIZER] [-e ENT_FEATS_CSV]
[-b | --balanced | --no-balanced]
[-u | --usenil | --no-usenil] [--split SPLIT]
[-f FOLD]
paragraphlinks name_count_json
Vectorize paragraph text dataset into Vowpal Wabbit format
positional arguments:
paragraphlinks Paragraph links directory
name_count_json Surfaceform count json file
options:
-h, --help show this help message and exit
-o OUTFILE, --outfile OUTFILE
Output file or directory (default: vec*.parts)
(default: None)
--head HEAD Use only N first lines from each partition
(default: None)
--stem STEM Stemming language ISO 639-1 (2-letter) code
(default: None)
-v VECTORIZER, --vectorizer VECTORIZER
Scikit-learn vectorizer .pickle or Fasttext .bin word
embeddings. If unset, use tokens directly.
(default: None)
-e ENT_FEATS_CSV, --ent-feats-csv ENT_FEATS_CSV
CSV of (ent_id,space separated feat list) entity features
(default: None)
-b, --balanced, --no-balanced
Use balanced training
(default: False)
-u, --usenil, --no-usenil
Use NIL option
(default: False)
--split SPLIT Split the data into several parts
(default: None)
-f FOLD, --fold FOLD Ignore this fold of the split data
(default: None)
ent-feats
usage: minimel ent-feats [-h] [-p PART] spo_parquet anchor_json
Extract entity features from parquet triples
positional arguments:
spo_parquet Parquet triple file
anchor_json Anchor counts
options:
-h, --help show this help message and exit
-p PART, --part PART Filter part of features based on count
<1: Quantile of feature count
>1: Minimum feature count
(default: 1)
train
usage: minimel train [-h] [-o OUTFILE] [-b BITS] vec_file
Train Logistic Regression models
Writes
positional arguments:
vec_file Training data in Vowpal Wabbit format
options:
-h, --help show this help message and exit
-o OUTFILE, --outfile OUTFILE
Output file or directory (default: model.b{bits}.vw)
(default: None)
-b BITS, --bits BITS Number of bits of the Vowpal Wabbit feature hash function
(default: 20)
run
usage: minimel run [-h] [-o OUTFILE] [-v VECTORIZER]
[--ent-feats-csv ENT_FEATS_CSV] [-l LANG]
[--fallback FALLBACK] [--evaluate | --no-evaluate]
[--evalfile EVALFILE]
[--evalfile-per-name EVALFILE_PER_NAME]
[-p | --predict-only | --no-predict-only]
[-a | --all-scores | --no-all-scores]
[-u | --upperbound | --no-upperbound] [-s SPLIT]
[--fold FOLD]
dawgfile [candidatefile] [modelfile] [runfiles ...]
Perform entity disambiguation
positional arguments:
dawgfile DAWG trie file of Wikipedia > Wikidata count
candidatefile Candidate {name -> [ID]} json
(default: None)
modelfile Vowpal Wabbit model
(default: None)
runfiles Input file (- or absent for standard input). TSV rows of
(ID, {name -> ID}, text) or ({name -> ID}, text) or (text)
options:
-h, --help show this help message and exit
-o OUTFILE, --outfile OUTFILE
Write outputs to file (default: stdout)
(default: None)
-v VECTORIZER, --vectorizer VECTORIZER
Scikit-learn vectorizer .pickle or Fasttext .bin word
embeddings. If unset, use HashingVectorizer.
(default: None)
--ent-feats-csv ENT_FEATS_CSV
CSV of (ent_id,space separated feat list) entity features
(default: None)
-l LANG, --lang LANG (default: None)
--fallback FALLBACK Additional fallback deterministic name -> ID json
(default: None)
--evaluate, --no-evaluate
Report evaluation scores instead of predictions
(default: False)
--evalfile EVALFILE Write evaluation results to file
(default: None)
--evalfile-per-name EVALFILE_PER_NAME
Write evaluation results per name to file
(default: None)
-p, --predict-only, --no-predict-only
Only print predictions, not original text
(default: True)
-a, --all-scores, --no-all-scores
Output all candidate scores
(default: False)
-u, --upperbound, --no-upperbound
Create upper bound on performance
(default: False)
-s SPLIT, --split SPLIT
Split the data into several parts
(default: None)
--fold FOLD Use only this fold of the split data
(default: None)
evaluate
usage: minimel evaluate [-h] [-a [AGG ...]] [-e EVALFILE]
goldfile [predfiles ...]
Evaluate predictions
positional arguments:
goldfile
predfiles
options:
-h, --help show this help message and exit
-a [AGG ...], --agg [AGG ...]
Aggregation jsons (TODO: depend on data...?)
(default: ())
-e EVALFILE, --evalfile EVALFILE
Write evaluation results to file
(default: None)
experiment
usage: minimel experiment [-h] [-o OUTDIR] [-n NPARTS] [--head HEAD]
[--split [SPLIT ...]] [--fold [FOLD ...]]
[--stem [STEM ...]] [-m [MIN_COUNT ...]]
[--freqnorm [FREQNORM ...]]
[--badentfile [BADENTFILE ...]]
[-t [TOKENSCORE_THRESHOLD ...]]
[--entropy-threshold [ENTROPY_THRESHOLD ...]]
[--countratio-threshold [COUNTRATIO_THRESHOLD ...]]
[-q [QUANTILE_TOP_SHADOWED ...]]
[--cluster-threshold [CLUSTER_THRESHOLD ...]]
[-v [VECTORIZER ...]]
[--ent-feats-csv [ENT_FEATS_CSV ...]]
[--balanced [BALANCED ...]] [--usenil [USENIL ...]]
[--bits [BITS ...]] [-r [RUNFILE ...]]
[--use-fallback [USE_FALLBACK ...]]
[-a | --also-baseline | --no-also-baseline]
[--evaluate | --no-evaluate]
[--evaluate-per-name | --no-evaluate-per-name]
[root]
Run all steps to train and evaluate EL models over a parameter sweep.
The root directory must contain the following files:
- index_*.dawg: DAWG trie mapping of article names -> numeric IDs
- *-disambig.txt: See disambig_ent_file in ~get_disambig.get_disambig
positional arguments:
root Root directory
(default: .)
options:
-h, --help show this help message and exit
-o OUTDIR, --outdir OUTDIR
Write outputs to this directory
(default: None)
-n NPARTS, --nparts NPARTS
Number of parts to chunk wikidump into
(default: 100)
--head HEAD Use only N first lines from each partition
(default: None)
--split [SPLIT ...] Split the data into several parts
(default: (None,))
--fold [FOLD ...] Ignore this fold of the split data in training, use in evaluation
(default: (None,))
--stem [STEM ...] Stemming language ISO 639-1 (2-letter) code (use X for no stemming)
(default: ('',))
-m [MIN_COUNT ...], --min-count [MIN_COUNT ...]
Minimal (anchor-text, target) occurrence
(default: (2,))
--freqnorm [FREQNORM ...]
Normalize counts by total entity frequency (1/0)
(default: (False,))
--badentfile [BADENTFILE ...]
File of entity IDs to ignore, one per line (default: *-disambig.txt)
(default: ('',))
-t [TOKENSCORE_THRESHOLD ...], --tokenscore-threshold [TOKENSCORE_THRESHOLD ...]
Threshold for mean asymmentric Jaccard index
between name and candidate entity labels
(default: (0.1,))
--entropy-threshold [ENTROPY_THRESHOLD ...]
Entropy threshold (high entropy = flat dist)
(default: (1.0,))
--countratio-threshold [COUNTRATIO_THRESHOLD ...]
Count-ratio (len / sum) threshold
(default: (0.5,))
-q [QUANTILE_TOP_SHADOWED ...], --quantile-top-shadowed [QUANTILE_TOP_SHADOWED ...]
Only train models for a % names with highest counts
of candidate entities shadowed by the top candidate
(default: (0,))
--cluster-threshold [CLUSTER_THRESHOLD ...]
Cluster names based on their meanings
(default: (None,))
-v [VECTORIZER ...], --vectorizer [VECTORIZER ...]
Scikit-learn vectorizer .pickle or Fasttext .bin word
embeddings. If unset, use tokens directly.
(default: ('',))
--ent-feats-csv [ENT_FEATS_CSV ...]
CSV of (ent_id,space separated feat list) entity features
(default: ('',))
--balanced [BALANCED ...]
Use balanced training
(default: (False,))
--usenil [USENIL ...]
Use NIL option for training unlinked mentions
(default: (False,))
--bits [BITS ...] Number of bits of the Vowpal Wabbit feature hash function
(default: (20,))
-r [RUNFILE ...], --runfile [RUNFILE ...]
TSV rows of (ID, {name -> ID}, text) or ({name -> ID}, text)
(default: ('',))
--use-fallback [USE_FALLBACK ...]
Use raw counts as fallback
(default: (True,))
-a, --also-baseline, --no-also-baseline
Also run a baseline model without model predictions
(default: True)
--evaluate, --no-evaluate
Write evaluation scores to file
(default: False)
--evaluate-per-name, --no-evaluate-per-name
Write evaluation scores per name to file
(default: False)
audit
usage: minimel audit [-h] modelfile datafile name [limit]
Print prediction scores and model coefficients
positional arguments:
modelfile Model
datafile VW format vectorized data
name
limit (default: 1000)
options:
-h, --help show this help message and exit