Command Line Interface ====================== .. ansi-block:: usage: minimel [-h] [--verbose] [--slurm] {prepare,index,xml-db,query-pages,get-disambig,get-paragraphs,count,count-names,clean,vectorize,ent-feats,train,run,evaluate,experiment,audit} ... positional arguments: {prepare,index,xml-db,query-pages,get-disambig,get-paragraphs,count,count-names,clean,vectorize,ent-feats,train,run,evaluate,experiment,audit} prepare Download required files and make indices index Make an efficient DAWG trie index from a Wikimapper sqlite file xml-db Make a name database from Wikidump page ids query-pages Query the Wikidata API to get disambiguation (& list pages if indicated) get-disambig Get disambiguation links. get-paragraphs Extract hyperlinks from Wikipedia dumps. count Count targets per anchor text in Wikipedia paragraphs. count-names Count anchor texts in Wikipedia paragraphs. clean Filter anchor counts (given their candidate entity counts). vectorize Vectorize paragraph text dataset into Vowpal Wabbit format ent-feats Extract entity features from parquet triples train Train Logistic Regression models run Perform entity disambiguation evaluate Evaluate predictions experiment Run all steps to train and evaluate EL models over a parameter sweep. audit Print prediction scores and model coefficients options: -h, --help show this help message and exit --verbose, -v Verbosity (use -vv for debug messages) --slurm, -s Use Slurm prepare ^^^^^^^ .. ansi-block:: usage: minimel prepare [-h] [-r ROOTDIR] [-m MIRROR] [-o | --overwrite | --no-overwrite] [-n NPARTS] [-i | --index-only | --no-index-only] [-c CUSTOM_LANGCODE] wikiname version Download required files and make indices positional arguments: wikiname Wikipedia edition name (eg. "simplewiki") version Wikipedia version (eg. "latest") options: -h, --help show this help message and exit -r ROOTDIR, --rootdir ROOTDIR Root directory (default: None) -m MIRROR, --mirror MIRROR Wikimedia mirror (default: https://dumps.wikimedia.org) -o, --overwrite, --no-overwrite Whether to overwrite existing files (default: False) -n NPARTS, --nparts NPARTS Number of chunks to read (default: 100) -i, --index-only, --no-index-only Whether to only create the DAWG index (default: False) -c CUSTOM_LANGCODE, --custom-langcode CUSTOM_LANGCODE Custom language code (if different from wikiname, e.g. "en-simple") (default: None) index ^^^^^ .. ansi-block:: usage: minimel index [-h] db_fname Make an efficient DAWG trie index from a Wikimapper sqlite file positional arguments: db_fname Wikimapper SQLite3 index file options: -h, --help show this help message and exit xml-db ^^^^^^ .. ansi-block:: usage: minimel xml-db [-h] [--ns NS] [--nparts NPARTS] wikidump Make a name database from Wikidump page ids positional arguments: wikidump Wikipedia XML dump file options: -h, --help show this help message and exit --ns NS Page Namespace (default: 0) --nparts NPARTS Number of chunks to read (default: 100) query-pages ^^^^^^^^^^^ .. ansi-block:: usage: minimel query-pages [-h] [-q | --query-listpages | --no-query-listpages] [-o OUTFILE] langcode Query the Wikidata API to get disambiguation (& list pages if indicated) Returns Wikidata Qids, one per line positional arguments: langcode Wikipedia language code options: -h, --help show this help message and exit -q, --query-listpages, --no-query-listpages Whether to also query for list pages (default: False) -o OUTFILE, --outfile OUTFILE (default: None) get-disambig ^^^^^^^^^^^^ .. ansi-block:: usage: minimel get-disambig [-h] [-d DISAMBIG_TEMPLATE] [-n NPARTS] wikidump dawgfile [disambig_ent_file] Get disambiguation links. Writes disambig.json. positional arguments: wikidump Wikipedia XML dump file dawgfile DAWG trie file of Wikipedia > Wikidata mapping disambig_ent_file Flat text file of disambiguation pages with one entity ID per line (default: None) options: -h, --help show this help message and exit -d DISAMBIG_TEMPLATE, --disambig-template DISAMBIG_TEMPLATE Use disambiguation pages that contain a template with this name instead of disambig_ent_file (if disambig_ent_file is provided, create it) (default: None) -n NPARTS, --nparts NPARTS Number of chunks to read (default: 1000) get-paragraphs ^^^^^^^^^^^^^^ .. ansi-block:: usage: minimel get-paragraphs [-h] [-n NPARTS] wikidump dawgfile [skip ...] Extract hyperlinks from Wikipedia dumps. Writes to outdir. positional arguments: wikidump Wikipedia pages-articles XML dump file dawgfile DAWG trie file of Wikipedia > Wikidata mapping skip Skip pages with this prefix options: -h, --help show this help message and exit -n NPARTS, --nparts NPARTS Number of chunks to read (default: 1000) count ^^^^^ .. ansi-block:: usage: minimel count [-h] [-o OUTFILE] [-m MIN_COUNT] [--stem STEM] [--head HEAD] [--split SPLIT] [-f FOLD] paragraphlinks Count targets per anchor text in Wikipedia paragraphs. Writes count.min{min_count}[.stem-{LANG}].json positional arguments: paragraphlinks Directory of (pagetitle, links-json, paragraph) .tsv files options: -h, --help show this help message and exit -o OUTFILE, --outfile OUTFILE Output file or directory (default: count.json) (default: None) -m MIN_COUNT, --min-count MIN_COUNT Minimal (anchor-text, target) occurrence (default: 2) --stem STEM Stemming language ISO 639-1 (2-letter) code (default: None) --head HEAD Use only N first lines from each partition (default: None) --split SPLIT Split the data into several parts (default: None) -f FOLD, --fold FOLD Ignore this fold of the split data (default: None) count-names ^^^^^^^^^^^ .. ansi-block:: usage: minimel count-names [-h] [-o OUTFILE] [-s STEM] [--head HEAD] paragraphlinks countfile Count anchor texts in Wikipedia paragraphs. positional arguments: paragraphlinks Directory of (pagetitle, links-json, paragraph) .tsv files countfile Hyperlink anchor count JSON file options: -h, --help show this help message and exit -o OUTFILE, --outfile OUTFILE Output file or directory (default: name{countfile}[.stem-{LANG}].json) (default: None) -s STEM, --stem STEM Stemming language ISO 639-1 (2-letter) code (default: None) --head HEAD Use only N first lines from each partition (default: None) clean ^^^^^ .. ansi-block:: usage: minimel clean [-h] [-o OUTFILE] [-s STEM] [-f | --freqnorm | --no-freqnorm] [-b BADENTFILE] [-m MIN_COUNT] [-t TOKENSCORE_THRESHOLD] [-e ENTROPY_THRESHOLD] [--countratio-threshold COUNTRATIO_THRESHOLD] [-q QUANTILE_TOP_SHADOWED] [--cluster-threshold CLUSTER_THRESHOLD] indexdbfile disambigfile countfile [namecountfile] Filter anchor counts (given their candidate entity counts). First, only keep ambiguous candidate entities that either have minimal counts or are linked from disambiguation pages. If the tokenscore is low, then names with high entropy or countratio (len / sum) are removed. positional arguments: indexdbfile Wikimapper index sqlite3 database disambigfile Disambiguation JSON file countfile Hyperlink anchor count {word: {Q_ent: count}} JSON file namecountfile Counts of names (regardless of hyperlinks) (default: None) options: -h, --help show this help message and exit -o OUTFILE, --outfile OUTFILE Output file or directory (default: clean.json) (default: None) -s STEM, --stem STEM Stemming language ISO 639-1 (2-letter) code (default: None) -f, --freqnorm, --no-freqnorm Normalize counts by total entity frequency (default: False) -b BADENTFILE, --badentfile BADENTFILE Files of entity IDs to ignore, one per line (default: None) -m MIN_COUNT, --min-count MIN_COUNT Minimal candidate entity count (default: 2) -t TOKENSCORE_THRESHOLD, --tokenscore-threshold TOKENSCORE_THRESHOLD Threshold for mean asymmentric Jaccard index between name and candidate entity labels (default: 0.1) -e ENTROPY_THRESHOLD, --entropy-threshold ENTROPY_THRESHOLD Entropy threshold (high entropy = flat dist) (default: 1.0) --countratio-threshold COUNTRATIO_THRESHOLD Count-ratio (len / sum) threshold (default: 0.5) -q QUANTILE_TOP_SHADOWED, --quantile-top-shadowed QUANTILE_TOP_SHADOWED Only train models for a % names with highest counts of candidate entities shadowed by the top candidate (default: None) --cluster-threshold CLUSTER_THRESHOLD (default: None) vectorize ^^^^^^^^^ .. ansi-block:: usage: minimel vectorize [-h] [-o OUTFILE] [--head HEAD] [--stem STEM] [-v VECTORIZER] [-e ENT_FEATS_CSV] [-b | --balanced | --no-balanced] [-u | --usenil | --no-usenil] [--split SPLIT] [-f FOLD] paragraphlinks name_count_json Vectorize paragraph text dataset into Vowpal Wabbit format positional arguments: paragraphlinks Paragraph links directory name_count_json Surfaceform count json file options: -h, --help show this help message and exit -o OUTFILE, --outfile OUTFILE Output file or directory (default: vec*.parts) (default: None) --head HEAD Use only N first lines from each partition (default: None) --stem STEM Stemming language ISO 639-1 (2-letter) code (default: None) -v VECTORIZER, --vectorizer VECTORIZER Scikit-learn vectorizer .pickle or Fasttext .bin word embeddings. If unset, use tokens directly. (default: None) -e ENT_FEATS_CSV, --ent-feats-csv ENT_FEATS_CSV CSV of (ent_id,space separated feat list) entity features (default: None) -b, --balanced, --no-balanced Use balanced training (default: False) -u, --usenil, --no-usenil Use NIL option (default: False) --split SPLIT Split the data into several parts (default: None) -f FOLD, --fold FOLD Ignore this fold of the split data (default: None) ent-feats ^^^^^^^^^ .. ansi-block:: usage: minimel ent-feats [-h] [-p PART] spo_parquet anchor_json Extract entity features from parquet triples positional arguments: spo_parquet Parquet triple file anchor_json Anchor counts options: -h, --help show this help message and exit -p PART, --part PART Filter part of features based on count <1: Quantile of feature count >1: Minimum feature count (default: 1) train ^^^^^ .. ansi-block:: usage: minimel train [-h] [-o OUTFILE] [-b BITS] vec_file Train Logistic Regression models Writes positional arguments: vec_file Training data in Vowpal Wabbit format options: -h, --help show this help message and exit -o OUTFILE, --outfile OUTFILE Output file or directory (default: model.b{bits}.vw) (default: None) -b BITS, --bits BITS Number of bits of the Vowpal Wabbit feature hash function (default: 20) run ^^^ .. ansi-block:: usage: minimel run [-h] [-o OUTFILE] [-v VECTORIZER] [--ent-feats-csv ENT_FEATS_CSV] [-l LANG] [--fallback FALLBACK] [--evaluate | --no-evaluate] [--evalfile EVALFILE] [--evalfile-per-name EVALFILE_PER_NAME] [-p | --predict-only | --no-predict-only] [-a | --all-scores | --no-all-scores] [-u | --upperbound | --no-upperbound] [-s SPLIT] [--fold FOLD] dawgfile [candidatefile] [modelfile] [runfiles ...] Perform entity disambiguation positional arguments: dawgfile DAWG trie file of Wikipedia > Wikidata count candidatefile Candidate {name -> [ID]} json (default: None) modelfile Vowpal Wabbit model (default: None) runfiles Input file (- or absent for standard input). TSV rows of (ID, {name -> ID}, text) or ({name -> ID}, text) or (text) options: -h, --help show this help message and exit -o OUTFILE, --outfile OUTFILE Write outputs to file (default: stdout) (default: None) -v VECTORIZER, --vectorizer VECTORIZER Scikit-learn vectorizer .pickle or Fasttext .bin word embeddings. If unset, use HashingVectorizer. (default: None) --ent-feats-csv ENT_FEATS_CSV CSV of (ent_id,space separated feat list) entity features (default: None) -l LANG, --lang LANG (default: None) --fallback FALLBACK Additional fallback deterministic name -> ID json (default: None) --evaluate, --no-evaluate Report evaluation scores instead of predictions (default: False) --evalfile EVALFILE Write evaluation results to file (default: None) --evalfile-per-name EVALFILE_PER_NAME Write evaluation results per name to file (default: None) -p, --predict-only, --no-predict-only Only print predictions, not original text (default: True) -a, --all-scores, --no-all-scores Output all candidate scores (default: False) -u, --upperbound, --no-upperbound Create upper bound on performance (default: False) -s SPLIT, --split SPLIT Split the data into several parts (default: None) --fold FOLD Use only this fold of the split data (default: None) evaluate ^^^^^^^^ .. ansi-block:: usage: minimel evaluate [-h] [-a [AGG ...]] [-e EVALFILE] goldfile [predfiles ...] Evaluate predictions positional arguments: goldfile predfiles options: -h, --help show this help message and exit -a [AGG ...], --agg [AGG ...] Aggregation jsons (TODO: depend on data...?) (default: ()) -e EVALFILE, --evalfile EVALFILE Write evaluation results to file (default: None) experiment ^^^^^^^^^^ .. ansi-block:: usage: minimel experiment [-h] [-o OUTDIR] [-n NPARTS] [--head HEAD] [--split [SPLIT ...]] [--fold [FOLD ...]] [--stem [STEM ...]] [-m [MIN_COUNT ...]] [--freqnorm [FREQNORM ...]] [--badentfile [BADENTFILE ...]] [-t [TOKENSCORE_THRESHOLD ...]] [--entropy-threshold [ENTROPY_THRESHOLD ...]] [--countratio-threshold [COUNTRATIO_THRESHOLD ...]] [-q [QUANTILE_TOP_SHADOWED ...]] [--cluster-threshold [CLUSTER_THRESHOLD ...]] [-v [VECTORIZER ...]] [--ent-feats-csv [ENT_FEATS_CSV ...]] [--balanced [BALANCED ...]] [--usenil [USENIL ...]] [--bits [BITS ...]] [-r [RUNFILE ...]] [--use-fallback [USE_FALLBACK ...]] [-a | --also-baseline | --no-also-baseline] [--evaluate | --no-evaluate] [--evaluate-per-name | --no-evaluate-per-name] [root] Run all steps to train and evaluate EL models over a parameter sweep. The root directory must contain the following files: - index_*.dawg: DAWG trie mapping of article names -> numeric IDs - *-disambig.txt: See disambig_ent_file in ~get_disambig.get_disambig positional arguments: root Root directory (default: .) options: -h, --help show this help message and exit -o OUTDIR, --outdir OUTDIR Write outputs to this directory (default: None) -n NPARTS, --nparts NPARTS Number of parts to chunk wikidump into (default: 100) --head HEAD Use only N first lines from each partition (default: None) --split [SPLIT ...] Split the data into several parts (default: (None,)) --fold [FOLD ...] Ignore this fold of the split data in training, use in evaluation (default: (None,)) --stem [STEM ...] Stemming language ISO 639-1 (2-letter) code (use X for no stemming) (default: ('',)) -m [MIN_COUNT ...], --min-count [MIN_COUNT ...] Minimal (anchor-text, target) occurrence (default: (2,)) --freqnorm [FREQNORM ...] Normalize counts by total entity frequency (1/0) (default: (False,)) --badentfile [BADENTFILE ...] File of entity IDs to ignore, one per line (default: *-disambig.txt) (default: ('',)) -t [TOKENSCORE_THRESHOLD ...], --tokenscore-threshold [TOKENSCORE_THRESHOLD ...] Threshold for mean asymmentric Jaccard index between name and candidate entity labels (default: (0.1,)) --entropy-threshold [ENTROPY_THRESHOLD ...] Entropy threshold (high entropy = flat dist) (default: (1.0,)) --countratio-threshold [COUNTRATIO_THRESHOLD ...] Count-ratio (len / sum) threshold (default: (0.5,)) -q [QUANTILE_TOP_SHADOWED ...], --quantile-top-shadowed [QUANTILE_TOP_SHADOWED ...] Only train models for a % names with highest counts of candidate entities shadowed by the top candidate (default: (0,)) --cluster-threshold [CLUSTER_THRESHOLD ...] Cluster names based on their meanings (default: (None,)) -v [VECTORIZER ...], --vectorizer [VECTORIZER ...] Scikit-learn vectorizer .pickle or Fasttext .bin word embeddings. If unset, use tokens directly. (default: ('',)) --ent-feats-csv [ENT_FEATS_CSV ...] CSV of (ent_id,space separated feat list) entity features (default: ('',)) --balanced [BALANCED ...] Use balanced training (default: (False,)) --usenil [USENIL ...] Use NIL option for training unlinked mentions (default: (False,)) --bits [BITS ...] Number of bits of the Vowpal Wabbit feature hash function (default: (20,)) -r [RUNFILE ...], --runfile [RUNFILE ...] TSV rows of (ID, {name -> ID}, text) or ({name -> ID}, text) (default: ('',)) --use-fallback [USE_FALLBACK ...] Use raw counts as fallback (default: (True,)) -a, --also-baseline, --no-also-baseline Also run a baseline model without model predictions (default: True) --evaluate, --no-evaluate Write evaluation scores to file (default: False) --evaluate-per-name, --no-evaluate-per-name Write evaluation scores per name to file (default: False) audit ^^^^^ .. ansi-block:: usage: minimel audit [-h] modelfile datafile name [limit] Print prediction scores and model coefficients positional arguments: modelfile Model datafile VW format vectorized data name limit (default: 1000) options: -h, --help show this help message and exit