Command Line Interface
======================

.. ansi-block::

    
	usage: minimel [-h] [--verbose] [--slurm]
	               {prepare,index,xml-db,query-pages,get-disambig,get-paragraphs,count,count-names,clean,vectorize,ent-feats,train,run,evaluate,experiment,audit}
	               ...
	
	positional arguments:
	  {prepare,index,xml-db,query-pages,get-disambig,get-paragraphs,count,count-names,clean,vectorize,ent-feats,train,run,evaluate,experiment,audit}
	    prepare             Download required files and make indices
	    index               Make an efficient DAWG trie index from a Wikimapper sqlite file
	    xml-db              Make a name database from Wikidump page ids
	    query-pages         Query the Wikidata API to get disambiguation (& list pages if indicated)
	    get-disambig        Get disambiguation links.
	    get-paragraphs      Extract hyperlinks from Wikipedia dumps.
	    count               Count targets per anchor text in Wikipedia paragraphs.
	    count-names         Count anchor texts in Wikipedia paragraphs.
	    clean               Filter anchor counts (given their candidate entity counts).
	    vectorize           Vectorize paragraph text dataset into Vowpal Wabbit format
	    ent-feats           Extract entity features from parquet triples
	    train               Train Logistic Regression models
	    run                 Perform entity disambiguation
	    evaluate            Evaluate predictions
	    experiment          Run all steps to train and evaluate EL models over a parameter sweep.
	    audit               Print prediction scores and model coefficients
	
	options:
	  -h, --help            show this help message and exit
	  --verbose, -v         Verbosity (use -vv for debug messages)
	  --slurm, -s           Use Slurm


prepare
^^^^^^^

.. ansi-block::

    
	usage: minimel prepare [-h] [-r ROOTDIR] [-m MIRROR]
	                       [-o | --overwrite | --no-overwrite] [-n NPARTS]
	                       [-i | --index-only | --no-index-only]
	                       [-c CUSTOM_LANGCODE]
	                       wikiname version
	
	Download required files and make indices
	
	positional arguments:
	  wikiname              Wikipedia edition name (eg. "simplewiki")
	  version               Wikipedia version (eg. "latest")
	
	options:
	  -h, --help            show this help message and exit
	  -r ROOTDIR, --rootdir ROOTDIR
	                        Root directory
	                        (default: None)
	  -m MIRROR, --mirror MIRROR
	                        Wikimedia mirror
	                        (default: https://dumps.wikimedia.org)
	  -o, --overwrite, --no-overwrite
	                        Whether to overwrite existing files
	                        (default: False)
	  -n NPARTS, --nparts NPARTS
	                        Number of chunks to read
	                        (default: 100)
	  -i, --index-only, --no-index-only
	                        Whether to only create the DAWG index
	                        (default: False)
	  -c CUSTOM_LANGCODE, --custom-langcode CUSTOM_LANGCODE
	                        Custom language code (if different from wikiname, e.g. "en-simple")
	                        (default: None)


index
^^^^^

.. ansi-block::

    
	usage: minimel index [-h] db_fname
	
	Make an efficient DAWG trie index from a Wikimapper sqlite file
	
	positional arguments:
	  db_fname    Wikimapper SQLite3 index file
	
	options:
	  -h, --help  show this help message and exit


xml-db
^^^^^^

.. ansi-block::

    
	usage: minimel xml-db [-h] [--ns NS] [--nparts NPARTS] wikidump
	
	Make a name database from Wikidump page ids
	
	positional arguments:
	  wikidump         Wikipedia XML dump file
	
	options:
	  -h, --help       show this help message and exit
	  --ns NS          Page Namespace
	                   (default: 0)
	  --nparts NPARTS  Number of chunks to read
	                   (default: 100)


query-pages
^^^^^^^^^^^

.. ansi-block::

    
	usage: minimel query-pages [-h]
	                           [-q | --query-listpages | --no-query-listpages]
	                           [-o OUTFILE]
	                           langcode
	
	Query the Wikidata API to get disambiguation (& list pages if indicated)
	
	Returns Wikidata Qids, one per line
	
	positional arguments:
	  langcode              Wikipedia language code
	
	options:
	  -h, --help            show this help message and exit
	  -q, --query-listpages, --no-query-listpages
	                        Whether to also query for list pages
	                        (default: False)
	  -o OUTFILE, --outfile OUTFILE
	                        (default: None)


get-disambig
^^^^^^^^^^^^

.. ansi-block::

    
	usage: minimel get-disambig [-h] [-d DISAMBIG_TEMPLATE] [-n NPARTS]
	                            wikidump dawgfile [disambig_ent_file]
	
	Get disambiguation links.
	
	Writes [4mdisambig.json[0m.
	
	positional arguments:
	  wikidump              Wikipedia XML dump file
	  dawgfile              DAWG trie file of Wikipedia > Wikidata mapping
	  disambig_ent_file     Flat text file of disambiguation pages with one entity ID per line
	                        (default: None)
	
	options:
	  -h, --help            show this help message and exit
	  -d DISAMBIG_TEMPLATE, --disambig-template DISAMBIG_TEMPLATE
	                        Use disambiguation pages that contain a template with this name instead of [4mdisambig_ent_file[0m (if disambig_ent_file is provided, create it)
	                        (default: None)
	  -n NPARTS, --nparts NPARTS
	                        Number of chunks to read
	                        (default: 1000)


get-paragraphs
^^^^^^^^^^^^^^

.. ansi-block::

    
	usage: minimel get-paragraphs [-h] [-n NPARTS] wikidump dawgfile [skip ...]
	
	Extract hyperlinks from Wikipedia dumps.
	
	Writes to [4moutdir[0m.
	
	positional arguments:
	  wikidump              Wikipedia pages-articles XML dump file
	  dawgfile              DAWG trie file of Wikipedia > Wikidata mapping
	  skip                  Skip pages with this prefix
	
	options:
	  -h, --help            show this help message and exit
	  -n NPARTS, --nparts NPARTS
	                        Number of chunks to read
	                        (default: 1000)


count
^^^^^

.. ansi-block::

    
	usage: minimel count [-h] [-o OUTFILE] [-m MIN_COUNT] [--stem STEM]
	                     [--head HEAD] [--split SPLIT] [-f FOLD]
	                     paragraphlinks
	
	Count targets per anchor text in Wikipedia paragraphs.
	
	Writes [4mcount.min{min_count}[.stem-{LANG}].json[0m
	
	positional arguments:
	  paragraphlinks        Directory of (pagetitle, links-json, paragraph) .tsv files
	
	options:
	  -h, --help            show this help message and exit
	  -o OUTFILE, --outfile OUTFILE
	                        Output file or directory (default: [4mcount.json[0m)
	                        (default: None)
	  -m MIN_COUNT, --min-count MIN_COUNT
	                        Minimal (anchor-text, target) occurrence
	                        (default: 2)
	  --stem STEM           Stemming language ISO 639-1 (2-letter) code
	                        (default: None)
	  --head HEAD           Use only N first lines from each partition
	                        (default: None)
	  --split SPLIT         Split the data into several parts
	                        (default: None)
	  -f FOLD, --fold FOLD  Ignore this fold of the split data
	                        (default: None)


count-names
^^^^^^^^^^^

.. ansi-block::

    
	usage: minimel count-names [-h] [-o OUTFILE] [-s STEM] [--head HEAD]
	                           paragraphlinks countfile
	
	Count anchor texts in Wikipedia paragraphs.
	
	positional arguments:
	  paragraphlinks        Directory of (pagetitle, links-json, paragraph) .tsv files
	  countfile             Hyperlink anchor count JSON file
	
	options:
	  -h, --help            show this help message and exit
	  -o OUTFILE, --outfile OUTFILE
	                        Output file or directory (default: [4mname{countfile}[.stem-{LANG}].json[0m)
	                        (default: None)
	  -s STEM, --stem STEM  Stemming language ISO 639-1 (2-letter) code
	                        (default: None)
	  --head HEAD           Use only N first lines from each partition
	                        (default: None)


clean
^^^^^

.. ansi-block::

    
	usage: minimel clean [-h] [-o OUTFILE] [-s STEM]
	                     [-f | --freqnorm | --no-freqnorm] [-b BADENTFILE]
	                     [-m MIN_COUNT] [-t TOKENSCORE_THRESHOLD]
	                     [-e ENTROPY_THRESHOLD]
	                     [--countratio-threshold COUNTRATIO_THRESHOLD]
	                     [-q QUANTILE_TOP_SHADOWED]
	                     [--cluster-threshold CLUSTER_THRESHOLD]
	                     indexdbfile disambigfile countfile [namecountfile]
	
	Filter anchor counts (given their candidate entity counts).
	
	First, only keep ambiguous candidate entities that either have minimal counts or are
	linked from disambiguation pages.
	If the tokenscore is low, then names with high entropy or countratio
	(len / sum) are removed.
	
	positional arguments:
	  indexdbfile           Wikimapper index sqlite3 database
	  disambigfile          Disambiguation JSON file
	  countfile             Hyperlink anchor count {word: {Q_ent: count}} JSON file
	  namecountfile         Counts of names (regardless of hyperlinks)
	                        (default: None)
	
	options:
	  -h, --help            show this help message and exit
	  -o OUTFILE, --outfile OUTFILE
	                        Output file or directory (default: [4mclean.json[0m)
	                        (default: None)
	  -s STEM, --stem STEM  Stemming language ISO 639-1 (2-letter) code
	                        (default: None)
	  -f, --freqnorm, --no-freqnorm
	                        Normalize counts by total entity frequency
	                        (default: False)
	  -b BADENTFILE, --badentfile BADENTFILE
	                        Files of entity IDs to ignore, one per line
	                        (default: None)
	  -m MIN_COUNT, --min-count MIN_COUNT
	                        Minimal candidate entity count
	                        (default: 2)
	  -t TOKENSCORE_THRESHOLD, --tokenscore-threshold TOKENSCORE_THRESHOLD
	                        Threshold for mean asymmentric Jaccard index
	                        between name and candidate entity labels
	                        (default: 0.1)
	  -e ENTROPY_THRESHOLD, --entropy-threshold ENTROPY_THRESHOLD
	                        Entropy threshold (high entropy = flat dist)
	                        (default: 1.0)
	  --countratio-threshold COUNTRATIO_THRESHOLD
	                        Count-ratio (len / sum) threshold
	                        (default: 0.5)
	  -q QUANTILE_TOP_SHADOWED, --quantile-top-shadowed QUANTILE_TOP_SHADOWED
	                        Only train models for a % names with highest counts
	                        of candidate entities shadowed by the top candidate
	                        (default: None)
	  --cluster-threshold CLUSTER_THRESHOLD
	                        (default: None)


vectorize
^^^^^^^^^

.. ansi-block::

    
	usage: minimel vectorize [-h] [-o OUTFILE] [--head HEAD] [--stem STEM]
	                         [-v VECTORIZER] [-e ENT_FEATS_CSV]
	                         [-b | --balanced | --no-balanced]
	                         [-u | --usenil | --no-usenil] [--split SPLIT]
	                         [-f FOLD]
	                         paragraphlinks name_count_json
	
	Vectorize paragraph text dataset into Vowpal Wabbit format
	
	positional arguments:
	  paragraphlinks        Paragraph links directory
	  name_count_json       Surfaceform count json file
	
	options:
	  -h, --help            show this help message and exit
	  -o OUTFILE, --outfile OUTFILE
	                        Output file or directory (default: [4mvec*.parts[0m)
	                        (default: None)
	  --head HEAD           Use only N first lines from each partition
	                        (default: None)
	  --stem STEM           Stemming language ISO 639-1 (2-letter) code
	                        (default: None)
	  -v VECTORIZER, --vectorizer VECTORIZER
	                        Scikit-learn vectorizer .pickle or Fasttext .bin word
	                        embeddings. If unset, use tokens directly.
	                        (default: None)
	  -e ENT_FEATS_CSV, --ent-feats-csv ENT_FEATS_CSV
	                        CSV of (ent_id,space separated feat list) entity features
	                        (default: None)
	  -b, --balanced, --no-balanced
	                        Use balanced training
	                        (default: False)
	  -u, --usenil, --no-usenil
	                        Use NIL option
	                        (default: False)
	  --split SPLIT         Split the data into several parts
	                        (default: None)
	  -f FOLD, --fold FOLD  Ignore this fold of the split data
	                        (default: None)


ent-feats
^^^^^^^^^

.. ansi-block::

    
	usage: minimel ent-feats [-h] [-p PART] spo_parquet anchor_json
	
	Extract entity features from parquet triples
	
	positional arguments:
	  spo_parquet           Parquet triple file
	  anchor_json           Anchor counts
	
	options:
	  -h, --help            show this help message and exit
	  -p PART, --part PART  Filter part of features based on count
	                        <1: Quantile of feature count
	                        >1: Minimum feature count
	                        (default: 1)


train
^^^^^

.. ansi-block::

    
	usage: minimel train [-h] [-o OUTFILE] [-b BITS] vec_file
	
	Train Logistic Regression models
	
	Writes
	
	positional arguments:
	  vec_file              Training data in Vowpal Wabbit format
	
	options:
	  -h, --help            show this help message and exit
	  -o OUTFILE, --outfile OUTFILE
	                        Output file or directory (default: [4mmodel.b{bits}.vw[0m)
	                        (default: None)
	  -b BITS, --bits BITS  Number of bits of the Vowpal Wabbit feature hash function
	                        (default: 20)


run
^^^

.. ansi-block::

    
	usage: minimel run [-h] [-o OUTFILE] [-v VECTORIZER]
	                   [--ent-feats-csv ENT_FEATS_CSV] [-l LANG]
	                   [--fallback FALLBACK] [--evaluate | --no-evaluate]
	                   [--evalfile EVALFILE]
	                   [--evalfile-per-name EVALFILE_PER_NAME]
	                   [-p | --predict-only | --no-predict-only]
	                   [-a | --all-scores | --no-all-scores]
	                   [-u | --upperbound | --no-upperbound] [-s SPLIT]
	                   [--fold FOLD]
	                   dawgfile [candidatefile] [modelfile] [runfiles ...]
	
	Perform entity disambiguation
	
	positional arguments:
	  dawgfile              DAWG trie file of Wikipedia > Wikidata count
	  candidatefile         Candidate {name -> [ID]} json
	                        (default: None)
	  modelfile             Vowpal Wabbit model
	                        (default: None)
	  runfiles              Input file (- or absent for standard input). TSV rows of
	                        (ID, {name -> ID}, text) or ({name -> ID}, text) or (text)
	
	options:
	  -h, --help            show this help message and exit
	  -o OUTFILE, --outfile OUTFILE
	                        Write outputs to file (default: stdout)
	                        (default: None)
	  -v VECTORIZER, --vectorizer VECTORIZER
	                        Scikit-learn vectorizer .pickle or Fasttext .bin word
	                        embeddings. If unset, use HashingVectorizer.
	                        (default: None)
	  --ent-feats-csv ENT_FEATS_CSV
	                        CSV of (ent_id,space separated feat list) entity features
	                        (default: None)
	  -l LANG, --lang LANG  (default: None)
	  --fallback FALLBACK   Additional fallback deterministic name -> ID json
	                        (default: None)
	  --evaluate, --no-evaluate
	                        Report evaluation scores instead of predictions
	                        (default: False)
	  --evalfile EVALFILE   Write evaluation results to file
	                        (default: None)
	  --evalfile-per-name EVALFILE_PER_NAME
	                        Write evaluation results per name to file
	                        (default: None)
	  -p, --predict-only, --no-predict-only
	                        Only print predictions, not original text
	                        (default: True)
	  -a, --all-scores, --no-all-scores
	                        Output all candidate scores
	                        (default: False)
	  -u, --upperbound, --no-upperbound
	                        Create upper bound on performance
	                        (default: False)
	  -s SPLIT, --split SPLIT
	                        Split the data into several parts
	                        (default: None)
	  --fold FOLD           Use only this fold of the split data
	                        (default: None)


evaluate
^^^^^^^^

.. ansi-block::

    
	usage: minimel evaluate [-h] [-a [AGG ...]] [-e EVALFILE]
	                        goldfile [predfiles ...]
	
	Evaluate predictions
	
	positional arguments:
	  goldfile
	  predfiles
	
	options:
	  -h, --help            show this help message and exit
	  -a [AGG ...], --agg [AGG ...]
	                        Aggregation jsons (TODO: depend on data...?)
	                        (default: ())
	  -e EVALFILE, --evalfile EVALFILE
	                        Write evaluation results to file
	                        (default: None)


experiment
^^^^^^^^^^

.. ansi-block::

    
	usage: minimel experiment [-h] [-o OUTDIR] [-n NPARTS] [--head HEAD]
	                          [--split [SPLIT ...]] [--fold [FOLD ...]]
	                          [--stem [STEM ...]] [-m [MIN_COUNT ...]]
	                          [--freqnorm [FREQNORM ...]]
	                          [--badentfile [BADENTFILE ...]]
	                          [-t [TOKENSCORE_THRESHOLD ...]]
	                          [--entropy-threshold [ENTROPY_THRESHOLD ...]]
	                          [--countratio-threshold [COUNTRATIO_THRESHOLD ...]]
	                          [-q [QUANTILE_TOP_SHADOWED ...]]
	                          [--cluster-threshold [CLUSTER_THRESHOLD ...]]
	                          [-v [VECTORIZER ...]]
	                          [--ent-feats-csv [ENT_FEATS_CSV ...]]
	                          [--balanced [BALANCED ...]] [--usenil [USENIL ...]]
	                          [--bits [BITS ...]] [-r [RUNFILE ...]]
	                          [--use-fallback [USE_FALLBACK ...]]
	                          [-a | --also-baseline | --no-also-baseline]
	                          [--evaluate | --no-evaluate]
	                          [--evaluate-per-name | --no-evaluate-per-name]
	                          [root]
	
	Run all steps to train and evaluate EL models over a parameter sweep.
	
	The root directory must contain the following files:
	
	- [4mindex_*.dawg[0m: DAWG trie mapping of article names -> numeric IDs
	
	- [4m*-disambig.txt[0m: See [4mdisambig_ent_file[0m in ~get_disambig.get_disambig
	
	positional arguments:
	  root                  Root directory
	                        (default: .)
	
	options:
	  -h, --help            show this help message and exit
	  -o OUTDIR, --outdir OUTDIR
	                        Write outputs to this directory
	                        (default: None)
	  -n NPARTS, --nparts NPARTS
	                        Number of parts to chunk wikidump into
	                        (default: 100)
	  --head HEAD           Use only N first lines from each partition
	                        (default: None)
	  --split [SPLIT ...]   Split the data into several parts
	                        (default: (None,))
	  --fold [FOLD ...]     Ignore this fold of the split data in training, use in evaluation
	                        (default: (None,))
	  --stem [STEM ...]     Stemming language ISO 639-1 (2-letter) code (use X for no stemming)
	                        (default: ('',))
	  -m [MIN_COUNT ...], --min-count [MIN_COUNT ...]
	                        Minimal (anchor-text, target) occurrence
	                        (default: (2,))
	  --freqnorm [FREQNORM ...]
	                        Normalize counts by total entity frequency (1/0)
	                        (default: (False,))
	  --badentfile [BADENTFILE ...]
	                        File of entity IDs to ignore, one per line (default: [4m*-disambig.txt[0m)
	                        (default: ('',))
	  -t [TOKENSCORE_THRESHOLD ...], --tokenscore-threshold [TOKENSCORE_THRESHOLD ...]
	                        Threshold for mean asymmentric Jaccard index
	                        between name and candidate entity labels
	                        (default: (0.1,))
	  --entropy-threshold [ENTROPY_THRESHOLD ...]
	                        Entropy threshold (high entropy = flat dist)
	                        (default: (1.0,))
	  --countratio-threshold [COUNTRATIO_THRESHOLD ...]
	                        Count-ratio (len / sum) threshold
	                        (default: (0.5,))
	  -q [QUANTILE_TOP_SHADOWED ...], --quantile-top-shadowed [QUANTILE_TOP_SHADOWED ...]
	                        Only train models for a % names with highest counts
	                        of candidate entities shadowed by the top candidate
	                        (default: (0,))
	  --cluster-threshold [CLUSTER_THRESHOLD ...]
	                        Cluster names based on their meanings
	                        (default: (None,))
	  -v [VECTORIZER ...], --vectorizer [VECTORIZER ...]
	                        Scikit-learn vectorizer .pickle or Fasttext .bin word
	                        embeddings. If unset, use tokens directly.
	                        (default: ('',))
	  --ent-feats-csv [ENT_FEATS_CSV ...]
	                        CSV of (ent_id,space separated feat list) entity features
	                        (default: ('',))
	  --balanced [BALANCED ...]
	                        Use balanced training
	                        (default: (False,))
	  --usenil [USENIL ...]
	                        Use NIL option for training unlinked mentions
	                        (default: (False,))
	  --bits [BITS ...]     Number of bits of the Vowpal Wabbit feature hash function
	                        (default: (20,))
	  -r [RUNFILE ...], --runfile [RUNFILE ...]
	                        TSV rows of (ID, {name -> ID}, text) or ({name -> ID}, text)
	                        (default: ('',))
	  --use-fallback [USE_FALLBACK ...]
	                        Use raw counts as fallback
	                        (default: (True,))
	  -a, --also-baseline, --no-also-baseline
	                        Also run a baseline model without model predictions
	                        (default: True)
	  --evaluate, --no-evaluate
	                        Write evaluation scores to file
	                        (default: False)
	  --evaluate-per-name, --no-evaluate-per-name
	                        Write evaluation scores per name to file
	                        (default: False)


audit
^^^^^

.. ansi-block::

    
	usage: minimel audit [-h] modelfile datafile name [limit]
	
	Print prediction scores and model coefficients
	
	positional arguments:
	  modelfile   Model
	  datafile    VW format vectorized data
	  name
	  limit       (default: 1000)
	
	options:
	  -h, --help  show this help message and exit