minimel.vectorize module

Vectorize paragraph text dataset

minimel.vectorize.vw_tok(text)

minimel.vectorize.vw(lines, name_count_json: Path, ent_feats_csv=None, balanced=False, stem=False, usenil=False, head=None, split=None, fold=None)

Create VW-formatted training data

Parameters:

lines – iterable of (pageid, {name: entityid} json, text) tsv lines
name_count_json (Path) – path to json file of {name: {entityid: weight}}
ent_feats_csv – path to csv of (entityid,feat1 feat2 feat3 …)

Keyword Arguments:

head – Use only N first lines from each partition
stem – Stemming language ISO 639-1 (2-letter) code
ent_feats_csv – CSV of (ent_id,space separated feat list) entity features
balanced – Use balanced training
usenil – Use NIL option
split – Split the data into several parts
fold – Ignore this fold of the split data

class minimel.vectorize.TransLiterator(lang)

Bases: object

code(text)

minimel.vectorize.hashvec(paragraphs, dim=None, lang=None, tokenizer=None)

minimel.vectorize.transform(paragraphs, vectorizer)

minimel.vectorize.embed(paragraphs, embeddingsfile, dim=None)

Vectorize paragraph text dataset into Vowpal Wabbit format

Parameters:

paragraphlinks (Path) – Paragraph links directory
name_count_json (Path) – Surfaceform count json file
outfile (Optional[Path])
head (Optional[int])
stem (Optional[str])
vectorizer (Optional[Path])
ent_feats_csv (Optional[Path])
balanced (bool)
usenil (bool)
split (Optional[int])
fold (Optional[int])

Keyword Arguments:

outfile – Output file or directory (default: vec*.parts)
head – Use only N first lines from each partition
stem – Stemming language ISO 639-1 (2-letter) code
vectorizer – Scikit-learn vectorizer .pickle or Fasttext .bin word embeddings. If unset, use tokens directly.
ent_feats_csv – CSV of (ent_id,space separated feat list) entity features
balanced – Use balanced training
usenil – Use NIL option
split – Split the data into several parts
fold – Ignore this fold of the split data