minimel.vectorize module

Vectorize paragraph text dataset

minimel.vectorize.vw_tok(text)
minimel.vectorize.vw(lines, name_count_json: Path, ent_feats_csv=None, balanced=False, stem=False, usenil=False, head=None, split=None, fold=None)

Create VW-formatted training data

Parameters:
  • lines – iterable of (pageid, {name: entityid} json, text) tsv lines

  • name_count_json (Path) – path to json file of {name: {entityid: weight}}

  • ent_feats_csv – path to csv of (entityid,feat1 feat2 feat3 …)

Keyword Arguments:
  • head – Use only N first lines from each partition

  • stem – Stemming language ISO 639-1 (2-letter) code

  • ent_feats_csv – CSV of (ent_id,space separated feat list) entity features

  • balanced – Use balanced training

  • usenil – Use NIL option

  • split – Split the data into several parts

  • fold – Ignore this fold of the split data

class minimel.vectorize.TransLiterator(lang)

Bases: object

code(text)
minimel.vectorize.hashvec(paragraphs, dim=None, lang=None, tokenizer=None)
minimel.vectorize.transform(paragraphs, vectorizer)
minimel.vectorize.embed(paragraphs, embeddingsfile, dim=None)
minimel.vectorize.vectorize(paragraphlinks: Path, name_count_json: Path, *, outfile: Path | None = None, head: int | None = None, stem: str | None = None, vectorizer: Path | None = None, ent_feats_csv: Path | None = None, balanced: bool = False, usenil: bool = False, split: int | None = None, fold: int | None = None)

Vectorize paragraph text dataset into Vowpal Wabbit format

Parameters:
Keyword Arguments:
  • outfile – Output file or directory (default: vec*.parts)

  • head – Use only N first lines from each partition

  • stem – Stemming language ISO 639-1 (2-letter) code

  • vectorizer – Scikit-learn vectorizer .pickle or Fasttext .bin word embeddings. If unset, use tokens directly.

  • ent_feats_csv – CSV of (ent_id,space separated feat list) entity features

  • balanced – Use balanced training

  • usenil – Use NIL option

  • split – Split the data into several parts

  • fold – Ignore this fold of the split data