minimel.vectorize module
Vectorize paragraph text dataset
- minimel.vectorize.vw_tok(text)
- minimel.vectorize.vw(lines, name_count_json: Path, ent_feats_csv=None, balanced=False, stem=False, usenil=False, head=None, split=None, fold=None)
Create VW-formatted training data
- Parameters:
lines – iterable of (pageid, {name: entityid} json, text) tsv lines
name_count_json (
Path) – path to json file of {name: {entityid: weight}}ent_feats_csv – path to csv of (entityid,feat1 feat2 feat3 …)
- Keyword Arguments:
head – Use only N first lines from each partition
stem – Stemming language ISO 639-1 (2-letter) code
ent_feats_csv – CSV of (ent_id,space separated feat list) entity features
balanced – Use balanced training
usenil – Use NIL option
split – Split the data into several parts
fold – Ignore this fold of the split data
- minimel.vectorize.hashvec(paragraphs, dim=None, lang=None, tokenizer=None)
- minimel.vectorize.transform(paragraphs, vectorizer)
- minimel.vectorize.embed(paragraphs, embeddingsfile, dim=None)
- minimel.vectorize.vectorize(paragraphlinks: Path, name_count_json: Path, *, outfile: Path | None = None, head: int | None = None, stem: str | None = None, vectorizer: Path | None = None, ent_feats_csv: Path | None = None, balanced: bool = False, usenil: bool = False, split: int | None = None, fold: int | None = None)
Vectorize paragraph text dataset into Vowpal Wabbit format
- Parameters:
- Keyword Arguments:
outfile – Output file or directory (default: vec*.parts)
head – Use only N first lines from each partition
stem – Stemming language ISO 639-1 (2-letter) code
vectorizer – Scikit-learn vectorizer .pickle or Fasttext .bin word embeddings. If unset, use tokens directly.
ent_feats_csv – CSV of (ent_id,space separated feat list) entity features
balanced – Use balanced training
usenil – Use NIL option
split – Split the data into several parts
fold – Ignore this fold of the split data