minimel.get_disambig module

Extract list-hyperlinks from Wikipedia disambiguation pages

minimel.get_disambig.writer(fn)
minimel.get_disambig.query_pages(langcode: str, *, query_listpages: bool = False, outfile: Path | None = None)

Query the Wikidata API to get disambiguation (& list pages if indicated)

Returns Wikidata Qids, one per line

Parameters:
Keyword Arguments:

query_listpages – Whether to also query for list pages

minimel.get_disambig.get_disambig(wikidump: Path, dawgfile: Path, disambig_ent_file: Path | None = None, *, disambig_template: str | None = None, nparts: int = 1000)

Get disambiguation links.

Writes disambig.json.

Parameters:
  • wikidump (Path) – Wikipedia XML dump file

  • dawgfile (Path) – DAWG trie file of Wikipedia > Wikidata mapping

  • disambig_ent_file (Optional[Path]) – Flat text file of disambiguation pages with one entity ID per line

  • disambig_template (Optional[str])

  • nparts (int)

Keyword Arguments:
  • nparts – Number of chunks to read

  • disambig_template – Use disambiguation pages that contain a template with this name instead of disambig_ent_file (if disambig_ent_file is provided, create it)