minimel.get_paragraphs module

Extract hyperlinks from Wikipedia dumps

minimel.get_paragraphs.get_str(node)
minimel.get_paragraphs.get_text(w)
minimel.get_paragraphs.process_line(pagename, mwcode, index, skip=None)
minimel.get_paragraphs.get_anchor_paragraphs(lines, dawgfile, skip=[])
minimel.get_paragraphs.get_paragraphs(wikidump: Path, dawgfile: Path, *skip: str, nparts: int = 1000)

Extract hyperlinks from Wikipedia dumps.

Writes to outdir.

Parameters:
  • wikidump (Path) – Wikipedia pages-articles XML dump file

  • dawgfile (Path) – DAWG trie file of Wikipedia > Wikidata mapping

  • skip (str) – Skip pages with this prefix

  • nparts (int)

Keyword Arguments:

nparts – Number of chunks to read