Getting Started

First, install the library with extras to train models:

pip install -e git+https://github.com/bennokr/minimel.git#egg=minimel[train]
[1]:
wiki = 'iawiki-latest' # use Interlingua language Wikipedia version to test
root = 'wiki/' + wiki
!mkdir -p $root
!wikimapper download $wiki --dir $root
outdb = f'{root}/index_{wiki}.db'
!wikimapper create $wiki --dumpdir $root --target $outdb
2024-05-22 23:31:53,738 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-page.sql.gz] to [wiki/iawiki-latest/iawiki-latest-page.sql.gz]
2024-05-22 23:32:02,031 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-page_props.sql.gz] to [wiki/iawiki-latest/iawiki-latest-page_props.sql.gz]
2024-05-22 23:32:04,885 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-redirect.sql.gz] to [wiki/iawiki-latest/iawiki-latest-redirect.sql.gz]
2024-05-22 23:32:06,819 - wikimapper.processor - INFO - Creating index for [iawiki-latest] in [wiki/iawiki-latest/index_iawiki-latest.db]
2024-05-22 23:32:06,822 - wikimapper.processor - INFO - Parsing pages dump
2024-05-22 23:32:07,209 - wikimapper.processor - INFO - Creating database index on 'wikipedia_title'
2024-05-22 23:32:07,237 - wikimapper.processor - INFO - Parsing page properties dump
2024-05-22 23:32:07,529 - wikimapper.processor - INFO - Parsing redirects dump
2024-05-22 23:32:07,591 - wikimapper.processor - INFO - Creating database index on 'wikidata_id'
[2]:
!minimel -v index $outdb
Loading mapping...: 100%|█████████████| 34570/34570 [00:00<00:00, 329200.66it/s]
INFO:root:Building IntDAWG trie...
INFO:root:Saving to wiki/iawiki-latest/index_iawiki-latest.dawg...
[3]:
wikiname = wiki.split('-')[0]
!wget -P $root https://dumps.wikimedia.org/$wikiname/latest/$wiki-pages-articles.xml.bz2
!bunzip2 $root/$wiki-pages-articles.xml.bz2
--2024-05-22 23:35:13--  https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-pages-articles.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.71
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.71|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10654986 (10M) [application/octet-stream]
Saving to: ‘wiki/iawiki-latest/iawiki-latest-pages-articles.xml.bz2’

iawiki-latest-pages  82%[===============>    ]   8,37M  17,3KB/s    in 4m 7s

2024-05-22 23:39:26 (34,7 KB/s) - Connection closed at byte 8781489. Retrying.

--2024-05-22 23:39:27--  (try: 2)  https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-pages-articles.xml.bz2
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.71|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 10654986 (10M), 1873497 (1,8M) remaining [application/octet-stream]
Saving to: ‘wiki/iawiki-latest/iawiki-latest-pages-articles.xml.bz2’

iawiki-latest-pages 100%[++++++++++++++++===>]  10,16M   404KB/s    in 4,5s

2024-05-22 23:39:34 (404 KB/s) - ‘wiki/iawiki-latest/iawiki-latest-pages-articles.xml.bz2’ saved [10654986/10654986]

[4]:
dump = f'{root}/{wiki}-pages-articles.xml'
dawg = f'{root}/index_{wiki}.dawg'
!minimel -v get-paragraphs -n 100 $dump $dawg
INFO:root:Finished in 30s################] | 100% Completed | 30.5s
INFO:root:Wrote 100 partitions
[7]:
lang = wiki.split('wiki')[0]
disambigpages = f'{root}/ents-disambig.txt'
!minimel -v query-pages $lang -o $disambigpages
INFO:root:Writing to wiki/iawiki-latest/ents-disambig.txt
[8]:
!minimel -v get-disambig -n 100 $dump $dawg $disambigpages
INFO:root:Extracting disambiguation links...
INFO:root:Finished in 2s#################] | 100% Completed |  2.5s
INFO:root:Writing to wiki/iawiki-latest/disambig.json
[9]:
paragraphlinks = f'{root}/{wiki}-paragraph-links/'
!minimel -v count $paragraphlinks
INFO:root:Counting links...
INFO:root:Finished in 6s#################] | 100% Completed |  6.8s
INFO:root:Got 32602 counts.
INFO:root:Aggregating...
INFO:root:Finished in 10s################] | 100% Completed | 10.5s
INFO:root:Writing to wiki/iawiki-latest/count.min2.json
[10]:
# Get Wikidata IDs for disambiguation and list articles
badent = f'{root}/badent.txt'
!minimel query-pages $lang -q -o $badent
[11]:
disambigfile = f'{root}/disambig.json'
countfile = f'{root}/count.min2.json'
!minimel -v clean -b $badent $outdb $disambigfile $countfile
Counting entities...: 100%|███████████| 11560/11560 [00:00<00:00, 178917.02it/s]
INFO:root:Removing 133 bad entities
Loading labels...: 100%|███████████████| 34570/34570 [00:00<00:00, 97792.93it/s]
Filtering names...: 100%|██████████████| 11498/11498 [00:00<00:00, 17444.66it/s]
INFO:root:Filtering out 1 bad names
INFO:root:Keeping 11497 good names
INFO:root:Writing to wiki/iawiki-latest/clean.json
[12]:
cleanfile = f'{root}/clean.json'
!minimel -v vectorize $paragraphlinks $cleanfile
INFO:root:Vectorizing training examples for 286 ambiguous names
INFO:root:Writing to wiki/iawiki-latest/vec.clean.dat.parts
INFO:root:Finished in 3s#################] | 100% Completed |  3.4s
INFO:root:Wrote 34 partitions
INFO:root:Concatenating to wiki/iawiki-latest/vec.clean.dat
Concatenating: 100%|██████████████████████████| 34/34 [00:00<00:00, 3840.94it/s]
[13]:
vecfile = f'{root}/vec.clean.dat'
!minimel -v train $vecfile
INFO:root:Writing to wiki/iawiki-latest/model.20b.vw
creating quadratic features for pairs: ls sf
final_regressor = wiki/iawiki-latest/model.20b.vw
creating cache_file = wiki/iawiki-latest/vec.clean.dat.cache
Reading datafile = wiki/iawiki-latest/vec.clean.dat
num sources = 1
Num weight bits = 20
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
Enabled learners: gd, scorer-identity, csoaa_ldf-prob, shared_feature_merger
Input label = CS
Output pred = SCALARS
average  since         example        example        current        current  current
loss     last          counter         weight          label        predict features
0.000000 0.000000            1            1.0        unknown              0     1414
0.000000 0.000000            2            2.0        unknown              0       24
0.000000 0.000000            4            4.0        unknown              0      348
0.000000 0.000000            8            8.0        unknown              0       12
0.125000 0.250000           16           16.0        unknown              0      188
0.093750 0.062500           32           32.0        unknown              0      100
0.046875 0.000000           64           64.0        unknown              0      108
0.039062 0.031250          128          128.0        unknown              0       72
0.031250 0.023438          256          256.0        unknown              0      552
0.064453 0.097656          512          512.0        unknown              0       88
0.102539 0.140625         1024         1024.0          known             41      150
0.092773 0.083008         2048         2048.0        unknown              0      104
0.104248 0.115723         4096         4096.0          known          37922      136
0.115845 0.127441         8192         8192.0        unknown              0      348
unknown  unknown         16384        16384.0        unknown              0       24 h
unknown  unknown         32768        32768.0        unknown              0      108 h

finished run
number of examples per pass = 10250
passes used = 4
weighted example sum = 41000.000000
weighted label sum = 0.000000
average loss = undefined (no holdout)
average multiclass log loss = 0.394560 h
total feature number = 15863608
INFO:root:Wrote to model.20b.vw
[16]:
modelfile = f'{root}/model.20b.vw'
!minimel -v run --evaluate -o /dev/null $dawg $cleanfile $modelfile $paragraphlinks/*
Predicting: 100%|███████████████████████| 59765/59765 [00:17<00:00, 3471.64it/s]
INFO:root:,,0
micro,precision,0.909326061550448
micro,recall,0.909326061550448
micro,fscore,0.909326061550448
macro,precision,0.9236526246023489
macro,recall,0.9062367026135526
macro,fscore,0.9121998587060755
,support,192525.0