Getting Started
First, install the library with extras to train models:
pip install -e git+https://github.com/bennokr/minimel.git#egg=minimel[train]
[1]:
wiki = 'iawiki-latest' # use Interlingua language Wikipedia version to test
root = 'wiki/' + wiki
!mkdir -p $root
!wikimapper download $wiki --dir $root
outdb = f'{root}/index_{wiki}.db'
!wikimapper create $wiki --dumpdir $root --target $outdb
2024-05-22 23:31:53,738 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-page.sql.gz] to [wiki/iawiki-latest/iawiki-latest-page.sql.gz]
2024-05-22 23:32:02,031 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-page_props.sql.gz] to [wiki/iawiki-latest/iawiki-latest-page_props.sql.gz]
2024-05-22 23:32:04,885 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-redirect.sql.gz] to [wiki/iawiki-latest/iawiki-latest-redirect.sql.gz]
2024-05-22 23:32:06,819 - wikimapper.processor - INFO - Creating index for [iawiki-latest] in [wiki/iawiki-latest/index_iawiki-latest.db]
2024-05-22 23:32:06,822 - wikimapper.processor - INFO - Parsing pages dump
2024-05-22 23:32:07,209 - wikimapper.processor - INFO - Creating database index on 'wikipedia_title'
2024-05-22 23:32:07,237 - wikimapper.processor - INFO - Parsing page properties dump
2024-05-22 23:32:07,529 - wikimapper.processor - INFO - Parsing redirects dump
2024-05-22 23:32:07,591 - wikimapper.processor - INFO - Creating database index on 'wikidata_id'
[2]:
!minimel -v index $outdb
Loading mapping...: 100%|█████████████| 34570/34570 [00:00<00:00, 329200.66it/s]
INFO:root:Building IntDAWG trie...
INFO:root:Saving to wiki/iawiki-latest/index_iawiki-latest.dawg...
[3]:
wikiname = wiki.split('-')[0]
!wget -P $root https://dumps.wikimedia.org/$wikiname/latest/$wiki-pages-articles.xml.bz2
!bunzip2 $root/$wiki-pages-articles.xml.bz2
--2024-05-22 23:35:13-- https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-pages-articles.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.71
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.71|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10654986 (10M) [application/octet-stream]
Saving to: ‘wiki/iawiki-latest/iawiki-latest-pages-articles.xml.bz2’
iawiki-latest-pages 82%[===============> ] 8,37M 17,3KB/s in 4m 7s
2024-05-22 23:39:26 (34,7 KB/s) - Connection closed at byte 8781489. Retrying.
--2024-05-22 23:39:27-- (try: 2) https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-pages-articles.xml.bz2
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.71|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 10654986 (10M), 1873497 (1,8M) remaining [application/octet-stream]
Saving to: ‘wiki/iawiki-latest/iawiki-latest-pages-articles.xml.bz2’
iawiki-latest-pages 100%[++++++++++++++++===>] 10,16M 404KB/s in 4,5s
2024-05-22 23:39:34 (404 KB/s) - ‘wiki/iawiki-latest/iawiki-latest-pages-articles.xml.bz2’ saved [10654986/10654986]
[4]:
dump = f'{root}/{wiki}-pages-articles.xml'
dawg = f'{root}/index_{wiki}.dawg'
!minimel -v get-paragraphs -n 100 $dump $dawg
INFO:root:Finished in 30s################] | 100% Completed | 30.5s
INFO:root:Wrote 100 partitions
[7]:
lang = wiki.split('wiki')[0]
disambigpages = f'{root}/ents-disambig.txt'
!minimel -v query-pages $lang -o $disambigpages
INFO:root:Writing to wiki/iawiki-latest/ents-disambig.txt
[8]:
!minimel -v get-disambig -n 100 $dump $dawg $disambigpages
INFO:root:Extracting disambiguation links...
INFO:root:Finished in 2s#################] | 100% Completed | 2.5s
INFO:root:Writing to wiki/iawiki-latest/disambig.json
[9]:
paragraphlinks = f'{root}/{wiki}-paragraph-links/'
!minimel -v count $paragraphlinks
INFO:root:Counting links...
INFO:root:Finished in 6s#################] | 100% Completed | 6.8s
INFO:root:Got 32602 counts.
INFO:root:Aggregating...
INFO:root:Finished in 10s################] | 100% Completed | 10.5s
INFO:root:Writing to wiki/iawiki-latest/count.min2.json
[10]:
# Get Wikidata IDs for disambiguation and list articles
badent = f'{root}/badent.txt'
!minimel query-pages $lang -q -o $badent
[11]:
disambigfile = f'{root}/disambig.json'
countfile = f'{root}/count.min2.json'
!minimel -v clean -b $badent $outdb $disambigfile $countfile
Counting entities...: 100%|███████████| 11560/11560 [00:00<00:00, 178917.02it/s]
INFO:root:Removing 133 bad entities
Loading labels...: 100%|███████████████| 34570/34570 [00:00<00:00, 97792.93it/s]
Filtering names...: 100%|██████████████| 11498/11498 [00:00<00:00, 17444.66it/s]
INFO:root:Filtering out 1 bad names
INFO:root:Keeping 11497 good names
INFO:root:Writing to wiki/iawiki-latest/clean.json
[12]:
cleanfile = f'{root}/clean.json'
!minimel -v vectorize $paragraphlinks $cleanfile
INFO:root:Vectorizing training examples for 286 ambiguous names
INFO:root:Writing to wiki/iawiki-latest/vec.clean.dat.parts
INFO:root:Finished in 3s#################] | 100% Completed | 3.4s
INFO:root:Wrote 34 partitions
INFO:root:Concatenating to wiki/iawiki-latest/vec.clean.dat
Concatenating: 100%|██████████████████████████| 34/34 [00:00<00:00, 3840.94it/s]
[13]:
vecfile = f'{root}/vec.clean.dat'
!minimel -v train $vecfile
INFO:root:Writing to wiki/iawiki-latest/model.20b.vw
creating quadratic features for pairs: ls sf
final_regressor = wiki/iawiki-latest/model.20b.vw
creating cache_file = wiki/iawiki-latest/vec.clean.dat.cache
Reading datafile = wiki/iawiki-latest/vec.clean.dat
num sources = 1
Num weight bits = 20
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
Enabled learners: gd, scorer-identity, csoaa_ldf-prob, shared_feature_merger
Input label = CS
Output pred = SCALARS
average since example example current current current
loss last counter weight label predict features
0.000000 0.000000 1 1.0 unknown 0 1414
0.000000 0.000000 2 2.0 unknown 0 24
0.000000 0.000000 4 4.0 unknown 0 348
0.000000 0.000000 8 8.0 unknown 0 12
0.125000 0.250000 16 16.0 unknown 0 188
0.093750 0.062500 32 32.0 unknown 0 100
0.046875 0.000000 64 64.0 unknown 0 108
0.039062 0.031250 128 128.0 unknown 0 72
0.031250 0.023438 256 256.0 unknown 0 552
0.064453 0.097656 512 512.0 unknown 0 88
0.102539 0.140625 1024 1024.0 known 41 150
0.092773 0.083008 2048 2048.0 unknown 0 104
0.104248 0.115723 4096 4096.0 known 37922 136
0.115845 0.127441 8192 8192.0 unknown 0 348
unknown unknown 16384 16384.0 unknown 0 24 h
unknown unknown 32768 32768.0 unknown 0 108 h
finished run
number of examples per pass = 10250
passes used = 4
weighted example sum = 41000.000000
weighted label sum = 0.000000
average loss = undefined (no holdout)
average multiclass log loss = 0.394560 h
total feature number = 15863608
INFO:root:Wrote to model.20b.vw
[16]:
modelfile = f'{root}/model.20b.vw'
!minimel -v run --evaluate -o /dev/null $dawg $cleanfile $modelfile $paragraphlinks/*
Predicting: 100%|███████████████████████| 59765/59765 [00:17<00:00, 3471.64it/s]
INFO:root:,,0
micro,precision,0.909326061550448
micro,recall,0.909326061550448
micro,fscore,0.909326061550448
macro,precision,0.9236526246023489
macro,recall,0.9062367026135526
macro,fscore,0.9121998587060755
,support,192525.0