wikidict-wordlist - Wikipedia Monolingual Reference Data (wordlists)
This repository makes available a collection of wordlists derived from article titles in various language Wikipedias. The data has been extracted from Wikidata.
Data
The data directory contains subdirectories arranged in order of ISO language code.
The basic filename pattern is [ISO]-wordlist_wiki.txt, with [ISO] being the target language ISO code. A list of all available languages is below.
Available languages
| Language code | Language name |
|---|---|
af |
Afrikaans |
am |
Amharic |
ang |
Anglo-Saxon |
ar |
Arabic |
arc |
Aramaic |
bg |
Bulgarian |
bi |
Bislama |
bn |
Bengali |
bo |
Tibetan |
br |
Breton |
bs |
Bosnian |
ca |
Catalan |
cdo |
Min Dong |
chr |
Cherokee |
chy |
Cheyenne |
cr |
Cree |
cs |
Czech |
cy |
Welsh |
da |
Danish |
de |
German |
el |
Greek |
en |
English |
eo |
Esperanto |
es |
Spanish |
et |
Estonian |
eu |
Basque |
fa |
Persian |
ff |
Fula |
fi |
Finnish |
fr |
French |
ga |
Irish |
gan |
Gan |
gd |
Scottish Gaelic |
gu |
Gujarati |
gv |
Manx |
ha |
Hausa |
hak |
Hakka |
haw |
Hawaiian |
he |
Hebrew |
hi |
Hindi |
hr |
Croatian |
ht |
Haitian |
hu |
Hungarian |
hy |
Armenian |
id |
Indonesian |
ig |
Igbo |
is |
Icelandic |
it |
Italian |
iu |
Inuktitut |
ja |
Japanese |
jbo |
Lojban |
jv |
Javanese |
ka |
Georgian |
kg |
Kongo |
ki |
Kikuyu |
kl |
Greenlandic |
km |
Khmer |
ko |
Korean |
la |
Latin |
lg |
Luganda |
lo |
Lao |
lt |
Lithuanian |
lv |
Latvian |
mg |
Malagasy |
mi |
Maori |
mn |
Mongolian |
ms |
Malay |
mt |
Maltese |
nah |
Nahuatl |
ne |
Nepali |
nl |
Dutch |
nn |
Norwegian (Nynorsk) |
no |
Norwegian |
nv |
Navajo |
ny |
Chichewa |
oc |
Occitan |
pa |
Punjabi |
pi |
Pali |
pl |
Polish |
ps |
Pashto |
pt |
Portuguese |
qu |
Quechua |
ro |
Romanian |
ru |
Russian |
sa |
Sanskrit |
se |
Northern Sami |
sh |
Serbo-Croatian |
sk |
Slovak |
sl |
Slovenian |
sn |
Shona |
so |
Somali |
sq |
Albanian |
sr |
Serbian |
sv |
Swedish |
sw |
Kiswahili |
ta |
Tamil |
te |
Telugu |
th |
Thai |
tl |
Tagalog |
tpi |
Tok Pisin |
tr |
Turkish |
ug |
Uyghur |
uk |
Ukrainian |
ur |
Urdu |
vi |
Vietnamese |
wo |
Wolof |
wuu |
Wu |
xh |
Xhosa |
yi |
Yiddish |
yo |
Yoruba |
za |
Zhuang |
zh |
Chinese (Mandarin) |
zh_classical |
Classical Chinese |
zh_min_nan |
Min Nan |
zh_yue |
Cantonese |
zu |
Zulu |
Statistics
Wordlist size
| Language | # of entries |
|---|---|
af |
33599 |
am |
11014 |
ang |
2977 |
ar |
446845 |
arc |
1829 |
bg |
225573 |
bi |
490 |
bn |
59121 |
bo |
2929 |
br |
49865 |
bs |
64229 |
ca |
438072 |
cdo |
2909 |
chr |
492 |
chy |
710 |
cr |
70 |
cs |
327321 |
cy |
52130 |
da |
196279 |
de |
1787961 |
el |
136650 |
en |
4798378 |
eo |
209308 |
es |
1346715 |
et |
124124 |
eu |
203027 |
fa |
744454 |
ff |
464 |
fi |
363265 |
fr |
1862431 |
ga |
35768 |
gan |
14253 |
gd |
15561 |
gu |
27615 |
gv |
4723 |
ha |
518 |
hak |
4123 |
haw |
2009 |
he |
209505 |
hi |
120411 |
hr |
139555 |
ht |
45669 |
hu |
323069 |
hy |
161719 |
id |
338477 |
ig |
1075 |
is |
39429 |
it |
1183116 |
iu |
383 |
ja |
951498 |
jbo |
1179 |
jv |
45722 |
ka |
118968 |
kg |
868 |
ki |
311 |
kl |
1839 |
km |
4713 |
ko |
446200 |
la |
111691 |
lg |
179 |
lo |
1913 |
lt |
173148 |
lv |
58016 |
mg |
77182 |
mi |
2579 |
mn |
18668 |
ms |
245936 |
mt |
2981 |
nah |
10519 |
ne |
24961 |
nl |
1812937 |
nn |
117294 |
no |
403749 |
nv |
3887 |
ny |
170 |
oc |
88788 |
pa |
14042 |
pi |
2759 |
pl |
1088821 |
ps |
5148 |
pt |
866567 |
qu |
18494 |
ro |
264609 |
ru |
1461243 |
sa |
12256 |
se |
7216 |
sh |
284238 |
sk |
269048 |
sl |
132095 |
sn |
1671 |
so |
2760 |
sq |
53553 |
sr |
351888 |
sv |
1954061 |
sw |
26694 |
ta |
80394 |
te |
63860 |
th |
134176 |
tl |
57983 |
tpi |
1336 |
tr |
247607 |
ug |
2596 |
uk |
638342 |
ur |
125182 |
vi |
1241500 |
wo |
1636 |
wuu |
5032 |
xh |
319 |
yi |
12575 |
yo |
35053 |
za |
808 |
zh |
804107 |
zh_classical |
3855 |
zh_min_nan |
14851 |
zh_yue |
32062 |
zu |
689 |
Top ten wordlists by number of entries
| Language | # of entries |
|---|---|
en |
4798378 |
sv |
1954061 |
fr |
1862431 |
nl |
1812937 |
de |
1787961 |
ru |
1461243 |
es |
1346715 |
vi |
1241500 |
it |
1183116 |
pl |
1088821 |
License
According to the Wikidata website:
All structured data from the main and property namespace is available under the Creative Commons CC0 License
The data in this repository is therefore made available under the same Creative Commons CC0 License as that used by the Wikidata project. All of the data has been derived from the Wikidata JSON format database dumps.

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.
