Data Set Statistics

DBpedia Data Set Statistics


This page provides statistics about the DBpedia 3.8 release. The release contains localized editions of DBpedia for 111 languages which have been extracted from the Wikipedia edition in the corresponding language. For 22 out of these languages, we report the overall number of things (instances) being described in the localized version of DBpedia as well as the number of facts (statements) that have been extracted from infoboxes describing these things. Afterwards, we report the number of instances of popular classes within these 22 DBpedia editions.



1 Instances, Properties, and Statements per Language


The same thing, for instance a person or city, might be described by multiple pages within Wikipedia editions in different languages. Pages describing the same thing are often interlinked by cross-language links within Wikipedia.


When DBpedia extracts data from these pages, it produces two types of data sets. The localized data sets contain all things that are described in a specific language and in which things are identified with a language specific URI. In addition, we produce a canonicalized data set for each language. The canonicalized data sets only contain things for which a corresponding page in the English edition of Wikipedia exists. Within all canonicalized datasets, the same thing is identified with the same URI from the generic namespace http://dbpedia.org/resource/.


DBpedia uses two different extractors to extract data from Wikipedia infoboxes. The mapping-based extractor extracts data only for the infoboxes for which a language-specific extraction mapping to the DBpedia ontology exists in the DBpedia mapping wiki. Based on these mappings, it normalizes the different names that are used in various languages to refer to the same property. The second extractor is the raw infobox extractor which uses a generic heuristic to extract data from all infoboxes. The raw infobox extractor does not normalize property names but produces language-specific properties that directly reflect the property name in the Wikipedia infobox.


Below we report the overall number of things (instances), different ontology and raw-infobox properties, infobox statements and type statements for all 22 languages for which mappings exist in the DBpedia mapping wiki. The rows are sorted according to the number of instances for which mapping-based infobox data exists (Instances, CD, withMD column).


The column heading have the following meaning:

  • LD = Localized Data Sets.
  • CD = Canonicalized Data Sets.
  • all = Overall number of instances in the data set, calculated based on the short abstract dumps.
  • withMD = Number of instances for which mapping-based infobox data exists.
  • Raw Properties = Number of different properties that are generated by the raw infobox extractor.
  • Mapping Properties = Number of different properties that are generated by the mapping-based infobox extractor.
  • Raw Statements = Number of statements (facts) that are generated by the raw infobox extractor.
  • Mapping Statements = Number of statements (facts) that are generated by the mapping-based infobox extractor; include type statements.
 
  Instances, LD, all Instances, CD, all Instances, CD, withMD Raw Properties, CD Mapping Properties, CD Raw Statements, CD Mapping Statements, CD Type Statements, CD
en 3,769,926 3,769,926 2,359,521 48,293 1,313 65,143,840 33,742,015 13,655,887
it 882,127 580,620 383,643 9,716 181 12,227,870 4,804,731 2,142,194
pl 848,298 538,641 344,875 7,306 266 7,696,193 4,511,794 2,086,071
es 879,091 542,524 310,348 14,643 476 7,740,458 4,383,206 1,695,745
pt 699,446 460,258 272,660 12,851 602 6,255,151 4,005,527 1,493,280
fr 1,197,334 740,044 214,953 13,551 228 8,854,322 2,901,809 1,287,965
de 1,243,771 650,037 204,335 9,593 261 7,603,562 2,880,381 1,151,623
ru 822,681 439,605 123,011 13,522 76 6,973,305 1,389,473 692,282
ca 367,362 241,534 112,934 8,696 183 3,689,870 1,301,868 721,940
hu 209,180 138,998 63,441 6,821 295 2,506,399 601,037 325,401
eu 132,877 108,713 41,401 2,245 19 2,255,897 532,709 245,678
tr 187,850 106,644 40,438 7,512 440 1,350,679 556,943 229,317
bg 125,762 87,679 38,825 3,984 274 774,443 488,678 196,907
cs 225,133 148,819 34,893 5,564 334 1,857,230 474,459 208,044
ko 196,132 124,591 30,962 7,095 419 1,035,606 417,605 183,714
sl 129,834 73,099 22,036 4,235 470 1,213,801 222,447 133,660
ar 165,722 103,059 16,236 7,898 268 635,058 168,686 93,845
el 71,936 48,260 10,813 2,866 288 206,460 113,838 58,878
hr 109,890 71,469 10,343 3,334 158 701,182 151,196 66,937
nl 992,557 477,443 8,525 6,988 22 6,759,879 74,473 53,946
ga 14,761 13,308 3,562 1,076 72 71,707 39,129 20,433
bn 23,447 18,624 550 4,791 76 136,013 5,641 2,935

2 Instances of Selected Classes per Language


The table below reports the number of instances for a set of selected classes within the canonicalized DBpedia data sets for each language.
  en it pl es pt fr de ru ca hu eu tr bg cs ko sl ar el hr nl ga bn
Person 763,643 145,060 70,708 65,337 43,057 62,942 33,122 18,620 7,107 15,529 0 14,368 16,095 8,492 11,198 5,788 5,910 3,388 4,992 7,075 1,304 425
Artist 61,073 12,511 16,120 25,992 10,571 13,465 0 0 2,004 3,821 0 5,441 1,234 2,292 3,675 580 2,147 961 3,684 5,997 645 0
Actor 2,431 0 8,049 9,850 5,486 9,328 0 0 1,430 1,840 0 2,170 950 1,932 1,990 405 1,288 0 1,355 5,997 355 0
MusicalArtist 34,246 12,511 6,254 11,540 4,937 0 0 0 574 0 0 2,452 68 0 1,296 30 601 114 1,840 0 82 0
Athlete 185,126 47,187 30,332 19,482 14,130 21,646 31,237 0 721 4,527 0 4,544 2,411 2,814 3,503 1,936 1,545 559 0 0 196 0
Politician 23,096 0 8,943 5,513 3,342 0 0 12,004 1,376 760 0 0 0 0 592 61 0 306 0 54 297 0
Place 572,728 141,101 182,727 132,961 116,660 80,602 131,766 67,932 73,078 18,324 40,821 8,422 11,082 16,254 7,115 11,364 5,020 2,680 808 589 1,133 0
Building 60,514 1,270 2,946 3,570 803 921 83 43 0 527 0 916 127 284 603 82 169 76 9 0 0 0
Airport 11,533 0 3,121 1,697 636 0 0 0 0 162 0 0 0 154 375 72 269 0 0 0 0 0
Bridge 2,996 0 212 203 80 0 0 0 0 86 0 314 0 0 27 0 0 0 0 0 0 0
Skyscraper 1 0 0 0 0 0 0 0 0 0 0 0 0 0 68 0 0 0 0 0 0 0
PopulatedPlace 387,166 138,077 167,034 121,204 109,418 72,252 79,410 63,826 72,743 15,535 40,821 6,982 5,778 12,445 1,526 10,976 4,532 2,360 127 589 1,133 0
River 24,267 0 1,383 2 4,149 3,333 6,707 3,924 0 565 0 0 218 1,593 151 149 0 3 511 0 0 0
Organisation 192,832 4,142 12,193 11,710 10,949 17,513 16,973 1,598 1,399 3,993 288 3,637 3,231 1,406 4,468 554 1,824 1,089 0 0 164 0
Band 27,061 0 2,993 0 4,476 3,868 5,368 0 263 802 288 21 1,206 0 809 12 0 147 0 0 93 0
Company 44,516 4,142 2,566 975 1,903 5,832 7,200 0 440 618 0 642 337 916 1,328 121 866 138 0 0 0 0
Educ.Institution 42,270 0 599 1,207 270 1,636 2,171 1,010 0 115 0 213 135 110 785 46 273 61 0 0 0 0
Work 333,269 51,918 32,386 36,484 33,869 39,195 18,195 34,363 5,240 9,777 0 8,008 3,718 5,976 4,770 1,580 2,513 1,494 4,320 861 114 125
Book 26,198 4,232 1,282 1,561 949 2,628 0 15,730 78 425 0 519 397 346 171 201 201 135 188 0 0 40
Film 71,715 17,210 9,467 9,396 8,896 9,741 15,038 12,045 3,859 2,320 0 3,040 1,323 1,480 1,258 226 836 693 1,247 0 114 62
MusicalWork 159,070 23,652 14,987 15,911 17,053 16,206 0 6,588 697 4,704 0 2,712 1,430 3,154 1,310 300 355 235 2,705 0 0 0
Album 112,248 23,652 10,853 9,815 11,211 16,206 0 6,588 697 3,535 0 1,581 1,100 3,022 856 165 184 131 2,045 0 0 0
Single 41,774 0 4,134 5,252 5,104 0 0 0 0 1,047 0 1,063 330 0 418 127 171 102 660 0 0 0
Software 27,947 5,682 3,050 4,833 3,419 5,733 2,368 0 606 857 0 823 85 894 1,277 176 799 166 0 861 0 23
TelevisionShow 23,480 0 2,390 2,427 2,184 3,218 0 0 0 677 0 540 405 0 555 103 104 129 0 0 0 0

3 Cross-Language Overlap


For detailed statistics about the overlap of the DBpedia data sets in different languages, please refer to Cross-Language Overlap Statistics.