Data Set Statistics

DBpedia 3.9 Data Set Statistics


This page provides statistics about the DBpedia 3.9 release. The release contains localized editions of DBpedia for 120 languages which have been extracted from the Wikipedia edition in the corresponding language. For 24 out of these languages, we report the overall number of things (instances) being described in the localized version of DBpedia as well as the number of facts (statements) that have been extracted from infoboxes describing these things. Afterwards, we report the number of instances of popular classes within these 24 DBpedia editions.


Dataset statistics for DBpedia 3.8 can be found here. Below we compare the numbers between the two releases.




1 Instances, Properties, and Statements per Language


The same thing, for instance a person or city, might be described by multiple pages within Wikipedia editions in different languages. Pages describing the same thing are often interlinked by cross-language links within Wikipedia.


When DBpedia extracts data from these pages, it produces two types of data sets. The localized data sets contain all things that are described in a specific language and in which things are identified with a language specific URI. In addition, we produce a canonicalized data set for each language. The canonicalized data sets only contain things for which a corresponding page in the English edition of Wikipedia exists. Within all canonicalized datasets, the same thing is identified with the same URI from the generic namespace http://dbpedia.org/resource/.


DBpedia uses two different extractors to extract data from Wikipedia infoboxes. The mapping-based extractor extracts data only for the infoboxes for which a language-specific extraction mapping to the DBpedia ontology exists in the DBpedia mapping wiki. Based on these mappings, it normalizes the different names that are used in various languages to refer to the same property. The second extractor is the raw infobox extractor which uses a generic heuristic to extract data from all infoboxes. The raw infobox extractor does not normalize property names but produces language-specific properties that directly reflect the property name in the Wikipedia infobox.


Below we report the overall number of things (instances), different ontology and raw-infobox properties, infobox statements and type statements for all 24 languages for which mappings exist in the DBpedia mapping wiki. The rows are sorted according to the number of instances for which mapping-based infobox data exists (Instances, CD, withMD column).


The column heading have the following meaning:

  • LD = Localized Data Sets.
  • CD = Canonicalized Data Sets.
  • all = Overall number of instances in the data set, calculated based on the short abstract dumps.
  • withMD = Number of instances for which mapping-based infobox data exists.
  • Raw Properties = Number of different properties that are generated by the raw infobox extractor.
  • Mapping Properties = Number of different properties that are generated by the mapping-based infobox extractor.
  • Raw Statements = Number of statements (facts) that are generated by the raw infobox extractor.
  • Mapping Statements = Number of statements (facts) that are generated by the mapping-based infobox extractor; include type statements.
 
  Instances, LD, all Instances, CD, all Instances, CD, withMD Raw Properties, CD Mapping Properties, CD Raw Statements, CD Mapping Statements, CD Type Statements, CD
en 4,004,478 4,004,478 3,255,435 51,736 1,373 70,147,399 41,804,545 16,366,701
it 979,726 646,271 473,595 10,241 211 14,366,288 5,724,415 2,364,096
pl 960,781 598,383 334,214 7,478 264 8,113,838 4,624,126 2,031,952
es 964,838 601,258 376,975 15,992 549 9,147,643 5,950,626 2,305,659
pt 736,443 493,944 298,475 13,740 620 6,934,107 4,489,235 1,641,916
fr 1,314,943 820,694 346,214 13,990 689 10,741,192 5,273,302 2,145,950
de 1,367,844 716,047 327,548 10,659 327 9,284,326 4,070,927 1,800,424
ru 953,813 502,252 236,067 14,771 149 8,390,368 3,174,725 1,315,619
ca 391,188 263,071 119,675 9,391 184 4,057,610 1,420,025 757,526
hu 229,389 153,607 68,939 7,283 298 2,859,593 669,836 358,586
eu 148,260 118,662 74,114 2,683 97 2,381,903 975,775 456,815
tr 207,630 124,372 47,673 8,172 438 1,701,192 648,288 270,546
bg 139,738 98,364 43,961 4,728 268 950,554 564,830 225,843
cs 255,392 168,414 40,549 5,873 340 2,192,854 556,742 244,058
ko 230,691 149,696 47,081 7,605 435 1,276,866 646,461 271,610
sl 132,727 78,178 23,584 4,473 474 1,335,247 265,908 151,203
ar 210,871 128,250 25,325 9,492 286 883,730 256,761 143,042
el 81,250 55,725 27,856 3,695 461 287,562 275,669 159,570
hr 122,898 79,757 11,452 3,501 158 779,862 168,804 74,455
nl 1,404,595 559,842 368,688 7,481 642 7,916,452 5,039,583 2,144,581
ga 19,449 17,350 3,791 1,128 72 76,746 41,331 21,847
bn 25,809 20,745 1,275 5,467 86 176,630 13,852 6,856
id 207,055 111,990 33,385 10,264 372 1,417,031 449,244 199,564
ja 824,573 356,222 115,227 14,752 395 4,353,518 1,674,891 656,290

The following table integrates the Dataset Statistic for DBpedia 3.8 with the statistics presented above, thus allowing for comparison between the versions. %-columns contain the increase in the number of instances/statements in version 3.9 with respect to 3.8. There are two new languages in the 3.9 release, Japanese and Indonesian, for which property mappings has become available; the respective numbers can be found in the two last rows of the table.
  Instances, LD, all     Instances, CD, all     Instances, CD, withMD     Raw Properties, CD     Mapping Properties, CD     Raw Statements, CD     Mapping Statements, CD     Type Statements, CD
  3.8 3.9 % 3.8 3.9 % 3.8 3.9 % 3.8 3.9 % 3.8 3.9 % 3.8 3.9 % 3.8 3.9 % 3.8 3.9 %
en 3,769,926 4,004,478 6.2 3,769,926 4,004,478 6.2 2,359,521 3,255,435 38.0 48,293 51,736 7.1 1,313 1,373 4.6 65,143,840 70,147,399 7.7 33,742,015 41,804,545 23.9 13,655,887 16,366,701 19.9
it 882,127 979,726 11.1 580,620 646,271 11.3 383,643 473,595 23.4 9,716 10,241 5.4 181 211 16.6 12,227,870 14,366,288 17.5 4,804,731 5,724,415 19.1 2,142,194 2,364,096 10.4
pl 848,298 960,781 13.3 538,641 598,383 11.1 344,875 334,214 -3.1 7,306 7,478 2.4 266 264 -0.8 7,696,193 8,113,838 5.4 4,511,794 4,624,126 2.5 2,086,071 2,031,952 -2.6
es 879,091 964,838 9.8 542,524 601,258 10.8 310,348 376,975 21.5 14,643 15,992 9.2 476 549 15.3 7,740,458 9,147,643 18.2 4,383,206 5,950,626 35.8 1,695,745 2,305,659 36.0
pt 699,446 736,443 5.3 460,258 493,944 7.3 272,660 298,475 9.5 12,851 13,740 6.9 602 620 3.0 6,255,151 6,934,107 10.9 4,005,527 4,489,235 12.1 1,493,280 1,641,916 10.0
fr 1,197,334 1,314,943 9.8 740,044 820,694 10.9 214,953 346,214 61.1 13,551 13,990 3.2 228 689 202.2 8,854,322 10,741,192 21.3 2,901,809 5,273,302 81.7 1,287,965 2,145,950 66.6
de 1,243,771 1,367,844 10.0 650,037 716,047 10.2 204,335 327,548 60.3 9,593 10,659 11.1 261 327 25.3 7,603,562 9,284,326 22.1 2,880,381 4,070,927 41.3 1,151,623 1,800,424 56.3
ru 822,681 953,813 15.9 439,605 502,252 14.3 123,011 236,067 91.9 13,522 14,771 9.2 76 149 96.1 6,973,305 8,390,368 20.3 1,389,473 3,174,725 128.5 692,282 1,315,619 90.0
ca 367,362 391,188 6.5 241,534 263,071 8.9 112,934 119,675 6.0 8,696 9,391 8.0 183 184 0.5 3,689,870 4,057,610 10.0 1,301,868 1,420,025 9.1 721,940 757,526 4.9
hu 209,180 229,389 9.7 138,998 153,607 10.5 63,441 68,939 8.7 6,821 7,283 6.8 295 298 1.0 2,506,399 2,859,593 14.1 601,037 669,836 11.4 325,401 358,586 10.2
eu 132,877 148,260 11.6 108,713 118,662 9.2 41,401 74,114 79.0 2,245 2,683 19.5 19 97 410.5 2,255,897 2,381,903 5.6 532,709 975,775 83.2 245,678 456,815 85.9
tr 187,850 207,630 10.5 106,644 124,372 16.6 40,438 47,673 17.9 7,512 8,172 8.8 440 438 -0.5 1,350,679 1,701,192 26.0 556,943 648,288 16.4 229,317 270,546 18.0
bg 125,762 139,738 11.1 87,679 98,364 12.2 38,825 43,961 13.2 3,984 4,728 18.7 274 268 -2.2 774,443 950,554 22.7 488,678 564,830 15.6 196,907 225,843 14.7
cs 225,133 255,392 13.4 148,819 168,414 13.2 34,893 40,549 16.2 5,564 5,873 5.6 334 340 1.8 1,857,230 2,192,854 18.1 474,459 556,742 17.3 208,044 244,058 17.3
ko 196,132 230,691 17.6 124,591 149,696 20.1 30,962 47,081 52.1 7,095 7,605 7.2 419 435 3.8 1,035,606 1,276,866 23.3 417,605 646,461 54.8 183,714 271,610 47.8
sl 129,834 132,727 2.2 73,099 78,178 6.9 22,036 23,584 7.0 4,235 4,473 5.6 470 474 0.9 1,213,801 1,335,247 10.0 222,447 265,908 19.5 133,660 151,203 13.1
ar 165,722 210,871 27.2 103,059 128,250 24.4 16,236 25,325 56.0 7,898 9,492 20.2 268 286 6.7 635,058 883,730 39.2 168,686 256,761 52.2 93,845 143,042 52.4
el 71,936 81,250 12.9 48,260 55,725 15.5 10,813 27,856 157.6 2,866 3,695 28.9 288 461 60.1 206,460 287,562 39.3 113,838 275,669 142.2 58,878 159,570 171.0
hr 109,890 122,898 11.8 71,469 79,757 11.6 10,343 11,452 10.7 3,334 3,501 5.0 158 158 0.0 701,182 779,862 11.2 151,196 168,804 11.6 66,937 74,455 11.2
nl 992,557 1,404,595 41.5 477,443 559,842 17.3 8,525 368,688 4,224.8 6,988 7,481 7.1 22 642 2,818.2 6,759,879 7,916,452 17.1 74,473 5,039,583 6,667.0 53,946 2,144,581 3,875.4
ga 14,761 19,449 31.8 13,308 17,350 30.4 3,562 3,791 6.4 1,076 1,128 4.8 72 72 0.0 71,707 76,746 7.0 39,129 41,331 5.6 20,433 21,847 6.9
bn 23,447 25,809 10.1 18,624 20,745 11.4 550 1,275 131.8 4,791 5,467 14.1 76 86 13.2 136,013 176,630 29.9 5,641 13,852 145.6 2,935 6,856 133.6
id   207,055     111,990     33,385     10,264     372     1,417,031     449,244     199,564  
ja   824,573     356,222     115,227     14,752     395     4,353,518     1,674,891     656,290  

2 Instances of Selected Classes per Language


The table below reports the number of instances for a set of selected classes within the canonicalized DBpedia data sets for each language.
  en it pl es pt fr de ru ca hu eu tr bg cs ko sl ar el hr nl ga bn id ja
Person 831,558 167,168 84,216 83,452 51,977 109,975 119,171 74,613 8,559 17,765 3,780 17,262 17,542 10,742 17,798 6,737 7,751 4,050 5,588 45,495 1,446 1,094 14,723 40,627
Artist 68,237 13,741 17,564 30,152 12,238 19,597 0 26,213 2,328 4,438 913 6,827 1,707 2,613 8,448 1,207 2,737 1,206 4,031 13,873 771 0 2,479 19,334
Actor 2,670 0 8,803 11,589 6,423 11,799 0 0 1,643 2,106 913 2,634 1,268 2,198 5,957 420 1,482 0 1,466 6,782 468 0 0 9,118
MusicalArtist 37,936 13,741 6,221 13,096 5,541 2,509 0 7,780 685 0 0 3,009 71 0 2,038 592 838 152 2,000 5,248 83 0 1,904 7,817
Athlete 232,082 58,293 39,846 26,204 17,805 51,404 35,824 14,279 821 5,217 439 6,016 3,003 3,805 4,418 1,976 1,658 823 0 21,295 195 0 5,597 15,225
Politician 24,724 4,184 9,940 6,337 3,785 0 0 0 1,590 895 766 0 0 0 668 62 0 523 0 1,567 309 0 0 1,787
Place 639,450 163,545 151,583 148,027 120,700 139,290 145,266 84,008 73,922 19,923 49,003 8,823 13,227 18,399 8,159 12,200 11,061 3,736 997 185,289 1,188 0 5,676 19,588
Building 67,287 2,387 2,191 3,524 194 5,519 667 56 0 385 729 299 175 154 187 31 95 22 20 2,723 0 0 86 1,162
Bridge 3,259 0 229 222 100 0 0 0 0 96 0 321 0 0 30 0 0 0 0 250 0 0 0 0
Airport 12,231 0 3,333 1,832 676 1,342 0 0 0 169 0 0 0 170 455 76 299 0 0 1,041 0 0 1,288 545
Skyscraper 2 0 0 0 0 0 562 0 0 0 0 0 0 0 71 0 0 0 0 379 0 0 0 0
PopulatedPlace 427,068 154,224 134,067 128,875 112,910 113,956 85,848 79,399 73,458 16,756 47,898 7,978 7,697 14,040 1,806 11,778 10,393 3,114 127 170,380 1,188 0 3,440 4,555
River 24,962 1,976 1,510 2 4,255 3,540 7,317 4,319 0 627 339 0 357 1,674 156 163 0 25 645 1,138 0 0 95 714
Organisation 209,471 9,282 14,105 13,942 12,476 21,192 24,494 13,939 1,517 4,329 309 4,603 3,676 1,139 5,236 610 2,144 1,352 0 10,078 173 0 2,070 7,826
Band 28,682 0 3,855 0 4,811 4,422 5,899 4,305 268 873 309 18 1,393 0 1,041 12 0 161 0 1,824 101 0 0 0
Company 49,402 4,688 2,813 1,040 2,310 6,707 8,264 4,101 497 713 0 841 378 594 1,603 127 1,053 62 0 1,746 0 0 404 3,282
Educ.Institution 45,234 0 709 1,544 444 2,386 2,382 1,245 0 126 0 370 159 129 884 53 343 77 0 617 0 0 352 1,736
Work 372,226 67,261 36,105 44,371 38,062 50,649 28,659 39,193 5,017 11,048 615 9,551 4,335 7,225 6,909 1,164 3,269 12,443 4,630 22,296 116 181 6,030 26,683
Book 28,128 5,267 1,483 1,902 1,144 3,068 0 14,842 83 529 0 567 436 523 198 234 288 10,436 209 771 0 58 236 459
Film 77,769 20,772 11,049 11,062 10,352 12,083 16,953 13,630 4,292 2,642 468 3,344 1,743 1,855 2,250 235 1,116 1,012 1,415 9,032 116 99 2,625 9,314
Musical Work 166,520 26,323 16,162 18,692 18,727 19,227 4,965 10,721 1 5,075 147 3,610 1,524 3,707 1,694 357 377 461 2,761 7,815 0 0 2,512 8,182
Album 116,371 26,323 11,627 11,434 12,162 18,445 4,965 7,773 1 3,764 147 1,806 1,154 3,562 1,050 185 199 317 2,093 4,325 0 0 1,575 4,779
Single 41,763 0 4,535 6,233 5,685 0 0 2,948 0 1,175 0 1,682 370 0 604 164 178 139 668 2,573 0 0 937 3,403
Software 28,865 6,541 3,285 5,654 3,835 8,336 4,413 0 641 984 0 919 95 1,012 1,534 187 1,050 227 0 1,118 0 24 383 5,177
Television Show 25,629 4,255 2,782 2,884 2,910 3,838 0 0 0 876 0 644 452 0 983 112 136 145 0 1,893 0 0 0 737

3 Cross-Language Overlap


For detailed statistics about the overlap of the DBpedia data sets in different languages, please refer to Cross-Language Overlap Statistics.