Dataset Statistics

DBpedia 2014 Data Set Statistics


This page provides statistics about the DBpedia 2014 release. The release contains localized editions of DBpedia for 125 languages which have been extracted from the Wikipedia edition in the corresponding language. For 28 out of these languages, we report the overall number of things (instances) being described in the localized version of DBpedia as well as the number of facts (statements) that have been extracted from infoboxes describing these things. Afterwards, we report the number of instances of popular classes within these 28 DBpedia editions.

Dataset statistics for DBpedia 3.9 can be found here. Below we compare the numbers between the two releases.



1 Instances, Properties, and Statements per Language

The same thing, for instance a person or city, might be described by multiple pages within Wikipedia editions in different languages. Pages describing the same thing are often interlinked by cross-language links within Wikipedia.

When DBpedia extracts data from these pages, it produces two types of data sets. The localized data sets contain all things that are described in a specific language and in which things are identified with a language specific URI. In addition, we produce a canonicalized data set for each language. The canonicalized data sets only contain things for which a corresponding page in the English edition of Wikipedia exists. Within all canonicalized datasets, the same thing is identified with the same URI from the generic namespace http://dbpedia.org/resource/.

DBpedia uses two different extractors to extract data from Wikipedia infoboxes. The mapping-based extractor extracts data only for the infoboxes for which a language-specific extraction mapping to the DBpedia ontology exists in the DBpedia mapping wiki. Based on these mappings, it normalizes the different names that are used in various languages to refer to the same property. The second extractor is the raw infobox extractor which uses a generic heuristic to extract data from all infoboxes. The raw infobox extractor does not normalize property names but produces language-specific properties that directly reflect the property name in the Wikipedia infobox.

Below we report the overall number of things (instances), different ontology and raw-infobox properties, infobox statements and type statements for all 28 languages for which mappings exist in the DBpedia mapping wiki. The rows are sorted according to the number of instances for which mapping-based infobox data exists (Instances, CD, withMD column).

The column heading have the following meaning:

  • LD = Localized Data Sets.
  • CD = Canonicalized Data Sets.
  • all = Overall number of instances in the data set, calculated based on the labels and redirects dumps.
  • withMD = Number of instances for which mapping-based infobox data exists.
  • Raw Properties = Number of different properties that are generated by the raw infobox extractor.
  • Mapping Properties = Number of different properties that are generated by the mapping-based infobox extractor.
  • Raw Statements = Number of statements (facts) that are generated by the raw infobox extractor.
  • Mapping Statements = Number of statements (facts) that are generated by the mapping-based infobox extractor; include type statements.

 

  Instances, LD, all Instances, CD, all Instances, CD, withMD Raw Properties, CD Mapping Properties, CD Raw Statements, CD Mapping Statements, CD Type Statements, CD
en 4,584,616 4,584,616 4,232,626 55,986 1,122 68,091,260 56,549,445 28,563,803
it 1,128,909 745,345 540,474 10,591 249 13,840,025 7,413,922 3,929,338
de 1,692,634 857,196 479,731 11,695 420 9,677,586 6,059,745 3,468,237
nl 1,774,536 674,849 455,222 8,100 634 8,044,539 5,857,801 3,118,581
es 1,086,296 683,251 419,328 17,347 457 9,728,204 6,538,847 3,190,529
fr 1,504,453 942,505 415,390 15,111 595 11,521,313 6,234,623 3,396,756
pl 1,043,400 653,571 411,883 7,751 219 8,554,227 5,590,196 3,189,677
pt 812,610 552,362 321,211 14,637 522 7,069,586 4,801,340 2,185,948
ru 1,119,142 579,612 266,562 15,665 141 8,825,572 3,717,635 1,986,532
ja 913,488 397,907 134,380 15,981 342 4,403,612 2,028,745 1,002,180
ca 426,696 289,485 128,544 10,183 175 3,643,659 1,574,797 962,352
eu 178,822 139,023 90,948 2,947 118 2,010,728 916,523 577,224
hu 260,512 171,391 76,273 7,806 268 2,429,115 830,290 536,290
ko 276,881 178,872 58,937 8,503 377 1,409,638 878,745 458,870
tr 233,737 143,914 57,034 9,008 370 1,636,893 825,459 443,345
cs 296,094 193,674 48,356 6,368 291 2,272,303 649,900 377,149
bg 161,427 112,571 44,698 5,095 223 964,269 599,891 333,355
ar 266,386 170,430 44,298 11,008 254 1,185,465 479,823 316,167
id 354,326 142,616 43,980 11,514 329 1,599,822 653,002 347,255
el 96,301 67,390 36,255 4,437 445 389,068 382,708 252,492
sl 140,612 85,167 25,494 4,844 406 950,604 323,292 212,340
hr 135,272 92,952 12,003 3,674 139 827,890 200,690 106,691
ga 30,670 27,674 4,176 1,231 67 83,457 51,086 31,872
bn 29,631 26,136 2,160 6,609 83 271,070 30,350 19,015
be (new) 71,656 52,040 23,512 4,998 175 557,540 301,188 168,132
cy (new) 57,127 43,127 11,945 2,084 28 204,058 59,428 54,578
sk (new) 192,410 138,492 5,268 4,757 25 1,814,997 70,207 21,148
sr (new) 246,996 189,158 138,166 6,069 470 2,278,757 1,853,525 873,394


The following table integrates the Dataset Statistic for DBpedia 3.9 with the statistics presented above, thus allowing for comparison between the versions. %-columns contain the increase in the number of instances/statements in version 2014 with respect to 3.9. There are four new languages in the 2014 release: Belarusian (be), Serbian (sr), Welsh (cy), Slovak (sk), for which property mappings has become available; the respective numbers can be found in the four last rows of the table. The decrease in the number of raw properties is due to the fact that triple de-duplication was introduced in 2014.

  Instances, LD, all     Instances, CD, all     Instances, CD, withMD     Raw Properties, CD     Mapping Properties, CD     Raw Statements, CD     Mapping Statements, CD     Type Statements, CD    
  3.9 2014 % 3.9 2014 % 3.9 2014 % 3.9 2014 % 3.9 2014 % 3.9 2014 % 3.9 2014   3.9 2014 %
en 4,258,406 4,584,616 7.7 4,258,406 4,584,616 7.7 3,255,435 4,232,626 30 51,736 55,986 8.2 1,373 1,122 -18.3 70,147,399 68,091,260 -2.9 41,804,545 56,549,445 35.3 16,366,701 28,563,803 74.5
it 1,029,528 1,128,909 9.7 672,981 745,345 10.8 473,595 540,474 14.1 10,241 10,591 3.4 211 249 18.0 14,366,288 13,840,025 -3.7 5,724,415 7,413,922 29.5 2,364,096 3,929,338 66.2
de 1,547,785 1,692,634 9.4 779,104 857,196 10 327,548 479,731 46.5 10,659 11,695 9.7 327 420 28.4 9,284,326 9,677,586 4.2 4,070,927 6,059,745 48.9 1,800,424 3,468,237 92.6
nl 1,461,314 1,774,536 21.4 590,014 674,849 14.4 368,688 455,222 23.5 7,481 8,100 8.3 642 634 -1.2 7,916,452 8,044,539 1.6 5,039,583 5,857,801 16.2 2,144,581 3,118,581 45.4
es 1,003,158 1,086,296 8.3 621,472 683,251 9.9 376,975 419,328 11.2 15,992 17,347 8.5 549 457 -16.8 9,147,643 9,728,204 6.3 5,950,626 6,538,847 9.9 2,305,659 3,190,529 38.4
fr 1,378,099 1,504,453 9.2 856,004 942,505 10.1 346,214 415,390 20 13,990 15,111 8 689 595 -13.6 10,741,192 11,521,313 7.3 5,273,302 6,234,623 18.2 2,145,950 3,396,756 58.3
pl 960,880 1,043,400 8.6 598,754 653,571 9.2 334,214 411,883 23.2 7,478 7,751 3.7 264 219 -17.0 8,113,838 8,554,227 5.4 4,624,126 5,590,196 20.9 2,031,952 3,189,677 57
pt 764,132 812,610 6.3 511,741 552,362 7.9 298,475 321,211 7.6 13,740 14,637 6.5 620 522 -15.8 6,934,107 7,069,586 2 4,489,235 4,801,340 7.0 1,641,916 2,185,948 33.1
ru 999,165 1,119,142 12 516,870 579,612 12.1 236,067 266,562 12.9 14,771 15,665 6.1 149 141 -5.4 8,390,368 8,825,572 5.2 3,174,725 3,717,635 17.1 1,315,619 1,986,532 51
ja 860,917 913,488 6.1 370,912 397,907 7.3 115,227 134,380 16.6 14,752 15,981 8.3 395 342 -13.4 4,353,518 4,403,612 1.2 1,674,891 2,028,745 21.1 656,290 1,002,180 52.7
ca 400,271 426,696 6.6 267,856 289,485 8.1 119,675 128,544 7.4 9,391 10,183 8.4 184 175 -4.9 4,057,610 3,643,659 -10.2 1,420,025 1,574,797 10.9 757,526 962,352 27
eu 150,294 178,822 19 119,752 139,023 16.1 74,114 90,948 22.7 2,683 2,947 9.8 97 118 21.6 2,381,903 2,010,728 -15.6 975,775 916,523 -6.1 456,815 577,224 26.4
hu 239,711 260,512 8.7 157,034 171,391 9.1 68,939 76,273 10.6 7,283 7,806 7.2 298 268 -10.1 2,859,593 2,429,115 -15.1 669,836 830,290 24.0 358,586 536,290 49.6
ko 237,506 276,881 16.6 154,397 178,872 15.9 47,081 58,937 25.2 7,605 8,503 11.8 435 377 -13.3 1,276,866 1,409,638 10.4 646,461 878,745 35.9 271,610 458,870 68.9
tr 213,820 233,737 9.3 127,281 143,914 13.1 47,673 57,034 19.6 8,172 9,008 10.2 438 370 -15.5 1,701,192 1,636,893 -3.8 648,288 825,459 27.3 270,546 443,345 63.9
cs 263,317 296,094 12.4 172,763 193,674 12.1 40,549 48,356 19.3 5,873 6,368 8.4 340 291 -14.4 2,192,854 2,272,303 3.6 556,742 649,900 16.7 244,058 377,149 54.5
bg 146,608 161,427 10.1 101,310 112,571 11.1 43,961 44,698 1.7 4,728 5,095 7.8 268 223 -16.8 950,554 964,269 1.4 564,830 599,891 6.2 225,843 333,355 47.6
ar 215,042 266,386 23.9 129,600 170,430 31.5 25,325 44,298 74.9 9,492 11,008 16 286 254 -11.2 883,730 1,185,465 34.1 256,761 479,823 86.9 143,042 316,167 121
id 208,891 354,326 69.6 113,047 142,616 26.2 33,385 43,980 31.7 10,264 11,514 12.2 372 329 -11.6 1,417,031 1,599,822 12.9 449,244 653,002 45.4 199,564 347,255 74
el 84,359 96,301 14.2 57,249 67,390 17.7 27,856 36,255 30.2 3,695 4,437 20.1 461 445 -3.5 287,562 389,068 35.3 275,669 382,708 38.8 159,570 252,492 58.2
sl 136,684 140,612 2.9 80,102 85,167 6.3 23,584 25,494 8.1 4,473 4,844 8.3 474 406 -14.3 1,335,247 950,604 -28.8 265,908 323,292 21.6 151,203 212,340 40.4
hr 127,930 135,272 5.7 82,016 92,952 13.3 11,452 12,003 4.8 3,501 3,674 4.9 158 139 -12.0 779,862 827,890 6.2 168,804 200,690 18.9 74,455 106,691 43.3
ga 19,450 30,670 57.7 17,350 27,674 59.5 3,791 4,176 10.2 1,128 1,231 9.1 72 67 -6.9 76,746 83,457 8.7 41,331 51,086 23.6 21,847 31,872 45.9
bn 25,811 29,631 14.8 20,753 26,136 25.9 1,275 2,160 69.4 5,467 6,609 20.9 86 83 -3.5 176,630 271,070 53.5 13,852 30,350 119.1 6,856 19,015 177.3
be   71,656     52,040   23,512     4998       175     557,540     301,188     168,132  
cy   57,127     43,127   11,945     2,084       28     204,058     59,428     54,578  
sk   192,410     138,492   5,268     4,757       25     1,814,997     70,207     21,148  
sr   246,996     189,158   138,166     6,069       470     2,278,757     1,853,525     873,394  

 

 

2 Instances of Selected Classes per Language

The table below reports the number of instances for a set of selected classes within the canonicalized DBpedia data sets for each language.
  en it pl es pt fr de ru ca bn eu ga hr nl cs el bg hu ko sl tr id ja cy be sk sr
Person 1,445,104 189,448 96,135 99,147 60,056 134,749 179,421 86,269 10,533 1,788 4,366 1,511 5,869 54,879 12,884 5,964 19,047 22,444 21,844 7,740 21,422 17,627 48,642 656 6,724 668 16,300
Athlete 268,773 67,932 45,890 31,527 19,849 65,782 42,101 16,631 905 0 480 197 0 26,113 4,885 1,782 3,625 5,919 5,828 2,577 8,144 6,439 17,883 0 827 668 3,889
Actor 6,501 508 10,106 13,831 7,546 14,019 0 0 2,054 0 1,052 515 1,550 8,117 2,680 0 1,552 2,519 7,000 435 2,912 0 10,633 0 25 0 2,054
Artist 96,282 15,621 20,180 34,898 14,603 32,562 0 30,266 3,193 0 1,052 823 4,348 16,656 3,195 1,567 2,404 6,633 10,118 1,276 7,805 2,867 21,896 0 1,931 0 4,449
Musical Artist 45,089 15,113 6,924 14,594 6,332 11,138 0 9,015 1,139 0 0 76 2,121 5,959 0 200 48 0 2,525 614 3,499 2,186 8,566 0 434 0 1,049
Politician 40,343 4,893 10,639 7,460 4,110 11,461 0 0 1,901 0 977 316 0 1,805 0 792 0 1,025 707 63 0 0 2,849 283 30 0 1,552
Scientist 18,233 0 0 4,626 6,242 2,431 0 9,322 0 604 0 0 189 1,309 356 19 872 487 737 421 0 612 1,148 0 1,002 0 1,042
Place 735,062 177,524 211,084 156,377 123,114 148,586 168,082 91,099 74,835 0 50,969 1,385 1,063 202,393 21,582 5,000 10,865 19,992 11,031 12,634 10,697 7,068 20,669 11,182 11,820 0 78,506
PopulatedPlace 478,351 160,582 191,208 133,947 114,155 118,716 96,556 86,137 74,344 0 48,804 1,385 128 183,335 16,661 4,132 5,321 16,331 3,472 12,148 8,865 4,493 4,889 376 10,031 0 76,063
Building 68,582 3,888 2,549 4,455 228 6,926 990 82 0 0 1,013 0 23 2,373 293 30 236 513 306 42 427 236 1,106 0 324 0 173
Airport 13,649 1,069 3,392 1,921 720 1,499 2,087 0 0 0 0 0 0 1,050 174 0 0 187 528 78 0 1,330 565 0 12 0 132
Bridge 3,543 216 249 444 108 0 711 0 0 0 0 0 0 305 0 0 0 107 31 0 337 0 0 0 0 0 27
River 26,295 2,099 1,859 3 4,397 3,957 7,949 4,598 0 0 469 0 696 1,257 1,712 54 378 730 163 172 0 101 792 0 537 0 759
Organisation 241,286 15,554 15,288 15,955 13,625 27,542 28,935 15,414 1,623 0 912 177 0 12,234 1,327 2,336 3,924 4,829 6,448 704 5,190 2,734 8,475 77 765 0 3,479
Company 58,400 5,337 3,054 1,077 2,610 8,180 9,473 4,512 541 0 0 0 0 2,490 708 192 430 831 1,844 160 956 560 3,485 0 209 0 363
Educ. Institution 49,172 918 845 1,709 514 2,943 2,600 1,418 0 0 0 0 0 775 146 106 175 158 1,001 56 430 449 1,938 77 181 0 103
Band 30,572 0 4,054 0 5,076 5,177 6,462 4,656 297 0 324 104 0 2,057 0 250 1,348 949 1,395 14 21 0 0 0 87 0 336
SportsTeam 28,357 7,900 5,048 5,585 3,575 5,844 4,157 3,930 0 0 429 73 0 5,751 0 1,169 1,032 2,322 1,384 188 2,276 1,301 2,034 0 45 0 1,804
Work 411,295 78,975 37,363 50,374 43,263 61,212 38,945 45,848 6,113 372 1,065 119 4,822 30,126 8,489 16,072 5,204 12,449 9,068 1,221 11,226 6,802 29,139 30 503 0 3,800
Musical Work 180,308 31,309 17,406 21,379 20,815 22,065 7,540 12,445 230 0 151 0 2,808 8,774 4,189 1,018 1,792 5,374 2,137 379 4,338 2,804 8,900 0 9 0 911
Album 123,374 30,252 12,406 12,897 13,006 14,426 5,462 8,856 230 0 151 0 2,135 4,786 4,041 845 1,364 3,952 1,256 202 2,108 1,741 5,178 0 0 0 635
Single 45,433 0 5,000 7,296 6,271 5,621 1,532 3,589 0 0 0 0 673 2,982 0 166 428 1,276 833 169 2,055 1,063 3,722 0 9 0 268
Film 87,282 24,156 12,555 12,140 11,643 15,669 18,707 14,912 5,105 220 914 119 1,487 10,239 2,388 1,128 2,095 3,188 3,392 259 3,768 3,011 10,126 0 80 0 1,733
Book 31,029 6,083 1,687 2,217 1,343 3,549 0 18,491 109 123 0 0 241 953 597 13,133 556 608 216 239 754 241 547 30 67 0 220
Software 31,401 7,145 1,187 6,284 4,245 8,980 5,286 0 669 29 0 0 0 3,878 1,172 305 99 1,105 1,753 182 1,081 430 5,526 0 141 0 276
Television Show 29,466 570 3,071 3,544 3,433 4,373 3,399 0 0 0 0 0 0 2,077 0 186 559 1,152 1,285 117 728 0 888 0 0 0 530
Event 45,377 18,118 4,064 5,050 6,488 23,123 3,294 3,519 0 0 0 0 9 5,552 1,429 359 753 3,627 1,720 1,220 2,006 758 1,864 0 381 0 1,539
Celestial Body 32,864 16,974 13,312 2,541 15,648 0 5,199 544 0 0 0 0 0 1,666 7 2,146 138 10,581 0 735 689 0 3,666 0 59 4,600 5,512
Species 252,166 22,810 23,474 64,950 47,407 0 28,491 23,303 35,440 0 0 749 0 129,539 0 0 3,897 94 6,290 7 0 8,068 10,603 0 2,448 0 7,542
Mean Of Transportation 50,984 6,692 4,293 5,019 3,401 2,005 6,681 0 0 0 0 0 0 2,683 1,207 553 148 1,201 488 429 576 769 2,048 0 26 0 1,073
Disease 6,078 1,868 1,448 2,397 1,134 0 0 0 0 0 0 0 0 1,088 290 286 29 0 536 283 300 0 548 0 192 0 587

 

3 Cross-Language Overlap

For detailed statistics about the overlap of the DBpedia data sets in different languages, please refer to Cross-Language Overlap Statistics.