Cross-Language Overlap Statistics

Cross-Language Overlap Statistics


General statistics about the DBpedia 3.9 release are provided on the Data Set Statistics page. Below we analyze for the canonicalized, mapping-based data sets to which extent the same instance is described in different languages.


We consider the DBpedia versions in the following 24 languages: en, ar, bn, eu, ga, hr, nl, cs, de, el, es, fr, it, pl, pt, bg, ca, hu, ko, ru, sl, tr, id, ja. The last two are two new languages in the 3.9 release with respect to 3.8, Japanese and Indonesian.


A thing (instance) is identified with the same URI within all canonicalized data sets. The mapping-based infobox extractor normalizes the property names that are used in different languages to refer to the same property to a single English property name. This allows us to 'merge' the datasets from the 24 languages into a single multi-lingual version which is analyzed below.


Cross-language overlap statistics for DBpedia 3.8 can be found here. Below we compare the numbers between the two releases.

1 Overall Number of Instances and Statements for Selected Classes


In the table below we report the overall number of instances, mapping-based infobox statements and distinct mapping-based infobox properties (for which all statements that have the same predicate and refer to the same instance are counted as a single property) for a number of important DBpedia classes.


The number 979,053 for the class Person means that all 24 language versions together describes 979,053 different persons. As we will see below, a single person can be described in multiple languages. The number is also higher than the number of persons described in the canonilized English infobox data set (831,558), as there might be an infobox on a non-English page describing a person while there is no infobox on the English page describing the same person. Out of the same reason, the number of distinct properties in the merged version is 2,045 compared to 1,373 distinct properties in English version alone.


The Instances 2009 column contains the number of instances that were contained in the 2009 edition of DBpedia for comparison. Note that the old numbers refer to English DBpedia only, while the new numbers refer to the merge of all languages. For some of the DBpedia classes 2009 data is not available.


Classes in bold (e.g. Work) are top-level, i.e. subclasses of owl:Thing, in the DBpedia ontology, classes in italic (e.g. MusicalWork) are on the 2nd level and the rest (e.g. Album) are on the 3rd level of the class hierarchy.
  Instances     Statements   Distinct Properties  
  3.9 3.8 2009 3.9 3.8 3.9 3.8
Person 979,053 871,630 198,056 25,692,893 18,323,794 7,760,313 6,195,234
Artist 128,912 100,793 54,262 5,693,303 3,723,440 1,390,336 998,616
Actor 31,861 25,340 26,009 1,759,089 1,070,066 372,112 247,690
MusicalArtist 57,488 46,364 19,535 2,979,826 2,069,152 715,093 550,225
Athlete 267,977 217,067 74,832 9,805,423 6,373,136 2,586,730 1,853,233
Politician 39,584 41,126 12,874 1,587,472 1,407,548 486,174 454,209
Place 723,757 643,260 247,507 32,250,749 24,698,893 9,979,392 8,026,305
Building 76,107 65,355 23,304 1,321,828 1,058,610 634,144 530,010
Airport 12,566 11,675 7,971 449,272 352,377 153,614 138,944
Bridge 3,765 3,425 1,420 78,418 66,968 38,846 34,470
Skyscraper 795 68 2,028 35,403 3,091 11,415 719
PopulatedPlace 467,259 424,291 181,847 26,584,052 20,565,679 7,694,182 6,212,991
River 27,818 26,892 10,797 806,429 681,782 229,762 208,146
Organisation 228,286 206,670 91,275 6,256,810 4,940,190 2,282,581 2,029,620
SportsLeague 3,853 2,746 124,224 70,475 36,378 23,379
SportsTeam 25,336 19,113 1,025,695 565,625 204,706 150,382
Band 31,344 29,101 14,952 1,449,829 1,126,744 340,016 298,743
Company 55,320 48,989 20,173 1,292,526 1,048,251 524,303 445,758
Educ.Institution 47,101 43,250 21,052 1,101,287 958,257 543,322 493,792
Work 410,676 360,808 189,620 12,641,898 9,649,228 4,153,448 3,566,511
Book 55,694 44,339 15,677 1,721,132 1,111,960 517,959 408,724
Film 81,833 75,067 34,680 3,408,374 2,663,487 923,070 787,129
MusicalWork 169,811 160,383 101,985 5,213,842 4,116,625 1,794,704 1,635,655
Album 128,514 122,729 74,055 4,381,101 3,400,942 1,360,640 1,224,746
Single 42,933 42,393 24,597 1,543,304 1,226,636 576,181 534,023
Software 30,495 28,930 5,652 932,543 731,138 272,488 242,411
TelevisionShow 28,559 24,784 10,169 788,740 565,136 342,230 282,594
Species 234,512 205,231 6,375,368 4,415,050 2,166,889 1,836,221
Activity 1,707 1,490 14,562 10,282 6,744 5,213
AnatomicalStructure 4,454 4,217 39,596 36,179 25,105 23,732
Biomolecule 17,300 12,108 113,106 72,444 54,898 42,836
CelestialBody 21,697 19,839 536,325 465,133 205,953 185,628
ChemicalSubstance 12,064 10,849 213,352 140,466 71,102 51,687
Device 28,961 25,983 220,662 191,195 122,024 108,322
Event 57,775 44,481 1,377,047 830,068 380,148 270,979
Food 4,339 1,671 35,420 14,274 18,771 7,930
Disease 6,021 5,776 84,070 58,217 30,600 27,917
Drug 5,396 5,197 69,472 56,580 38,299 31,609
EthnicGroup 4,163 3,820 56,146 46,043 21,072 18,485
Holiday 898 750 8,184 5,914 4,047 3,036

2 Cross-Language Instance Overlap


Next we try to measure to which extent the multi-lingual DBpedia dataset obtained by 'merging' the 24 language versions, is actually multi-lingual.


In the table below, the Instances column contains the total number of instances per class across all languages, the 1 column contains the number of instances that are contained only in a single out of the 24 languages, the 2 column contains the number of instances that are contained in 2 languages but not in 3 or more languages, etc.


So, for example, 2,066 Persons are described in 12 languages but not in 13 or more languages. If for the Person class we sum 2 to 24 columns, we get that 295,853 Persons are described in 2 or more languages. This might seem small compared to the total of 979,053 Persons, but come back to the second table on the Data Set Statistics page and note there are 831,558 Persons in English DBpedia and only 167,168 Persons in the second large Italian DBpedia. Therefore, most of the 683,200 Persons described in just one language are very likely to be the ones present in English DBpedia only.
  Instances 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Person 979,053 683,200 129,557 62,966 34,112 21,543 13,435 8,967 6,165 4,553 3,621 2,650 2,066 1,627 1,293 1,035 702 510 353 281 188 102 72 38 17
Artist 128,912 50,019 28,653 16,073 9,807 6,276 4,262 3,147 2,402 1,823 1,482 1,154 874 686 532 436 330 241 181 187 139 87 66 38 17
Actor 31,861 3,821 8,030 5,552 3,621 2,468 1,780 1,303 995 836 661 581 424 360 305 266 205 146 101 126 99 67 59 38 17
MusicalArtist 57,488 24,952 11,639 6,339 3,920 2,737 1,863 1,459 1,176 831 703 506 390 261 193 142 121 79 56 57 40 14 9 0 1
Athlete 267,977 153,599 48,614 26,507 14,327 8,959 4,990 3,269 2,019 1,356 1,070 762 602 471 443 338 241 193 120 54 25 12 6 0 0
Politician 39,584 19,173 8,063 4,202 2,148 1,420 1,026 744 574 469 390 299 256 216 187 172 95 54 39 32 22 3 0 0 0
Place 723,757 340,140 149,192 60,423 31,806 30,030 19,608 16,841 20,175 28,551 17,629 5,845 1,509 672 401 296 271 217 89 40 16 6 0 0 0
Building 76,107 62,698 7,856 2,759 1,181 633 366 224 145 91 63 38 30 10 10 2 0 1 0 0 0 0 0 0 0
Airport 12,566 7,739 2,274 1,057 593 318 201 164 90 71 36 19 1 3 0 0 0 0 0 0 0 0 0 0 0
Bridge 3,765 3,092 417 149 54 18 21 6 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Skyscraper 795 57 259 177 131 89 45 20 7 4 1 4 1 0 0 0 0 0 0 0 0 0 0 0 0
PopulatedPlace 467,259 156,980 107,900 44,503 24,991 26,594 17,612 15,541 19,345 28,002 17,216 5,570 1,336 558 329 230 227 198 71 36 15 5 0 0 0
River 27,818 15,779 6,742 2,180 827 626 458 311 230 160 128 116 72 59 35 39 25 16 13 1 1 0 0 0 0
Organisation 228,286 167,572 27,017 11,870 6,588 4,134 2,825 2,000 1,431 1,037 931 699 560 421 334 271 262 160 74 53 32 12 3 0 0
SportsLeague 3,853 1,754 636 451 338 285 230 139 2 1 1 16 0 0 0 0 0 0 0 0 0 0 0 0 0
SportsTeam 25,336 12,715 4,554 2,546 1,402 907 622 445 330 270 291 274 237 179 156 141 149 88 26 4 0 0 0 0 0
Band 31,344 17,024 5,057 2,607 1,581 1,099 766 645 570 428 386 286 225 166 129 89 87 61 44 47 32 12 3 0 0
Company 55,320 38,709 7,827 3,381 1,906 1,137 776 525 336 232 165 92 67 58 37 33 24 9 4 2 0 0 0 0 0
Educ.Institution 47,101 41,036 3,453 1,129 571 325 204 122 99 50 44 28 15 11 6 5 1 2 0 0 0 0 0 0 0
Work 410,676 255,075 69,316 30,495 16,623 10,522 7,034 5,349 3,976 3,129 2,356 1,824 1,419 1,041 752 639 409 269 182 136 71 33 17 8 1
Book 55,694 33,522 8,824 3,926 2,393 1,528 1,134 903 698 515 411 350 342 305 237 252 124 81 62 44 23 11 5 4 0
Film 81,833 41,060 16,078 7,605 4,349 2,772 2,072 1,537 1,262 1,117 941 794 603 457 327 247 195 144 104 83 47 22 12 4 1
MusicalWork 169,811 110,752 27,884 11,768 6,201 4,109 2,525 1,969 1,364 1,083 690 505 348 208 144 107 72 45 22 11 4 0 0 0 0
Album 128,514 77,500 22,030 10,174 5,815 3,961 2,483 1,954 1,358 1,083 690 505 348 208 144 107 72 45 22 11 4 0 0 0 0
Single 42,933 25,209 7,532 3,943 2,139 1,381 905 637 425 298 178 119 77 47 19 18 4 1 1 0 0 0 0 0 0
Software 30,495 14,939 6,048 3,257 1,886 1,307 909 628 464 336 259 147 113 69 42 43 30 12 4 2 0 0 0 0 0
TelevisionShow 28,559 18,443 4,424 1,988 1,422 756 498 365 232 156 104 58 42 35 23 9 1 1 1 1 0 0 0 0 0
Species 234,512 100,655 64,630 27,515 16,867 9,057 5,354 3,472 2,281 1,549 1,163 863 598 311 120 66 11 0 0 0 0 0 0 0 0
Activity 1,707 1,369 220 64 24 12 6 3 1 6 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
AnatomicalStructure 4,454 3,621 640 171 19 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Biomolecule 17,300 14,868 1,356 1,031 38 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CelestialBody 21,697 4,515 2,766 1,797 3,771 5,033 1,870 1,167 660 61 50 5 1 1 0 0 0 0 0 0 0 0 0 0 0
ChemicalSubstance 12,064 3,705 3,145 1,825 1,179 829 619 501 209 47 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Device 28,961 26,854 1,144 543 215 93 54 31 16 8 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0
Event 57,775 35,552 10,922 5,486 2,255 1,276 908 551 255 178 126 105 59 38 37 21 6 0 0 0 0 0 0 0 0
Food 4,339 4,124 179 26 6 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Disease 6,021 2,823 1,082 665 447 333 239 155 105 77 33 32 18 3 6 1 2 0 0 0 0 0 0 0 0
Drug 5,396 3,322 883 448 293 248 141 46 12 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
EthnicGroup 4,163 3,277 517 206 60 51 33 13 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Holiday 898 593 192 68 34 9 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 Cross-Language Property Overlap


The table below contains the statistics on properties of the PopulatedPlace class that are frequently used in different languages. We report the number of instances that have this property in exactly 1, 2, 3, ..., 8 and from 9 to 24 languages.

Property Total 1 2 3 4 5 6 7 8 >8
rdf:type 467,259 156,980 107,900 44,503 24,991 26,594 17,612 15,541 19,345 53,793
foaf:name 463,390 160,232 114,157 39,928 28,094 29,659 20,575 30,880 25,054 14,811
dbp-ont:country 440,677 229,005 89,070 52,387 17,410 11,723 4,612 29,021 7,376 73
wgs84:at 391,537 149,248 92,600 35,848 37,671 35,144 20,054 16,824 2,865 1,283
wgs84:long 391,537 149,247 92,601 35,848 37,671 35,144 20,054 16,824 2,865 1,283
geoross:point 390,682 150,334 92,458 36,517 36,501 33,917 19,994 16,818 2,860 1,283
dbp-ont:populationTotal 336,331 101,425 70,162 32,841 26,340 41,903 39,924 16,311 4,400 3,025
dbp-ont:isPartOf 296,005 293,480 2,525 0 0 0 0 0 0 0
dbp-ont:postalCode 235,865 88,888 29,843 23,141 14,632 10,060 9,810 14,528 12,726 32,237
dbp-ont:timeZone 213,091 167,334 43,734 1,735 213 56 15 3 1 0
dbp-ont:areaCode 206,730 109,880 48,298 14,967 8,740 7,212 14,712 2,452 315 154
dbp-ont:utcOffset 197,984 195,878 2,069 37 0 0 0 0 0 0
dbp-ont:type 190,096 165,556 24,166 367 7 0 0 0 0 0
dbp-ont:elevation 188,585 89,731 36,274 21,997 16,946 18,572 3,531 1,102 265 167
dbp-ont:areaTotal 184,410 95,704 58,693 23,577 3,856 1,882 456 160 61 21
dbp-ont:area 138,125 68,821 35,482 33,788 33 1 0 0 0 0
dbp-ont:censusYear 123,933 110,610 9,927 3,115 281 0 0 0 0 0
dbp-ont:administrativeDistrict 122,687 109,726 12,026 790 145 0 0 0 0 0
dbp-ont:region 121,205 50,884 19,810 8,064 5,765 10,072 16,879 8,648 1,006 77
dbp-ont:populationDensity 119,908 96,076 19,848 2,702 1,006 49 51 120 49 7
dbp-ont:vehicleCode 107,712 79,751 14,956 11,371 1,476 145 13 0 0 0