Cross-Language Overlap Statistics

Cross-Language Overlap Statistics


General statistics about the DBpedia 3.8 release are provided on the Data Set Statistics page. Below we analyze for the canonicalized, mapping-based data sets to which extent the same instance is described in different languages.


We consider the DBpedia versions in the following 22 languages: en, ar, bn, eu, ga, hr, nl, cs, de, el, es, fr, it, pl, pt, bg, ca, hu, ko, ru, sl, tr.


A thing (instance) is identified with the same URI within all canonicalized data sets. The mapping-based infobox extractor normalizes the property names that are used in different languages to refer to the same property to a single English property name. This allows us to 'merge' the datasets from the 22 languages into a single multi-lingual version which is analyzed below.



1 Overall Number of Instances and Statements for Selected Classes


In the table below we report the overall number of instances, mapping-based infobox statements and distinct mapping-based infobox properties (for which all statements that have the same predicate and refer to the same instance are counted as a single property) for a number of important DBpedia classes.


The number 871,630 for the class Person means that all 22 language version together describes 871,630 different persons. As we will see below, a single person can be described in multiple languages. The number is also higher than the number of persons described in the canonilized English infobox data set (763,643), as there might be an infobox on a non-English page describing a person while there is no infobox on the English page describing the same person. Out of the same reason, the number of distinct properties in the merged version is 1,542 compared to 1,313 distinct properties in English version alone.


The Instances 2009 column contains the number of instances that were contained in the 2009 edition of DBpedia for comparison. Note that the old numbers refer to English DBpedia only, while the new numbers refer to the merge of all languages.
  Instances Instances 2009 Statements Distinct Properties
Person 871,630 198,056 18,323,794 6,195,234
Artist 100,793 54,262 3,723,440 998,616
Actor 25,340 26,009 1,070,066 247,690
MusicalArtist 46,364 19,535 2,069,152 550,225
Athlete 217,067 74,832 6,373,136 1,853,233
Politician 41,126 12,874 1,407,548 454,209
Place 643,260 247,507 24,698,893 8,026,305
Building 65,355 23,304 1,058,610 530,010
Airport 11,675 7,971 352,377 138,944
Bridge 3,425 1,420 66,968 34,470
Skyscraper 68 2,028 3,091 719
PopulatedPlace 424,291 181,847 20,565,679 6,212,991
River 26,892 10,797 681,782 208,146
Organisation 206,670 91,275 4,940,190 2,029,620
Band 29,101 14,952 1,126,744 298,743
Company 48,989 20,173 1,048,251 445,758
Educ.Institution 43,250 21,052 958,257 493,792
Work 360,808 189,620 9,649,228 3,566,511
Book 44,339 15,677 1,111,960 408,724
Film 75,067 34,680 2,663,487 787,129
MusicalWork 160,383 101,985 4,116,625 1,635,655
Album 122,729 74,055 3,400,942 1,224,746
Single 42,393 24,597 1,226,636 534,023
Software 28,930 5,652 731,138 242,411
TelevisionShow 24,784 10,169 565,136 282,594

2 Cross-Language Instance Overlap


Next we try to measure to which extent the multi-lingual DBpedia dataset obtained by 'merging' the 22 language versions, is actually multi-lingual.


In the table below, the Instances column contains the total number of instances per class across all languages, the 1 column contains the number of instances that are contained only in a single out of the 22 languages, the 2 column contains the number of instances that are contained in 2 languages but not in 3 or more languages, etc.


So, for example, 732 Persons are described in 12 languages but not in 13 or more languages. If for the Person class we sum 2 to >20 columns, we get that 195,263 Persons that are described in 2 or more languages. This might seem small compared to the total of 871,630 Persons, but come back to the second table on the Data Set Statistics page and note there are 763,643 Persons in English DBpedia and only 145,060 Persons in the second large Italian DBpedia. Therefore, most of the 676,367 Persons described in just one language are very likely to be the ones present in English DBpedia only.
  Instances 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 >20
Person 871,630 676,367 94,339 42,382 21,647 12,936 8,198 5,295 3,437 2,391 1,493 1,063 732 532 329 224 126 85 38 16 0
Artist 100,793 48,046 21,751 11,208 6,543 4,168 2,739 1,999 1,282 929 602 435 316 244 178 134 93 72 38 16 0
Actor 25,340 3,761 7,966 4,735 2,846 1,815 1,174 800 565 430 301 230 172 147 126 94 71 57 35 15 0
MusicalArtist 46,364 24,524 8,514 4,428 2,803 1,917 1,339 1,034 630 423 257 170 111 82 53 37 23 15 3 1 0
Athlete 217,067 142,131 35,432 17,662 8,724 5,113 2,971 1,884 1,116 706 459 309 240 173 87 40 16 4 0 0 0
Politician 41,126 20,439 8,165 4,269 2,461 1,712 1,292 807 631 455 301 236 138 97 56 42 16 9 0 0 0
Place 643,260 307,729 150,349 45,836 36,339 20,831 13,523 20,808 31,422 11,262 3,356 785 387 281 219 103 19 9 2 0 0
Building 65,355 56,357 5,895 1,771 718 294 146 85 41 26 8 12 1 1 0 0 0 0 0 0 0
Airport 11,675 7,845 2,294 863 414 142 72 26 14 5 0 0 0 0 0 0 0 0 0 0 0
PopulatedPlace 424,291 134,882 120,416 37,050 32,929 19,058 12,588 20,273 31,111 11,078 3,253 712 343 257 214 98 19 8 2 0 0
River 26,892 16,054 6,599 2,102 800 569 300 178 118 78 43 28 12 10 1 0 0 0 0 0 0
Organisation 206,670 160,398 22,661 9,312 5,002 3,221 2,072 1,421 928 594 399 268 150 101 64 42 21 13 2 1 0
Band 29,101 17,119 4,657 2,286 1,400 962 759 653 415 276 195 116 87 62 47 32 19 13 2 1 0
Company 48,989 35,806 6,405 2,836 1,442 990 671 332 210 101 63 51 24 29 17 10 2 0 0 0 0
Educ.Institution 43,250 38,802 2,539 939 463 218 126 72 48 21 13 6 1 2 0 0 0 0 0 0 0
Work 360,808 243,706 54,855 23,097 12,605 8,277 5,732 4,007 2,911 1,995 1,274 808 574 389 252 146 94 51 20 14 1
Book 44,339 29,406 6,601 2,905 1,700 1,179 838 620 415 234 133 92 61 55 37 26 13 15 6 3 0
Film 75,067 43,414 12,722 6,040 3,500 2,343 1,800 1,332 1,148 926 613 391 307 199 139 80 61 27 13 11 1
MusicalWork 160,383 111,897 24,453 9,479 5,074 3,375 2,213 1,492 968 601 363 213 128 66 36 16 6 2 1 0 0
Album 122,729 79,074 20,680 8,651 4,910 3,334 2,193 1,487 968 601 363 213 128 66 36 16 6 2 1 0 0
Single 42,393 28,707 6,497 2,963 1,590 1,114 712 434 194 104 40 28 9 1 0 0 0 0 0 0 0
Software 28,930 16,545 5,459 2,588 1,472 1004 681 439 295 174 101 74 48 31 13 5 1 0 0 0 0
TelevisionShow 24,784 18,802 3,121 1,231 668 422 217 141 85 49 35 11 1 1 0 0 0 0 0 0 0

3 Cross-Language Property Overlap


The table below contains the statistics on properties of the PopulatedPlace class that are frequently used in different languages. We report the number of instances that have this property in exactly 1, 2, 3, ..., 8 and from 9 to 22 languages.

Property Total 1 2 3 4 5 6 7 8 >8
rdf:type 440,275 134,882 120,416 37,050 32,929 19,058 12,588 20,273 31,111 15,984
FOAF:name 422,707 144,738 116,480 36,420 34,634 23,515 34,326 20,250 6,866 2,739
DBP-ONT:country 389,964 246,787 81,367 10,617 11,061 7,079 32,957 85 11 0
WGS84:long 344,584 140,259 86,915 39,024 23,878 26,497 14,145 12,482 978 203
WGS84:lat 344,584 140,260 86,914 39,024 23,878 26,497 14,145 12,482 978 203
GEORSS:point 344,408 141,370 87,318 39,445 22,516 25,788 14,110 12,477 978 203
DBP-ONT:populationTotal 298,923 88,181 70,049 30,281 46,607 49,574 9,045 2,906 1,150 565
DBP-ONT:isPartOf 245,696 245,693 3 0 0 0 0 0 0 0
DBP-ONT:postalCode 219,424 84,962 23,887 23,649 12,494 11,163 13,859 13,697 21,749 6,982
DBP-ONT:areaCode 194,572 111,002 42,458 13,299 8,947 12,320 5,369 857 206 57
DBP-ONT:areaTotal 179,580 120,692 28,412 24,929 3,441 1,549 335 131 67 12
DBP-ONT:elevation 171,109 86,863 35,951 24,898 19,269 3,366 586 138 32 3
DBP-ONT:type 170,160 145,885 24,214 59 2 0 0 0 0 0
DBP-ONT:timeZone 165,985 125,102 39,585 1,201 85 11 1 0 0 0
DBP-ONT:utcOffset 156,417 154,845 1,540 32 0 0 0 0 0 0
DBP-ONT:area 128,269 68,785 20,655 8,519 30,280 30 0 0 0 0
DBP-ONT:administrativeDistrict 115,475 105,737 8,832 766 140 0 0 0 0 0
DBP-ONT:populationDensity 106,290 94,877 9,143 1,936 69 15 20 51 117 31