Getting Started
Guide for Internationalization Developers
General information about the extraction framework is in the main documentation. The procedure is exactly the same but you will have to change some configuration files for better results. Most internationalization (I18n) configuration options are in the core module under org.dbpedia.extraction.config
Questions may be asked on the
DBpedia developers list.
You are also encouraged to read the DBpedia I18n paper before proceeding further, all issues in this page are discussed there
http://dx.doi.org/10.1016/j.websem.2012.01.001 /
http://svn.aksw.org/papers/2011/DBpedia_I18n/public.pdf.
Encoding / resource namespace / titles
We encourage you to use xx.dbpedia.org as the namespace of your localized extraction of language xx.
However, if you want, you can choose the generic domain name dbpedia.org instead of the default xx.dbpedia.org.
The option (for now) is in the following file:
A setting in dump/extract.properties selects if URIs are serialized as URIs, IRIs, or both. For example, with the following settings, files with the suffix iri.nt (containing IRIs in N-Triples format) and uri.nq (containing URIs in N-Quads format) are written.
Currently, these format combinations are available: iri or uri, followed by a dot and one of nt (N-Triples), nq (N-Quads), ttl (Turtle), tql (Turtle Quads – N-Quads with Turtle encoding), triples.trix, quads.trix.
Extractor / Parser tuning
org.dbpedia.extraction.config
Some extractors/Parsers are language sensitive, and you need to set up language specific options for them to work:
- Disambiguation Extractor
- Homepage Extractor
- Image Extractor
- Infobox Extractor
- Inter Language Links Extractor
- Template Parameter Extractor
- Date Time Parser
- Duration
- Flag Template Parser
- Unit Value Parse
If you want to know a list of extractors that you should use for your language, see:
http://mappings.dbpedia.org/index.php/DBpedia_datasets
Interlinking
In order to create links to the English DBpedia and to the LOD Cloud, you will have to run some scripts. Go to scripts/shell-script, and run the interwiki links, and then the interlinking scripts.
Interwiki creates owl:sameAs links between two Wikipedias / DBpedias. It uses the Inter Language Links Extractor, but it removes all one-way links. It has been shown that one-way links (<10%) are responsible for >90% of errors in article linking. (See DBpedia I18n paper sec 5.1)
The interlinking script takes the owl:sameAs links (output of the previous script), downloads all the databases linking DBpedia to other datasources, and filters them down to the common triples.
Loading triples
You can use any triple store of your choice. Most i18n chapters use Virtuoso, so far. Instructions on how to load your triples into a Virtuoso triple store are available at:
Step-by-step:
http://www.openlinksw.com/dataspace/dav/wiki/Main/VirtAWSDBpedia351C
http://www.openlinksw.com/blog/~kidehen/?id=1654
How Do I?
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/#How%20Do%20I...
Guide by DBpedia Polish:
http://translate.google.com/translate?sl=pl&tl=en&js=n&prev=_t&hl=en&ie=...
General Virtuoso instructions:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFInsert
Example loading script (populate.sql):
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia_vad/
There is a script made by DBpedia Greece to clear and reload datasets:
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia_vad/file/1f3041fb38ec/populate.sql
TODO: describe solutions for the loading errors with IRIs
Dereferenceable IRIs / URIs
After installing the Virtuoso server, execute the following statements to adjust Virtuoso registry values --
The default DBpedia plug-in was changed to parametrically accept these variables. Note that the HTTP protocol only accepts URIs, so an encoding/decoding strategy was implemented to dereference IRIs (just set 'dbp_decode_iri' to 'on'). All
TCN (Transparent Content Negotiation) rules have been implemented. (See DBpedia I18n paper sec 6)
When the server is accessible through the official direction, say xx.dbpedia.org, where xx stands for a language code, the virtual host must be declared. To do so, open the conductor interface
http://xx.dbpedia.org/conductor, open tab Web Application Server, then subtab Virtual Domains & Directories. Define a host xx.dbpedia.org on port 80 and interface 0.0.0.0. If you already installed dbpedia vad, you will need to uninstall it and resintall it to configure this host as well.
Download
DBpedia VAD and install it: vad_install('[path/to/dbpedia_dav.vad]', 0); (the file dbpedia_dav.vad must be stored in a folder listed in the entry Dirs Allowed of the file virtuoso.ini).
Note (1): the above vad file is now forked in
http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia_vad/ You are encouraged to use this instead and report back any bugs.
Note (2): You must set the registry values above before installing this VAD!
If you need to change the values, first uninstall DBpedia VAD : vad_uninstall('dbpedia/[version]');.
Last version is 1.3.25, you can check which one you installed by entrering vad_list_packages ();.
To check the registry values, use select registry_get([entry_key]);, for instance select registry_get('dbp_website');
The Virtuoso server should now be configured properly and you should see something
like this.
If you get an empty page with code 404 (check with curl -I <url>), probably you should set Dynamic Local = 0 in your virtuoso.ini configuration file under section [URIQA].
To 'resolve' namespaces like it.dbpedia.org/property to dbprob prefix, add and entry in virtuoso conductor under Linked Data / Namespaces tab.
Setting up Apache
Some chapters keep different services in different machines, or everything in one machine but with software for content management (e.g. Drupal), project management (e.g. TRAC), etc. Some of us chose to use Apache to handle redirects to different services.
Here is an example virtual host configuration (make sure you have mod_rewrite enabled). In this example, 10.0.0.2 is the IP to the main server, and two other servers are located at dataserver.example.com and anothermachine.example.com.
Rewrite Rule may cause problems with URIs containing non-ASCII characters. In that case, make sure that mod_proxy is enabled and use
ProxyPass instead. For example:
Setting up your i18n chapter
If you want to set up a new DBpedia chapter, you first have to download the extraction framework and look at the i18n-specific changes described in this page.
When you are done, you can set up your chapter. Please perform the following steps:
1) Add your URLs to the table of chapters: http://wiki.dbpedia.org/Internationalization/Chapters
2) Add your name to the contacts:
http://dbpedia.org/Internationalization
3) Send to
dbpedia-developers a pull request for modifications you've made to the code
4) Add to the wiki any instructions that were missing when you started, adding the things that you had to figure out yourself
5) Set up a landing page acknowledging the mapping editors and other people that supported the creation of your chapter.
6) Give us the IPs for redirection/forwarding to your subdomain (e.g. es.dbpedia.org)
7) Make sure you've spelled DBpedia correctly (it is not DBPedia or dbPedia, it is DBpedia)
8) We recommend naming your chapter DBpedia Insert Language Name. Examples: DBpedia Portuguese, DBpedia Italiano, DBpédia en français. Avoid using country names or nationalities, specially in cases where the language is spoken in multiple countries. Prefer less esoteric names: Italiano may sound better than Italophone.
More help
See what other chapters have done and ask for help on the list:
Information
Last Modification:
2012-05-15 16:55:46 by Pablo Mendes