DBpedia Latest Core Releases

About

The DBpedia Latest Core Release is the small subset of of the total DBpedia releases that we are loading into the main DBpedia SPARQL endpoint and Linked Data Interface and Lookup Search. Latest Core is also called Tiny Diamond and is the tiniest Knowledge Graph released by DBpedia. Legacy note: Latest core is the refurbished equivalent of the “core” folder used for previous releases before 2020 (e.g. DBpedia 2016-10 ). 

In a nutshell, the dataset:

  • contains factual data from articles and infoboxes of the English Wikipedia Language Edition (WPLE)
  • is enriched with labels and abstracts from the largest Wikipedia Language Editions
  • is enriched with links
  • is enriched with rdf:type statements to several ontologies
  • contains approx. 900 million triples (Jan 2021), but is steadily growing. In the last three years, infoboxes in all WPLEs grew by 150%, English WPLE doubled in size (p. 8).

Databus Collection: https://databus.dbpedia.org/dbpedia/collections/latest-core

Contribution and Improvement

Running the extraction each month follows the paradigm “Extraction as a Platform” (EaaP). Any changes reaching the master branch of the DBpedia Information Extraction Framework (DIEF) will be available for the next release and improve DBpedia. Please submit encountered data issues via the GitHub issue tracker. The debugging process includes the writing of data tests (SHACL and other), fixing the software or mappings and merging into master. General discussion (e.g. new extractors) are discussed in the DBpedia Forum.

Release Cycle and Timeliness of Data

MARVIN Release Bot (>15 days delay)

Extraction is fully automated and run by our release bot MARVIN via cronjob (see MARVIN-config) and then registered to the Databus into the MARVIN agent, cleaned and registered again into the DBpedia organisation. The normal cycle includes: 1st of each month, Wikimedia is preparing the Wikipedia dumps. They are available around the 10th of each month. MARVIN then downloads and processes (5-10 days), which means that the release of the month is normally available on the 15th of each month.

Note: the MARVIN Release Bot has many external dependencies and can break, in which case, the release is normally skipped. A dashboard is available to track progress.

DBpedia Live (DBpedia in realtime)

DBpedia Live constantly updates and processes all edits of Wikipedia in realtime (delay of just several minutes). Please check out the DBpedia Live page.

Online Access and Querying

In frequent intervals, we update all DBpedia services based on Latest Core. They are hosted for free (best effort with around 10 million API hits per day):

Fork the DBpedia services and setup your own local mirror:

Download the Latest Core Collection

It contains a small but useful subset of datasets from the DBpedia Extractions. Moreover this subset is loaded to the DBpedia Main SPARQL endpoint. With the help of Databus Latest-Core Collection it is quite easy to fetch a fresh custom-tailored selection of DBpedia files for a specific use case (e.g. a custom list of languages).

(Manually) go to https://databus.dbpedia.org/dbpedia/collections/latest-core and click on the individual download links.

(Automatic) Retrieve the latest core SPARQL query

  • Visit the collection page and click on Actions > Copy Query to Clipboard
  • or run curl https://databus.dbpedia.org/dbpedia/collections/latest-core -H "accept: text/sparql" > query.sparql

Select one of the following options:

  • Run the query against https://databus.dbpedia.org/repo/sparql to get the list of downloadable files (make sure to use a POST request, since the request length exceeds the maximum length of a GET request)
    curl -X POST --data-urlencode query@query.sparql -d format=text/tab-separated-values  https://databus.dbpedia.org/repo/sparql
    The query will return a list of download links, which can be retrieved with wget
  • Give the query to the Databus Client. The Client provides additional options for compression and format conversion, so you don’t need to do it manually.

DBpedia Extraction Groups

The New DBpedia Release Cycle follows an Extraction as a Platform (EaaP) approach. In regular intervals (normally each month), the DBpedia extraction framework will be run automatically over Wikipedia (all languages) and Wikidata dumps to extract around 5000 RDF files packaged in 50 artifacts and 4 high-level groups: Generic (using generic parsers and language-specific RDF properties), Mappings (using editable ontology mappings from mappings.dbpedia.org), Text (abstract and article full-text extraction), Wikidata (mapped and cleaned Wikidata data using the DBpedia Ontology). A full description of the release cycle can be found in Hofer et al., The New DBpedia Release Cycle: Increasing Agility and Efficiency in Knowledge Extraction Workflows, SEMANTiCS 2020.

The Databus Latest-Core Collection 

Databus Collections can be seen as a customizable dynamic shopping carts of data (files). The collection link for latest core files is https://databus.dbpedia.org/dbpedia/collections/latest-core . This collection updates automatically and always refers to the latest available files. A small part of data from DBpedia Extraction Groups (approx. 100 of 4000 files or 2.5%) is selected in the latest-core collection. If you would like to customize it, it is advised to create your own Databus collection: 1. register/login 2. go to the collection and click “Action” -> “Edit Copy”

Improving

Documentation and Statistics

Current issues

  • sameAs links to other DBpedia Chapters, i.e. de.dbpedia.org (in progress)
  • rdfs:label/comment/dbo:abstract only in English, was en + 19 languages, could be up to 140 languages (in progress) (chapters: ar, ca ,cs ,de ,el ,eo ,es ,eu ,fr ,ga ,id ,it ,ja ,ko ,nl ,pl ,pt ,sv ,uk) (additional: ru, zh)
  • http://purl.org/linguistics/gold/hypernym , 4 million relations missing
  • sdtypes from Mannheim need to be checked
  • Yago types and links are missing (in progress)
  • ImageExtractor was malfunctioning and disabled, i.e. only images from infoboxes are extracted, no clean licenses. (Will be fixed with https://databus.dbpedia.org/dbpedia/wikidata/images/)
  • sameAs links to external Linked Data sites are currently not updated, (in progress, we are centralizing this with Global ID management
  • Umbel in store, but not in Databus collection, loaded from https://github.com/structureddynamics/UMBEL/blob/master/External%20Ontol…

What the Future Holds

  • Fused data: We already created several tests for a fused dataset of dbo properties. This dataset enriches the English version with Wikidata and dbo properties from over 20 Wikipedia languages, resulting in a denser graph.
  • Community extensions such as caligraph.org or https://ner.vse.cz/datasets/linkedhypernyms/ can now be streamlined and easier contributed with the Databus and routed to the main endpoint and chapter knowledge graphs.
  • Links, Mappings, Ontologies: A special focus of DBpedia will be to take the role of a custodian for links, mappings, ontologies on the web of data and make these easier to contribute and more centrally available.