The project was heralded as "one of the more famous pieces" of the decentralized Linked Data effort by Tim Berners-Lee, one of the Internet's pioneers.[4] As of June 2021, DBPedia contained over 850 million triples.
Wikipedia articles consist mostly of free text, but also include structured information embedded in the articles, such as "infobox" tables (the pull-out panels that appear in the top right of the default view of many Wikipedia articles, or at the start of the mobile versions), categorization information, images, geo-coordinates and links to external Web pages. This structured information is extracted and put in a uniform dataset which can be queried.
Dataset
The 2016-04 release of the DBpedia data set describes 6.0 million entities, out of which 5.2 million are classified in a consistent ontology, including 1.5 million persons, 810,000 places, 135,000 music albums, 106,000 films, 20,000 video games, 275,000 organizations, 301,000 species and 5,000 diseases.[8] DBpedia uses the Resource Description Framework (RDF) to represent extracted information and consists of 9.5 billion RDF triples, of which 1.3 billion were extracted from the English edition of Wikipedia and 5.0 billion from other language editions.[8]
From this data set, information spread across multiple pages can be extracted. For example, book authorship can be put together from pages about the work, or the author.[further explanation needed]
One of the challenges in extracting information from Wikipedia is that the same concepts can be expressed using different parameters in infobox and other templates, such as |birthplace= and |placeofbirth=. Because of this, queries about where people were born would have to search for both of these properties in order to get more complete results. As a result, the DBpedia Mapping Language has been developed to help in mapping these properties to an ontology while reducing the number of synonyms. Due to the large diversity of infoboxes and properties in use on Wikipedia, the process of developing and improving these mappings has been opened to public contributions.[9]
Version 2014 was released in September 2014.[10] A main change since previous versions was the way abstract texts were extracted. Specifically, running a local mirror of Wikipedia and retrieving rendered abstracts from it made extracted texts considerably cleaner. Also, a new data set extracted from Wikimedia Commons was introduced.
As of June 2021, DBPedia contains over 850 million triples.[11]
Examples
DBpedia extracts factual information from Wikipedia pages, allowing users to find answers to questions where the information is spread across multiple Wikipedia articles. Data is accessed using an SQL-like query language for RDF called SPARQL.
For example, if one were interested in the Japaneseshōjo manga series Tokyo Mew Mew, and wanted to find the genres of other works written by its illustrator Mia Ikumi. DBpedia combines information from Wikipedia's entries on Tokyo Mew Mew, Mia Ikumi and on this author's works such as Super Doll Licca-chan and Koi Cupid. Since DBpedia normalises information into a single database, the following query can be asked without needing to know exactly which entry carries each fragment of information, and will list related genres:
Such a rich source of structured cross-domain knowledge is fertile ground for artificial intelligence systems. DBpedia was used as one of the knowledge sources in IBM Watson's Jeopardy! winning system.[22]
Data about creators from DBpedia can be used for enriching artworks' sales observations.[24]
The crowdsourcing software company, Ushahidi, built a prototype of its software that leveraged DBpedia to perform semantic annotations on citizen-generated reports. The prototype incorporated the "YODIE" (Yet another Open Data Information Extraction system) service[25] developed by the University of Sheffield, which uses DBpedia to perform the annotations. The goal for Ushahidi was to improve the speed and facility with which incoming reports could be validated managed.[26]
DBpedia Spotlight
DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text. This allows linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight performs named entity extraction, including entity detection and name resolution (in other words, disambiguation). It can also be used for named entity recognition, and other information extraction tasks. DBpedia Spotlight aims to be customizable for many use cases. Instead of focusing on a few entity types, the project strives to support the annotation of all 3.5million entities and concepts from more than 320 classes in DBpedia. The project started in June 2010 at the Web Based Systems Group at the Free University of Berlin.
DBpedia Spotlight is publicly available as a web service for testing and a Java/ScalaAPI licensed via the Apache License. The DBpedia Spotlight distribution includes a jQuery plugin that allows developers to annotate pages anywhere on the Web by adding one line to their page.[27] Clients are also available in Java or PHP.[28] The tool handles various languages through its demo page[29] and web services. Internationalization is supported for any language that has a Wikipedia edition.[30]
Archivo ontology database
From 2020, the DBpedia project provides a regularly updated database of web‑accessible ontologies written in the OWL ontology language.[31] Archivo also provides a four star rating scheme for the ontologies it scrapes, based on accessibility, quality, and related fitness‑for‑use criteria. For instance, SHACL compliance for graph‑based data is evaluated when appropriate. Ontologies should also contain metadata about their characteristics and specify a public license describing their terms‑of‑use.[32][33] As of June 2021[update] the Archivo database contains 1368 entries.
History
DBpedia was initiated in 2007 by Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak and Zachary Ives.[5]
^"Life in the Linked Data Cloud". opencalais.com. Archived from the original on 24 November 2009. Retrieved 10 November 2009. Wikipedia has a Linked Data twin called DBpedia. DBpedia has the same structured information as Wikipedia – but translated into a machine-readable format.
^"Zemanta talks Linked Data with SDK and commercial API". ZDNet. Archived from the original on 28 February 2010. Retrieved 10 November 2009. Zemanta fully supports the Linking Open Data initiative. It is the first API that returns disambiguated entities linked to dbPedia, Freebase, MusicBrainz, and Semantic Crunchbase.
^"BBC Learning - Open Lab - Reference". BBC. Archived from the original on 25 August 2009. Retrieved 10 November 2009. Dbpedia is a database version of Wikipedia. It is used in a lot of projects for a wide range of different reasons. At the BBC we are using it for tagging content.
^David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty "Building Watson: An Overview of the DeepQA Project."Archived 6 November 2020 at the Wayback Machine In AI Magazine Fall, 2010. Association for the Advancement of Artificial Intelligence (AAAI).
^Filipiak, Dominik; Filipowska, Agata (2 December 2015). "DBpedia in the Art Market". Business Information Systems Workshops. Lecture Notes in Business Information Processing. Vol. 228. pp. 321–331. doi:10.1007/978-3-319-26762-3_28. ISBN978-3-319-26761-6.
^
Frey, Johannes; Streitmatter, Denis; Götz, Fabian; Hellmann, Sebastian; Arndt, Natanael (27 October 2020). "DBpedia Archivo: a web-scale interface for ontology archiving under consumer-oriented aspects". In Sure-Vetter, York; Sack, Harald; Cudré-Mauroux, Philippe; Maleshkova, Maria; Pellegrini, Tassilo; Acosta, Maribel (eds.). Semantic systems: the power of AI and knowledge graphs. Cham, Switzerland: Springer. doi:10.1007/978-3-030-59833-4_2. ISBN978-3-030-59832-7. S2CID219939266. Download as PDF or ePUB.