DBpedia Blog

New DBpedia Usage Report for July 2017 through January 2021

Summary

Our partner OpenLink Software recently published a new DBpedia usage report on the SPARQL endpoint and associated Linked Data deployment.

Copyright © 2021 OpenLink Software

Introduction

This document shows some of the statistics from the DBpedia 2016-10 dataset collected between July 2017 and January 2021; spanning more than three and a half year of logs from the DBpedia web service operated by OpenLink Software at http://dbpedia.org/sparql/ .

The log files used to prepare this document include data from the following DBpedia release:

Infrastructure

The DBpedia service consists of:

  • two or more Virtuoso Universal Server Instances — facilitating Linked Data Deployment including providing a SPARQL endpoint delivering RDF data in a variety of document formats subject to content-negotiation.
  • Reverse Proxy Server — which redirects client requests to an available Virtuoso instance and caches the results in case another client repeats the same request within a specified timeframe
  • a physical computer — hosted in OpenLink Software’s datacenter

Currently the DBpedia service is hosted on two virtual machines running CentOS 6, each using 8 Intel Xeon E5–2630 2.30 GHz cores with 200 GB SSD and 64GB memory, hosting Virtuoso 7.2 Enterprise Edition with the Column Store Module.

Rate and Connection limits

To maintain equitable access to the DBpedia service for everyone, OpenLink Software limits connections by rate and concurrent connection, limiting disruption by faulty or misbehaving applications.

Current limit rates are:

  • Connection limit of 50 parallel connections per IP address . This number is fairly high to permit multiple clients in networks using Network Address Translation (NAT) to appear as one network IP. Without the use of tracking cookies, it is impossible to distinguish between machines inside a NAT network, and for privacy and legal reasons, OpenLink Software has decided not to use such cookies at this point in time.
  • Rate limit of 100 requests per second per IP address, with an initial burst of 120 requests.

As part of monitoring the DBpedia service, OpenLink Software performs frequent traffic analysis to make sure the service is running smoothly.

Ideally, applications should be written to check the HTTP status code of each request, and in case of a 503 (Service Unavailable) or 429 (Too Many Requests) code, perform a 1–2 second sleep before retrying the request.

OpenLink Software may alter these parameters at any time to make sure the service remains reachable to the general public.

In case of misuse, OpenLink Software may temporarily block an offender’s IP address from accessing the DBpedia service. This temporary ban will be automatically lifted once such a blocked IP address refrains from making any request to the DBpedia service for at least 5 minutes.

Configured Virtuoso limits on the DBpedia endpoint

The Virtuoso configuration for the DBpedia endpoint includes:

  • Query Execution Timeout of 120 seconds. This is the query solution preparation threshold. If the timeout stops execution before the solution is complete — i.e., if the solution is partial — this is indicated to the query client via HTTP response headers.
  • Maximum SPARQL query solution (aka result set) size of 10,000 rows. This is the maximum number of solution rows (for SELECT queries) or triple/quad statements (for CONSTRUCT or DESCRIBE queries) returned per query-solution-retrieval round-trip.

Virtuoso “Anytime Query” Functionality

The Anytime Query is a core feature of Virtuoso that enables it to handle the challenges inherent in providing a publicly accessible interface for ad-hoc querying at Web scale. This feature allows an application compliant with the SPARQL- and HTTP-protocol to issue long-running and/or large-solution queries, for which finding the complete solution would exceed configured query timeout and/or result set limits, and rather than being rebuffed with no solution, to receive partial solutions conforming to those thresholds. Further, this feature enables the use of LIMIT and OFFSET (typically combined with ORDER BY and/or GROUP BY) to create windows (also known as sliding windows or cursors ) to iterate through the complete query solution without being adversely affected by inserts or deletions.

Note: Even while paging through a partial query solution, Virtuoso continues to work towards a complete solution in the background.

Custom HTTP headers

As the W3C SPARQL standard currently does not specify an authoritative status code or header response to report a partial result set, OpenLink Software has opted to have Virtuoso return a status code of 200 to denote a successful request and add a custom header to the result to indicate that the result was limited to what could be returned within the settings enforced by the server.

If full execution of the query would return more than the configured maximum number of rows, the X-SPARQL-MaxRows line is added, as shown below:

HTTP/1.1 200 OK
Date: Tue, 1 Jan 2018 12:00:00 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 1427536
Connection: keep-alive
Vary: Accept-Encoding
Server: Virtuoso/07.20.3224 (Linux) i686-generic-linux-glibc212-64 VDB
X-SPARQL-default-graph: http://dbpedia.org
X-SPARQL-MaxRows: 10000
Expires: Tue, 07 Jan 2018 12:00:00 GMT
Cache-Control: max-age=604800
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: HEAD, GET, POST, OPTIONS
Access-Control-Allow-Headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Accept-Encoding
Accept-Ranges: bytes

If the AnyTime Query timeout is reached, several headers are added:

HTTP/1.1 200 OK
Date: Tue, 01 Jan 2018 12:00:00 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 80
Connection: keep-alive
Server: Virtuoso/07.20.3224 (Linux) i686-generic-linux-glibc212-64 VDB
X-SPARQL-default-graph: http://dbpedia.org
X-SQL-State: S1TAT
X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout. Activity: 7 rnd 64.87M seq 0 same seg 1 same pg 0 same par 0 disk 0 spec disk 0B / 0 mess
X-Exec-Milliseconds: 30000
X-Exec-DB-Activity: 7 rnd 64.87M seq 0 same seg 1 same pg 0 same par 0 disk 0 spec disk 0B / 0 messages 0 fork
Expires: Tue, 07 Jan 2018 12:00:00 GMT
Cache-Control: max-age=604800
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: HEAD, GET, POST, OPTIONS
Access-Control-Allow-Headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Accept-Encoding
Accept-Ranges: bytes

Hosting Independent DBpedia Instances

The restrictions described above may impair some complex analytical queries. Users who frequently encounter these limits are advised to use one of the following methods:

HTTP logs

The HTTP server log files used in this report exclude traffic generated by:

  • IP addresses that were temporarily rate-limited after their burst period
  • IP addresses that were banned after misuse
  • applications, spiders, and other crawlers that were blocked after frequently hitting the rate-limiter or which generally claimed too many resources

The system uses a combination of firewall rules and Access Control Lists (ACLs) to quickly drop such connections, so legitimate users of the DBpedia service can continue to connect and execute queries.

To save time, these dropped connections are not recorded in the log files.

The data for this document was extracted from reports generated by Webalizer v2.21.

HTTP Usage Historical Overview

The first table shows the average numbers of Visits and Hits per day during the time each DBpedia dataset was was live on the http://dbpedia.org/sparql endpoint.

DBpediaFromUntilDaysVisits per dayHits per dayTotal Hits
3.32009-06-302009-11-051289,602733,81194,661,592
3.42009-11-062010-04-0715211,1001,212,549185,519,930
3.52010-04-082011-01-1728416,3811,122,612282,898,279
3.62011-01-182011-06-3016319,2881,328,355219,178,587
3.72011-07-012012-06-1935423,4082,052,660594,338,675
3.82012-06-202013-09-1945616,6142,925,335570,440,410
3.92013-09-202014-09-0234722,0263,035,4281,062,399,840
20142014-09-032015-07-0530527,9273,423,4901,051,011,401
2015-042015-07-062016-03-3126924,6893,516,936953,089,788
2015-102016-04-012016-10-13195110,7456,581,2171,263,593,686
2016-042016-10-142017-07-03262231,7357,646,4472,003,369,014
2016-102017-07-042021-01-071283257.9947,542,6239.501.427.081

For detailed information on the specific usage numbers, please visit the original report by OpenLink Software published here. Also, older reports are available through their site. Read the previous usage report 2020 on the DBpedia blog.

Further Links

For the latest news, subscribe to the DBpedia Newsletter, check our DBpedia Website and follow us on Twitter or LinkedIn .

Thanks for reading and keep using DBpedia!

Yours DBpedia Associaton