performance.rst 6.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158
  1. Performance Benchmark Report
  2. ============================
  3. The purpose of this document is to provide very broad performance measurements
  4. and comparison between Lakesuperior and Fedora/Modeshape implementations.
  5. Environment
  6. -----------
  7. Hardware
  8. ~~~~~~~~
  9. - MacBook Pro14,2
  10. - 1x Intel(R) Core(TM) i5 @3.1Ghz
  11. - 16Gb RAM
  12. - SSD
  13. - OS X 10.13
  14. - python 3.7.2
  15. - lmdb 0.9.22
  16. Benchmark script
  17. ~~~~~~~~~~~~~~~~
  18. `Generator script <../../util/benchmark.py>`__
  19. The script was run with default values: resprectively 10,000 and 100,000
  20. children under the same parent. PUT and POST requests were tested separately.
  21. The script calculates only the timings used for the PUT or POST requests, not
  22. counting the time used to generate the random data.
  23. Data Set
  24. ~~~~~~~~
  25. Synthetic graph created by the benchmark script. The graph is unique for
  26. each request and consists of 200 triples which are partly random data,
  27. with a consistent size and variation:
  28. - 50 triples have an object that is a URI of an external resource (50
  29. unique predicates; 5 unique objects).
  30. - 50 triples have an object that is a URI of a repository-managed
  31. resource (50 unique predicates; 5 unique objects).
  32. - 100 triples have an object that is a 64-character random Unicode
  33. string (50 unique predicates; 100 unique objects).
  34. LDP Data Retrieval
  35. ~~~~~~~~~~~~~~~~~~
  36. REST API request::
  37. time curl http://localhost:8000/ldp/pomegranate > /dev/null
  38. SPARQL Query
  39. ~~~~~~~~~~~~
  40. *Note:* The query may take a long time and therefore is made on the
  41. single-threaded server (``lsup-server``) that does not impose a timeout (of
  42. course, gunicorn could also be used by changing the configuration to allow a
  43. long timeout).
  44. Sample query::
  45. PREFIX ldp: <http://www.w3.org/ns/ldp#>
  46. SELECT (COUNT(?s) AS ?c) WHERE {
  47. ?s a ldp:Resource .
  48. ?s a ldp:Container .
  49. }
  50. Raw request::
  51. time curl -iXPOST -H'Accept:application/sparql-results+json' \
  52. -H'Content-Type:application/x-www-form-urlencoded; charset=UTF-8' \
  53. -d 'query=PREFIX+ldp:+<http://www.w3.org/ns/ldp#> SELECT+(COUNT(?s)+AS+?c)'\
  54. '+WHERE+{ ++?s+a+ldp:Resource+. ++?s+a+ldp:Container+. }+' \
  55. http://localhost:5000/query/sparql
  56. Python API Retrieval
  57. ~~~~~~~~~~~~~~~~~~~~
  58. In order to illustrate the advantages of the Python API, a sample retrieval of
  59. the container resource after the load has been timed. This was done in an
  60. IPython console::
  61. In [1]: from lakesuperior import env_setup
  62. In [2]: from lakesuperior.api import resource as rsrc_api
  63. In [3]: %timeit x = rsrc_api.get('/pomegranate').imr.as_rdflib
  64. Results
  65. -------
  66. 10K Resources
  67. ^^^^^^^^^^^^^
  68. =============================== ============= ============= ============ ============ ============
  69. System PUT POST Store GET SPARQL Query
  70. =============================== ============= ============= ============ ============ ============
  71. FCREPO / Modeshape 4.7.5 68ms (100%) XXms (100%) 3.9Gb (100%) 6.2s (100%) N/A
  72. Lakesuperior 1.0a20 REST API 105ms (159%) XXXms (XXX%) 298Mb (8%) 2.1s XXXXXXXs
  73. Lakesuperior 1.0a20 Python API 53ms (126%) XXms (XXX%) 789Mb (21%) 381ms N/A
  74. =============================== ============= ============= ============ ============ ============
  75. **Notes:**
  76. - The Python API time for the GET request in alpha18 is 8.5% of the request.
  77. This means that over 91% of the time is spent serializing the results.
  78. This time could be dramatically reduced by using faster serialization
  79. libraries, or can be outright zeroed out by an application that uses the
  80. Python API directly and manipulates the native RDFLib objects (of course, if
  81. a serialized output is eventually needed, that cost is unavoidable).
  82. - Similarly, the ``triples`` retrieval method of the SPARQL query only takes
  83. 13.6% of the request time. The rest is spent evaluating SPARQL and results.
  84. An application can use ``triples`` directly for relatively simple lookups
  85. without that overhead.
  86. 100K Resources
  87. ^^^^^^^^^^^^^^
  88. =============================== =============== =============== ============= =============== ==============
  89. System PUT POST Store GET SPARQL Query
  90. =============================== =============== =============== ============= =============== ==============
  91. FCREPO / Modeshape 4.7.5 500+ms* 65ms (100%)\*\* 12Gb (100%) 3m41s (100%) N/A
  92. Lakesuperior 1.0a20 REST API 104ms (100%) 123ms (189%) 8.7Gb (72%) 30s (14%) XXXXXXXXs
  93. Lakesuperior 1.0a20 Python API 69ms (60%) XXms (XXX%) 8.7Gb (72%) 6s (2.7%) XXXXXXXs\*\*\*
  94. =============================== =============== =============== ============= =============== ==============
  95. \* POST was stopped at 30K resources after the ingest time reached >1s per
  96. resource. This is the manifestation of the "many members" issue which is
  97. visible in the graph below. The "Store" value is for the PUT operation which
  98. ran regularly with 100K resources.
  99. \*\* the POST test with 100K resources was conducted with fedora 4.7.5 because
  100. 5.0 would not automatically create a pairtree, thereby resulting in the same
  101. performance as the PUT method.
  102. \*\*\* Timing based on a warm cache. The first query timed at 0m22.2s.
  103. Conclusions
  104. -----------
  105. Lakesuperior appears to be markedly slower on writes and markedly faster
  106. on reads. Both these factors are very likely related to the underlying
  107. LMDB store which is optimized for read performance.
  108. In a real-world application scenario, in which a client may perform multiple
  109. reads before and after storing resources, the write performance gap may
  110. decrease. A Python application using the Python API for querying and writing
  111. would experience a dramatic improvement in reading timings, and somewhat in
  112. write timings.
  113. Comparison of results between the laptop and the server demonstrates
  114. that both read and write performance ratios between repository systems are
  115. identical in the two environments.
  116. As it may be obvious, these are only very partial and specific
  117. results. They should not be taken as a thorough performance assessment.
  118. Such an assessment may be impossible and pointless to make given the
  119. very different nature of the storage models, which may behave radically
  120. differently depending on many variables.