performance.rst 6.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
  1. Performance Benchmark Report
  2. ============================
  3. The purpose of this document is to provide very broad performance measurements
  4. and comparison between Lakesuperior and Fedora/Modeshape implementations.
  5. Environment
  6. -----------
  7. Hardware
  8. ~~~~~~~~
  9. - MacBook Pro14,2
  10. - 1x Intel(R) Core(TM) i5 @3.1Ghz
  11. - 16Gb RAM
  12. - SSD
  13. - OS X 10.13
  14. - Python 3.7.2
  15. - lmdb 0.9.22
  16. Benchmark script
  17. ~~~~~~~~~~~~~~~~
  18. `Source code <../../util/benchmark.py>`__
  19. The script was run by generating 100,000 children under the same parent. PUT
  20. and POST requests were tested separately. The POST method produced pairtrees
  21. in Fedora to counter its known issue with many resources as direct children of
  22. a container.
  23. The script calculates only the timings used for the PUT or POST requests, not
  24. counting the time used to generate the random data.
  25. Data Set
  26. ~~~~~~~~
  27. Synthetic graph created by the benchmark script. The graph is unique for
  28. each request and consists of 200 triples which are partly random data,
  29. with a consistent size and variation:
  30. - 50 triples have an object that is a URI of an external resource (50
  31. unique predicates; 5 unique objects).
  32. - 50 triples have an object that is a URI of a repository-managed
  33. resource (50 unique predicates; 5 unique objects).
  34. - 100 triples have an object that is a 64-character random Unicode
  35. string (50 unique predicates; 100 unique objects).
  36. The benchmark script is also capable of generating random binaries and a mix of
  37. binary and RDF resources; a large-scale benchmark, however, was impractical at
  38. the moment due to storage constraints.
  39. LDP Data Retrieval
  40. ~~~~~~~~~~~~~~~~~~
  41. REST API request::
  42. time curl http://localhost:8000/ldp/pomegranate > /dev/null
  43. SPARQL Query
  44. ~~~~~~~~~~~~
  45. The following query was used against the repository after the 100K resource
  46. ingest::
  47. PREFIX ldp: <http://www.w3.org/ns/ldp#>
  48. SELECT (COUNT(?s) AS ?c) WHERE {
  49. ?s a ldp:Resource .
  50. ?s a ldp:Container .
  51. }
  52. Raw request::
  53. time curl -iXPOST -H'Accept:application/sparql-results+json' \
  54. -H'Content-Type:application/x-www-form-urlencoded; charset=UTF-8' \
  55. -d 'query=PREFIX+ldp:+<http://www.w3.org/ns/ldp#> SELECT+(COUNT(?s)+AS+?c)'\
  56. '+WHERE+{ ++?s+a+ldp:Resource+. ++?s+a+ldp:Container+. }+' \
  57. http://localhost:5000/query/sparql
  58. Python API Retrieval
  59. ~~~~~~~~~~~~~~~~~~~~
  60. In order to illustrate the advantages of the Python API, a sample retrieval of
  61. the container resource after the load has been timed. This was done in an
  62. IPython console::
  63. In [1]: from lakesuperior import env_setup
  64. In [2]: from lakesuperior.api import resource as rsrc_api
  65. In [3]: %timeit x = rsrc_api.get('/pomegranate').imr.as_rdflib()
  66. Results
  67. -------
  68. =================== =============== ================ ============= ==================== ==============
  69. Software PUT POST Store Size GET SPARQL Query
  70. =================== =============== ================ ============= ==================== ==============
  71. FCREPO 5.0.2 >500ms [#]_ 65ms (100%) [#]_ 12Gb (100%) 3m41s (100%) N/A
  72. Lakesuperior REST 104ms (100%) 123ms (189%) 8.7Gb (72%) 30s (14%) 19.3s (100%)
  73. Lakesuperior Python 69ms (60%) 58ms (89%) 8.7Gb (72%) 6.7s (3%) [#]_ [#]_ 9.17s (47%)
  74. =================== =============== ================ ============= ==================== ==============
  75. .. [#] POST was stopped at 30K resources after the ingest time reached >1s per
  76. resource. This is the manifestation of the "many members" issue which is
  77. visible in the graph below. The "Store" value is for the PUT operation
  78. which ran regularly with 100K resources.
  79. .. [#] the POST test with 100K resources was conducted with fedora 4.7.5
  80. because 5.0 would not automatically create a pairtree, thereby resulting
  81. in the same performance as the PUT method.
  82. .. [#] Timing based on a warm cache. The first query timed at 22.2s.
  83. .. [#] The Python API time for the "GET request" (retrieval) without the
  84. conversion to Python in alpha20 is 3.2 seconds, versus the 6.7s that
  85. includes conversion to Python/RDFlib objects. This can be improved by
  86. using more efficient libraries that allow serialization and
  87. deserialization of RDF.
  88. Charts
  89. ------
  90. .. figure:: assets/plot_fcrepo_put_30K.png
  91. :alt: Fedora with PUT, 30K request time chart
  92. Fedora/Modeshape using PUT requests under the same parent. The "many
  93. members" issue is clearly visible after a threshold is reached.
  94. .. figure:: assets/plot_fcrepo_post_100K.png
  95. :alt: Fedora with POST, 100K request time chart
  96. Fedora/Modeshape using POST requests generating pairtrees. The performance
  97. is greatly improved, however the ingest time increases linearly with the
  98. repository size (O(n) time complexity)
  99. .. figure:: assets/plot_lsup_post_100K.png
  100. :alt: Lakesuperior with POST, 100K request time chart
  101. Lakesuperior using POST requests, NOT generating pairtrees (equivalent to
  102. a PUT request). The timing increase is closer to a O(log n) pattern.
  103. .. figure:: assets/plot_lsup_pyapi_post_100K.png
  104. :alt: Lakesuperior Python API, 100K request time chart
  105. Lakesuperior using Python API. The pattern is much smoother, with less
  106. frequent and less pronounced spikes. The O(log n) performance is more
  107. clearly visile here: time increases quickly at the beginning, then slows
  108. down as the repository size increases.
  109. Conclusions
  110. -----------
  111. Lakesuperior appears to be slower on writes and much faster on reads than
  112. Fedora 4-5. Both these factors are very likely related to the underlying LMDB
  113. store which is optimized for read performance. The write performance gap is
  114. more than filled when ingesting via the Python API.
  115. In a real-world application scenario, in which a client may perform multiple
  116. reads before and after storing resources, the write performance gap may
  117. decrease. A Python application using the Python API for querying and writing
  118. would experience a dramatic improvement in read as well as write timings.
  119. As it may be obvious, these are only very partial and specific
  120. results. They should not be taken as a thorough performance assessment, but
  121. rather as a starting point to which specific use-case variables may be added.