123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163 |
- Performance Benchmark Report
- ============================
- The purpose of this document is to provide very broad performance measurements
- and comparison between Lakesuperior and Fedora/Modeshape implementations.
- Environment
- -----------
- Hardware
- ~~~~~~~~
- - MacBook Pro14,2
- - 1x Intel(R) Core(TM) i5 @3.1Ghz
- - 16Gb RAM
- - SSD
- - OS X 10.13
- - Python 3.7.2
- - lmdb 0.9.22
- Benchmark script
- ~~~~~~~~~~~~~~~~
- `Source code <../../util/benchmark.py>`__
- The script was run by generating 100,000 children under the same parent. PUT
- and POST requests were tested separately. The POST method produced pairtrees
- in Fedora to counter its known issue with many resources as direct children of
- a container.
- The script calculates only the timings used for the PUT or POST requests, not
- counting the time used to generate the random data.
- Data Set
- ~~~~~~~~
- Synthetic graph created by the benchmark script. The graph is unique for
- each request and consists of 200 triples which are partly random data,
- with a consistent size and variation:
- - 50 triples have an object that is a URI of an external resource (50
- unique predicates; 5 unique objects).
- - 50 triples have an object that is a URI of a repository-managed
- resource (50 unique predicates; 5 unique objects).
- - 100 triples have an object that is a 64-character random Unicode
- string (50 unique predicates; 100 unique objects).
- The benchmark script is also capable of generating random binaries and a mix of
- binary and RDF resources; a large-scale benchmark, however, was impractical at
- the moment due to storage constraints.
- LDP Data Retrieval
- ~~~~~~~~~~~~~~~~~~
- REST API request::
- time curl http://localhost:8000/ldp/pomegranate > /dev/null
- SPARQL Query
- ~~~~~~~~~~~~
- The following query was used against the repository after the 100K resource
- ingest::
- PREFIX ldp: <http://www.w3.org/ns/ldp#>
- SELECT (COUNT(?s) AS ?c) WHERE {
- ?s a ldp:Resource .
- ?s a ldp:Container .
- }
- Raw request::
- time curl -iXPOST -H'Accept:application/sparql-results+json' \
- -H'Content-Type:application/x-www-form-urlencoded; charset=UTF-8' \
- -d 'query=PREFIX+ldp:+<http://www.w3.org/ns/ldp#> SELECT+(COUNT(?s)+AS+?c)'\
- '+WHERE+{ ++?s+a+ldp:Resource+. ++?s+a+ldp:Container+. }+' \
- http://localhost:5000/query/sparql
- Python API Retrieval
- ~~~~~~~~~~~~~~~~~~~~
- In order to illustrate the advantages of the Python API, a sample retrieval of
- the container resource after the load has been timed. This was done in an
- IPython console::
- In [1]: from lakesuperior import env_setup
- In [2]: from lakesuperior.api import resource as rsrc_api
- In [3]: %timeit x = rsrc_api.get('/pomegranate').imr.as_rdflib()
- Results
- -------
- =================== =============== ================ ============= ==================== ==============
- Software PUT POST Store Size GET SPARQL Query
- =================== =============== ================ ============= ==================== ==============
- FCREPO 5.0.2 >500ms [#]_ 65ms (100%) [#]_ 12Gb (100%) 3m41s (100%) N/A
- Lakesuperior REST 104ms (100%) 123ms (189%) 8.7Gb (72%) 30s (14%) 19.3s (100%)
- Lakesuperior Python 69ms (60%) 58ms (89%) 8.7Gb (72%) 6.7s (3%) [#]_ [#]_ 9.17s (47%)
- =================== =============== ================ ============= ==================== ==============
- .. [#] POST was stopped at 30K resources after the ingest time reached >1s per
- resource. This is the manifestation of the "many members" issue which is
- visible in the graph below. The "Store" value is for the PUT operation
- which ran regularly with 100K resources.
- .. [#] the POST test with 100K resources was conducted with fedora 4.7.5
- because 5.0 would not automatically create a pairtree, thereby resulting
- in the same performance as the PUT method.
- .. [#] Timing based on a warm cache. The first query timed at 22.2s.
- .. [#] The Python API time for the "GET request" (retrieval) without the
- conversion to Python in alpha20 is 3.2 seconds, versus the 6.7s that
- includes conversion to Python/RDFlib objects. This can be improved by
- using more efficient libraries that allow serialization and
- deserialization of RDF.
- Charts
- ------
- .. figure:: assets/plot_fcrepo_put_30K.png
- :alt: Fedora with PUT, 30K request time chart
- Fedora/Modeshape using PUT requests under the same parent. The "many
- members" issue is clearly visible after a threshold is reached.
- .. figure:: assets/plot_fcrepo_post_100K.png
- :alt: Fedora with POST, 100K request time chart
- Fedora/Modeshape using POST requests generating pairtrees. The performance
- is greatly improved, however the ingest time increases linearly with the
- repository size (O(n) time complexity)
- .. figure:: assets/plot_lsup_post_100K.png
- :alt: Lakesuperior with POST, 100K request time chart
- Lakesuperior using POST requests, NOT generating pairtrees (equivalent to
- a PUT request). The timing increase is closer to a O(log n) pattern.
- .. figure:: assets/plot_lsup_pyapi_post_100K.png
- :alt: Lakesuperior Python API, 100K request time chart
- Lakesuperior using Python API. The pattern is much smoother, with less
- frequent and less pronounced spikes. The O(log n) performance is more
- clearly visile here: time increases quickly at the beginning, then slows
- down as the repository size increases.
- Conclusions
- -----------
- Lakesuperior appears to be slower on writes and much faster on reads than
- Fedora 4-5. Both these factors are very likely related to the underlying LMDB
- store which is optimized for read performance. The write performance gap is
- more than filled when ingesting via the Python API.
- In a real-world application scenario, in which a client may perform multiple
- reads before and after storing resources, the write performance gap may
- decrease. A Python application using the Python API for querying and writing
- would experience a dramatic improvement in read as well as write timings.
- As it may be obvious, these are only very partial and specific
- results. They should not be taken as a thorough performance assessment, but
- rather as a starting point to which specific use-case variables may be added.
|