By Theofilos Ioannidis (tioannid [at] di [dot] uoa [dot] gr), created on , last updated on
The FAIR Data Principles [1] feature 15 facets corresponding to the four letters of FAIR - Findable, Accessible, Interoperable, Reusable. These principles are gradually being adopted by the research world. The European Commission, in an effort to promote these principles, has expanded its demand for research to produce open data.
FAIR data principles place emphasis on enhancing the ability of machines to automatically find and use the data and clearly states that it is distinct from peer initiatives that focus on the human scholar. As pointed out in [2], Interoperable and Reusable facets of FAIR are very difficult to adhere to and require time and resources. Therefore the initiative, does not prioritize creating a human scholar friendly platform.
The authors in [3] state that, open access to research data may become a double-edged sword as FAIR principles encourage systematic reuse of data and metadata standards which may facilitate the introduction and perpetuation of errors, bias and questionable interpretations, and limitations of the original study. Therefore additional checks on research quality and integrity of data and methods are needed.
HOBBIT [4] (Holistic Benchmarking of Big Linked Data) is a benchmarking framework platform that by design tries to comply to the FAIR data principles initiative. These platforms are benchmarking frameworks designed for deployment to cloud infrastructures, with distributed file systems and containerization technologies. They are multi-user environments where researchers can store and share datasets, querysets, execution results and system modules. They promise increased reuse of implemented systems and workloads, improved transparency of experiment results by allowing easy repetition of experiments and result comparisons and assisting users in managing the whole process with intuitive web UIs. HOBBIT is the most complete of these platforms, extends the scope of benchmarking to the entire linked data life-cycle, such as link discovery, employs intuitive web UIs and allows the integration of systems in various programming languages.
A common problem for all similar distributed infrastructures is that they require from the user a substantial initial investment of time to learn all related technologies, UIs and especially the platform APIs, such as, the benchmarking API. In addition, to the substantial effort and time required to learn especially the platform APIs, a set of rules and policies usually dominate these environments and dictate the style and pace of work. Side-effects of HOBBIT's compliance to the FAIR initiative is that, substantial effort and time is required for learning the HOBBIT platform core API and the RabbitMQ message broker API, as well as, familiarizing oneself with the Docker technology stack [5] and the Maven build tool-chain.
Many containers are required for benchmark and systems. More specifically 4 containers (Data Generator, Task Generator, Evaluation Module, Benchmark Controller) [HOBBIT: How to integrate a Benchmark] are needed to describe a benchmark and up to 2 containers (System and SystemAdapter) [HOBBIT: How to Integrate Your Own Code] for each system to be tested.
Each system, benchmark and experiment is required to provide a [HOBBIT ontology] compliant model as an RDF file in Turtle format, which describes the component properties, optional configuration parameters, docker image URLs, etc.
The standard way of uploading a system or benchmark to the HOBBIT platform, requires that their respective containers are git-pushed to a Docker registry on a Gitlab repository and the RDF model file is committed to this repository.
The researcher is not spared any effort on setting up, configuring, optimizing and dockerizing the system that needs to be tested. Instructions are available on using skeleton Java classes for creating generic SystemAdapter and Benchmark components, which leaves system and benchmark related coding to the user. This missing knowledge includes system architecture setup and geospatial or GeoSPARQL support, e.g. configuring RDF store memory allocation based on the available container memory or customizing the spatial index to be used for the repos. The data loading process is part of the containers' code, which does not allow for RDF store external and more efficient bulk loader utilities to be incorporated to properly evaluate the stores. Monitoring and handling query execution phases is upon the users to provide if they would like fine-grained control over exception handling.
Overall, it seems that HOBBIT achieves generality to accommodate benchmarks across the whole Linked Data life-cycle, achieves component flexibility with containerization, promotes language independence, vertical scalability and compliance to FAIR initiative. On the other hand, HOBBIT increases platform complexity, sacrifices usability for new users and does not provide out-of-the-box benchmark-specific and system-specific knowledge reusability for benchmark researchers. Human scholars need to heavily invest on this framework and still not get the expected assistance for their effort. In the context of an already difficult and detailed task as benchmarking, this additional set of requirements may seem unfair to many benchmark researchers.