gms | German Medical Science

Information Retrieval Meeting (IRM 2022)

10.06. - 11.06.2022, Cologne

Online information retrieval evaluation using the STELLA framework

Meeting Abstract

  • presenting/speaker Timo Breuer - TH Köln – University of Applied Sciences, Cologne, Germany
  • Narges Tavakolpoursaleh - GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany
  • Johann Schaible - EU|FH – University of Applied Sciences, Bruehl, Germany
  • Daniel Hienert - GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany
  • Philipp Schaer - TH Köln – University of Applied Sciences, Cologne, Germany
  • corresponding author Leyla Jael Castro - ZB MED – Information Centre for Life Sciences, Cologne, Germany

Information Retrieval Meeting (IRM 2022). Cologne, 10.-11.06.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. Doc22irm22

doi: 10.3205/22irm22, urn:nbn:de:0183-22irm229

Published: June 8, 2022

© 2022 Breuer et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at



Introduction: Involving users in early phases of software development has become a common strategy as it enables developers to consider user needs from the beginning. Once a system is in production, new opportunities to observe, evaluate and learn from users emerge as more information becomes available. Gathering information from users to continuously evaluate their behavior is a common practice for commercial software, while the Cranfiled paradigm remains the preferred option for Information Retrieval (IR) and recommendation systems in the academic world. Here we introduce the Infrastructures for Living Labs STELLA project which aims to create an evaluation infrastructure allowing experimental systems to run along production web-based academic search systems with real users. STELLA combines user interactions and log files analyses to enable large-scale A/B experiments for academic search.

Methods: The STELLA evaluation infrastructure provides an online reproducible environment allowing developers and researchers to work together to produce and evaluate new retrieval and recommendation approaches for existing IR. STELLA integrates itself to a production system, allows experimental systems to run along the production one, and evaluates the performance of those experimental systems using real-time information coming from the regular users of the system. The production system acts as a baseline that experimental systems try to outperform. Our experimental setup uses interleaving, i.e. it combines experimental results with those from the corresponding baseline systems. STELLA gathers information on user interactions and provides statistics useful to developers and researchers. STELLA architecture (Figure 1 [Fig. 1]) comprises three main elements: (i) micro-services corresponding to experimental systems, (ii) a multi-container application (MCA) bundling together all the participant experimental systems, and (iii) a central server to manage participant and production systems, and to provide feedback.

Results: STELLA was the technological and methodological foundation of the CLEF 2021 Living Labs for Academic Search (LiLAS) lab. LiLAS aimed to strengthen the concept of user-centric living labs for academic search with two separated evaluation rounds of 4 weeks each. LiLAS integrated STELLA into two academic search systems: LIVIVO (for the task of ranking documents wrt a head query) and GESIS Search (for the task of ranking datasets wrt a reference document). We evaluated nine experimental systems contributed by three participating groups. Overall, we consider our lab as a successful advancement to previous living lab experiments. We were able to exemplify the benefits of fully dockerized systems delivering results for arbitrary results on-the-fly. Furthermore, we could confirm several previous findings, for instance the power laws underlying the click distributions.

Keywords: living labs, information retrieval, recommendation systems, reproducibility, evaluation framework