SIGIR '25: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

SIGIR 2025 Proceedings

SIGIR '25: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

Full Citation in the ACM Digital Library

SESSION: Keynotes

Please meet AI, Our Dear New Colleague. In other Words: Can Scientists and Machines Truly Cooperate?

Iryna Gurevych

How can AI and LLMs facilitate the work of scientists in different stages of the research process? Can technology even make scientists obsolete? The role of AI and Large Language Models (LLMs) in science as the target application domain has recently been rapidly growing. This includes assessing the impact of scientific work, facilitating writing and revising manuscripts as well as intelligent support for manuscript quality assessment, peer-review and scientific discussions. The talk will illustrate such methods and models using several tasks from the scientific domain. We argue that while AI and LLMs can effectively support and augment specific steps of the research process, expert-AI collaboration may be a more promising mode for complex research tasks.

SESSION: Conversational IR and Intelligent Agents

MSCRS: Multi-modal Semantic Graph Prompt Learning Framework for Conversational Recommender Systems

Yibiao Wei
Jie Zou
Weikang Guo
Guoqing Wang
Xing Xu
Yang Yang

Conversational Recommender Systems (CRSs) aim to provide personalized recommendations by interacting with users through conversations. Most existing studies of CRS focus on extracting user preferences from conversational contexts. However, due to the short and sparse nature of conversational contexts, it is difficult to fully capture user preferences by conversational contexts only. We argue that multi-modal semantic information can enrich user preference expressions from diverse dimensions (e.g., a user preference for a certain movie may stem from its magnificent visual effects and compelling storyline). In this paper, we propose a multi-modal semantic graph prompt learning framework for CRS, named MSCRS. First, we extract textual and image features of items mentioned in the conversational contexts. Second, we capture higher-order semantic associations within different semantic modalities (collaborative, textual, and image) by constructing modality-specific graph structures. Finally, we propose an innovative integration of multi-modal semantic graphs with prompt learning, harnessing the power of large language models to comprehensively explore high-dimensional semantic relationships. Experimental results demonstrate that our proposed method significantly improves accuracy in item recommendation, as well as generates more natural and contextually relevant content in response generation. Code and extended multi-modal CRS datasets are available at https://github.com/BIAOBIAO12138/MSCRS-main.

SESSION: Benchmarks and Datasets

OmniNER2025: Diverse and Comprehensive Fine-Grained NER Dataset and Benchmark for Chinese

Yong Zhou
Shuaipeng Liu
Yunqing Li
Mengting Hu
Wen Dai
Xiaowei Zhao
Xiujuan Xu

As Named Entity Recognition (NER) tasks have evolved, artificial intelligence has been widely applied in this field. However, most benchmarks are limited to English, making it challenging to replicate successful experiences in other languages. To expand NER to informal and diverse Chinese text scenarios, we have proposed a new large-scale Chinese NER dataset, OmniNER2025. This dataset, obtained from user posts on a popular Chinese social media platform Xiaohongshu, contains 195,568 samples and 89 categories, all manually annotated. To our knowledge, it is currently the largest Chinese open-source NER dataset in terms of sample size, category diversity, and domain coverage. This dataset is more challenging than existing Chinese NER datasets and better reflects real-world applications. The large sample size and diverse entity types provide valuable research resources. Additionally, we introduced the ERRTA tool for error analysis and teacher model guidance, significantly reducing model errors and improving performance. In the future, we will refine the ERRTA framework and explore optimization strategies to enhance the practical value of NER models. By releasing the OmniNER2025 dataset and introducing the ERRTA tool, we have advanced fine-grained NER research and improved model performance, promoting its application and development in real-world scenarios.

SESSION: Evaluation

Preference-Strength-Aware Self-Improving Alignment with Generative Preference Models

Yuanzhao Zhai
Zhuo Zhang
Cheng Yang
Kele Xu
Yue Yu
Wei Li
Hui Wang
Zenglin Xu
Dawei Feng
Bo Ding
Huaimin Wang

Self-improving alignment leveraging large language models (LLMs) to automatically generate synthetic preference data has garnered significant attention as a means of reducing reliance on human labelers. These methods typically employ the LLM-as-a-judge mechanism, where the LLM generates responses and then employs itself to judge which response best aligns with the given prompt for curating the binary self-preferred dataset. However, these methods encounter two major challenges: (1) LLM-as-a-judge often produces error-prone evaluations, resulting in low-quality preference annotation, and (2) their optimization strategies often overlook the strength of preferences within binary pairs, leading to overfitting. This paper proposes a novel method, Preference-Strength-aware Optimization (PSO), to address these issues. Specifically, PSO frames the preference annotation process as a judgment token prediction task using the generative preference model to produce reliable judgments. The predicted judgment token indicates the preferred response and its corresponding probability reflects the disparity between responses, referred to as preference strength. Based on this strength, we introduce a new preference-strength-aware loss to adaptively reweight the impact of different response pairs on optimization, concentrating the model's learning on high-quality response pairs. Our experiments demonstrate that PSO significantly improves performance in preference benchmarks, achieving stronger alignment with human preferences, reducing verbose responses, and mitigating overfitting. Furthermore, PSO exhibits robust generalization and sample efficiency, offering a scalable and promising solution for LLM alignment without relying on human-annotated preferences.

SESSION: Domain-specific Applications 1

Open-World Fine-Grained Fashion Retrieval with LLM-based Commonsense Knowledge Infusion

Jianfeng Dong
Junwei Zhu
Daizong Liu
Xiaoye Qu
Cuizhu Bao
Zhike Han
Jixiang Zhu
Xun Wang

Attribute-Specific Fashion Retrieval (ASFR) focuses on retrieving images based on fine-grained, attribute-specific criteria rather than naive global visual similarity, enabling more precise and interpretable search results. Existing ASFR methods ideally assume that all attribute semantics are in-domain distributions of the training datasets. However, realistic scenarios are generally more complex and naturally contain unseen attribute information, often resulting in ungeneralizable retrieval outcomes. In this paper, we take the first step to address the new and challenging open-world ASFR setting, which involves handling diverse and practical attributes instead of relying solely on predefined attribute sets in closed-world scenarios. Specifically, to comprehend unseen attributes, we propose a novel LLM-based Commonsense Knowledge Infusion (CoKi) framework that integrates commonsense knowledge as complementary context into attribute representations using a Large Language Model (LLM). By infusing such LLM-based commonsense knowledge through descriptive contexts, our method enables robust semantic enrichment and effective generalization to unseen attributes. Additionally, we introduce a modality-switchable prompt and an imputation mechanism to ensure model robustness across diverse input configurations by dynamically adapting to missing modalities. Extensive experiments demonstrate that our approach not only achieves state-of-the-art in-domain retrieval performance but also significantly enhances adaptability to unseen attributes and cross-domain generalization, establishing a new benchmark for fine-grained fashion retrieval in open-world scenarios. Our source code is publicly available at https://github.com/HuiGuanLab/CoKi.

SESSION: Domain-specific Applications 2

SESSION: FATE 1

PATFinger: Prompt-Adapted Transferable Fingerprinting against Unauthorized Multimodal Dataset Usage

Wenyi Zhang
Ju Jia
Xiaojun Jia
Yihao Huang
Xinfeng Li
Cong Wu
Lina Wang

The multimodal datasets can be leveraged to pre-train large-scale vision-language models by providing cross-modal semantics. Current endeavors for determining the usage of datasets mainly focus on single-modal dataset ownership verification through intrusive methods and non-intrusive techniques, while cross-modal approaches remain under-explored. Intrusive methods can adapt to multimodal datasets but degrade model accuracy, while non-intrusive methods rely on label-driven decision boundaries that fail to guarantee stable behaviors for verification. To address these issues, we propose a novel prompt-adapted transferable fingerprinting scheme from a training-free perspective, called PATFinger, which incorporates the global optimal perturbation (GOP) and the adaptive prompts to capture dataset-specific distribution characteristics. Our scheme utilizes inherent dataset attributes as fingerprints instead of compelling the model to learn triggers. The GOP is derived from the sample distribution to maximize embedding drifts between different modalities. Subsequently, our PATFinger re-aligns the adaptive prompt with GOP samples to capture the cross-modal interactions on the carefully crafted surrogate model. This allows the dataset owner to check the usage of datasets by observing specific prediction behaviors linked to the PATFinger during retrieval queries. Extensive experiments demonstrate the effectiveness of our scheme against unauthorized multimodal dataset usage on various cross-modal retrieval architectures by 30% over state-of-the-art baselines.

Document Screenshot Retrievers are Vulnerable to Pixel Poisoning Attacks

Shengyao Zhuang
Ekaterina Khramtsova
Xueguang Ma
Bevan Koopman
Jimmy Lin
Guido Zuccon

Recent advancements in dense retrieval have introduced vision-language model (VLM)-based retrievers, such as DSE and ColPali, which leverage document screenshots embedded as vectors to enable effective search and offer a simplified pipeline over traditional text-only methods. In this study, we propose three pixel poisoning attack methods designed to compromise VLM-based retrievers and evaluate their effectiveness under various attack settings and parameter configurations. Our empirical results demonstrate that injecting even a single adversarial screenshot into the retrieval corpus can significantly disrupt search results, poisoning the top-10 retrieved documents for 41.9% of queries in the case of DSE and 26.4% for ColPali. These vulnerability rates notably exceed those observed with equivalent attacks on text-only retrievers. Moreover, when targeting a small set of known queries, the attack success rate raises, achieving complete success in certain cases. By exposing the vulnerabilities inherent in vision-language models, this work highlights the potential risks associated with their deployment.

SESSION: FATE 2

Query Smarter, Trust Better? Exploring Search Behaviours for Verifying News Accuracy

David Elsweiler
Samy Ateia
Markus Bink
Gregor Donabauer
Marcos Fernández Pichel
Alexander Frummet
Udo Kruschwitz
David E. Losada
Bernd Ludwig
Selina Meyer
Noel Pascual Presa

While it is often assumed that searching for information to evaluate misinformation will help identify false claims, recent work suggests that search behaviours can instead reinforce belief in misleading news, particularly when users generate queries using vocabulary from the source articles. Our research explores how different query generation strategies affect news verification and whether the way people search influences the accuracy of their information evaluation. A mixed-methods approach was used, consisting of three parts: (1) an analysis of existing data to understand how search behaviour influences trust in fake news (2) a simulation of query generation strategies using a Large Language Model (LLM) to assess the impact of different query formulations on search result quality, and (3) a user study to examine how 'Boost' interventions in interface design can guide users to adopt more effective query strategies. The results show that search behaviour significantly affects trust in news, with successful searches involving multiple queries and yielding higher-quality results. Queries inspired by different parts of a news article produced search results of varying quality, and weak initial queries improved when reformulated using full SERP information. Although 'Boost' interventions had limited impact, the study suggests that interface design encouraging users to thoroughly review search results can enhance query formulation. This study highlights the importance of query strategies in evaluating news and proposes that interface design can play a key role in promoting more effective search practices, serving as one component of a broader set of interventions to combat misinformation.

SESSION: FATE 3

Measuring Text-Image Retrieval Fairness with Synthetic Data

Lluis Gomez

In this paper, we study social bias in cross-modal text-image retrieval systems, focusing on the interaction between textual queries and image responses. Despite the significant advancements in cross-modal retrieval models, the potential for social bias in their responses remains a pressing concern, necessitating a comprehensive framework for assessment and mitigation. We introduce a novel framework for evaluating social bias in cross-modal retrieval systems, leveraging a new dataset and appropriate metrics specifically designed for this purpose. Our dataset, Social Inclusive Synthetic Professionals Images (SISPI), comprises 49K images generated using state-of-the-art text-to-image models, ensuring a balanced representation of demographic groups across various professional roles. We use this dataset to conduct an extensive analysis of social bias (gender and ethnic) in state of the art cross-modal retrieval deep models, including CLIP, ALIGN, BLIP, FLAVA, COCA, and many others. Using diversity metrics, grounded in the distribution of different demographic groups' images in the retrieval rankings, we provide a quantitative measure of fairness, facilitating a detailed analysis of models' behavior. Our work sheds light on biases present in current cross-modal retrieval systems and emphasizes the importance of training data curation, providing a foundation for future research and development towards more equitable and unbiased models. The dataset and code of our framework is publicly available at https://sispi-benchmark.github.io/sispi-benchmark/.

SESSION: Humans and Interfaces

SESSION: Machine Learning 1

PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization

Yang Jiao
Xiaodong Wang
Kai Yang

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of applications, e.g., medical question-answering, mathematical sciences, and code generation. However, they also exhibit inherent limitations, such as outdated knowledge and susceptibility to hallucinations. Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm to address these issues, but it also introduces new vulnerabilities. Recent efforts have focused on the security of RAG-based LLMs, yet existing attack methods face three critical challenges: (1) their effectiveness declines sharply when only a limited number of poisoned texts can be injected into the knowledge database (2) they lack sufficient stealth, as the attacks are often detectable by anomaly detection systems, which compromises their effectiveness, and (3) they rely on heuristic approaches to generate poisoned texts, lacking formal optimization frameworks and theoretic guarantees, which limits their effectiveness and applicability. To address these issues, we propose coordinated Prompt-RAG attack (PR-attack), a novel optimization-driven attack that introduces a small number of poisoned texts into the knowledge database while embedding a backdoor trigger within the prompt. When activated, the trigger causes the LLM to generate pre-designed responses to targeted queries, while maintaining normal behavior in other contexts. This ensures both high effectiveness and stealth. We formulate the attack generation process as a bilevel optimization problem leveraging a principled optimization framework to develop optimal poisoned texts and triggers. Extensive experiments across diverse LLMs and datasets demonstrate the effectiveness of PR-Attack, achieving a high attack success rate even with a limited number of poisoned texts and significantly improved stealth compared to existing methods. These results highlight the potential risks posed by PR-Attack and emphasize the importance of securing RAG-based LLMs against such threats.

SESSION: Machine Learning 2

SESSION: Image Retrieval

SESSION: Video Retrieval

Queries Are Not Alone: Clustering Text Embeddings for Video Search

Peiyang Liu
Xi Wang
Ziqiang Cui
Wei Ye

The rapid proliferation of video content across various platforms has highlighted the urgent need for advanced video retrieval systems. Traditional methods, which primarily depend on directly matching textual queries with video metadata, often fail to bridge the semantic gap between text descriptions and the multifaceted nature of video content. This paper introduces a novel framework, the Video-Text Cluster (VTC), which enhances video retrieval by clustering text queries to capture a broader semantic scope. We propose a unique clustering mechanism that groups related queries, enabling our system to consider multiple interpretations and nuances of each query. This clustering is further refined by our innovative Sweeper module, which identifies and mitigates noise within these clusters. Additionally, we introduce the Video-Text Cluster-Attention (VTC-Att) mechanism, which dynamically adjusts focus within the clusters based on the video content, ensuring that the retrieval process emphasizes the most relevant textual features. Further experiments have demonstrated that our proposed model surpasses existing state-of-the-art models on five public datasets.

SESSION: Multi-modal Retrieval

Meta-Guided Adaptive Weight Learner for Noisy Correspondence

Chenyu Mu
Erkun Yang
Cheng Deng

Cross-modal retrieval with noisy correspondences is a critical challenge, especially when data annotations for large-scale multimodal datasets are prone to systematic corruption. To mitigate the impact of noise, many existing methods rely on small-loss sample selection to filter out clean samples. However, these methods can ineluctably result in the inclusion of false positives, which significantly degrade the performance. To tackle this issue, we propose a novel method, named the Meta Similarity Importance Assignment Network (MSIAN), to achieve robust cross-modal retrieval. MSIAN employs a meta-learning strategy to dynamically learn the importance of each sample through a two-level optimization process. With adaptively guiding the learning process, MSIAN adjusts the importance weight of each sample based on its inherent trustworthiness. Thereby, thus iterative mechanism progressively shifts the network's focus on the most reliable data points, amplifying the impact of credible samples while diminishing the adaptive weight of noisy ones. Furthermore, MSIAN dynamically adapts the soft margin of each sample through continuously updated adaptive weights, thereby improving the robustness of the model. Extensive experiments on three widely used datasets, including Flickr30K, MS-COCO, and Conceptual Captions, demonstrate the effectiveness of our approach in improving cross-modal retrieval performance.

Multi-level Encoding with Hierarchical Alignment for Sketch-Based 3D Shape Retrieval

Donglin Zhang
Changxing Li
Xiao-Jun Wu

Sketch-based 3D shape retrieval (SBSR) aims to retrieve 3D shapes using hand-drawn sketches as query inputs. Although existing SBSR methods have achieved promising results, several challenges still require further investigation. First, most existing approaches usually leverage simple aggregation schemes, often failing to capture the intrinsic relationships between views, which limits the effectiveness of 3D shape feature extraction. Second, conventional SBSR primarily focuses on instance-level alignment while ignoring multi-level alignment, which may neglect complex hierarchical relationships. To address these limitations, we propose a novel Multi-level Encoding with Hierarchical Alignment (MEHA) method for SBSR. Specifically, we adopt spatial encoding and view encoding for multiple views of 3D shapes. The proposed aggregation scheme then integrates these multi-level embedded local features to enhance the representation of 3D shape features. Considering the complexity of 3D shapes, MEHA adopts a two-stage training process: the first stage focuses on learning 3D shape features, while the second stage emphasizes modality alignment. Furthermore, we introduce a hierarchical alignment strategy that bridges the modality gap through instance-level, prototype-level, and centre-level alignment. Extensive experiments on two public benchmark datasets demonstrate the superiority of our method, showing that MEHA outperforms the state-of-the-art baselines.

SESSION: Biomedical and Health

ProtChatGPT: Towards Understanding Proteins with Hybrid Representation and Large Language Models

Chao Wang
Hehe Fan
Ruijie Quan
Lina Yao
Yi Yang

Protein research is crucial in various scientific disciplines, but understanding their intricate structure-function relationships remains challenging. Recent advancements in Large Language Models (LLMs) have significantly improved the comprehension of task-specific knowledge, suggesting the potential for specialized ChatGPT-like systems in protein research to aid fundamental investigations. In this work, we introduce ProtChatGPT, which aims to learn and understand protein structures using natural language. ProtChatGPT enables users to upload proteins, ask questions, and engage in interactive conversations to produce comprehensive answers. The system comprises multi-level protein encoding, protein-language alignment, and instruction tuning of LLMs. A protein first undergoes multiple protein encoders and PLP-former to produce multi-level hybrid protein embeddings, which are then aligned through a Protein Context Gating (PCG) module with contrastive learning, and projected by an adapter to conform with the LLM. The LLM finally combines user questions with projected protein embeddings to generate informative answers. Experiments show that ProtChatGPT can produce promising responses to proteins and the corresponding user questions. We hope that ProtChatGPT could form the basis for further exploration and application in protein research. Code and our pre-trained model will be publicly available.

SESSION: Question Answering

OBELLA: Open the Book for Evaluating Long-Form Large Language Model Answers in Open-Domain Question Answering

Tianyu Ren
Zhaoyu Zhang
Hui Wang
Karen Rafferty

Reliable factuality evaluation is critical for the iterative development of open-domain question answering (ODQA) systems, especially given the rise of large language models (LLMs) and their propensity for hallucination. However, state-of-the-art (SOTA) automatic metrics, which are mostly supervised, remain notably less reliable than humans. In this paper, we find two key challenges behind this gap: (1) length distribution mismatch between lengthy LLM answers and shorter training answers used by current metrics; and (2) reference incompleteness, where current metrics often misjudge valid system answers absent from given references-a challenge worsened by the diversity of LLM outputs. To address these issues, we present a new ODQA factuality evaluation dataset called OBELLA (Open-Book Evaluation for Long-form LLM Answers). OBELLA narrows the length distribution mismatch by significantly increasing the candidate answer length to align with LLM outputs. Moreover, it introduces a neutral class for plausible yet under-supported candidate answers to differentiate reference incompleteness from outright incorrectness, thus enabling flexible reevaluation by consulting external knowledge for more references. Based on OBELLA, we propose a novel metric named OBELLAM (OBELLA Metric). OBELLAM integrates a cross-attention mechanism to enhance long-form candidate answer representations and employs a dynamic closed-open book evaluation strategy to tackle reference incompleteness. Our OBELLAM sets a new SOTA in aligning with human judgments across two ODQA evaluation benchmarks, marking a promising step toward more robust ODQA factuality evaluation.

Question-Answer Extraction from Scientific Articles Using Knowledge Graphs and Large Language Models

Hosein Azarbonyad
Zi Long Zhu
Georgios Cheirmpos
Zubair Afzal
Vikrant Yadav
Georgios Tsatsaronis

When deciding to read an article or incorporate it into their research, scholars often seek to quickly identify and understand its main ideas. In this paper, we aim to extract these key concepts and contributions from scientific articles in the form of Question and Answer (QA) pairs. We propose two distinct approaches for generating QAs. The first approach involves selecting salient paragraphs, using a Large Language Model (LLM) to generate questions, ranking these questions by the likelihood of obtaining meaningful answers, and subsequently generating answers. This method relies exclusively on the content of the articles. However, assessing an article's novelty typically requires comparison with the existing literature. Therefore, our second approach leverages a Knowledge Graph (KG) for QA generation. We construct a KG by fine-tuning an Entity Relationship (ER) extraction model on scientific articles and using it to build the graph. We then employ a salient triplet extraction method to select the most pertinent ERs per article, utilizing metrics such as the centrality of entities based on a triplet TF-IDF-like measure. This measure assesses the saliency of a triplet based on its importance within the article compared to its prevalence in the literature. For evaluation, we generate QAs using both approaches and have them assessed by Subject Matter Experts (SMEs) through a set of predefined metrics to evaluate the quality of both questions and answers. Our evaluations demonstrate that the KG-based approach effectively captures the main ideas discussed in the articles. Furthermore, our findings indicate that fine-tuning the ER extraction model on our scientific corpus is crucial for extracting high-quality triplets from such documents.

SESSION: Knowledge and Knowledge Graphs

Mitigating Modality Bias in Multi-modal Entity Alignment from a Causal Perspective

Taoyu Su
Jiawei Sheng
Duohe Ma
Xiaodong Li
Juwei Yue
Mengxiao Song
Yingkai Tang
Tingwen Liu

Multi-Modal Entity Alignment (MMEA) aims to retrieve equivalent entities from different Multi-Modal Knowledge Graphs (MMKGs), a critical information retrieval task. Existing studies have explored various fusion paradigms and consistency constraints to improve the alignment of equivalent entities, while overlooking that the visual modality may not always contribute positively. Empirically, entities with low-similarity images usually generate unsatisfactory performance, highlighting the limitation of overly relying on visual features. We believe the model can be biased toward the visual modality, leading to a shortcut image-matching task. To address this, we propose a counterfactual debiasing framework for MMEA, termed CDMEA, which investigates visual modality bias from a causal perspective. Our approach aims to leverage both visual and graph modalities to enhance MMEA while suppressing the direct causal effect of the visual modality on model predictions. By estimating the Total Effect (TE) of both modalities and excluding the Natural Direct Effect (NDE) of the visual modality, we ensure that the model predicts based on the Total Indirect Effect (TIE), effectively utilizing both modalities and reducing visual modality bias. Extensive experiments on 9 benchmark datasets show that CDMEA outperforms 14 state-of-the-art methods, especially in low-similarity, high-noise, and low-resource data scenarios.

SESSION: Natural Language Processing 1

Predicting RAG Performance for Text Completion

Oz Huly
David Carmel
Oren Kurland

We address the challenge of predicting the performance of using retrieval augmented generation (RAG) in large language models (LLMs) for the task of text completion; specifically, we predict the perplexity gain attained by applying RAG. We present novel supervised post-retrieval prediction methods that utilize the specific characteristics of the text completion setting. Our predictors substantially outperform a wide variety of prediction methods originally proposed for ad hoc document retrieval. We then show that integrating our post-retrieval predictors with recently proposed post-generation predictors - i.e., those analyzing the next-token distribution - is of much merit: the resultant prediction quality is statistically significantly better than that of using the post-generation predictors alone. Finally, we show that our post-retrieval predictors are as effective as post-generation predictors for selective application of RAG. This finding is of utmost importance in terms of efficiency of selective RAG.

SESSION: Natural Language Processing 2

SESSION: Natural Language Processing 3

Optimizing Tail-Head Trade-off for Extreme Multi-Label Text Classification (XMTC) with RAG-Labels and a Dynamic Two-Stage Retrieval and Fusion Pipeline

Celso França
Gestefane Rabbi
Thiago Salles
Washington Cunha
Leonardo Rocha
Marcos André Gonçalves

We tackle Extreme Multi-Label Text Classification (XMTC), which involves assigning relevant labels to texts from a huge label space. Attempting to optimize the underexplored tail-head trade-off, we address the XMTC task through its core challenges of volume, skewness, and quality by proposing xCoRetriev, a novel two-stage retrieving and fusing ranking pipeline. Our pipeline addresses the volume challenge by dynamically slicing the large label space; it also tackles the skewness challenge by favoring the tail labels while fusing sparse and dense retrievers. Finally, xCoRetriev faces the quality challenge by enhancing the label space with Retrieval-Augmented Generated (RAG)-labels. Our experiments with four XMTC benchmarks with hundreds of thousands of text documents and labels against six state-of-the-art XMTC baselines demonstrate xCoRetriev's strengths in terms of: (i)~scalability for large label spaces, being among the most efficient methods at training and prediction; (ii)~effectiveness in the face of high skewness, with gains of up to 48% in propensity-scored metrics against the best state-of-the-art baselines; and (iii)~capability of handling very noisy datasets by exploiting RAG-labels.

SESSION: RecSys: Sequential 1

Pre-train, Align, and Disentangle: Empowering Sequential Recommendation with Large Language Models

Yuhao Wang
Junwei Pan
Pengyue Jia
Wanyu Wang
Maolin Wang
Zhixiang Feng
Xiaotian Li
Jie Jiang
Xiangyu Zhao

Sequential Recommendation (SR) aims to leverage the sequential patterns in users' historical interactions to accurately track their preferences. However, the primary reliance of existing SR methods on collaborative data results in challenges such as the cold-start problem and sub-optimal performance. Concurrently, despite the proven effectiveness of large language models (LLMs), their integration into commercial recommender systems is impeded by issues such as high inference latency, incomplete capture of all distribution statistics, and catastrophic forgetting. To address these issues, we introduce a novel Pre-train, Align, and Disentangle (PAD) framework to enhance SR models with LLMs. In particular, we initially pre-train both the SR and LLM models to obtain collaborative and textual embeddings. Subsequently, we propose a characteristic recommendation-anchored alignment loss using multi-kernel maximum mean discrepancy with Gaussian kernels. Lastly, a triple-experts architecture, comprising aligned and modality-specific experts with disentangled embeddings, is fine-tuned in a frequency-aware manner. Experimental results on three public datasets validate the efficacy of PAD, indicating substantial enhancements and compatibility with various SR backbone models, particularly for cold items. The code and datasets are accessible for reproduction: https://github.com/Applied-Machine-Learning-Lab/PAD.

SESSION: RecSys: Sequential 2

Multi-Grained Patch Training for Efficient LLM-based Recommendation

Jiayi Liao
Ruobing Xie
Sihang Li
Xiang Wang
Xingwu Sun
Zhanhui Kang
Xiangnan He

Large Language Models (LLMs) have emerged as a new paradigm for recommendation by converting interacted item history into language modeling. However, constrained by the limited context length of LLMs, existing approaches have to truncate item history in the prompt, focusing only on recent interactions and sacrificing the ability to model long-term history. To enable LLMs to model long histories, we pursue a concise embedding representation for items and sessions. In the LLM embedding space, we construct an item's embedding by aggregating its textual token embeddings; similarly, we construct a session's embedding by aggregating its item embeddings. While efficient, this way poses two challenges since it ignores the temporal significance of user interactions and LLMs do not natively interpret our custom embeddings. To overcome these, we propose PatchRec, a multi-grained patch training method consisting of two stages: (1) Patch Pre-training, which familiarizes LLMs with aggregated embeddings -- patches, and (2) Patch Fine-tuning, which enables LLMs to capture time-aware significance in interaction history. Extensive experiments show that PatchRec effectively models longer behavior histories with improved efficiency. This work facilitates the practical use of LLMs for modeling long behavior histories.

Multi-Modal Multi-Behavior Sequential Recommendation with Conditional Diffusion-Based Feature Denoising

Xiaoxi Cui
Weihai Lu
Yu Tong
Yiheng Li
Zhejun Zhao

The sequential recommendation system utilizes historical user interactions to predict preferences. Effectively integrating diverse user behavior patterns with rich multimodal information of items to enhance the accuracy of sequential recommendations is an emerging and challenging research direction. This paper focuses on the problem of multi-modal multi-behavior sequential recommendation, aiming to address the following challenges: (1) the lack of effective characterization of modal preferences across different behaviors, as user attention to different item modalities varies depending on the behavior; (2) the difficulty of effectively mitigating implicit noise in user behavior, such as unintended actions like accidental clicks; (3) the inability to handle modality noise in multi-modal representations, which further impacts the accurate modeling of user preferences. To tackle these issues, we propose a novel Multi-Modal Multi-Behavior Sequential Recommendation model (M³BSR). This model first removes noise in multi-modal representations using a Conditional Diffusion Modality Denoising Layer. Subsequently, it utilizes deep behavioral information to guide the denoising of shallow behavioral data, thereby alleviating the impact of noise in implicit feedback through Conditional Diffusion Behavior Denoising. Finally, by introducing a Multi-Expert Interest Extraction Layer, M³BSR explicitly models the common and specific interests across behaviors and modalities to enhance recommendation performance. Experimental results indicate that M³BSR significantly outperforms existing state-of-the-art methods on benchmark datasets.

Mitigating Distribution Shifts in Sequential Recommendation: An Invariance Perspective

Yuxin Liao
Yonghui Yang
Min Hou
Le Wu
Hefei Xu
Hao Liu

Sequential recommendation aims to learn users' dynamic preferences from their historical interactions and predict the next item they are most likely to engage with. In real-world scenarios, time-varying factors (e.g., product promotions, seasonal changes) induce distribution shifts in user interactions. Despite the demonstrated success of existing models, their generalization capability remains limited under such dynamic conditions. Current methods tackle this challenge by leveraging distributionally robust optimization (DRO) to optimize the "worst-case" loss or by employing manually designed data augmentation to enrich the training distribution. Despite their effectiveness, DRO-based approaches are inherently constrained by the sparsity of training data, limiting the range of distributions they can model, while manually designed augmentations risk introducing noise or irrelevant information that could distort user preference learning. Furthermore, these methods often overlook the sensitivity of user interactions to distribution shifts, which is essential for capturing the stable factors in the evolution of user preferences in real-world settings.

In this work, we tackle the distribution shifting problem from the perspective of invariant learning. We propose a novel framework called Invariant Learning for Distribution Shifts in SEquential RecommendAtion (IDEA) to develop robust sequential recommendation. The key of IDEA lies on learning stable preferences across various distribution-aware environments. Since explicit environments are unavailable, we first extract multiple subsequences by dropping potential noise items, then extend environments with our proposed subsequence mixup. Given the simulated environments, IDEA then learns stable user preferences through invariant risk minimization (IRM) across various environments. To encourage the diversity of simulated environments, IDEA employs an adversarial training strategy to explore potential diverse environments, and further enhance the model's generalization to unseen test distributions. It is worth mentioning that IDEA is a flexible model-agnostic framework, which is applicable to various sequential recommendation models. Extensive experimental results on three public datasets clearly demonstrate the effectiveness of the proposed framework. Our code is available at: https://github.com/hermione314/IDEA.

SESSION: RecSys: Sequential 3

SESSION: RecSys: FATE

Review-driven Personalized Preference Reasoning with Large Language Models for Recommendation

Jieyong Kim
Hyunseo Kim
Hyunjin Cho
SeongKu Kang
Buru Chang
Jinyoung Yeo
Dongha Lee

Recent advancements in Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, generating significant interest in their application to recommendation systems. However, existing methods have not fully harnessed the potential of LLMs, often constrained by limited input information or failing to fully utilize their advanced reasoning capabilities. To address these limitations, we introduce EXP3RT, a novel LLM-based recommender designed to leverage rich preference information contained in user and item reviews. EXP3RT is basically fine-tuned through distillation from a teacher LLM to perform three key steps in order: (1) preference extraction (2) profile construction, and (3) textual reasoning for rating prediction. EXP3RT first extracts and encapsulates essential subjective preferences from raw reviews, next aggregates and summarizes them according to specific criteria to create user and item profiles. It then generates detailed step-by-step reasoning followed by predicted rating, i.e., reasoning-enhanced rating prediction, by considering both subjective and objective information from user/item profiles and item descriptions. This personalized preference reasoning from EXP3RT enhances rating prediction accuracy and also provides faithful and reasonable explanations for recommendation. Extensive experiments show that EXP3RT outperforms existing methods on both rating prediction and candidate item reranking for top-k recommendation, while significantly enhancing the explainability of recommendation systems.

SESSION: RecSys: Domain-specific

NR4DER: Neural Re-ranking for Diversified Exercise Recommendation

Xinghe Cheng
Xufang Zhou
Liangda Fang
Chaobo He
Yuyu Zhou
Weiqi Luo
Zhiguo Gong
Quanlong Guan

With the widespread adoption of online education platforms, an increasing number of students are gaining new knowledge through Massive Open Online Courses (MOOCs). Exercise recommendation have made strides toward improving student learning outcomes. However, existing methods not only struggle with high dropout rates but also fail to match the diverse learning pace of students. They frequently face difficulties in adjusting to inactive students' learning patterns and in accommodating individualized learning paces, resulting in limited accuracy and diversity in recommendations. To tackle these challenges, we propose Neural Re-ranking for Diversified Exercise Recommendation (in short, NR4DER). NR4DER first leverages the mLSTM model to improve the effectiveness of the exercise filter module. It then employs a sequence enhancement method to enhance the representation of inactive students, accurately matches students with exercises of appropriate difficulty. Finally, it utilizes neural re-ranking to generate diverse recommendation lists based on individual students' learning histories. Extensive experimental results indicate that NR4DER significantly outperforms existing methods across multiple real-world datasets and effectively caters to the diverse learning pace of students.

SESSION: RecSys: Multimodal

MELON: Learning Multi-Aspect Modality Preferences for Accurate Multimedia Recommendation

Dongho Jeong
Taeri Kim
Donghyeon Cho
Sang-Wook Kim

Existing multimedia recommender systems have made the best efforts to predict user preferences for items by utilizing behavioral similarities between users and the modality features of items a user has interacted with. However, we identify two key limitations in existing methods regarding preferences for modality features: (L1) although preferences for modality features is an important aspect of users' preferences, existing methods only leverage neighbors with similar interactions and do not consider the neighbors who may have similar preferences for modality features while having different interactions; (L2) although modality features of a user and an item may have a complex geometric relationship in the latent space, existing methods overlook and face challenges in precisely capturing this relationship. To address these two limitations, we propose a novel multimedia recommendation framework, named MELON, which is based on two core ideas: (Idea 1) Modality-cEntered embedding extraction; (Idea 2) reLatiOnship-ceNtered embedding extraction. We validate the effectiveness and validity of MELON through extensive experiments with four real-world datasets, showing 10.51% higher accuracy compared to the best competitor in terms of recall@10. The code and dataset of MELON is available at https://github.com/Bigdasgit/MELON.

SESSION: RecSys: LLMs

MSL: Not All Tokens Are What You Need for Tuning LLM as a Recommender

Bohao Wang
Feng Liu
Jiawei Chen
Xingyu Lou
Changwang Zhang
Jun Wang
Yuegang Sun
Yan Feng
Chun Chen
Can Wang

Large language models (LLMs), known for their comprehension capabilities and extensive knowledge, have been increasingly applied to recommendation systems (RS). Given the fundamental gap between the mechanism of LLMs and the requirement of RS, researchers have focused on fine-tuning LLMs with recommendation-specific data to enhance their performance. Language Modeling Loss (LML), originally designed for language generation tasks, is commonly adopted. However, we identify two critical limitations of LML: 1) it exhibits significant divergence from the recommendation objective; 2) it erroneously treats all fictitious item descriptions as negative samples, introducing misleading training signals.

To address these limitations, we propose a novel Masked Softmax Loss (MSL) tailored for fine-tuning LLMs on recommendation. MSL improves LML by identifying and masking invalid tokens that could lead to fictitious item descriptions during loss computation. This strategy can effectively avoid the interference from erroneous negative signals and ensure well alignment with the recommendation objective supported by theoretical guarantees. During implementation, we identify a potential challenge related to gradient vanishing of MSL. To overcome this, we further introduce the temperature coefficient and propose an Adaptive Temperature Strategy (ATS) that adaptively adjusts the temperature without requiring extensive hyperparameter tuning. Extensive experiments conducted on four public datasets further validate the effectiveness of MSL, achieving an average improvement of 42.24% in NDCG@10. The code is available at https://github.com/WANGBohaO-jpg/MSL.

Order-agnostic Identifier for Large Language Model-based Generative Recommendation

Xinyu Lin
Haihan Shi
Wenjie Wang
Fuli Feng
Qifan Wang
See-Kiong Ng
Tat-Seng Chua

Leveraging Large Language Models (LLMs) for generative recommendation has attracted significant research interest, where item tokenization is a critical step. It involves assigning item identifiers for LLMs to encode user history and generate the next item. Existing approaches leverage either token-sequence identifiers, representing items as discrete token sequences, or single-token identifiers, using ID or semantic embeddings. Token-sequence identifiers face issues such as the local optima problem in beam search and low generation efficiency due to step-by-step generation. In contrast, single-token identifiers fail to capture rich semantics or encode Collaborative Filtering (CF) information, resulting in suboptimal performance.

To address these issues, we propose two fundamental principles for item identifier design: 1) integrating both CF and semantic information to fully capture multi-dimensional item information, and 2) designing order-agnostic identifiers without token dependency, mitigating the local optima issue and achieving simultaneous generation for generation efficiency. Accordingly, we introduce a novel set identifier paradigm for LLM-based generative recommendation, representing each item as a set of order-agnostic tokens. To implement this paradigm, we propose SETRec, which leverages CF and semantic tokenizers to obtain order-agnostic multi-dimensional tokens. To eliminate token dependency, SETRec uses a sparse attention mask for user history encoding and a query-guided generation mechanism for simultaneous token generation. We instantiate SETRec on T5 and Qwen (from 1.5B to 7B). Extensive experiments on four datasets demonstrate its effectiveness across various scenarios (e.g., full ranking, warm- and cold-start ranking, and various item popularity groups). Moreover, results validate SETRec's superior efficiency and show promising scalability on cold-start items as model sizes increase.

SESSION: RecSys: Collaborative Filtering

SESSION: RecSys: Graphs

Rating-Aware Homogeneous Review Graphs and User Likes/Dislikes Differentiation for Effective Recommendations

Jiwon Son
Hyunjoon Kim
Sang-Wook Kim

The goal of Review-Based Recommendation System (RBRS) is to effectively learn the representations of users and items by utilizing review texts in addition to user-item interactions. From user-item interaction graphs widely employed in recommendation systems, recent RBRS methods using graph neural networks (GNNs) obtain the representations by associating each edge between a user and an item with the review information of the user for that item. However, these GNN-based RBRS methods present two main issues: (1) by con- verting each review text into the weight, i.e., single value, of a edge between a user node and an item node, they lose the rich informa- tion about users and items inherent in the review; and (2) by creating only a single general representation for each user, they cannot repre- sent the individual effects of users' likes and dislikes on their ratings for items they have interacted with. To address these problems, we propose a novel GNN-based RBRS, named LETTER, utilizing homo- geneous graphs, i.e., user-user graphs and an item-item graph, to learn general representations of users and items along with users' like and dislike representations. LETTER can learn user and item representations without losing review information by utilizing the proposed homogeneous graphs. Furthermore, LETTER explicitly designs the influence of users' like and dislike representations on their ratings to perform accurate rating predictions. Through ex- periments on six datasets, we verify that the proposed LETTER out- performs nine state-of-the-art RBRSs by up to 23.1%. Our source code is available at https://github.com/Bigdasgit/LETTER.

SESSION: RecSys: Scalability, Embeddings and Training

Pre-training for Recommendation Unlearning

Guoxuan Chen
Lianghao Xia
Chao Huang

Modern recommender systems powered by Graph Neural Networks (GNNs) excel at modeling complex user-item interactions, yet increasingly face scenarios requiring selective forgetting of training data. Beyond user requests to remove specific interactions due to privacy concerns or preference changes, regulatory frameworks mandate recommender systems' ability to eliminate the influence of certain user data from models. This recommendation unlearning challenge presents unique difficulties as removing connections within interaction graphs creates ripple effects throughout the model, potentially impacting recommendations for numerous users. Traditional approaches suffer from significant drawbacks: fragmentation methods damage graph structure and diminish performance, while influence function techniques make assumptions that may not hold in complex GNNs, particularly with self-supervised or random architectures. To address these limitations, we propose a novel model-agnostic pre-training paradigm UnlearnRec that prepares systems for efficient unlearning operations. Our Influence Encoder takes unlearning requests together with existing model parameters and directly produces updated parameters of unlearned model with little fine-tuning, avoiding complete retraining while preserving model performance characteristics. Extensive evaluation on public benchmarks demonstrates that our method delivers exceptional unlearning effectiveness while providing more than 10x speedup compared to retraining approaches. We release our method implementation at: https://github.com/HKUDS/UnlearnRec.

Multi-scenario Instance Embedding Learning for Deep Recommender Systems

Chaohua Yang
Dugang Liu
Xing Tang
Yuwen Fu
Xiuqiang He
Xiangyu Zhao
Zhong Ming

Multi-scenario recommendation (MSR) has become a core component of various online platforms, but its increasing model size has also brought attention to its efficiency optimization. An important effort is to find effective and efficient feature embedding layers for MSR, and existing work focuses on scenario-level feature selection, i.e., all instance embeddings in the same scenario get the same filtering results on the feature set, and the filtering results are different for different scenarios. However, this ignores the information redundancy of the dimension set and the individuality of different instances in the same scenario. To address these limitations, we propose a multi-scenario instance embedding learning (MultiEmb) framework that implements exclusive feature-dimension redundant information removal for different instances within a scenario to obtain the optimal individual embeddings. The core of our MultiEmb is to introduce an instance embedding selection network to effectively complete the above challenging tasks, in which a set of feature selection and dimension selection adaptive components are equipped for each scenario, and their combination completes the optimal embedding selection for each instance. Finally, we evaluate MultiEmb through extensive experiments on two public multi-scenario benchmarks and demonstrate its effectiveness, compatibility, transferability, etc.

SESSION: RecSys: Ranking and Adaptivity

MGIPF: Multi-Granularity Interest Prediction Framework for Personalized Recommendation

Ruoxuan Feng
Zhen Tian
Qiushi Peng
Jiaxin Mao
Wayne Xin Zhao
Di Hu
Changwang Zhang

Personalized recommender systems, which focus on predicting users' interests, have significantly enhanced user experiences across diverse applications. However, existing approaches implicitly model users' preferences through fitting the fine-grained labels (e.g., click labels), but often neglecting the coarse-grained interest information inherent in the inputs themselves. Relying solely on the fine-grained labels could bring negative impact on interest modeling and limit the performance, as the labels may carry inevitable noise in real-world scenarios. In addition, it is considerably demanding in terms of data for most existing approaches to effectively model users' multi-granularity interests with limited or no supporting examples, resulting in subpar performance due to the significant long-tail phenomenon. To tackle these issues, we propose a novel learning framework named the Multi-Granularity Interest Prediction Framework (MGIPF), for better modeling users' diverse interests. Unlike prior work, our key idea is to utilize both the coarse-grained and fine-grained interests for supervising the training of models. Specifically, we introduce a pseudo-labeling approach explicitly mining users' potential multi-granularity interests from the raw data, and propose coarse-grained interest prediction modules that collaborate to utilize the multi-granularity supervision signals to enhance the learning of low-frequency items. The corresponding coarse-grained losses are softly weighted, taking into account the varying confidence of potential multi-granularity preferences on positive and negative samples. Importantly, our framework is lightweight and adaptable, capable of being applied effectively to mainstream recommendation models, establishing a comprehensive end-to-end training process. Extensive experiments conducted on three publicly available datasets have demonstrated the efficacy of our approach. The code is available at https://github.com/GeWu-Lab/MGIPF.

SESSION: Reranking

QDER: Query-Specific Document and Entity Representations for Multi-Vector Document Re-Ranking

Shubham Chatterjee
Jeff Dalton

Neural IR has advanced through two distinct paths: entity-oriented approaches leveraging knowledge graphs and multi-vector models capturing fine-grained semantics. We introduce QDER, a neural re-ranking model that unifies these approaches by integrating knowledge graph semantics into a multi-vector model.

QDER's key innovation lies in its modeling of query-document relationships: rather than computing similarity scores on aggregated embeddings, we maintain individual token and entity representations throughout the ranking process, performing aggregation only at the final scoring stage-an approach we call ''late aggregation.'' We first transform these fine-grained representations through learned attention patterns, then apply carefully chosen mathematical operations for precise matches. Experiments across five standard benchmarks show that QDER achieves significant performance gains, with improvements of 36% in nDCG@20 over the strongest baseline on TREC Robust 2004 and similar improvements on other datasets. QDER particularly excels on difficult queries, achieving an nDCG@20 of 0.70 where traditional approaches fail completely (nDCG@20 = 0.0), setting a foundation for future work in entity-aware retrieval.

SESSION: Search and Ranking 1

On the Scaling of Robustness and Effectiveness in Dense Retrieval

Yu-An Liu
Ruqing Zhang
Jiafeng Guo
Maarten de Rijke
Yixing Fan
Xueqi Cheng

Robustness and Effectiveness are critical aspects of developing dense retrieval models for real-world applications. It is known that there is a trade-off between the two. Recent work has addressed scaling laws of effectiveness in dense retrieval, revealing a power-law relationship between effectiveness and the size of models and data. Does robustness follow scaling laws too? If so, can scaling improve both robustness and effectiveness together, or do they remain locked in a trade-off?

To answer these questions, we conduct a comprehensive experimental study. We find that: (i) Robustness, including out-of-distribution and adversarial robustness, also follows a scaling law. (ii) Robustness and effectiveness exhibit different scaling patterns, leading to significant resource costs when jointly improving both. Given these findings, we shift to the third factor that affects model performance, namely the optimization strategy, beyond the model size and data size. We find that: (i) By fitting different optimization strategies, the joint performance of robustness and effectiveness traces out a Pareto frontier. (ii) When the optimization strategy strays from Pareto efficiency, the joint performance scales in a sub-optimal direction. (iii) By adjusting the optimization weights to fit the Pareto efficiency, we can achieve Pareto training, where the scaling of joint performance becomes most efficient. Even without requiring additional resources, Pareto training is comparable to the performance of scaling resources several times under optimization strategies that overly prioritize either robustness or effectiveness. Finally, we demonstrate that our findings can help deploy dense retrieval models in real-world applications that scale efficiently and are balanced for robustness and effectiveness.

Optimizing Compound Retrieval Systems

Harrie Oosterhuis
Rolf Jagerman
Zhen Qin
Xuanhui Wang

Modern retrieval systems do not rely on a single ranking model to construct their rankings. Instead, they generally take a cascading approach where a sequence of ranking models are applied in multiple re-ranking stages. Thereby, they balance the quality of the top-K ranking with computational costs by limiting the number of documents each model re-ranks. However, the cascading approach is not the only way models can interact to form a retrieval system.

We propose the concept of compound retrieval systems as a broader class of retrieval systems that apply multiple prediction models. This encapsulates cascading models but also allows other types of interactions than top-K re-ranking. In particular, we enable interactions with large language models (LLMs) which can provide relative relevance comparisons. We focus on the optimization of compound retrieval system design which uniquely involves learning where to apply the component models and how to aggregate their predictions into a final ranking. This work shows how our compound approach can combine the classic BM25 retrieval model with state-of-the-art (pairwise) LLM relevance predictions, while optimizing a given ranking metric and efficiency target. Our experimental results show optimized compound retrieval systems provide better trade-offs between effectiveness and efficiency than cascading approaches, even when applied in a self-supervised manner.

With the introduction of compound retrieval systems, we hope to inspire the information retrieval field to more out-of-the-box thinking on how prediction models can interact to form rankings.

Precise Zero-Shot Pointwise Ranking with LLMs through Post-Aggregated Global Context Information

Kehan Long
Shasha Li
Chen Xu
Jintao Tang
Ting Wang

Recent advancements have successfully harnessed the power of Large Language Models (LLMs) for zero-shot document ranking, exploring a variety of prompting strategies. Comparative approaches like pairwise and listwise achieve high effectiveness but are computationally intensive and thus less practical for larger-scale applications. Scoring-based pointwise approaches exhibit superior efficiency by independently and simultaneously generating the relevance scores for each candidate document. However, this independence ignores critical comparative insights between documents, resulting in inconsistent scoring and suboptimal performance. In this paper, we aim to improve the effectiveness of pointwise methods while preserving their efficiency through two key innovations: (1) We propose a novel Global-Consistent Comparative Pointwise Ranking (GCCP) strategy that incorporates global reference comparisons between each candidate and an anchor document to generate contrastive relevance scores. We strategically design the anchor document as a query-focused summary of pseudo-relevant candidates, which serves as an effective reference point by capturing the global context for document comparison. (2) These contrastive relevance scores can be efficiently Post-Aggregated with existing pointwise methods, seamlessly integrating essential Global Context information in a training-free manner (PAGC). Extensive experiments on the TREC DL and BEIR benchmark demonstrate that our approach significantly outperforms previous pointwise methods while maintaining comparable efficiency. Our method also achieves competitive performance against comparative methods that require substantially more computational resources. More analyses further validate the efficacy of our anchor construction strategy.

SESSION: Search and Ranking 2

SESSION: Efficiency

SESSION: Short Research Papers

AgentCF++: Memory-enhanced LLM-based Agents for Popularity-aware Cross-domain Recommendations

Jiahao Liu
Shengkang Gu
Dongsheng Li
Guangping Zhang
Mingzhe Han
Hansu Gu
Peng Zhang
Tun Lu
Li Shang
Ning Gu

LLM-based user agents, which simulate user interaction behavior, are emerging as a promising approach to enhancing recommender systems. In real-world scenarios, users' interactions often exhibit cross-domain characteristics and are influenced by others. However, the memory design in current methods causes user agents to introduce significant irrelevant information during decision-making in cross-domain scenarios and makes them unable to recognize the influence of other users' interactions, such as popularity factors. To tackle this issue, we propose a dual-layer memory architecture combined with a two-step fusion mechanism. This design avoids irrelevant information during decision-making while ensuring effective integration of cross-domain preferences. We also introduce the concepts of interest groups and group-shared memory to better capture the influence of popularity factors on users with similar interests. Comprehensive experiments validate the effectiveness of AgentCF++. Our code is available at https://github.com/jhliu0807/AgentCF-plus.

Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M

Dario Di Palma
Felice Antonio Merra
Maurizio Sfilio
Vito Walter Anelli
Fedelucio Narducci
Tommaso Di Noia

Large Language Models (LLMs) have become increasingly central to recommendation scenarios due to their remarkable natural language understanding and generation capabilities. Although significant research has explored the use of LLMs for various recommendation tasks, little effort has been dedicated to verifying whether they have memorized public recommendation dataset as part of their training data. This is undesirable because memorization reduces the generalizability of research findings, as benchmarking on memorized datasets does not guarantee generalization to unseen datasets. Furthermore, memorization can amplify biases, for example, some popular items may be recommended more frequently than others.

In this work, we investigate whether LLMs have memorized public recommendation datasets. Specifically, we examine two model families (GPT and Llama) across multiple sizes, focusing on one of the most widely used dataset in recommender systems: MovieLens-1M. First, we define dataset memorization as the extent to which item attributes, user profiles, and user-item interactions can be retrieved by prompting the LLMs. Second, we analyze the impact of memorization on recommendation performance. Lastly, we examine whether memorization varies across model families and model sizes. Our results reveal that all models exhibit some degree of memorization of MovieLens-1M, and that recommendation performance is related to the extent of memorization.

Counterfactual Model Selection in Contextual Bandits

Shion Ishikawa
Young-joo Chung
Yun-Ching Liu
Yu Hirate

Contextual bandit algorithms are crucial in various decision-making applications, such as personalized content recommendation, online advertising, and e-commerce banner placement. Despite their successful applications in various domains, contextual bandit algorithms still face significant challenges with exploration efficiency compared to non-contextual bandit algorithms due to exploration in feature spaces. To overcome this issue, model selection policies such as MetaEXP and MetaCORRAL have been proposed to interactively explore base policies. In this paper, we introduce a novel counterfactual approach to address the model selection problem in contextual bandits. Unlike previous methods, our approach leverages unbiased Off-Policy Evaluation (OPE) to dynamically select base policies, making it more robust to model misspecification. We present two new algorithms, MetaEXP-OPE and MetaGreedy-OPE, which utilize OPE for model selection policy. We also provide theoretical analysis on regret bounds and evaluate the impact of different OPE estimators. We evaluated our model on synthetic data and a semi-synthetic simulator using a real-world dataset, and the results show that MetaEXP-OPE and MetaGreedy-OPE significantly outperform existing policies, including MetaEXP and MetaCORRAL.

Characterising Topic Familiarity and Query Specificity Using Eye-Tracking Data

Jiaman He
Zikang Leng
Dana McKay
Johanne R. Trippas
Damiano Spina

Eye-tracking data has been shown to correlate with a user's knowledge level and query formulation behaviour. While previous work has focused primarily on eye gaze fixations for attention analysis, often requiring additional contextual information, our study investigates the memory-related cognitive dimension by relying solely on pupil dilation and gaze velocity to infer users' topic familiarity and query specificity without needing any contextual information. Using eye-tracking data collected via a lab user study N=18, we achieved a Macro F1 score of 71.25% for predicting topic familiarity with a Gradient Boosting classifier, and a Macro F1 score of 60.54% with a k-nearest neighbours (KNN) classifier for query specificity. Furthermore, we developed a novel annotation guideline - specifically tailored for question answering - to manually classify queries as Specific or Non-specific. This study demonstrates the feasibility of eye-tracking to better understand topic familiarity and query specificity in search.

Bias in Language Models: Interplay of Architecture and Data?

Mozhgan Talebpour
Yunfei Long
Alba G. Seco De Herrera
Shoaib Jameel

Pre-trained language models (PLMs), despite showing strong performance, can carry and increase biases, which can limit the development of fair NLP and IR systems. This research investigates the foundational origins of bias within PLMs, moving beyond detection to a detailed analysis of its formation and propagation across diverse architectures. Through a novel attention weight analysis, we reveal distinct attention patterns for biased versus neutral content, offering insights into the internal representations learned by PLMs. Our findings demonstrate a complex interplay between training data and model architecture, revealing that while the transformer's self-attention mechanism amplifies existing biases, the training data plays a crucial role in the initial encoding of bias within the model's representations.

Dynamic Margin-based Contrastive Learning for Robust Negative Sampling in Information Retrieval

Tsai-Tsung Chen
Chuan-Ju Wang
Ming-Feng Tsai

Modern information retrieval (IR) systems, powered by bi-encoder architectures and pretrained language models, rely on effective negative sampling for contrastive learning. While easy negatives are computationally simple, they fail to challenge the model, whereas hard negatives-selected via methods like BM25, ANCE, or ADORE-can be overly difficult and misleading. To this end, this paper proposes dynamic margin-based contrastive learning (DMCL), which adaptively adjusts the decision boundary based on query-negative similarity, ensuring consistent exposure to moderately hard negatives. Experiments across diverse datasets and models show that DMCL outperforms traditional methods, achieving state-of-the-art retrieval performance with minimal computational cost.

EIoU-EMC: A Novel Loss for Domain-specific Nested Entity Recognition

Jian Zhang
Tianqing Zhang
Qi Li
Hongwei Wang

Nested NER tasks have some challenges in specific domains, such as biomedical and industrial fields, particularly due to low resource and class imbalance, which impede its wide application. In this study, we design a novel loss EIoU-EMC, by enhancing the implement of Intersection over Union loss and Multi-class loss. Our proposed method specially leverages the information of entity boundary and entity classification, thereby enhancing the model's capacity to learn from a limited number of data samples. To validate the performance of this innovative method in enhancing NER task, we conducted experiments on three distinct biomedical NER datasets and one dataset constructed by ourselves from industrial complex equipment maintenance documents. Comparing to strong baselines, our method demonstrates the competitive performance across all datasets. During the experimental analysis, our proposed method exhibits significant advancements in entity boundary recognition and entity classification. Our code and data are available at https://github.com/luminous11/EIoU-EMC/

ELEC: Efficient Large Language Model-Empowered Click-Through Rate Prediction

Rui Dong
Wentao Ouyang
Xiangzheng Liu

Click-through rate (CTR) prediction plays an important role in online advertising systems. On the one hand, traditional CTR prediction models capture the collaborative signals in tabular data via feature interaction modeling, but they lose semantics in text. On the other hand, Large Language Models (LLMs) excel in understanding the context and meaning behind text, but they face challenges in capturing collaborative signals and they have long inference latency. In this paper, we aim to leverage the benefits of both types of models and pursue collaboration, semantics and efficiency. We present ELEC, which is an Efficient LLM-Empowered CTR prediction framework. We first adapt an LLM for the CTR prediction task. In order to leverage the ability of the LLM but simultaneously keep efficiency, we utilize the pseudo-siamese network which contains a gain network and a vanilla network. We inject the high-level representation vector generated by the LLM into a collaborative CTR model to form the gain network such that it can take advantage of both tabular modeling and textual modeling. However, its reliance on the LLM limits its efficiency. We then distill the knowledge from the gain network to the vanilla network on both the score level and the representation level, such that the vanilla network takes only tabular data as input, but can still generate comparable performance as the gain network. Our approach is model-agnostic. It allows for the integration with various existing LLMs and collaborative CTR models. Experiments on real-world datasets demonstrate the effectiveness and efficiency of ELEC for CTR prediction.

Dual Debiasing in LLM-based Recommendation

Sijin Lu
Zhibo Man
Fangyuan Luo
Jun Wu

Large language models (LLMs) have been widely applied in recommender systems, achieving remarkable success. However, LLM-based recommendation (LR) suffers from more severe popularity bias than conventional recommendation (CR), stemming from both training and inference stages. In this paper, we propose a novel debiasing method for LR, which performs debiasing in such two stages, so termed as Dual Debiasing in LR (D²LR). Concretely, in the training stage, we conduct token-wise inverse propensity score weighting to force the LLM to pay more attention on unpopular tokens. In the inference stage, we train a more biased CR model by increasing the weights of popular items, which adjusts the generation probability of corresponding tokens according to its scores for items, hoping to suppress the excessive generation of popular tokens. Experiments conducted on three real-world datasets validate the effectiveness of our D²LR in mitigating popularity bias in LR.

A Comparative Study of Large Language Models and Traditional Privacy Measures to Evaluate Query Obfuscation Approaches

Francesco Luigi De Faveri
Guglielmo Faggioli
Nicola Ferro

When interacting with an Information Retrieval (IR) system, users might disclose personal information, such as medical details, through their queries. Thus, assessing the level of privacy granted to users when querying an IR system is essential to determine the confidentiality of submitted sensitive data. Query obfuscation protocols have traditionally been employed to obscure a user's real information need when retrieving documents. In these protocols, the query is modified employing ε-Differential Privacy (DP) obfuscation mechanisms, which alter query terms according to a predefined privacy budget ε. While this budget ensures formal mathematical guarantees, it provides only limited guarantees of the privacy experienced by the user and calls for empirical privacy evaluation to be carried out. Such privacy assessments employ lexical and semantic similarity measures between the original and obfuscated queries. In this study, we explore the role of Large Language Models (LLMs) in privacy evaluation, simulating a scenario where users employ such models to determine whether their input has been effectively privatized. Our primary research objective is to determine whether LLMs provide a novel perspective on privacy estimation and if their assessments serve as a proxy for traditional similarity metrics, such as the Jaccard and cosine similarity derived from Transformer-based sentence embeddings. Our findings reveal a positive correlation between LLMs-generated privacy scores and cosine similarity computed using different Transformer architectures. This suggests that LLM assessments act as a proxy for similarity-based measures.

GINGER: Grounded Information Nugget-Based Generation of Responses

Weronika Łajewska
Krisztian Balog

Retrieval-augmented generation (RAG) faces challenges related to factual correctness, source attribution, and response completeness. To address them, we propose a modular pipeline for grounded response generation that operates on information nuggets - minimal, atomic units of relevant information extracted from retrieved documents. The multistage pipeline encompasses nugget detection, clustering, ranking, top cluster summarization, and fluency enhancement. It guarantees grounding in specific facts, facilitates source attribution, and ensures maximum information inclusion within length constraints. Experiments on the TREC RAG'24 dataset, using the AutoNuggetizer framework, demonstrate that GINGER achieves state-of-the-art performance on this benchmark.

Echoes in the Feed: Evolution-aware Prompt-augmented Micro-video Popularity Prediction

Wei Chen
Jiao Li
Jian Lang
Zhangtao Cheng
Yong Wang
Fan Zhou

Micro-video popularity prediction (MVPP) is a crucial research topic with important implications for social media marketing and stakeholders. Current works in MVPP utilized the pre-trained vision-language models (PVLs) to model the multimodal features for prediction, failing to capture the evolving popularity trend in micro-videos and leading to suboptimal results. To tackle this limitation, we propose EvoPro, an Evolution-aware Prompt-augmented framework that enhances MVPP. First, inspired by the powerful multimodal understanding and text generation skills of Large Multimodal Models (LMMs), an LMM-driven generative retriever is proposed to create contextually rich retrieval queries and perform precise video-to-video retrieval, forming dynamic micro-video support sets that effectively reflect evolving patterns. Building upon this, a graph-based prompter generates evolutionary prompts by capturing the relational structures within the support set. These prompts, representing the latest trend dynamics, serve as few-shot examples to guide PVLs. By integrating evolutionary prompts, the PVLs are empowered to model the evolving popularity trends more accurately, yielding stronger and more predictive representations. Extensive experiments conducted on three benchmarks demonstrate that EvoPro significantly outperforms competitive baselines.

Efficient Conversational Search via Topical Locality in Dense Retrieval

Cristina Ioana Muntean
Franco Maria Nardini
Raffaele Perego
Guido Rocchietti
Cosimo Rulli

Pre-trained language models have been widely exploited to learn dense representations of documents and queries for information retrieval. While previous efforts have primarily focused on improving effectiveness and user satisfaction, response time remains a critical bottleneck of conversational search systems. To address this, we exploit the topical locality inherent in conversational queries, i.e., the tendency of queries within a conversation to focus on related topics. By leveraging query embedding similarities, we dynamically restrict the search space to semantically relevant document clusters, reducing computational complexity without compromising retrieval quality. We evaluate our approach on the TREC CAsT, 2019 and 2020 datasets using multiple embedding models and vector indexes, achieving improvements in processing speed of up to 10.3X with little loss in performance (4.3X without any loss). Our results show that the proposed system effectively handles complex, multi-turn queries with high precision and efficiency, offering a practical solution for real-time conversational search.

Assessing Support for the TREC 2024 RAG Track: A Large-Scale Comparative Study of LLM and Human Evaluations

Nandan Thakur
Ronak Pradeep
Shivani Upadhyay
Daniel Campos
Nick Craswell
Ian Soboroff
Hoa Trang Dang
Jimmy Lin

Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing ''ground truth''. A crucial factor in RAG evaluation is ''support'', or whether the information in the cited documents supports the answer. We conducted a comparative study of submissions to the TREC 2024 RAG Track, evaluating an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate good agreement between human and GPT-4o predictions. Further analysis of the disagreements shows that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. We provide a qualitative analysis of human and GPT-4o errors to help guide future evaluations.

Bias-Aware Curriculum Sampling For Fair Ranking

Shirin Seyedsalehi
Hai Son Le
Morteza Zihayat
Ebrahim Bagheri

Neural ranking models are widely used to retrieve and rank relevant documents. However, these models may inherit and amplify biases present in the training data, posing challenges for fairness and relevance in ranking outputs. In this paper, we propose a novel curriculum-based training approach that manages bias exposure throughout the training process. We design a bias-aware curriculum that stages the exposure of the model to biased samples during the training stages, allowing the model to establish a fair relevance baseline. We conduct extensive experiments across different LLMs and datasets to evaluate the effectiveness of our approach. Our results demonstrate that our proposed strategy outperforms other bias reduction methods in terms of both fairness and relevance, without sacrificing retrieval effectiveness.

Automatic Document Editing for Improved Ranking

Niv Bardas
Tommy Mordo
Oren Kurland
Moshe Tennenholtz

We present a study of using large language models (LLMs) to modify a document so as to have it highly ranked for a query by an undisclosed ranking function. We present different prompting methods inspired by work on using LLMs to induce ranking. Empirical evaluation attests to the merits of the best performing methods with respect to human modifications and a highly effective feature-based modification method.

An Alternative to FLOPS Regularization to Effectively Productionize SPLADE-Doc

Aldo Porco
Dhruv Mehra
Igor Malioutov
Karthik Radhakrishnan
Moniba Keymanesh
Daniel Preoţiuc-Pietro
Sean MacAvaney
Pengxiang Cheng

Learned Sparse Retrieval (LSR) models encode text as weighted term vectors, which need to be sparse to leverage inverted index structures during retrieval. SPLADE, the most popular LSR model, uses FLOPS regularization to encourage vector sparsity during training. However, FLOPS regularization does not ensure sparsity among terms-only within a given query or document. Terms with very high Document Frequencies (DFs) substantially increase latency in production retrieval engines, such as Apache Solr, due to their lengthy posting lists. To address the issue of high DFs, we present a new variant of FLOPS regularization: DF-FLOPS. This new regularization technique penalizes the usage of high-DF terms, thereby shortening posting lists and reducing retrieval latency. Unlike other inference-time sparsification methods, such as stopword removal, DF-FLOPS regularization allows for the selective inclusion of high-frequency terms in cases where the terms are truly salient. We find that DF-FLOPS successfully reduces the prevalence of high-DF terms and lowers retrieval latency (around 10x faster) in a production-grade engine while maintaining effectiveness both in-domain (only a 2.2-point drop in MRR@10) and cross-domain (improved performance in 12 out of 13 tasks on which we tested). With retrieval latencies on par with BM25, this work provides an important step towards making LSR practical for deployment in production-grade search engines.

Dual-perspective Data Augmentation and Curriculum Learning Framework for Low-resource Complex Named Entity Recognition

Mengxiao Song
Tianyun Liu
Wenyuan Zhang
Quangang Li
Tingwen Liu

Low-resource complex named entity recognition focuses on identifying complex entities such as creative work, product name and so on, in scenarios where annotated training data is limited. Recent advanced works deal with this task through data augmentation and make substantial progress. However, existing methods ignore the influence of different types or levels of augmented data on model optimization in different learning stages. To address it, we propose a dual-perspective data augmentation and curriculum learning framework. Specifically, we first employ the large language model (LLM) to construct two kinds of augmented datasets from context-perspective and entity-perspective, respectively. Then, we present a multi-stage curriculum learning strategy including a novel adaptive curriculum arrangement algorithm to automatically select the most suitable kind of augmented set to optimize the target model at each training epoch, thus using the augmented data more effectively and controllably. Experimental results on the public benchmark across various low-resource settings show that our framework outperforms previous works.

Conversational Argument Search Under Selective Exposure: Strategies for Balanced Perspective Access

Kyusik Kim
Jeongwoo Ryu
Dongseok Heo
Hyungwoo Song
Changhoon Oh
Bongwon Suh

Conversational argument search systems influence how users access diverse perspectives but are prone to selective exposure. To address this, we propose two strategies: an interface-level multi-agent framework that structures perspective presentation and an interaction-level questioning strategy that encourages deeper engagement. We evaluate these strategies through a 2 x 2 factorial user study, examining their impact on selective exposure. Results show that the multi-agent setup facilitates broader perspective comparison, while agent-initiated questioning fosters deeper reflection; together, they promote more balanced argument access. Based on these findings, we discuss conversational search systems to mitigate selective exposure by implementing multi-agent interactions and questioning mechanisms.

Axiomatic Re-Ranking for Argument Retrieval

Maximilian Heinrich
Marvin Vogel
Alexander Bondarenko
Matthias Hagen
Benno Stein

Information retrieval axioms are formalized constraints that retrieval systems should ideally satisfy (e.g., to rank documents higher that contain the query terms more often). In this paper, we propose new axioms that focus on the scenario of argument retrieval: retrieval for queries that need arguments in the results. Our underlying axiomatic idea is that in such scenarios, documents should be prioritized with argumentative units that are similar to the query. We test our new axioms in re-ranking experiments on the data of the Touché ~2020 and~2021 shared task on argument retrieval for controversial questions, and show that the new axioms can improve the effectiveness of Touché's strong DirichletLM baseline model and even of the top-performing system from Touché ~2021, a system already specifically optimized for argument retrieval. Finally, we also propose a new method for visualizing the relationships between axioms based on their effects in re-ranking settings.

Document Similarity Enhanced IPS Estimation for Unbiased Learning to Rank

Zeyan Liang
Graham McDonald
Iadh Ounis

Learning to Rank (LTR) models learn from historical user interactions, such as user clicks. However, there is an inherent bias in the clicks of users due to position bias, i.e., users are more likely to click highly-ranked documents than low-ranked documents. To address this bias when training LTR models, many approaches from the literature re-weight the users' click data using Inverse Propensity Scoring (IPS). IPS re-weights the user's clicks proportionately to the position in the historical ranking that a document was placed when it was clicked since low-ranked documents are less likely to be seen by a user. In this paper, we argue that low-ranked documents that are similar to highly-ranked relevant documents are also likely to be relevant. Moreover, accounting for the similarity of low-ranked documents to highly ranked relevant documents when calculating IPS can more effectively mitigate the effects of position bias. Therefore, we propose an extension to IPS, called IPSsim, that takes into consideration the similarity of documents when estimating IPS. We evaluate our IPSsim estimator using two large publicly available LTR datasets under a number of simulated user click settings, and with different numbers of training clicks. Our experiments show that our IPSsim estimator is more effective than the existing IPS estimators for learning an unbiased LTR model, particularly in top-n settings when n >= 30. For example, when n = 50, our IPSsim estimator achieves a statistically significant ~3% improvement (p < 0.05) in terms of NDCG compared to the Doubly Robust estimator from the literature.

Balancing Precision and Generalization: Dynamic Instruction Generation for Model Adaptive Zero-Shot Reasoning in LLMs

Ruihan Zhu
Bo Wang
Dongming Zhao
Jing Liu
Ruifang He
Yuexian Hou

Current research shows that providing instructions to guide Large Language Models (LLMs) improves reasoning tasks, but existing methods struggle to balance accuracy and generalization. Manually crafted instructions tailored to specific LLMs and tasks improve performance but reduce generalizability, while more general instructions lack detail and lower performance. To address this, we propose a dynamic instruction-generation method using an Instruction-Generation Prompt (IGP). IGP categorizes problems into domains and integrates the model's capabilities to generate detailed task-specific instructions, resulting in a comprehensive plan. This approach achieves high precision with general prompts without requiring in-depth knowledge of LLMs or tasks. We validated our method across five LLMs and ten datasets in three task categories. Our dynamically generated instructions outperformed traditionally handcrafted, LLM-specific instructions across various LLMs and tasks.

Evaluating Contrastive Feedback for Effective User Simulations

Andreas Konstantin Kruff
Timo Breuer
Philipp Schaer

The use of Large Language Models (LLMs) for simulating user behavior in the domain of Interactive Information Retrieval has recently gained significant popularity. However, their application and capabilities remain highly debated and understudied. This study explores whether the underlying principles of contrastive training techniques, which have been effective for fine-tuning LLMs, can also be applied beneficially in the area of prompt engineering for user simulations.

Previous research has shown that LLMs possess comprehensive world knowledge, which can be leveraged to provide accurate estimates of relevant documents. This study attempts to simulate a knowledge state by enhancing the model with additional implicit contextual information gained during the simulation. This approach enables the model to refine the scope of desired documents further. The primary objective of this study is to analyze how different modalities of contextual information influence the effectiveness of user simulations.

Various user configurations were tested, where models are provided with summaries of already judged relevant, irrelevant, or both types of documents in a contrastive manner. The focus of this study is the assessment of the impact of the prompting techniques on the simulated user agent performance. We hereby lay the foundations for leveraging LLMs as part of more realistic simulated users.

Effective Inference-Free Retrieval for Learned Sparse Representations

Franco Maria Nardini
Thong Nguyen
Cosimo Rulli
Rossano Venturini
Andrew Yates

Learned Sparse Retrieval (LSR) is an effective IR approach that exploits pre-trained language models for encoding text into a learned bag of words. Several efforts in the literature have shown that sparsity is key to enabling a good trade-off between the efficiency and effectiveness of the query processor. To induce the right degree of sparsity, researchers typically use regularization techniques when training LSR models. Recently, new efficient-inverted index-based-retrieval engines have been proposed, leading to a natural question: has the role of regularization changed in training LSR models? In this paper, we conduct an extended evaluation of regularization approaches for LSR where we discuss their effectiveness, efficiency, and out-of-domain generalization capabilities. We first show that regularization can be relaxed to produce more effective LSR en- coders. We also show that query encoding is now the bottleneck limiting the overall query processor performance. To remove this bottleneck, we advance the state-of-the-art of inference-free LSR by proposing Learned Inference-free Retrieval (Li-Lsr). At training time, Li-Lsr learns a score for each token, casting the query encoding step into a seamless table lookup. Our approach yields state-of-the-art effectiveness for both in-domain and out-of-domain evaluation,surpassing Splade-v3-Doc by 1 point of mRR@10 on MsMarco and 1.8 points of nDCG@10 on Beir.

Are Information Retrieval Approaches Good at Harmonising Longitudinal Surveys in Social Science?

Wing Yan Li
Zeqiang Wang
Jon Johnson
Suparna De

Automated detection of semantically equivalent questions in longitudinal social science surveys is crucial for long-term studies informing empirical research in the social, economic, and health sciences. Retrieving equivalent questions faces dual challenges: inconsistent representation of theoretical constructs (i.e. concept/sub-concept) across studies as well as between question and response options, and the evolution of vocabulary and structure in longitudinal text. To address these challenges, our multi-disciplinary collaboration of computer scientists and survey specialists presents a new information retrieval (IR) task of identifying concept (e.g. Housing, Job, etc.) equivalence across question and response options to harmonise longitudinal population studies. This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020, including probabilistic models, linear probing of language models, and pre-trained neural networks specialised for IR. We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably. Additionally, the re-ranking of the probabilistic model's results with neural models only introduces modest improvements of 0.07 at most in F1-score. Qualitative post-hoc evaluation by survey specialists shows that models generally have a low sensitivity to questions with high lexical overlap, particularly in cases where sub-concepts are mismatched. Altogether, our analysis serves to further research on harmonising longitudinal studies in social science.

Aligning Web Query Generation with Ranking Objectives via Direct Preference Optimization

João Coelho
Bruno Martins
João Magalhães
Chenyan Xiong

Neural retrieval models excel in Web search, but their training requires substantial amounts of labeled query-document pairs, which are costly to obtain. With the widespread availability of Web document collections like ClueWeb22, synthetic queries generated by large language models offer a scalable alternative. Still, synthetic training queries often vary in quality, which leads to suboptimal downstream retrieval performance. Existing methods typically filter out noisy query-document pairs based on signals from an external re-ranker. In contrast, we propose a framework that leverages Direct Preference Optimization (DPO) to integrate ranking signals into the query generation process, aiming to directly optimize the model towards generating high-quality queries that maximize downstream retrieval effectiveness. Experiments show higher ranker-assessed relevance between query-document pairs after DPO, leading to stronger downstream performance on the MS~MARCO benchmark when compared to baseline models trained with synthetic data.

Dynamic Superblock Pruning for Fast Learned Sparse Retrieval

Parker Carlson
Wentai Xie
Shanxiu He
Tao Yang

This paper proposes superblock pruning (SP) during top-k online document retrieval for learned sparse representations. SP structures the sparse index as a set of superblocks on a sequence of document blocks and conducts a superblock-level selection to decide if some superblocks can be pruned before visiting their child blocks. SP generalizes the previous flat block or cluster-based pruning, allowing the early detection of groups of documents that cannot or are less likely to appear in the final top-k list. SP can accelerate sparse retrieval in a rank-safe or approximate manner under a high-relevance competitiveness constraint. Our experiments show that the proposed scheme significantly outperforms state-of-the-art baselines on MS MARCO passages on a single-threaded CPU.

A Large-Scale Study of Reranker Relevance Feedback at Inference

Revanth Gangi Reddy
Pradeep Dasigi
Md Arafat Sultan
Arman Cohan
Avirup Sil
Heng Ji
Hannaneh Hajishirzi

Neural IR systems often employ a retrieve-and-rerank framework: a bi-encoder retrieves a fixed number of candidates (e.g., K=100), which a cross-encoder then reranks. Recent studies have indicated that relevance feedback from the reranker at inference time can improve the recall of the retriever. The approach works by updating the retriever's query representations via a distillation process that aligns it with the reranker's predictions. While a powerful idea, the arguably narrow scope of past studies focusing on a small number of specific domains such as english question answering and entity retrieval has left a gap in our understanding of how well it generalizes. In this paper, we study inference-time reranker relevance feedback extensively across multiple retrieval domains, languages, and modalities, while also investigating aspects such as the performance and latency implications of the number of distillation updates and feedback candidates.

Bridging Time Gaps: Temporal Logic Relations for Enhancing Temporal Reasoning in Large Language Models

Xintong Song
Bin Liang
Yang Sun
Chenhua Zhang
Bingbing Wang
Ruifeng Xu

The understanding and cognition of time are the basis for large language models to understand the world. Although large language models (LLMs) have demonstrated strong capabilities in multiple reasoning tasks, they still have significant deficiencies in temporal reasoning, mainly due to the diversity of temporal expressions and the lack of temporal logic reasoning capabilities. In this study, we propose a novel Temporal Chain of Thought framework(TempCoT) to improve the performance of LLM in temporal reasoning tasks through a three-stage reasoning strategy. First, TempCoT explicitly extracts time constraints to ensure the accuracy of time references during reasoning. Second, a semantic retrieval mechanism is introduced to dynamically obtain key temporal facts to enhance the integrity and reliability of information. Finally, an explicit temporal logic reasoning module is constructed based on point algebra to improve the consistency and interpretability of reasoning. Experimental results show that TempCoT significantly improves the temporal reasoning performance of five different LLMs and shows stronger robustness on complex temporal tasks.

Augmenting Vision-Language Retrieval: The Role of Multimodal LLMs as Synthetic Data Generators

Aidan Bell
James Gore
Behrooz Mansouri

Multimodal Large Language Models (MLLMs) connect and interpret different data types, making them suitable for various vision-language tasks. Despite the rapid advancements in MLLMs, their effectiveness for specialized cross-modal retrieval tasks remains underexplored. A challenging example is art retrieval, where the task is to find visually and conceptually relevant artwork corresponding to a textual description. This paper investigates the effects of fine-tuning cross-modal retrieval models using both human-annotated and MLLM-generated captions for artistic paintings. To this end, two cross-modal retrieval models, Long-CLIP and BLIP, are studied. Experimental results show that models fine-tuned on MLLM-generated captions achieve search effectiveness comparable to those fine-tuned on human-annotated captions.

Deep Multiple Quantization Network on Long Behavior Sequence for Click-Through Rate Prediction

Zhuoxing Wei
Qi Liu
Qingchen Xie

In Click-Through Rate (CTR) prediction, the long behavior sequence, comprising the user's long period of historical interactions with items has a vital influence on assessing the user's interest in the candidate item. Existing approaches strike efficiency and effectiveness through a two-stage paradigm: first retrieving hundreds of candidate-related items and then extracting interest intensity vector through target attention. However, we argue that the discrepancy in target attention's relevance distribution between the retrieved items and the full long behavior sequence inevitably leads to a performance decline. To alleviate the discrepancy, we propose the Deep Multiple Quantization Network (DMQN) to process long behavior sequence end-to-end through compressing the long behavior sequence. Firstly, the entire spectrum of long behavior sequence will be quantized into multiple codeword sequences based on multiple independent codebooks. Hierarchical Sequential Transduction Unit is incorporated to facilitate the interaction of reduced codeword sequences. Then, attention between the candidate and multiple codeword sequences will output the interest vector. To enable online serving, intermediate representations of the codeword sequences are cached, significantly reducing latency. Our extensive experiments on both industrial and public datasets confirm the effectiveness and efficiency of DMQN. The A/B test in our advertising system shows that DMQN improves CTR by 3.5% and RPM by 2.0%.

SESSION: Low Resource Environment Papers

Dense Retrieval for Low Resource languages - the Case of Amharic Language

Tilahun Yeshambel
Moncef Garouani
Serge Molina
Josiane Mothe

This paper presents our investigation into dense retrieval models for Amharic, a low-resource language spoken by more than 120 million people. We constructed training datasets tailored to dense retrieval models and evaluated model performance by comparing dense and sparse retrieval approaches on Amharic information retrieval. The study also highlights the challenges and efforts involved in advancing retrieval systems for low-resource languages.

Advancing Chichewa IR

Stanley Ndebvu
Reuben Moyo
Catherine Chavula

Malawi is home to over ten local languages, including Chichewa, yet many of these languages lack both printed and digital resources. Consequentially, access to information in these languages is limited, and this hinders knowledge sharing, which may potentially impact socio-economic development. In this paper, we discuss our work on developing language resources and tools for Chichewa. We begin by providing an overview of the Chichewa language, and highlight its inherent complexities that require new approaches to informa tion retrieval (IR) and natural language processing (NLP). We then present our past, current, and ongoing research and conclude with future directions. Our goal is to engage with the IR community to discuss how we can advance IR for low resource languages (LRLs) like Chichewa.

Towards Enhanced Agricultural Information Access in Kiswahili: Integrating Knowledge Graphs and Retrieval-Augmented Generation

Joseph P. Telemala
Neema N. Lyimo
Anna R. Kimaro
Camilius A. Sanga

Access to and consumption of agricultural research findings remains a challenge for Kiswahili-speaking farmers and extension officers in Tanzania due to the predominance of English in agriculture scholarly publications. To address this challenge, the Mkulima repository, a digital collection of over 600 Swahili agricultural publications, was developed at the Sokoine University of Agriculture to provide agriculture knowledge in Kiswahili. However, its current structure limits effective retrieval and accessibility, given the type of its intended audience, smallholder farmers. This work-in-progress aims to improve access to agricultural knowledge in Kiswahili through a hybrid model that integrates a domain-specific Knowledge Graph (KG) with Retrieval-Augmented Generation (RAG), an approach that combines traditional retrieval with generative language models for producing informed answers. The project's findings are aimed to contribute to AI-driven retrieval systems for low-resource languages, with results targeted for submission as a paper to SIGIR 2026.

Some Things Never Change: Overcoming Persistent Challenges in Children IR

Maria Soledad Pera
Theo Huibers
Emiliana Murgia
Monica Landoni

There is a lack of a steady and solid influx of information retrieval (IR) research that has children (as the user group) as the protagonist. Existing work is scattered, conducted by only a few research groups, and often based on small-scale user studies or data that cannot be widely shared. Moreover, much of the current research focuses on specific age ranges and abilities, neglecting the broader spectrum of children's needs. Consequently, the paucity of IR research on how search and recommender systems serve and/or ultimately affect children translates into one of many 'Low-resource environments' in IR. Drawing from the literature and our experience in this area, we highlight key challenges and encourage greater attention from the IR community to address this critical gap.

IR for AAC Users: A Hyperdimensional Computing (Vector Symbolic Architectures) Approach

Hunter Briegel
Maya Pagal
J. Shane Culpepper

This work proposes Hyperdimensional Computing (HDC) as a design paradigm [13] to facilitate search and recommendation activities for disabled users employing symbolic augmentative and alternative communication (AAC) systems. Such a context necessitates flexibility and composability in item and query representations as a consequence of vocabularies being tailored to an individual user. HDC is suggested to meet these needs in an efficient manner. However, construction and empirical evaluation are left to additional research.

Project development for symbolic AAC supports a small but highly diverse user base. This creates a ''low resource environment'' from both an economic and technical perspective. The heterogeneity of user communication abilities, preferences, and symbolic interpretation limits the generalisability of datasets and presents a cold start problem. Furthermore, countries with underdeveloped social welfare and health infrastructure, the Philippines serving as an illustrative example, have few practitioners able to fine-tune AAC devices or support inclusive decision making. Consequently, this exacerbates resource constraints, which motivates systems to offer consumer-level ease of use. Rural areas, in any socioeconomic setting, lack reliable network connectivity: designers should anticipate regular on-device computation, leaving networked tasks to limited occasions. Technologies for disabilities, broadly speaking, are related to all 17 UN Sustainable Development Goals [21]. Additionally, AAC is a focus area of UNICEF's work on disabilities and inclusion for marginalised children [2].

By proposing HDC as a way to mediate diverse feature spaces, search and recommendation applications are able to reuse semantic components and process structures - reducing costs. This flexibility enables adaptation to underserved cultures, where localised symbol interpretations are particularly needed [3]. A similar approach has been taken to circulate open access symbols [2]. HDC's development comes from an interdisciplinary perspective, implementing cognitive and linguistic science precepts computationally [8]. Building on this idea, AAC for search requires wider collaboration with entities outside the traditional information retrieval (IR) community, such as speech pathologists, disability support workers, and end users with disabilities.

SESSION: Reproducibility Papers

A Reproducibility Study of Graph-Based Legal Case Retrieval

Gregor Donabauer
Udo Kruschwitz

Legal retrieval is a widely studied area in Information Retrieval (IR) and a key task in this domain is retrieving relevant cases based on a given query case, often done by applying language models as encoders to model case similarity. Recently, Tang et al. proposed CaseLink, a novel graph-based method for legal case retrieval, which models both cases and legal charges as nodes in a network, with edges representing relationships such as references and shared semantics. This approach offers a new perspective on the task by capturing higher-order relationships of cases going beyond the stand-alone level of documents. However, while this shift in approaching legal case retrieval is a promising direction in an understudied area of graph-based legal IR, challenges in reproducing novel results have recently been highlighted, with multiple studies reporting difficulties in reproducing previous findings. Thus, in this work we reproduce CaseLink, a graph-based legal case retrieval method, to support future research in this area of IR. In particular, we aim to assess its reliability and generalizability by (i) first reproducing the original study setup and (ii) applying the approach to an additional dataset. We then build upon the original implementations by (iii) evaluating the approach's performance when using a more sophisticated graph data representation and (iv) using an open large language model (LLM) in the pipeline to address limitations that are known to result from using closed models accessed via an API. Our findings aim to improve the understanding of graph-based approaches in legal IR and contribute to improving reproducibility in the field. To achieve this, we share all our implementations and experimental artifacts with the community.

Accelerating Listwise Reranking: Reproducing and Enhancing FIRST

Zijian Chen
Ronak Pradeep
Jimmy Lin

Large language models (LLMs) have emerged as powerful listwise rerankers but remain prohibitively slow for many real-world applications. What's more, training on the language modeling (LM) objective is not intrinsically aligned with reranking tasks. To address these challenges, FIRST, a novel approach for listwise reranking, integrates a learning-to-rank objective and leverages only the logits of the first generated token for reranking, significantly reducing computational overhead while preserving effectiveness. We systematically evaluate the capabilities and limitations of FIRST. By extending its evaluation to TREC Deep Learning collections (DL19-23), we show that FIRST achieves robust out-of-domain effectiveness. Through training FIRST on a variety of backbone models, we demonstrate its generalizability across different model architectures, and achieve effectiveness surpassing the original implementation. Further analysis of the interaction between FIRST and various first-stage retrievers reveals diminishing returns akin to traditional LLM rerankers. A comprehensive latency study confirms that FIRST consistently delivers a 40% efficiency gain over traditional rerankers without sacrificing effectiveness. Notably, while LM training implicitly improves zero-shot single-token reranking, our experiments also highlight potential conflicts between LM pre-training and subsequent fine-tuning on the FIRST objective. These findings pave the way for more efficient and effective listwise reranking in future applications. Our code is available at: https://rankllm.ai.

Benchmark Granularity and Model Robustness for Image-Text Retrieval: A Reproducibility Study

Mariya Hendriksen
Shuo Zhang
Ridho Reinanda
Mohamed Yahya
Edgar Meij
Maarten de Rijke

Image-Text Retrieval (ITR) systems are central to multimodal information access, with Vision-Language Models (VLMs) showing strong performance on standard benchmarks. However, these benchmarks predominantly rely on coarse-grained annotations, limiting their ability to reveal how models would perform under real-world conditions, where query granularity varies. Motivated by this gap, we examine how dataset granularity and query perturbations affect retrieval performance and robustness across four architecturally diverse VLMs (ALIGN, AltCLIP, CLIP, and GroupViT). Using both standard benchmarks (MS-COCO, Flickr30k) and their fine-grained variants, we show that richer captions consistently enhance retrieval, especially in text-to-image tasks, where we observe an average improvement of 16.23%, compared to 6.44% in image-to-text. To assess robustness, we introduce a taxonomy of perturbations and conduct extensive experiments, revealing that while perturbations typically degrade performance, they can also unexpectedly improve retrieval, exposing nuanced model behaviors. Notably, word order emerges as a critical factor - contradicting prior assumptions of model insensitivity to it. Our results highlight variation in model robustness and a dataset-dependent relationship between caption granularity and perturbation sensitivity and emphasize the necessity of evaluating models on datasets of varying granularity.

Gosling Grows Up: Retrieval with Learned Dense and Sparse Representations Using Anserini

Jimmy Lin
Arthur Haonan Chen
Carlos Lassance
Xueguang Ma
Ronak Pradeep
Tommaso Teofili
Jasper Xian
Jheng-Hong Yang
Brayden Zhong
Vincent Zhong

The Anserini IR toolkit has come a long way since efforts began in 2015. Although the goals of the project - to bridge research and practice in information retrieval, and to provide reproducible, easy-to-use baselines - have remained constant, the world has changed quite a bit. We discuss how Anserini has evolved in response to this changing environment, the most significant of which is the advent of transformer-based retrieval models that did not exist when the project started. The bi-encoder architecture provides a framework for understanding retrieval models based on dense and sparse vector representations, and offers a reference for conveying the capabilities of our toolkit. Anserini provides end-to-end first-stage retrieval based on single-vector learned dense and sparse representations, directly building on the open-source Lucene search library and the ONNX runtime. This minimal design accelerates the pace of research and fosters reproducibility, enabling ''two-click reproductions''. By better aligning research and practice, we increase the potential real-world impact of research innovations.

Refined Medical Search via Dense Retrieval and User Interaction

Reyhaneh Goli
Alistair Moffat
George Buchanan

Users formulate search queries that reflect an information need. Those queries are then submitted to a search service in the expectation that the retrieved results will allow the user to complete an external task, and align with their broader information context.

In this study, we reimplement and reproduce the log-augmented dense retrieval approach introduced by Jin, Shin, and Lu in 2023. As part of our study we extend the experimentation by: (1) using all of the available training data rather than a subset; (2) exploring a second dense retrieval model; and (3) enhancing the approach by incorporating user reformulation behavior into the dense retrieval computation so as to improve ranking effectiveness. To evaluate these modifications, we again utilize the TripClick IR benchmark, which comprises approximately four million click log entries from a health domain web search engine.

Although we were unable to exactly replicate the results of Jin, Shin, and Lu, our findings confirm the overall trends reported in their study. Specifically, our results support the conclusion that incorporating user interactions into dense retrieval models improves ranking effectiveness compared to when no user information is available. Moreover, our enhanced formulation allows further small gains to be made in retrieval effectiveness.

Reproducibility, Replicability, and Insights into Visual Document Retrieval with Late Interaction

Jingfen Qiao
Jia-Huei Ju
Xinyu Ma
Evangelos Kanoulas
Andrew Yates

Visual Document Retrieval (VDR) is an emerging research area that focuses on encoding and retrieving document images directly, bypassing the dependence on Optical Character Recognition (OCR) for document search. A recent advance in VDR was introduced by ColPali, which significantly improved retrieval effectiveness through a late interaction mechanism. ColPali's approach demonstrated substantial performance gains over existing baselines that do not use late interaction on an established benchmark. In this study, we investigate the reproducibility and replicability of VDR methods with and without late interaction mechanisms by systematically evaluating their performance across multiple pre-trained vision-language models. Our findings confirm that late interaction yields considerable improvements in retrieval effectiveness; however, it also introduces computational inefficiencies during inference. Additionally, we examine the adaptability of VDR models to textual inputs and assess their robustness across text-intensive datasets within the proposed benchmark, particularly when scaling the indexing mechanism. Furthermore, our research investigates the specific contributions of late interaction by looking into query-patch matching in the context of visual document retrieval. We find that although query tokens cannot explicitly match image patches as in the text retrieval scenario, they tend to match the patch contains visually similar tokens or their surrounding patches.

Reproducing NevIR: Negation in Neural Information Retrieval

Coen van den Elsen
Francien Barkhof
Thijmen Nijdam
Simon Lupart
Mohammad Aliannejadi

Negation is a fundamental aspect of human communication, yet it remains a challenge for Language Models (LMs) in Information Retrieval (IR). Despite the heavy reliance of modern neural IR systems on LMs, little attention has been given to their handling of negation. In this study, we reproduce and extend the findings of NevIR, a benchmark study that revealed most IR models perform at or below the level of random ranking when dealing with negation. We replicate NevIR's original experiments and evaluate newly developed state-of-the-art IR models. Our findings show that a recently emerging category-listwise Large Language Model (LLM) re-rankers-outperforms other models but still underperforms human performance. Additionally, we leverage ExcluIR, a benchmark dataset designed for exclusionary queries with extensive negation, to assess the generalisability of negation understanding. Our findings suggest that fine-tuning on one dataset does not reliably improve performance on the other, indicating notable differences in their data distributions. Furthermore, we observe that only cross-encoders and listwise LLM re-rankers achieve reasonable performance across both negation tasks.

Revisiting Algorithmic Audits of TikTok: Poor Reproducibility and Short-term Validity of Findings

Matej Mosnar
Adam Skurla
Branislav Pecher
Matus Tibensky
Jan Jakubcik
Adrian Bindas
Peter Sakalik
Ivan Srba

Social media platforms are constantly shifting towards algorithmically curated content based on implicit or explicit user feedback. Regulators, as well as researchers, are calling for systematic social media algorithmic audits as this shift leads to enclosing users in filter bubbles and leading them to more problematic content. An important aspect of such audits is the reproducibility and generalisability of their findings, as it allows to draw verifiable conclusions and audit potential changes in algorithms over time. In this work, we study the reproducibility of the existing sockpuppeting audits of TikTok recommender systems, and the generalizability of their findings. In our efforts to reproduce the previous works, we find multiple challenges stemming from social media platform changes and content evolution, but also the research works themselves. These drawbacks limit the audit reproducibility and require an extensive effort altogether with inevitable adjustments to the auditing methodology. Our experiments also reveal that these one-shot audit findings often hold only in the short term, implying that the reproducibility and generalizability of the audits heavily depend on the methodological choices and the state of algorithms and content on the platform. This highlights the importance of reproducible audits that allow us to determine how the situation changes in time.

SESSION: Resource Papers

An EEG Dataset of Word-level Brain Responses for Semantic Text Relevance

Vadym Gryshchuk
Michiel M. Spapé
Maria Maistro
Christina Lioma
Tuukka Ruotsalo

Electroencephalography (EEG) can enable non-invasive, real-time measurement of brain activity reflecting cognitive processes during human language processing. Previously released EEG datasets primarily capture brain signals recorded either during natural reading or within controlled psycholinguistic experimental settings. Given that information retrieval research depends on understanding and modelling relevance, we present a novel dataset including EEG data recorded while participants read text that is semantically relevant or irrelevant to self-selected topics. The dataset contains 23, 270 time-locked (∼ 0.7s) word-level EEG recordings. Using these data, we conduct benchmark experiments with two evaluation protocols, cross-subject and within-subject, focusing on two prediction tasks: word relevance and sentence relevance. We report the performance of five well known models on these tasks. Altogether, our dataset paves the way for advancing research on language relevance, brain input and feedback-based recommendation and retrieval systems, and development of brain-computer interface (BCI) devices for online detection of language relevance. Our dataset and code are openly released at https://osf.io/xh3g5/wiki/home/ and at HuggingFace https://huggingface.co/datasets/Quoron/EEG-semantic-text-relevance.

CoLoTa: A Dataset for Entity-based Commonsense Reasoning over Long-Tail Knowledge

Armin Toroghi
Willis Guo
Scott Sanner

The rise of Large Language Models (LLMs) has redefined the AI landscape, particularly due to their ability to encode factual and commonsense knowledge, and their outstanding performance in tasks requiring reasoning. Despite these advances, hallucinations and reasoning errors remain a significant barrier to their deployment in high-stakes settings. In this work, we observe that even the most prominent LLMs, such as OpenAI-o1, suffer from high rates of reasoning errors and hallucinations on tasks requiring commonsense reasoning over obscure, long-tail entities. To investigate this limitation, we present a new dataset for Commonsense reasoning over Long-Tail entities (CoLoTa), that consists of 3,300 queries from question answering and claim verification tasks and covers a diverse range of commonsense reasoning skills. We remark that CoLoTa can also serve as a Knowledge Graph Question Answering (KGQA) dataset since the support of knowledge required to answer its queries is present in the Wikidata knowledge graph. However, as opposed to existing KGQA benchmarks that merely focus on factoid questions, our CoLoTa queries also require commonsense reasoning. Our experiments with strong LLM-based KGQA methodologies indicate their severe inability to answer queries involving commonsense reasoning. Hence, we propose CoLoTa as a novel benchmark for assessing both (i) LLM commonsense reasoning capabilities and their robustness to hallucinations on long-tail entities and (ii) the commonsense reasoning capabilities of KGQA methods.

Doctron: A Web-based Collaborative Annotation Tool for Ground Truth Creation in IR

Ornella Irrera
Stefano Marchesin
Farzad Shami
Gianmaria Silvello

In Information Retrieval (IR), ground truth creation is a crucial yet resource-intensive task that relies on human experts to build test collections - essential for training and evaluating retrieval models. Large-scale evaluation campaigns, such as TREC and CLEF, demand significant human effort to produce reliable, high-quality annotations. To ease this process, tailored annotation tools are pivotal to supporting assessors and streamlining their workload. To this end, we introduce Doctron, a web-based, dockerized annotation tool designed to streamline ground truth creation for IR tasks. Doctron enables the annotation of both textual documents and images. It supports annotating textual passages, identifying relationships, tagging and linking entities, evaluating document relevance to a topic with graded labels, and performing object detection. It offers a collaborative environment where teams can work with defined user roles and permissions. The integration of Inter Annotator Agreement (IAA) measures helps to identify inconsistencies between annotators, thereby ensuring the reliability and high quality of the annotated ground truth data.

FairDiverse: A Comprehensive Toolkit for Fairness- and Diversity-aware Information Retrieval

Chen Xu
Zhirui Deng
Clara Rus
Xiaopeng Ye
Yuanna Liu
Jun Xu
Zhicheng Dou
Ji-Rong Wen
Maarten de Rijke

In modern information retrieval (IR), going beyond accuracy is crucial for maintaining a healthy ecosystem, particularly in meeting fairness and diversity requirements. To address these needs, various datasets, algorithms, and evaluation methods have been developed. These algorithms are often tested with different metrics, datasets, and experimental settings, making comparisons inconsistent and challenging. Consequently, there is an urgent need for a comprehensive IR toolkit, enabling standardized assessments of fairness- and diversity-aware algorithms across IR tasks. To address these issues, we introduce an open-source standardized toolkit called FairDiverse. First, FairDiverse provides a comprehensive framework for incorporating fairness- and diversity-aware approaches, including pre-processing, in-processing, and post-processing methods, into different pipeline stages of IR. Second, FairDiverse enables the evaluation of 29 fairness, and diversity algorithms across 16 base models for two fundamental IR tasks-search and recommendation-facilitating the establishment of a comprehensive benchmark. Finally, FairDiverse is highly extensible, offering multiple APIs to enable IR researchers to quickly develop their own fairness- and diversity-aware IR models, and allows for fair comparisons with existing baselines. The project is open-sourced on GitHub:~ https://github.com/XuChen0427/FairDiverse.

JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System

Weihang Su
Baoqing Yue
Qingyao Ai
Yiran Hu
Jiaqi Li
Changyue Wang
Kaiyuan Zhang
Yueyue Wu
Yiqun Liu

This paper introduces JuDGE (Judgment Document Generation Evaluation), a novel benchmark for evaluating the performance of judgment document generation in the Chinese legal system. We define the task as generating a complete legal judgment document from the given factual description of the case. To facilitate this benchmark, we construct a comprehensive dataset consisting of factual descriptions from real legal cases, paired with their corresponding full judgment documents, which serve as the ground truth for evaluating the quality of generated documents. This dataset is further augmented by two external legal corpora that provide additional legal knowledge for the task: one comprising statutes and regulations, and the other consisting of a large collection of past judgment documents. In collaboration with legal professionals, we establish a comprehensive automated evaluation framework to assess the quality of generated judgment documents across various dimensions. We evaluate various baseline approaches, including few-shot in-context learning, fine-tuning, and a multi-source retrieval-augmented generation (RAG) approach, using both general and legal-domain LLMs. The experimental results demonstrate that, while RAG approaches can effectively improve performance in this task, there is still substantial room for further improvement. All the codes and datasets are available at: https://github.com/oneal2000/JuDGE

KIMERA: From Evaluation-as-a-Service to Evaluation-in-the-Cloud

Andrea Pasin
Nicola Ferro

Experimental evaluation steers the development of Information Retrieval (IR) systems, and large-scale evaluation campaigns provide the field with a common infrastructure to conduct comparable evaluation exercises. Over the years, tools and platforms have been developed to manage and automate these activities, enhance the reproducibility of conducted experiments and facilitate data sharing. In this context, Evaluation-as-a-Service (EaaS) emerged as an approach to avoid distributing experimental collections, which may contain copyrighted or sensitive data, and instead execute containerised code on that data on remote servers. We propose Kubernetes Infrastructure for Managed Evaluation and Resource Access (KIMERA) as the next step from EaaS into Evaluation-in-the-Cloud (EitC), allowing researchers to directly code and execute their systems through their browsers, requiring only an internet connection. Moreover, recent advancements, such as Large Language Models, or new computing paradigms, such as quantum computers, require external third party services and computational resources. In this respect, KIMERA streamlines and simplifies access to such services on-demand via their APIs. More in detail, KIMERA relies on state-of-the-art containerization and orchestration tools, such as Docker and Kubernetes, to provide a robust, scalable, secure, and fault-tolerant IR evaluation platform. KIMERA monitors and stores all the participants' submissions, accurately keeping track of the resource usage, allowing for evaluating both the efficiency and the effectiveness of the deployed methods. Moreover, all participants can be assigned workspaces sharing the same resources (i.e., CPU and RAM), thus enhancing reproducibility and comparability among systems. Finally, KIMERA has been designed with modularity and extensibility in mind, allowing it to be easily adapted to new use cases and usage scenarios. KIMERA has been developed and adopted in the context of the QuantumCLEF lab, to allow for mixed experiments, comparing approaches running on traditional hardware and on real quantum annealers provided by external companies. KIMERA has also been used as a learning resource to provide Quantum Computing tutorials for IR at major conferences, such as ECIR and SIGIR. The source code of KIMERA is openly available at https://github.com/MjPaxter/KIMERA.

MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation

Qinhan Yu
Zhiyou Xiao
Binghui Li
Zhengren Wang
Chong Chen
Wentao Zhang

Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web, Academia, and Lifestyle. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of generative models in the MRAMG task. Additionally, we propose an efficient and flexible multimodal answer generation framework that can leverage LLMs/MLLMs to generate multimodal responses. Our datasets and complete evaluation results for 11 popular generative models are available at https://github.com/MRAMG-Bench/MRAMG.

nlcTables: A Dataset for Marrying Natural Language Conditions with Table Discovery

Lingxi Cui
Huan Li
Ke Chen
Lidan Shou
Gang Chen

With the growing abundance of repositories containing tabular data, discovering relevant tables for in-depth analysis remains a challenging task. Existing table discovery methods primarily retrieve desired tables based on a query table or several vague keywords, leaving users to manually filter large result sets. To address this limitation, we propose a new task: NL-conditional table discovery (nlcTD), where users combine a query table with natural language (NL) requirements to refine search results. To advance research in this area, we present nlcTables, a comprehensive benchmark dataset comprising 627 diverse queries spanning NL-only, union, join, and fuzzy conditions, 22,080 candidate tables, and 21,200 relevance annotations. Our evaluation of six state-of-the-art table discovery methods on nlcTables reveals substantial performance gaps, highlighting the need for advanced techniques to tackle this challenging nlcTD scenario. The dataset, construction framework, and baseline implementations are publicly available at https://github.com/SuDIS-ZJU/nlcTables to foster future research.

PILs of Knowledge: A Synthetic Benchmark for Evaluating Question Answering Systems in Healthcare

Riccardo Lunardi
Michael Soprano
Paolo Coppola
Vincenzo Della Mea
Stefano Mizzaro
Kevin Roitero

Patient Information Leaflets (PILs) provide essential information about medication usage, side effects, precautions, and interactions, making them a valuable resource for Question Answering (QA) systems in healthcare. However, no dedicated benchmark currently exists to evaluate QA systems specifically on PILs, limiting progress in this domain. To address this gap, we introduce a fact-supported synthetic benchmark composed of multiple-choice questions and answers generated from real PILs. We construct the benchmark using a fully automated pipeline that leverages multiple Large Language Models (LLMs) to generate diverse, realistic, and contextually relevant question-answer pairs. The benchmark is publicly released as a standardized evaluation framework for assessing the ability of LLMs to process and reason over PIL content. To validate its effectiveness, we conduct an initial evaluation with state-of-the-art LLMs, showing that the benchmark presents a realistic and challenging task, making it a valuable resource for advancing QA research in the healthcare domain.

Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions

Jia Chen
Qian Dong
Haitao Li
Xiaohui He
Yan Gao
Shaosheng Cao
Yi Wu
Ping Yang
Chen Xu
Yao Hu
Qingyao Ai
Yiqun Liu

User-generated content (UGC) communities, especially those featuring multimodal content, improve user experiences by integrating visual and textual information into results (or items). The challenge of improving user experiences in complex systems with search and recommendation (S&R) services has drawn significant attention from both academia and industry these years. However, the lack of high-quality datasets has limited the research progress on multimodal S&R. To address the growing need for developing better S&R services, we present a novel multimodal information retrieval dataset in this paper, namely Qilin. The dataset is collected from Xiaohongshu, a popular social platform with over 300 million monthly active users and an average search penetration rate of over 70%. In contrast to existing datasets, Qilin offers a comprehensive collection of user sessions with heterogeneous results like image-text notes, video notes, commercial notes, and direct answers, facilitating the development of advanced multimodal neural retrieval models across diverse task settings. To better model user satisfaction and support the analysis of heterogeneous user behaviors, we also collect extensive APP-level contextual signals and genuine user feedback. Notably, Qilin contains user-favored answers and their referred results for search requests triggering the Deep Query Answering (DQA) module. This allows not only the training & evaluation of a Retrieval-augmented Generation (RAG) pipeline, but also the exploration of how such a module would affect users' search behavior. Through comprehensive analysis and experiments, we provide interesting findings and insights for further improving S&R systems. We hope that Qilin will significantly contribute to the advancement of multimodal content platforms with S&R services in the future.

Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for Deep Research

Corbin Rosset
Ho-Lam Chung
Guanghui Qin
Ethan Chau
Zhuo Feng
Ahmed Awadallah
Jennifer Neville
Nikhil Rao

Existing question answering (QA) datasets are no longer challenging to most powerful Large Language Models (LLMs). Traditional QA benchmarks like TriviaQA, NaturalQuestions, ELI5 and HotpotQA mainly study ''known unknowns'' with clear indications of both what information is missing, and how to find it to answer the question. A yet unmet need of the NLP community is a bank of non-factoid, multi-perspective questions involving a great deal of unclear information needs, i.e. ''unknown unknowns''. We claim we can find such questions in search engine logs, which is surprising because most question-intent queries are indeed factoid. Furthermore, recent products like Google's DeepResearch (announced a year after this resource was released publicly) specifically address such queries, retrieving hundreds of documents to synthesize report-style responses. We present Researchy Questions, the world's first, only and largest public dataset of ''Deep Research'' questions filtered from real search engine logs to be non-factoid, ''decompositional'' and multi-perspective. We show that users spend substantial ''effort'' on these questions in terms of signals like clicks and session length. We also show that ''slow thinking'' answering techniques, like decomposition into sub-questions shows benefit over answering directly. We release (at https://huggingface.co/datasets/corbyrosset/researchy_questions) about 100k Researchy Questions with a permissive CDLA-2.0 license, along with click histograms on over 350k Clueweb22 URLs that were clicked for each question.

TIREx Tracker: The Information Retrieval Experiment Tracker

Tim Hagen
Maik Fröbe
Jan Heinrich Merker
Harrisen Scells
Matthias Hagen
Martin Potthast

The reproducibility and transparency of retrieval experiments depends on the availability of information about the experimental setup. However, the manual collection of experiment metadata can be tedious, error-prone, and inconsistent, which calls for an automated systematic collection. Expanding ir_metadata, we present the TIREx tracker, a tool that records hardware configurations, power/CPU/RAM/GPU usage, and experiment/system versions. Implemented as a lightweight platform-independent C binary, the TIREx tracker integrates seamlessly into Python, Java, or C/C++ workflows and can be easily integrated into shard task submissions, as we demonstrate for the TIRA/TIREx platform. Code, binaries, and documentation of the TIREx tracker are publicly available at https://github.com/tira-io/tirex-tracker.

WikiHint: A Human-Annotated Dataset for Hint Ranking and Generation

Jamshid Mozafari
Florian Gerhold
Adam Jatowt

The use of Large Language Models (LLMs) has increased significantly with users frequently asking questions to chatbots. In the time when information is readily accessible, it is crucial to stimulate and preserve human cognitive abilities and maintain strong reasoning skills. This paper addresses such challenges by promoting the use of hints as an alternative or a supplement to direct answers. We first introduce a manually constructed hint dataset, WikiHint, which is based on Wikipedia and includes 5,000 hints created for 1,000 questions. We then finetune open-source LLMs for hint generation in answer-aware and answer-agnostic contexts. We assess the effectiveness of the hints with human participants who answer questions with and without the aid of hints. Additionally, we introduce a lightweight evaluation method, HintRank, to evaluate and rank hints in both answer-aware and answer-agnostic settings. Our findings show that (a) the dataset helps generate more effective hints (b) including answer information along with questions generally improves the quality of generated hints, and (c) encoder-based models perform better than decoder-based models in hint ranking.

Wrong Answers Can Also Be Useful: PlausibleQA - A Large-Scale QA Dataset with Answer Plausibility Scores

Jamshid Mozafari
Abdelrahman Abdallah
Bhawna Piryani
Adam Jatowt

Large Language Models (LLMs) are revolutionizing information retrieval, with chatbots becoming an important source for answering user queries. As by their design, LLMs prioritize generating correct answers, the value of highly plausible yet incorrect answers (candidate answers) tends to be overlooked. However, such answers can still prove useful, for example, they can play a crucial role in tasks like Multiple-Choice Question Answering (MCQA) and QA Robustness Assessment (QARA). Existing QA datasets primarily focus on correct answers without explicit consideration of the plausibility of other candidate answers, limiting opportunity for more nuanced evaluations of models. To address this gap, we introduce PlausibleQA, a large-scale dataset comprising 10,000 questions and 100,000 candidate answers, each annotated with plausibility scores and justifications for their selection. Additionally, the dataset includes 900,000 justifications for pairwise comparisons between candidate answers, further refining plausibility assessments. We evaluate PlausibleQA through human assessments and empirical experiments, demonstrating its utility in MCQA and QARA analysis. Our findings show that plausibility-aware approaches are effective for MCQA distractor generation and QARA. We release PlausibleQA as a resource for advancing QA research and enhancing LLM performance in distinguishing plausible distractors from correct answers.

SESSION: Perspective Papers

SESSION: Demo Papers

DeepReport: An AI-assisted Idea Generation System for Scientific Research

Yi Xu
Luoyi Fu
Shuqian Sheng
Bo Xue
Jiaxin Ding
Lei Zhou
Xinbing Wang
Chenghu Zhou

Nowadays, the explosive growth of academic literature has been going far beyond scientists' limited capability to read through, making it increasingly difficult for them to absorb disciplinary insights and extract intellectual essences critical for generating novel research ideas in interdisciplinary studies. To address this, we develop DeepReport, an AI-assisted scientific idea generation system to alleviate the research burden. Technically, DeepReport maintains evolving concept co-occurrence graphs to extract core insights from over 260 million publications across all disciplines. These concepts are periodically collected and updated, enabling the automatic extraction of hidden cross-domain connections. Combining temporal link prediction and analysis techniques with large language models, DeepReport is able to further transform these patterns of insights into actionable ideas. With the function of integrating up-to-date academic databases, visualizing dynamic relationships of concepts, and automatically generating new ideas, DeepReport empowers researchers to navigate complex knowledge landscapes, reduce cognitive burdens, and accelerate the generation of groundbreaking concepts. This work provides an in-depth exploration of DeepReport's architecture, functionalities, and applications, highlighting its transformative potential for advancing interdisciplinary research and fostering innovation. DeepReport is available at https://idea.acemap.cn/.

NLQxform-UI: An Interactive and Intuitive Scholarly Question Answering System

Ruijie Wang
Zhiruo Zhang
Luca Rossetto
Florian Ruosch
Abraham Bernstein

Most scholarly search services only provide basic text-matching or similarity-based searches, with limited operations that require manual configuration, such as sorting and filtering by specific metadata attributes. These capabilities are insufficient for researchers who often have queries that involve complex constraints and operations, such as ''enumerating the authors of a given paper along with the venues where they have published other papers.'' In this work, we develop an interactive and intuitive scholarly question answering system called NLQxform-UI, which allows users to pose complex queries in the form of natural language questions. It is capable of automatically translating these questions into SPARQL queries that can be executed over the DBLP knowledge graph to retrieve expected answers. Furthermore, the users can interact with each step of the answering process and browse the final results in a web-based interface. A video recording of our system is available at https://youtu.be/elq8CPykiyk Additionally, the system has been completely open-sourced: https://github.com/ruijie-wang-uzh/NLQxform-UI

ROKSANA: An Open-Source Toolkit for Robust Graph-Based Keyword Search

Radin Hamidi Rad
Amir Khosrojerdi
Ebrahim Bagheri

We introduce ROKSANA, an open-source Python toolkit designed to support research in graph-based keyword search under adversarial settings. ROKSANA provides a modular environment for dataset handling, graph neural network (GNN)-based retrieval, and adversarial attack modeling, enabling systematic evaluation of search robustness. The framework integrates built-in retrieval and attack methods while allowing seamless customization of search algorithms and perturbation strategies. Users can benchmark performance on a centralized leaderboard, generate reproducible evaluation reports, and explore ranking behaviors through an interactive web-based visualization interface. By centering around reproducibility, extensibility, and collaborative benchmarking, ROKSANA serves as a comprehensive platform for advancing robust and interpretable keyword search in graphs. This demonstration will showcase ROKSANA's capabilities in real-time, illustrating its impact on experimental workflows and adversarial robustness analysis in graph IR research.

MMMORRF: Multimodal Multilingual MOdularized Reciprocal Rank Fusion

Saron Samuel
Dan DeGenaro
Jimena Guallar-Blasco
Kate Sanders
Seun Eisape
Arun Reddy
Alexander Martin
Andrew Yates
Eugene Yang
Cameron Carpenter
David Etter
Efsun Kayi
Matthew Wiesner
Kenton Murray
Reno Kriz

Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retrieval.

Multimodal Search in Chemical Documents and Reactions

Ayush Kumar Shah
Abhisek Dey
Leo Luo
Bryan Amador
Patrick Philippy
Ming Zhong
Siru Ouyang
David Mark Friday
David Bianchi
Nick Jackson
Richard Zanibbi
Jiawei Han

We present a multimodal search tool for retrieval of chemical reactions, molecular structures, and associated text from scientific literature. Queries may combine molecular diagrams, textual descriptions, and reaction data, allowing users to connect different chemical information representations. Indexing includes chemical diagram extraction and parsing, extraction of reaction data from text in tabular form, and cross-modal linking of diagrams with their mentions in text. We describe the system's architecture and retrieval features, along with expert assessments of the system. Our demo highlights the workflow and search components. Online demo: https://www.cs.rit.edu/~dprl/reactionminer-demo-landing

Constructing and Evaluating Declarative RAG Pipelines in PyTerrier

Craig Macdonald
Jinyuan Fang
Andrew Parry
Zaiqiao Meng

Search engines often follow a pipeline architecture, where complex but effective reranking components are used to refine the results of an initial retrieval. Retrieval augmented generation (RAG) is an exciting application of the pipeline architecture, where the final component generates a coherent answer for the users from the retrieved documents. In this demo paper, we describe how such RAG pipelines can be formulated in the declarative PyTerrier architecture, and the advantages of doing so. Our PyTerrier-RAG extension for PyTerrier provides easy access to standard RAG datasets and evaluation measures, state-of-the-art LLM readers, and using PyTerrier's unique operator notation, easy-to-build pipelines. We demonstrate the succinctness of indexing and RAG pipelines on standard datasets (including Natural Questions) and how to build on the larger PyTerrier ecosystem with state-of-the-art sparse, learned-sparse, and dense retrievers, and other neural rankers.

SIGIR 2025 Proceedings

SIGIR '25: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

SESSION: Keynotes

SESSION: Conversational IR and Intelligent Agents

SESSION: Benchmarks and Datasets

SESSION: Evaluation

SESSION: Domain-specific Applications 1

SESSION: Domain-specific Applications 2

SESSION: FATE 1

SESSION: FATE 2

SESSION: FATE 3

SESSION: Humans and Interfaces

SESSION: Machine Learning 1

SESSION: Machine Learning 2

SESSION: Image Retrieval

SESSION: Video Retrieval

SESSION: Multi-modal Retrieval

SESSION: Biomedical and Health

SESSION: Question Answering

SESSION: Knowledge and Knowledge Graphs

SESSION: Natural Language Processing 1

SESSION: Natural Language Processing 2

SESSION: Natural Language Processing 3

SESSION: RecSys: Sequential 1

SESSION: RecSys: Sequential 2

SESSION: RecSys: Sequential 3

SESSION: RecSys: FATE

SESSION: RecSys: Domain-specific

SESSION: RecSys: Multimodal

SESSION: RecSys: LLMs

SESSION: RecSys: Collaborative Filtering

SESSION: RecSys: Graphs

SESSION: RecSys: Scalability, Embeddings and Training

SESSION: RecSys: Ranking and Adaptivity

SESSION: Reranking

SESSION: Search and Ranking 1

SESSION: Search and Ranking 2

SESSION: Efficiency

SESSION: Short Research Papers

SESSION: Low Resource Environment Papers

SESSION: Reproducibility Papers

SESSION: Resource Papers

SESSION: Perspective Papers

SESSION: Demo Papers

SESSION: Tutorials

SESSION: Workshop Summaries

SESSION: Doctoral Consortium

SESSION: SIRIP