SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

SESSION: Keynote Talks

Representation Learning and Information Retrieval

Yiming Yang

How to best represent words, documents, queries, entities, relations, and other variables in information retrieval (IR) and related applications has been a fundamental research question for decades. Early IR systems relied on the independence assumptions about words and documents for simplicity and scalability, which were clearly sub-optimal from a semantic point of view. The rapid development of deep neural networks in the past decade has revolutionized the representation learning technologies for contextualized word embedding and graph-enhanced document embedding, leading to the new era of dense IR. This talk highlights such impactful shifts in representation learning for IR and related areas, the new challenges coming along and the remedies, including our recent work in large-scale dense IR [1, 9], in graph-based reasoning for knowledge-enhanced predictions [10], in self-refinement of large language models (LLMs) with retrieval augmented generation (RAG)[2,7] and iterative feedback [3,4], in principle-driven self-alignment of LLMs with minimum human supervision [6], etc. More generally, the power of such deep learning goes beyond IR enhancements, e.g., for significantly improving the state-of-the-art solvers for NP-Complete problems in classical computer science [5,8].

SESSION: Session: LLMs and Search

TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision

Ruiwen Zhou
Yingxuan Yang
Muning Wen
Ying Wen
Wenhao Wang
Chunling Xi
Guoqiang Xu
Yong Yu
Weinan Zhang

Several large language model (LLM) agents have been constructed for diverse purposes such as web navigation and online shopping, leveraging the broad knowledge and text comprehension capabilities of LLMs. Many of these works rely on in-context examples to achieve generalization without requiring fine-tuning. However, few have addressed the challenge of selecting and effectively utilizing these examples. Recent approaches have introduced trajectory-level retrieval with task meta-data and the use of trajectories as in-context examples to enhance overall performance in some sequential decision making tasks like computer control. Nevertheless, these methods face issues like plausible examples retrieved without task-specific state transition dynamics and long input with plenty of irrelevant context due to using complete trajectories. In this paper, we propose a novel framework (TRAD) to tackle these problems. TRAD first employs Thought Retrieval for step-level demonstration selection through thought matching, enhancing the quality of demonstrations and reducing irrelevant input noise. Then, Aligned Decision is introduced to complement retrieved demonstration steps with their preceding or subsequent steps, providing tolerance for imperfect thought and offering a balance between more context and less noise. Extensive experiments on ALFWorld and Mind2Web benchmarks demonstrate that TRAD not only surpasses state-of-the-art models but also effectively reduces noise and promotes generalization. Furthermore, TRAD has been deployed in real-world scenarios of a global business insurance company and yields an improved success rate of robotic process automation. Our codes are available at: https://github.com/skyriver-2000/TRAD-Official.

"In-Context Learning" or: How I learned to stop worrying and love "Applied Information Retrieval"

Andrew Parry
Debasis Ganguly
Manish Chandra

With the increasing ability of large language models (LLMs), in-context learning (ICL) has evolved as a new paradigm for natural language processing (NLP), where instead of fine- tuning the parameters of an LLM specific to a downstream task with labeled examples,a small number of such examples is appended to a prompt instruction for controlling the decoder's generation process. ICL, thus, is conceptually similar to a non-parametric approach, such as k-NN,where the prediction for each instance essentially depends on the local topology, i.e., on a localised set of similar instances and their labels (called few-shot examples). This suggests that a test instance in ICL is analogous to a query in IR, and similar examples in ICL retrieved from a training set relate to a set of documents retrieved from a collection in IR. While standard unsupervised ranking models can be used to retrieve these few-shot examples from a training set, the effectiveness of the examples can potentially be improved by re-defining the notion of relevance specific to its utility for the downstream task, i.e., considering an example to be relevant if including it in the prompt instruction leads to a correct prediction. With this task-specific notion of relevance, it is possible to train a supervised ranking model (e.g., a bi-encoder or cross-encoder), which potentially learns to optimally select the few-shot examples. We believe that the recent advances in neural rankers can potentially find a use case for this task of optimally choosing examples for more effective downstream ICL predictions.

CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks

Xiaoxi Li
Zhicheng Dou
Yujia Zhou
Fangchao Liu

Large language models (LLMs) have gained significant attention in various fields but prone to hallucination, especially in knowledge-intensive (KI) tasks. To address this, retrieval-augmented generation (RAG) has emerged as a popular solution to enhance factual accuracy. However, traditional retrieval modules often rely on large document index and disconnect with generative tasks. With the advent of generative retrieval (GR), language models can retrieve by directly generating document identifiers (DocIDs), offering superior performance in retrieval tasks. However, the potential relationship between GR and downstream tasks remains unexplored. In this paper, we propose CorpusLM, a unified language model that leverages external corpus to tackle various knowledge-intensive tasks by integrating generative retrieval, closed-book generation, and RAG through a unified greedy decoding process. We design the following mechanisms to facilitate effective retrieval and generation, and improve the end-to-end effectiveness of KI tasks: (1) We develop a ranking-oriented DocID list generation strategy, which refines GR by directly learning from a DocID ranking list, to improve retrieval quality. (2) We design a continuous DocIDs-References-Answer generation strategy, which facilitates effective and efficient RAG. (3) We employ well-designed unsupervised DocID understanding tasks, to comprehend DocID semantics and their relevance to downstream tasks. We evaluate our approach on the widely used KILT benchmark with two variants of backbone models, i.e., T5 and Llama2. Experimental results demonstrate the superior performance of our models in both retrieval and downstream tasks.

A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models

Shengyao Zhuang
Honglei Zhuang
Bevan Koopman
Guido Zuccon

We propose a novel zero-shot document ranking approach based on Large Language Models (LLMs): the Setwise prompting approach. Our approach complements existing prompting approaches for LLM-based zero-shot ranking: Pointwise, Pairwise, and Listwise. Through the first-of-its-kind comparative evaluation within a consistent experimental framework and considering factors like model size, token consumption, latency, among others, we show that existing approaches are inherently characterised by trade-offs between effectiveness and efficiency. We find that while Pointwise approaches score high on efficiency, they suffer from poor effectiveness. Conversely, Pairwise approaches demonstrate superior effectiveness but incur high computational overhead. Our Setwise approach, instead, reduces the number of LLM inferences and the amount of prompt token consumption during the ranking procedure, compared to previous methods. This significantly improves the efficiency of LLM-based zero-shot ranking, while also retaining high zero-shot ranking effectiveness. We make our code and results publicly available at https://github.com/ielab/llm-rankers.

Unsupervised Large Language Model Alignment for Information Retrieval via Contrastive Feedback

Qian Dong
Yiding Liu
Qingyao Ai
Zhijing Wu
Haitao Li
Yiqun Liu
Shuaiqiang Wang
Dawei Yin
Shaoping Ma

Large language models (LLMs) have demonstrated remarkable capabilities across various research domains, including the field of Information Retrieval (IR). However, the responses generated by off-the-shelf LLMs tend to be generic, i.e., cannot capture the distinctiveness of each document with similar content. This limits the performance of LLMs in IR because finding and distinguishing relevant documents from substantial similar documents is a typical problem in many IR tasks. To address this issue, we propose an unsupervised alignment method, namely Reinforcement Learning from Contrastive Feedback (RLCF), empowering LLMs to generate both high-quality and context-specific responses. Our approach constructs unsupervised contrastive feedback signals based on similar document groups, and adopts a reward function, named group-wise reciprocal rank, to optimize LLMs. We conduct extensive experiments to evaluate the effectiveness of RLCF.

SESSION: Session: Reasoning and Knowledge Graphs

MetaHKG: Meta Hyperbolic Learning for Few-shot Temporal Reasoning

Ruijie Wang
Yutong Zhang
Jinyang Li
Shengzhong Liu
Dachun Sun
Tianchen Wang
Tianshi Wang
Yizhuo Chen
Denizhan Kara
Tarek Abdelzaher

This paper investigates the few-shot temporal reasoning capability within the hyperbolic space. The goal is to forecast future events for newly emerging entities within temporal knowledge graphs (TKGs), leveraging only a limited set of initial observations. Hyperbolic space is advantageous for modeling emerging graph entities for two reasons: First, its geometric property of exponential expansion aligns with the rapid growth of new entities in real-world graphs; Second, it excels in capturing power-law patterns and hierarchical structures, well-suitable for new entities distributed at the peripheries of graph hierarchies and loosely connected with others through few links. We therefore propose a meta-learning framework, MetaHKG, to enable few-shot temporal reasoning within a hyperbolic space. Unlike prior hyperbolic learning works, MetaHKG addresses the challenges of effectively representing new entities in TKGs and adapting model parameters by incorporating novel hyperbolic time encodings and temporal attention networks that achieve translational invariance. We also introduce a meta hyperbolic optimization algorithm to enhance model adaptation by learning both global and entity-specific parameters through bi-level optimization. Comprehensive experiments conducted on three real-world temporal knowledge graphs demonstrate the superiority of MetaHKG over a diverse range of baselines, which achieves average 5.2% relative improvements. Compared to its Euclidean counterpart, MetaHKG operates in a lower-dimensional space but yields a more stable and efficient adaptability towards new entities.

Transformer-based Reasoning for Learning Evolutionary Chain of Events on Temporal Knowledge Graph

Zhiyu Fang
Shuai-Long Lei
Xiaobin Zhu
Chun Yang
Shi-Xue Zhang
Xu-Cheng Yin
Jingyan Qin

Temporal Knowledge Graph (TKG) reasoning often involves completing missing factual elements along the timeline. Although existing methods can learn good embeddings for each factual element in quadruples by integrating temporal information, they often fail to infer the evolution of temporal facts. This is mainly because of (1) insufficiently exploring the internal structure and semantic relationships within individual quadruples and (2) inadequately learning a unified representation of the contextual and temporal correlations among different quadruples. To overcome these limitations, we propose a novel Transformer-based reasoning model (dubbed ECEformer) for TKG to learn the Evolutionary Chain of Events (ECE). Specifically, we unfold the neighborhood subgraph of an entity node in chronological order, forming an evolutionary chain of events as the input for our model. Subsequently, we utilize a Transformer encoder to learn the embeddings of intra-quadruples for ECE. We then craft a mixed-context reasoning module based on the multi-layer perceptron (MLP) to learn the unified representations of inter-quadruples for ECE while accomplishing temporal knowledge reasoning. In addition, to enhance the timeliness of the events, we devise an additional time prediction task to complete effective temporal information within the learned unified representation. Extensive experiments on six benchmark datasets verify the state-of-the-art performance and the effectiveness of our method.

LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval

Zhenyu Yang
Dizhan Xue
Shengsheng Qian
Weiming Dong
Changsheng Xu

Zero-Shot Composed Image Retrieval (ZS-CIR) has garnered increasing interest in recent years, which aims to retrieve a target image based on a query composed of a reference image and a modification text without training samples. Specifically, the modification text describes the distinction between the two images. To conduct ZS-CIR, the prevailing methods employ pre-trained image-to-text models to transform the query image and text into a single text, which is then projected into the common feature space by CLIP to retrieve the target image. However, these methods neglect that ZS-CIR is a typicalfuzzy retrieval task, where the semantics of the target image are not strictly defined by the query image and text. To overcome this limitation, this paper proposes a training-free LLM-based Divergent Reasoning and Ensemble (LDRE) method for ZS-CIR to capture diverse possible semantics of the composed result. Firstly, we employ a pre-trained captioning model to generate dense captions for the reference image, focusing on different semantic perspectives of the reference image. Then, we prompt Large Language Models (LLMs) to conduct divergent compositional reasoning based on the dense captions and modification text, deriving divergent edited captions that cover the possible semantics of the composed target. Finally, we design a divergent caption ensemble to obtain the ensemble caption feature weighted by semantic relevance scores, which is subsequently utilized to retrieve the target image in the CLIP feature space. Extensive experiments on three public datasets demonstrate that our proposed LDRE achieves the new state-of-the-art performance.

NativE: Multi-modal Knowledge Graph Completion in the Wild

Yichi Zhang
Zhuo Chen
Lingbing Guo
Yajing Xu
Binbin Hu
Ziqi Liu
Wen Zhang
Huajun Chen

Multi-modal knowledge graph completion (MMKGC) aims to automatically discover the unobserved factual knowledge from a given multi-modal knowledge graph by collaboratively modeling the triple structure and multi-modal information from entities. However, real-world MMKGs present challenges due to their diverse and imbalanced nature, which means that the modality information can span various types (e.g., image, text, numeric, audio, video) but its distribution among entities is uneven, leading to missing modalities for certain entities. Existing works usually focus on common modalities like image and text while neglecting the imbalanced distribution phenomenon of modal information. To address these issues, we propose a comprehensive framework NativE to achieve MMKGC in the wild. NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities and employs a collaborative modality adversarial training framework to augment the imbalanced modality information. We construct a new benchmark called WildKGC with five datasets to evaluate our method. The empirical results compared with 21 recent baselines confirm the superiority of our method, consistently achieving state-of-the-art performance across different datasets and various scenarios while keeping efficient and generalizable. Our code and data are released at https://github.com/zjukg/NATIVE.

Contrast then Memorize: Semantic Neighbor Retrieval-Enhanced Inductive Multimodal Knowledge Graph Completion

Yu Zhao
Ying Zhang
Baohang Zhou
Xinying Qian
Kehui Song
Xiangrui Cai

A large number of studies have emerged for Multimodal Knowledge Graph Completion (MKGC) to predict the missing links in MKGs. However, fewer studies have been proposed to study the inductive MKGC (IMKGC) involving emerging entities unseen during training. Existing inductive approaches focus on learning textual entity representations, which neglect rich semantic information in visual modality. Moreover, they focus on aggregating structural neighbors from existing KGs, which of emerging entities are usually limited. However, the semantic neighbors are decoupled from the topology linkage and usually imply the true target entity. In this paper, we propose the IMKGC task and a semantic neighbor retrieval-enhanced IMKGC framework CMR, where the contrast brings the helpful semantic neighbors close, and then the memorize supports semantic neighbor retrieval to enhance inference. Specifically, we first propose a unified cross-modal contrastive learning to simultaneously capture the textual-visual and textual-textual correlations of query-entity pairs in a unified representation space. The contrastive learning increases the similarity of positive query-entity pairs, therefore making the representations of helpful semantic neighbors close. Then, we explicitly memorize the knowledge representations to support the semantic neighbor retrieval. At test time, we retrieve the nearest semantic neighbors and interpolate them to the query-entity similarity distribution to augment the final prediction. Extensive experiments validate the effectiveness of CMR on three inductive MKGC datasets. Codes are available at https://github.com/OreOZhao/CMR.

EditKG: Editing Knowledge Graph for Recommendation

Gu Tang
Xiaoying Gan
Jinghe Wang
Bin Lu
Lyuwen Wu
Luoyi Fu
Chenghu Zhou

With the enrichment of user-item interactions, Graph Neural Networks (GNNs) are widely used in recommender systems to alleviate information overload. Nevertheless, they still suffer from the cold-start issue. Knowledge Graphs (KGs), providing external information, have been extensively applied in GNN-based methods to mitigate this issue. However, current KG-aware recommendation methods suffer from the knowledge imbalance problem caused by incompleteness of existing KGs. This imbalance is reflected by the long-tail phenomenon of item attributes, i.e., unpopular items usually lack more attributes compared to popular items. To tackle this problem, we propose a novel framework called EditKG: Editing Knowledge Graph for Recommendation, to balance attribute distribution of items via editing KGs. EditKG consists of two key designs: Knowledge Generator and Knowledge Deleter. Knowledge Generator generates attributes for items by exploring their mutual information correlations and semantic correlations. Knowledge Deleter removes the task-irrelevant item attributes according to the parameterized task relevance score, while dropping the spurious item attributes through aligning the attribute scores. Extensive experiments on three benchmark datasets demonstrate that EditKG significantly outperforms state-of-the-art methods, and achieves 8.98% average improvement. The implementations are available at https://github.com/gutang-97/2024SIGIR-EditKG.

Amazon-KG: A Knowledge Graph Enhanced Cross-Domain Recommendation Dataset

Yuhan Wang
Qing Xie
Mengzi Tang
Lin Li
Jingling Yuan
Yongjian Liu

Cross-domain recommendation (CDR) aims to utilize the information from relevant domains to guide the recommendation task in the target domain, and shows great potential in alleviating the data sparsity and cold-start problems of recommender systems. Most existing methods utilize the interaction information (e.g., ratings and clicks) or consider auxiliary information (e.g., tags and comments) to analyze the users' cross-domain preferences, but such kinds of information ignore the intrinsic semantic relationship of different domains. In order to effectively explore the inter-domain correlations, encyclopedic knowledge graphs (KG) involving different domains are highly desired in cross-domain recommendation tasks because they contain general information covering various domains with structured data format. However, there are few datasets containing KG information for CDR tasks, so in order to enrich the available data resource, we build a KG-enhanced cross-domain recommendation dataset, named Amazon-KG, based on the widely used Amazon dataset for CDR and the well-known KG DBpedia. In this work, we analyze the potential of KG applying in cross-domain recommendations, and describe the construction process of our dataset in detail. Finally, we perform quantitative statistical analysis on the dataset. We believe that datasets like Amazon-KG contribute to the development of knowledge-aware cross-domain recommender systems. Our dataset has been released at https://github.com/WangYuhan-0520/Amazon-KG-v2.0-dataset.

YAGO 4.5: A Large and Clean Knowledge Base with a Rich Taxonomy

Fabian M. Suchanek
Mehwish Alam
Thomas Bonald
Lihu Chen
Pierre-Henri Paris
Jules Soria

Knowledge Bases (KBs) find applications in many knowledge-intensive tasks and, most notably, in information retrieval. Wikidata is one of the largest public general-purpose KBs. Yet, its collaborative nature has led to a convoluted schema and taxonomy. The YAGO~4 KB cleaned up the taxonomy by incorporating the ontology of Schema.org, resulting in a cleaner structure amenable to automated reasoning. However, it also cut away large parts of the Wikidata taxonomy, which is essential for information retrieval. In this paper, we extend YAGO~4 with a large part of the Wikidata taxonomy -- while respecting logical constraints and the distinction between classes and instances. This yields YAGO~4.5, a new, logically consistent version of YAGO that adds a rich layer of informative classes. An intrinsic and an extrinsic evaluation show the value of the new resource.

SESSION: Session: Efficiency for Search

Ranked List Truncation for Large Language Model-based Re-Ranking

Chuan Meng
Negar Arabzadeh
Arian Askari
Mohammad Aliannejadi
Maarten de Rijke

We study ranked list truncation (RLT) from a novel retrieve-then-re-rank perspective, where we optimize re-ranking by truncating the retrieved list (i.e., trim re-ranking candidates). RLT is crucial for re-ranking as it can improve re-ranking efficiency by sending variable-length candidate lists to a re-ranker on a per-query basis. It also has the potential to improve re-ranking effectiveness. Despite its importance, there is limited research into applying RLT methods to this new perspective. To address this research gap, we reproduce existing RLT methods in the context of re-ranking, especially newly emerged large language model (LLM)-based re-ranking. In particular, we examine to what extent established findings on RLT for retrieval are generalizable to the "retrieve-then-re-rank" setup from three perspectives: (i) assessing RLT methods in the context of LLM-based re-ranking with lexical first-stage retrieval, (ii) investigating the impact of different types of first-stage retrievers on RLT methods, and (iii) investigating the impact of different types of re-rankers on RLT methods. We perform experiments on the TREC 2019 and 2020 deep learning tracks, investigating 8 RLT methods for pipelines involving 3 retrievers and 2 re-rankers. We reach new insights into RLT methods in the context of re-ranking.

Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations

Sebastian Bruch
Franco Maria Nardini
Cosimo Rulli
Rossano Venturini

Learned sparse representations form an attractive class of contextual embeddings for text retrieval. That is so because they are effective models of relevance and are interpretable by design. Despite their apparent compatibility with inverted indexes, however, retrieval over sparse embeddings remains challenging. That is due to the distributional differences between learned embeddings and term frequency-based lexical models of relevance such as BM25. Recognizing this challenge, a great deal of research has gone into, among other things, designing retrieval algorithms tailored to the properties of learned sparse representations, including approximate retrieval systems. In fact, this task featured prominently in the latest BigANN Challenge at NeurIPS 2023, where approximate algorithms were evaluated on a large benchmark dataset by throughput and recall. In this work, we propose a novel organization of the inverted index that enables fast yet effective approximate retrieval over learned sparse embeddings. Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector. During query processing, we quickly determine if a block must be evaluated using the summaries. As we show experimentally, single-threaded query processing using our method, Seismic, reaches sub-millisecond per-query latency on various sparse embeddings of the MS MARCO dataset while maintaining high recall. Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and further outperforms the winning (graph-based) submissions to the BigANN Challenge by a significant margin.

GUITAR: Gradient Pruning toward Fast Neural Ranking

Weijie Zhao
Shulong Tan
Ping Li

With the continuous popularity of deep learning and representation learning, fast vector search becomes a vital task in various ranking/retrieval based applications, say recommendation, ads ranking and question answering. Neural network based ranking is widely adopted due to its powerful capacity in modeling complex relationships, such as between users and items, questions and answers. However, it is usually exploited in offline or re-ranking manners for it is time-consuming in computations. Online neural network ranking--so called fast neural ranking --is considered challenging because neural network measures are usually non-convex and asymmetric. Traditional Approximate Nearest Neighbor (ANN) search which usually focuses on metric ranking measures, is not applicable to these advanced measures.

In this paper, we introduce a novel graph searching framework to accelerate the searching in the fast neural ranking problem. The proposed graph searching algorithm is bi-level: we first construct a probable candidate set; then we only evaluate the neural network measure over the probable candidate set instead of evaluating the neural network over all neighbors. Specifically, we propose a gradient-based algorithm that approximates the rank of the neural network matching score to construct the probable candidate set; and we present an angle-based heuristic procedure to adaptively identify the proper size of the probable candidate set. Empirical results on public data confirm the effectiveness of our proposed algorithms

Neural Passage Quality Estimation for Static Pruning

Xuejun Chang
Debabrata Mishra
Craig Macdonald
Sean MacAvaney

Neural networks-especially those that use large, pre-trained language models-have improved search engines in various ways. Most prominently, they can estimate the relevance of a passage or document to a user's query. In this work, we depart from this direction by exploring whether neural networks can effectively predict which of a document's passages are unlikely to be relevant to any query submitted to the search engine.We refer to this query-agnostic estimation of passage relevance as a passage's quality.We find that our novel methods for estimating passage quality allow passage corpora to be pruned considerably while maintaining statistically equivalent effectiveness; our best methods can consistently prune >25% of passages in a corpora, across various retrieval pipelines. Such substantial pruning reduces the operating costs of neural search engines in terms of computing resources, power usage, and carbon footprint-both when processing queries (thanks to a smaller index size) and when indexing (lightweight models can prune low-quality passages prior to the costly dense or learned sparse encoding step). This work sets the stage for developing more advanced neural "learning-what-to-index" methods.

Revisiting Document Expansion and Filtering for Effective First-Stage Retrieval

Watheq Mansour
Shengyao Zhuang
Guido Zuccon
Joel Mackenzie

Document expansion is a technique that aims to reduce the likelihood of term mismatch by augmenting documents with related terms or queries. Doc2Query minus minus (Doc2Query-) represents an extension to the expansion process that uses a neural model to identify and remove expansions that may not be relevant to the given document, thereby increasing the quality of the ranking while simultaneously reducing the amount of augmented data. In this work, we conduct a detailed reproducibility study of Doc2Query- to better understand the trade-offs inherent to document expansion and filtering mechanisms. After successfully reproducing the best-performing method from the Doc2Query- family, we show that filtering actually harms recall-based metrics on various test collections. Next, we explore whether the two-stage "generate-then-filter" process can be replaced with a single generation phase via reinforcement learning. Finally, we extend our experimentation to learned sparse retrieval models and demonstrate that filtering is not helpful when term weights can be learned. Overall, our work provides a deeper understanding of the behaviour and characteristics of common document expansion mechanisms, and paves the way for developing more efficient yet effective augmentation models.

SESSION: Session: Multimedia 1

Unsupervised Cross-Domain Image Retrieval with Semantic-Attended Mixture-of-Experts

Kai Wang
Jiayang Liu
Xing Xu
Jingkuan Song
Xin Liu
Heng Tao Shen

Unsupervised cross-domain image retrieval is designed to facilitate the retrieval between images in different domains in an unsupervised way. Without the guidance of labels, both intra-domain semantic learning and inter-domain semantic alignment pose significant challenges to the model's learning process. The resolution of these challenges relies on the accurate capture of domain-invariant semantic features by the model. Based on this consideration, we propose our Semantic-Attended Mixture of Experts (SA-MoE) model. Leveraging the proficiency of MoE network in capturing visual features, we enhance the model's focus on semantically relevant features through a series of strategies. We first utilize the self-attention mechanism of Vision Transformer to adaptively collect information with different weights on instances from different domains. In addition, we introduce contextual semantic association metrics to more accurately measure the semantic relatedness between instances. By utilizing the association metrics, secondary clustering is performed in the feature space to reinforce semantic relationships. Finally, we employ the metrics for information selection on the fused data to remove the semantic noise. We conduct extensive experiments on three widely used datasets. The consistent comparison results with existing methods indicate that our model possesses the state-of-the-art performance.

Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images

Shicheng Xu
Danyang Hou
Liang Pang
Jingcheng Deng
Jun Xu
Huawei Shen
Xueqi Cheng

With the application of generation models, internet is increasingly inundated with AI-generated content (AIGC), causing both real and AI-generated content indexed in corpus for search. This paper explores the impact of AI-generated images on text-image search in this scenario. Firstly, we construct a benchmark consisting of both real and AI-generated images for this study. In this benchmark, AI-generated images possess visual semantics sufficiently similar to real images. Experiments on this benchmark reveal that text-image retrieval models tend to rank the AI-generated images higher than the real images, even though the AI-generated images do not exhibit more visually relevant semantics to the queries than real images. We call this bias as invisible relevance bias. This bias is detected across retrieval models with different training data and architectures. Further exploration reveals that mixing AI-generated images into the training data of retrieval models exacerbates the invisible relevance bias. These problems cause a vicious cycle in which AI-generated images have a higher chance of exposing from massive data, which makes them more likely to be mixed into the training of retrieval models and such training makes the invisible relevance bias more and more serious. To mitigate this bias and elucidate the potential causes of the bias, firstly, we propose an effective method to alleviate this bias. Subsequently, we apply our proposed debiasing method to retroactively identify the causes of this bias, revealing that the AI-generated images induce the image encoder to embed additional information into their representation. This information makes the retriever estimate a higher relevance score. We conduct experiments to support this assertion.

Findings in this paper reveal the potential impact of AI-generated images on retrieval and have implications for further research. Code is released at https://github.com/xsc1234/Invisible-Relevance-Bias.

COMI: COrrect and MItigate Shortcut Learning Behavior in Deep Neural Networks

Lili Zhao
Qi Liu
Linan Yue
Wei Chen
Liyi Chen
Ruijun Sun
Chao Song

Deep Neural Networks (DNNs), despite their notable progress across information retrieval tasks, encounter the issues of shortcut learning and struggle with poor generalization due to their reliance on spurious correlations between features and labels. Current research mainly mitigates shortcut learning behavior using augmentation and distillation techniques, but these methods could be laborious and introduce unwarranted biases. To tackle these, in this paper, we propose COMI, a novel method to COrrect and MItigate shortcut learning behavior. Inspired by the ways students solve shortcuts in educational scenarios, we aim to reduce model's reliance on shortcuts and enhance its ability to extract underlying information integrated with standard Empirical Risk Minimization (ERM). Specifically, we first design Correct Habit (CoHa) strategy to retrieve the top m challenging samples for priority training, which encourages model to rely less on shortcuts in the early training. Then, to extract more meaningful underlying information, the information derived from ERM is separated into task-relevant and task-irrelevant information, the former serves as the primary basis for model predictions, while the latter is considered non-essential. However, within task-relevant information, certain potential shortcuts contribute to overconfident predictions. To mitigate this, we design Deep Mitigation (DeMi) network with shortcut margin loss to adaptively control the feature weights of shortcuts and eliminate their influence. Besides, to counteract unknown shortcut tokens issue in NLP, we adopt locally interpretable module-LIME to help recognize shortcut tokens. Finally, extensive experiments conducted on NLP and CV tasks demonstrate the effectiveness of COMI, which can perform well on both IID and OOD samples.

Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval

Haokun Wen
Xuemeng Song
Xiaolin Chen
Yinwei Wei
Liqiang Nie
Tat-Seng Chua

Composed image retrieval (CIR) aims to retrieve the target image based on a multimodal query, i.e., a reference image paired with corresponding modification text. Recent CIR studies leverage vision-language pre-trained (VLP) methods as the feature extraction backbone and perform nonlinear feature-level multimodal query fusion to retrieve the target image. Despite the promising performance, we argue that their nonlinear feature-level multimodal fusion may lead to the fused feature deviating from the original embedding space, potentially hurting the retrieval performance. To address this issue, in this work, we propose shifting the multimodal fusion from the feature level to the raw-data level to fully exploit the VLP model's multimodal encoding and cross-modal alignment abilities. In particular, we introduce a Dual Query Unification-based Composed Image Retrieval framework (DQU-CIR), whose backbone simply involves a VLP model's image encoder and a text encoder. Specifically, DQU-CIR first employs two training-free query unification components to derive a unified textual and visual query based on the raw data of the multimodal query, respectively. The unified textual query is derived by concatenating the modification text with the extracted reference image's textual description, while the unified visual query is created by writing the key modification words onto the reference image. Ultimately, to address diverse search intentions, DQU-CIR linearly combines the features of the two unified queries encoded by the VLP model to retrieve the target image. Extensive experiments on four real-world datasets validate the effectiveness of our proposed method.

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Haoqiang Lin
Haokun Wen
Xuemeng Song
Meng Liu
Yupeng Hu
Liqiang Nie

Composed Image Retrieval (CIR) allows users to search target images with a multimodal query, comprising a reference image and a modification text that describes the user's modification demand over the reference image. Nevertheless, due to the expensive labor cost of training data annotation, recent researchers have shifted to the challenging task of zero-shot CIR (ZS-CIR), which targets fulfilling CIR without annotated triplets. The pioneer ZS-CIR studies focus on converting the CIR task into a standard text-to-image retrieval task by pre-training a textual inversion network that can map a given image into a single pseudo-word token. Despite their significant progress, their coarse-grained textual inversion may be insufficient to capture the full content of the image accurately. To overcome this issue, in this work, we propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named FTI4CIR. In particular, FTI4CIR comprises two main components: fine-grained pseudo-word token mapping and tri-wise caption-based semantic regularization. The former maps the image into a subject-oriented pseudo-word token and several attribute-oriented pseudo-word tokens to comprehensively express the image in the textual form, while the latter works on jointly aligning the fine-grained pseudo-word tokens to the real-word token embedding space based on a BLIP-generated image caption template. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our proposed method.

SESSION: Session: Evaluation

The Treatment of Ties in Rank-Biased Overlap

Matteo Corsi
Julián Urbano

Rank-Biased Overlap (RBO) is a similarity measure for indefinite rankings: it is top-weighted, and can be computed when only a prefix of the rankings is known or when they have only some items in common. It is widely used for instance to analyze differences between search engines by comparing the rankings of documents they retrieve for the same queries. In these situations, though, it is very frequent to find tied documents that have the same score. Unfortunately, the treatment of ties in RBO remains superficial and incomplete, in the sense that it is not clear how to calculate it from the ranking prefixes only. In addition, the existing way of dealing with ties is very different from the one traditionally followed in the field of Statistics, most notably found in rank correlation coefficients such as Kendall's and Spearman's. In this paper we propose a generalized formulation for RBO to handle ties, thanks to which we complete the original definitions by showing how to perform prefix evaluation. We also use it to fully develop two variants that align with the ones found in the Statistics literature: one when there is a reference ranking to compare to, and one when there is not. Overall, these three variants provide researchers with flexibility when comparing rankings with RBO, by clearly determining what ties mean, and how they should be treated. Finally, using both synthetic and TREC data, we demonstrate the use of these new tie-aware RBO measures. We show that the scores may differ substantially from the original tie-unaware RBO measure, where ties had to be broken at random or by arbitrary criteria such as by document ID. Overall, these results evidence the need for a proper account of ties in rank similarity measures such as RBO.

Uncontextualized significance considered dangerous

Nicola Ferro
Mark Sanderson

We examine the context of significance tests in offline retrieval experiments. Our Information Retrieval (IR) community is notable for its experimental rigour: the use of statistical significance is grows across our publications. However, we show that ignoring the context of a test risks Type I errors, leading to potential publication bias. We examine two contexts: multiple testing and the types of the retrieval systems being compared. Our results show that multiple testing corrections are critical for experimental work. In addition, we find that past research on the reliability of test collections maybe flawed owing to the type of systems examined. The latter result has not been shown before. Together our results suggest substantial numbers of Type I errors in offline IR experiments. We detail a methodology to alleviate the errors.

Can We Trust Recommender System Fairness Evaluation? The Role of Fairness and Relevance

Theresia Veronika Rampisela
Tuukka Ruotsalo
Maria Maistro
Christina Lioma

Relevance and fairness are two major objectives of recommender systems (RSs). Recent work proposes measures of RS fairness that are either independent from relevance (fairness-only) or conditioned on relevance (joint measures). While fairness-only measures have been studied extensively, we look into whether joint measures can be trusted. We collect all joint evaluation measures of RS relevance and fairness, and ask: How much do they agree with each other? To what extent do they agree with relevance/fairness measures? How sensitive are they to changes in rank position, or to increasingly fair and relevant recommendations? We eempirically study for the first time the behaviour of these measures across 4 real-world datasets and 4 recommenders. We find that most of these measures: i) correlate weakly with one another and even contradict each other at times; ii) are less sensitive to rank position changes than relevance- and fairness-only measures, meaning that they are less granular than traditional RS measures; and iii) tend to compress scores at the low end of their range, meaning that they are not very expressive. We counter the above limitations with a set of guidelines on the appropriate usage of such measures, i.e., they should be used with caution due to their tendency to contradict each other and of having a very small empirical range.

What Matters in a Measure? A Perspective from Large-Scale Search Evaluation

Paul Thomas
Gabriella Kazai
Nick Craswell
Seth Spielman

Information retrieval (IR) has a large literature on evaluation, dating back decades and forming a central part of the research culture. The largest proportion of this literature discusses techniques to turn a sequence of relevance labels into a single number, reflecting the system's performance: precision or cumulative gain, for example, or dozens of alternatives. Those techniques-metrics-are themselves evaluated, commonly by reference to sensitivity and validity.

In our experience measuring search in industrial settings, a measurement regime needs many other qualities to be practical. For example, we must also consider how much a metric costs; how robust it is to the happenstance of sampling; whether it is debuggable; and what activities are incentivised when a metric is taken as a goal.

In this perspective paper we discuss what makes a search metric successful in large-scale settings, including factors which are not often canvassed in IR research but which are important in "real-world" use. We illustrate this with examples, including from industrial settings, and offer suggestions for metrics as part of a working system.

CIRAL: A Test Collection for CLIR Evaluations in African Languages

Mofetoluwa Adeyemi
Akintunde Oladipo
Xinyu Zhang
David Alfonso-Hermelo
Mehdi Rezagholizadeh
Boxing Chen
Abdul-Hakeem Omotayo
Idris Abdulmumin
Naome A. Etori
Toyib Babatunde Musa
Samuel Fanijo
Oluwabusayo Olufunke Awoyomi
Saheed Abdullahi Salahudeen
Labaran Adamu Mohammed
Daud Olamide Abolade
Falalu Ibrahim Lawan
Maryam Sabo Abubakar
Ruqayya Nasir Iro
Amina Imam Abubakar
Shafie Abdi Mohamed
Hanad Mohamud Mohamed
Tunde Oluwaseyi Ajayi
Jimmy Lin

Cross-lingual information retrieval (CLIR) continues to be an actively studied topic in information retrieval (IR), and there have been consistent efforts in curating test collections to support its research. However, there is a lack of high-quality human-annotated CLIR resources for African languages: the few existing collections are mostly curated synthetically or from sources with limited corpora for these languages. We present CIRAL, a test collection for cross-lingual retrieval with English queries and passages in four African languages: Hausa, Somali, Swahili, and Yoruba. CIRAL's corpora are obtained from Indigenous African websites and consist of a total of over 2.5 million passages. We gathered over 1,600 queries and 30k high-quality binary relevance judgments annotated by native speakers of the languages. Additional pools were also obtained at CIRAL's shared task, which was hosted at the Forum for Information Retrieval Evaluation 2023 to encourage community participation in CLIR for African languages. We describe the design and curation process of our test collection and provide reproducible baselines that demonstrate CIRAL's utility in evaluating the effectiveness of systems. CIRAL is available at https://github.com/ciralproject/ciral.

ACORDAR 2.0: A Test Collection for Ad Hoc Dataset Retrieval with Densely Pooled Datasets and Question-Style Queries

Qiaosheng Chen
Weiqing Luo
Zixian Huang
Tengteng Lin
Xiaxia Wang
Ahmet Soylu
Basil Ell
Baifan Zhou
Evgeny Kharlamov
Gong Cheng

Dataset search, or more specifically, ad hoc dataset retrieval which is a trending specialized IR task, has received increasing attention in both academia and industry. While methods and systems continue evolving, existing test collections for this task exhibit shortcomings, particularly suffering from lexical bias in pooling and limited to keyword-style queries for evaluation. To address these limitations, in this paper, we construct ACORDAR 2.0, a new test collection for this task which is also the largest to date. To reduce lexical bias in pooling, we adapt dense retrieval models to large structured data, using them to find an extended set of semantically relevant datasets to be annotated. To diversify query forms, we employ a large language model to rewrite keyword queries into high-quality question-style queries. We use the test collection to evaluate popular sparse and dense retrieval models to establish a baseline for future studies. The test collection and source code are publicly available.

Browsing and Searching Metadata of TREC

Timo Breuer
Ellen M. Voorhees
Ian Soboroff

Information Retrieval (IR) research is deeply rooted in experimentation and evaluation, and the Text REtrieval Conference (TREC) has been playing a central role in making that possible since its inauguration in 1992. TREC's mission centers around providing the infrastructure and resources to make IR evaluations possible at scale. Over the years, a plethora of different retrieval problems were addressed, culminating in data artifacts that remained as valuable and useful tools for the IR community. Even though the data are largely available from TREC's website, there is currently no resource that facilitates a cohesive way to obtain metadata information about the run file - the IR community's de-facto standard data format for storing rankings of system-oriented IR experiments.

To this end, the work at hand introduces a software suite that facilitates access to metadata of experimental resources, resulting from over 30 years of IR experiments and evaluations at TREC. With a particular focus on the run files, the paper motivates the requirements for better access to TREC metadata and details the concepts, the resources, the corresponding implementations, and possible use cases. More specifically, we contribute a web interface to browse former TREC submissions. Besides, we provide the underlying metadatabase and a corresponding RESTful interface for more principled and structured queries about the TREC metadata.

SESSION: Session: RecSys and LLMs

Large Language Models for Intent-Driven Session Recommendations

Zhu Sun
Hongyang Liu
Xinghua Qu
Kaidong Feng
Yan Wang
Yew Soon Ong

The goal of intent-aware session recommendation (ISR) approaches is to capture user intents within a session for accurate next-item prediction. However, the capability of these approaches is limited by assuming all sessions have a uniform and fixed number of intents. In reality, user sessions can vary, where the number of intentions may differ from one to another. Moreover, they can only learn user intents in the latent space, which further restricts the model's transparency. To ease these issues, we propose a simple yet effective paradigm for ISR motivated by the advanced reasoning capability of large language models (LLMs). Specifically, we first create an initial prompt to instruct LLMs to predict the next item by inferring varying user intents reflected in a session. Then, we propose an effective optimization mechanism to automatically optimize prompts with an iterative self-reflection. Finally, we leverage the robust generalizability of LLMs across diverse domains to efficiently select the optimal prompt for ISR. As such, the proposed paradigm effectively guides LLMs to identify varying user intents at a semantic level, thus delivering more accurate and comprehensible recommendations. Extensive experiments on three real-world datasets verify the superiority of our proposed method.

Sequential Recommendation with Latent Relations based on Large Language Model

Shenghao Yang
Weizhi Ma
Peijie Sun
Qingyao Ai
Yiqun Liu
Mingchen Cai
Min Zhang

Sequential recommender systems predict items that may interest users by modeling their preferences based on historical interactions. Traditional sequential recommendation methods rely on capturing implicit collaborative filtering signals among items. Recent relation-aware sequential recommendation models have achieved promising performance by explicitly incorporating item relations into the modeling of user historical sequences, where most relations are extracted from knowledge graphs. However, existing methods rely on manually predefined relations and suffer the sparsity issue, limiting the generalization ability in diverse scenarios with varied item relations.

In this paper, we propose a novel relation-aware sequential recommendation framework with Latent Lelation Riscovery (LRD). Different from previous relation-aware models that rely on predefined rules, we propose to leverage the Large Language Model (LLM) to provide new types of relations and connections between items. The motivation is that LLM contains abundant world knowledge, which can be adopted to mine latent relations of items for recommendation. Specifically, inspired by that humans can describe relations between items using natural language, LRD harnesses the LLM that has demonstrated human-like knowledge to obtain language knowledge representations of items. These representations are fed into a latent relation discovery module based on the discrete state variational autoencoder (DVAE). Then the self-supervised relation discovery tasks and recommendation tasks are jointly optimized. Experimental results on multiple public datasets demonstrate our proposed latent relation discovery method can be incorporated with existing relation-aware sequential recommendation models and significantly improve the performance. Further analysis experiments indicate the effectiveness and reliability of the discovered latent relations.

Enhancing Sequential Recommenders with Augmented Knowledge from Aligned Large Language Models

Yankun Ren
Zhongde Chen
Xinxing Yang
Longfei Li
Cong Jiang
Lei Cheng
Bo Zhang
Linjian Mo
Jun Zhou

Recommender systems are widely used in various online platforms. In the context of sequential recommendation, it is essential to accurately capture the chronological patterns in user activities to generate relevant recommendations. Conventional ID-based sequential recommenders have shown promise but lack comprehensive real-world knowledge about items, limiting their effectiveness. Recent advancements in Large Language Models (LLMs) offer the potential to bridge this gap by leveraging the extensive real-world knowledge encapsulated in LLMs. However, integrating LLMs into sequential recommender systems comes with its own challenges, including inadequate representation of sequential behavior patterns and long inference latency. In this paper, we propose SeRALM (Enhancing Sequential Recommenders with Augmented Knowledge from Aligned Large Language Models) to address these challenges. SeRALM integrates LLMs with conventional ID-based sequential recommenders for sequential recommendation tasks. We combine text-format knowledge generated by LLMs with item IDs and feed this enriched data into ID-based recommenders, benefitting from the strengths of both paradigms. Moreover, we develop a theoretically underpinned alignment training method to refine LLMs' generation using feedback from ID-based recommenders for better knowledge augmentation. We also present an asynchronous technique to expedite the alignment training process. Experimental results on public benchmarks demonstrate that SeRALM significantly improves the performances of ID-based sequential recommenders. Further, a series of ablation studies and analyses corroborate SeRALM's proficiency in steering LLMs to generate more pertinent and advantageous knowledge across diverse scenarios.

IDGenRec: LLM-RecSys Alignment with Textual ID Learning

Juntao Tan
Shuyuan Xu
Wenyue Hua
Yingqiang Ge
Zelong Li
Yongfeng Zhang

LLM-based Generative recommendation has attracted significant attention. However, in contrast to standard NLP tasks that inherently operate on human vocabulary, current generative recommendation approaches struggle to effectively encode items within the text-to-text framework. Due to this issue, the true potential of LLM-based generative recommendation remains largely unexplored. To better align LLMs with recommendation needs, we propose IDGenRec, representing each item as a unique, concise, semantically rich, platform-agnostic textual ID using human language tokens. This is achieved by training a textual ID generator alongside the LLM-based recommender, enabling seamless integration of personalized recommendations into natural language generation. Notably, as user history is expressed in natural language and decoupled from the original dataset, our approach suggests the potential for a foundational generative recommendation model.

Experiments show that our framework consistently surpasses existing models in sequential recommendation under standard experimental setting. Then, we train a foundation recommendation model on a collected fusion dataset and tested its recommendation performance on 6 unseen datasets across different platforms under a completely zero-shot setting. The results show that the zero-shot performance of the pre-trained model is comparable to or even better than some traditional recommendation models based on supervised training, showing the potential of the IDGenRec paradigm serving as the foundation model for generative recommendation. Code and data are open-sourced at https://github.com/agiresearch/IDGenRec.

Data-efficient Fine-tuning for LLM-based Recommendation

Xinyu Lin
Wenjie Wang
Yongqi Li
Shuo Yang
Fuli Feng
Yinwei Wei
Tat-Seng Chua

Leveraging Large Language Models (LLMs) for recommendation has recently garnered considerable attention, where fine-tuning plays a key role in LLMs' adaptation. However, the cost of fine-tuning LLMs on rapidly expanding recommendation data limits their practical application. To address this challenge, few-shot fine-tuning offers a promising approach to quickly adapt LLMs to new recommendation data. We propose the task of data pruning for efficient LLM-based recommendation, aimed at identifying representative samples tailored for LLMs' few-shot fine-tuning. While coreset selection is closely related to the proposed task, existing coreset selection methods often rely on suboptimal heuristic metrics or entail costly optimization on large-scale recommendation data.

To tackle these issues, we introduce two primary objectives for the data pruning task in the context of LLM-based recommendation: 1) high accuracy aims to identify the influential samples that can lead to high overall performance; and 2) high efficiency underlines the low costs of the data pruning process. To pursue the two objectives, we propose a novel data pruning method incorporating two scores, namely influence score and effort score, to efficiently identify the influential samples. Particularly, the influence score is introduced to accurately estimate the influence of removing each sample on the overall performance. To achieve low costs of the data pruning process, we employ a small-sized surrogate model to replace LLMs to obtain the influence score. Considering the potential gap between the surrogate model and LLMs, we further propose an effort score to prioritize some hard samples specifically for LLMs. We instantiate the proposed method on two competitive LLM-based recommender models, and empirical results on three real-world datasets validate the effectiveness of our proposed method. In particular, our method uses only 2% samples to surpass the full data fine-tuning, reducing time costs by 97%.

Reinforcement Learning-based Recommender Systems with Large Language Models for State Reward and Action Modeling

Jie Wang
Alexandros Karatzoglou
Ioannis Arapakis
Joemon M. Jose

Reinforcement Learning (RL)-based recommender systems have demonstrated promising performance in session-based and sequential recommendation tasks. Existing offline RL-based sequential recommendation methods face the challenge of obtaining effective user feedback from the environment. Developing a model for the user state and shaping an appropriate reward for recommendation remains a challenge. In this paper, we leverage language understanding capabilities and adapt large language models (LLMs) as an environment (LE) to enhance RL-based recommenders. The LE is learned from a subset of user-item interaction data, thus reducing the need for large training data, and can synthesize user feedback for offline data by: (i) acting as a state model that produces high-quality states that enrich the user representation, and (ii) functioning as a reward model to accurately capture nuanced user preferences on actions. Moreover, the LE allows us to generate positive actions that augment the limited offline training data. We propose a LE Augmentation (LEA) method to further improve recommendation performance by optimising jointly the supervised component and the RL policy, using the augmented actions and historical user signals. We use LEA, the state, and reward models in conjunction with state-of-the-art RL recommenders and report experimental results on two publicly available datasets.

OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems

Shuyuan Xu
Wenyue Hua
Yongfeng Zhang

In recent years, the integration of Large Language Models (LLMs) into recommender systems has garnered interest among both practitioners and researchers. Despite this interest, the field is still emerging, and the lack of open-source R&D platforms may impede the exploration of LLM-based recommendations. This paper introduces OpenP5, an open-source platform designed as a resource to facilitate the development, training, and evaluation of LLM-based generative recommender systems for research purposes. The platform is implemented using the encoder-decoder LLMs (e.g., T5) and the decoder-only LLMs (e.g., LLaMA-2) across 10 widely recognized public datasets, catering to two fundamental recommendation tasks: sequential and straightforward recommendations. Recognizing the crucial role of item IDs in LLM-based recommendations, we have also incorporated three item indexing methods within the OpenP5 platform: random indexing, sequential indexing and collaborative indexing. Built on the Transformers library, the platform facilitates easy customization of LLM-based recommendations for users. OpenP5 boasts a range of features including extensible data processing, task-centric optimization, comprehensive datasets and checkpoints, efficient acceleration, and standardized evaluations, making it a valuable tool for the implementation and evaluation of LLM-based recommender systems. The open-source code and pre-trained checkpoints for the OpenP5 library are publicly available at https://github.com/agiresearch/OpenP5.

SESSION: Session: Fairness in RecSys

Fair Sequential Recommendation without User Demographics

Huimin Zeng
Zhankui He
Zhenrui Yue
Julian McAuley
Dong Wang

Much existing literature on fair recommendation (i.e., group fairness) leverages users' demographic attributes (e.g., gender) to develop fair recommendation methods. However, in real-world scenarios, due to privacy concerns and convenience considerations, users may not be willing to share their demographic information with the system, which limits the application of many existing methods. Moreover, sequential recommendation (SR) models achieve state-of-the-art performance compared to traditional collaborative filtering (CF) recommenders, and can represent users solely using user-item interactions (user-free). This leaves a wrong impression that SR models are free from group unfairness by design. In this work, we explore a critical question: how can we build a fair sequential recommendation system without even knowing user demographics? To address this problem, we propose Agnostic FairSeqRec (A-FSR): a model-agnostic and demographic-agnostic debiasing framework for sequential recommendation without requiring users' demographic attributes. Firstly, A-FSR reduces the correlation between the potential stereotypical patterns in the input sequences and final recommendations via Dirichlet neighbor smoothing. Secondly, A-FSR estimates an under-represented group of sequences via a gradient-based heuristic, and implicitly moves training focus towards the under-represented group by minimizing a distributionally robust optimization (DRO) based objective. Results on real-world datasets show that A-FSR achieves significant improvements on group fairness in sequential recommendation, while outperforming other state-of-the-art baselines.

CaDRec: Contextualized and Debiased Recommender Model

Xinfeng Wang
Fumiyo Fukumoto
Jin Cui
Yoshimi Suzuki
Jiyi Li
Dongjin Yu

Recommender models aimed at mining users' behavioral patterns have raised great attention as one of the essential applications in daily life. Recent work on graph neural networks (GNNs) or debiasing methods has attained remarkable gains. However, they still suffer from (1) over-smoothing node embeddings caused by recursive convolutions with GNNs, and (2) the skewed distribution of interactions due to popularity and user-individual biases. This paper proposes a contextualized and debiased recommender model (CaDRec). To overcome the over-smoothing issue, we explore a novel hypergraph convolution operator that can select effective neighbors during convolution by introducing both structural context and sequential context. To tackle the skewed distribution, we propose two strategies for disentangling interactions: (1) modeling individual biases to learn unbiased item embeddings, and (2) incorporating item popularity with positional encoding. Moreover, we mathematically show that the imbalance of the gradients to update item embeddings exacerbates the popularity bias, thus adopting regularization and weighting schemes as solutions. Extensive experiments on four datasets demonstrate the superiority of the CaDRec against state-of-the-art (SOTA) methods. Our source code and data are released at https://github.com/WangXFng/CaDRec.

Going Beyond Popularity and Positivity Bias: Correcting for Multifactorial Bias in Recommender Systems

Jin Huang
Harrie Oosterhuis
Masoud Mansoury
Herke van Hoof
Maarten de Rijke

Two typical forms of bias in user interaction data with recommender systems (RSs) are popularity bias and positivity bias, which manifest themselves as the over-representation of interactions with popular items or items that users prefer, respectively. Debiasing methods aim to mitigate the effect of selection bias on the evaluation and optimization of RSs. However, existing debiasing methods only consider single-factor forms of bias, e.g., only the item (popularity) or only the rating value (positivity). This is in stark contrast with the real world where user selections are generally affected by multiple factors at once. In this work, we consider multifactorial selection bias in RSs. Our focus is on selection bias affected by both item and rating value factors, which is a generalization and combination of popularity and positivity bias. While the concept of multifactorial bias is intuitive, it brings a severe practical challenge as it requires substantially more data for accurate bias estimation. As a solution, we propose smoothing and alternating gradient descent techniques to reduce variance and improve the robustness of its optimization. Our experimental results reveal that, with our proposed techniques, multifactorial bias corrections are more effective and robust than single-factor counterparts on real-world and synthetic datasets.

Adaptive Fair Representation Learning for Personalized Fairness in Recommendations via Information Alignment

Xinyu Zhu
Lilin Zhang
Ning Yang

Personalized fairness in recommendations has been attracting increasing attention from researchers. The existing works often treat a fairness requirement, represented as a collection of sensitive attributes, as a hyper-parameter, and pursue extreme fairness by completely removing information of sensitive attributes from the learned fair embedding, which suffer from two challenges: huge training cost incurred by the explosion of attribute combinations, and the suboptimal trade-off between fairness and accuracy. In this paper, we propose a novel Adaptive Fair Representation Learning (AFRL) model, which achieves a real personalized fairness due to its advantage of training only one model to adaptively serve different fairness requirements during inference phase. Particularly, AFRL treats fairness requirements as inputs and can learn an attribute-specific embedding for each attribute from the unfair user embedding, which endows AFRL with the adaptability during inference phase to determine the non-sensitive attributes under the guidance of the user's unique fairness requirement. To achieve a better trade-off between fairness and accuracy in recommendations, AFRL conducts a novel Information Alignment to exactly preserve discriminative information of non-sensitive attributes and incorporate a debiased collaborative embedding into the fair embedding to capture attribute-independent collaborative signals, without loss of fairness. Finally, the extensive experiments conducted on real datasets together with the sound theoretical analysis demonstrate the superiority of AFRL.

Configurable Fairness for New Item Recommendation Considering Entry Time of Items

Huizhong Guo
Dongxia Wang
Zhu Sun
Haonan Zhang
Jinfeng Li
Jie Zhang

Recommender systems tend to excessively expose longer-standing items, resulting in significant unfairness to new items with little interaction records, despite they may possess potential to attract considerable amount of users. The existing fairness-based solutions do not specifically consider the exposure fairness of new items, for which a systematic definition also lacks, discouraging the promotion of new items or contents. In this work, we introduce a multi-degree new-item exposure fairness definition, which considers item entry-time, and also is configurable regarding different fairness requirements. We then propose a configurable new-item fairness-aware framework named CNIF, which employs two-stage training where fairness degrees are incorporated for guidance. Extensive experiments on multiple popular datasets and backbone models demonstrate that CNIF can effectively enhance fairness of the existing models regarding the exposure resources of new items (including the brand-new items with no interaction). Specifically, CNIF demonstrates a substantial advancement with a 65.59% improvement in fairness metric and a noteworthy 9.97% improvement in recommendation accuracy compared to backbone models on the KuaiRec dataset. In comparison to various fairness-based solutions, it stands out by achieving the best trade-off between fairness and recommendation accuracy, surpassing the best baseline by 14.20%.

Fair Recommendations with Limited Sensitive Attributes: A Distributionally Robust Optimization Approach

Tianhao Shi
Yang Zhang
Jizhi Zhang
Fuli Feng
Xiangnan He

As recommender systems are indispensable in various domains such as job searching and e-commerce, providing equitable recommendations to users with different sensitive attributes becomes an imperative requirement. Prior approaches for enhancing fairness in recommender systems presume the availability of all sensitive attributes, which can be difficult to obtain due to privacy concerns or inadequate means of capturing these attributes. In practice, the efficacy of these approaches is limited, pushing us to investigate ways of promoting fairness with limited sensitive attribute information. Toward this goal, it is important to reconstruct missing sensitive attributes. Nevertheless, reconstruction errors are inevitable due to the complexity of real-world sensitive attribute reconstruction problems and legal regulations. Thus, we pursue fair learning methods that are robust to reconstruction errors. To this end, we propose Distributionally Robust Fair Optimization (DRFO), which minimizes the worst-case unfairness over all potential probability distributions of missing sensitive attributes instead of the reconstructed one to account for the impact of the reconstruction errors. We provide theoretical and empirical evidence to demonstrate that our method can effectively ensure fairness in recommender systems when only limited sensitive attributes are accessible.

SESSION: Session: GenIR and The Future of Search with LLMs

Generative Retrieval via Term Set Generation

Peitian Zhang
Zheng Liu
Yujia Zhou
Zhicheng Dou
Fangchao Liu
Zhao Cao

Recently, generative retrieval has emerged as a promising alternative to the traditional retrieval paradigms. It assigns each document a unique identifier, known as the DocID, and employs a generative model to directly generate the relevant DocID for the input query. A common choice for the DocID is one or several natural language sequences, e.g. the title, synthetic queries, or n-grams, so that the pre-trained knowledge of the generative model can be effectively utilized. However, a sequence is generated token by token, where only the most likely candidates are kept and the rest are pruned at each decoding step, thus, retrieval fails if any token within the relevant DocID is falsely pruned. What's worse, during decoding, the model can only perceive preceding tokens in the DocID while being blind to subsequent ones, hence is prone to make such errors. To address this problem, we present a novel framework for generative retrieval, dubbed Term-Set Generation (TSGen). Instead of sequences, we use a set of terms as the DocID. The terms are selected based on learned weights from relevance signals, so that they concisely summarize the document's semantics and distinguish it from others. On top of the term-set DocID, we propose a permutation-invariant decoding algorithm, with which the term set can be generated in any permutation yet will always lead to the corresponding document. Remarkably, TSGen perceives all valid terms rather than only the preceding ones at each decoding step. Given the constant decoding space, it can make more reliable decisions due to the broader perspective. TSGen is also resilient to errors: the relevant DocID will not be falsely pruned as long as the decoded term belongs to it. Moreover, TSGen can explore the optimal decoding permutation of the term set on its own, which further improves the likelihood of generating the relevant DocID. Lastly, we design an iterative optimization procedure to incentivize the model to generate the relevant term set in its favorable permutation. We conduct extensive experiments on popular benchmarks of generative retrieval, which validate the effectiveness, the generalizability, the scalability, and the efficiency of TSGen.

Planning Ahead in Generative Retrieval: Guiding Autoregressive Generation through Simultaneous Decoding

Hansi Zeng
Chen Luo
Hamed Zamani

This paper introduces PAG-a novel optimization and decoding approach that guides autoregressive generation of document identifiers in generative retrieval models through simultaneous decoding. To this aim, PAG constructs a set-based and sequential identifier for each document. Motivated by the bag-of-words assumption in information retrieval, the set-based identifier is built on lexical tokens. The sequential identifier, on the other hand, is obtained via quantizing relevance-based representations of documents. Extensive experiments on MSMARCO and TREC Deep Learning Track data reveal that PAG outperforms the state-of-the-art generative retrieval model by a large margin (e.g., 15.6% MRR improvements on MS MARCO), while achieving 22× speed up in terms of query latency.

Large Language Models and Future of Information Retrieval: Opportunities and Challenges

ChengXiang Zhai

Recent years have seen great success of large language models (LLMs) in performing many natural language processing tasks with impressive performance, including tasks that directly serve users such as question answering and text summarization. They open up unprecedented opportunities for transforming information retrieval (IR) research and applications. However, concerns such as halluciation undermine their trustworthiness, limiting their actual utility when deployed in real-world applications, especially high-stake applications where trust is vital. How can we both exploit the strengths of LLMs and mitigate any risk caused by their weaknesses when applying LLMs to IR? What are the best opportunities for us to apply LLMs to IR? What are the major challenges that we will need to address in the future to fully exploit such opportunities? Given the anticipated growth of LLMs, what will future information retrieval systems look like? Will LLMs eventually replace an IR system? In this perspective paper, we examine these questions and provide provisional answers to them. We argue that LLMs will not be able to replace search engines, and future LLMs would need to learn how to use a search engine so that they can interact with a search engine on behalf of users. We conclude with a set of promising future research directions in applying LLMs to IR.

SESSION: Session: Graphs and LLMs

GraphGPT: Graph Instruction Tuning for Large Language Models

Jiabin Tang
Yuhao Yang
Wei Wei
Lei Shi
Lixin Su
Suqi Cheng
Dawei Yin
Chao Huang

Graph Neural Networks (GNNs) have evolved to understand graph structures through recursive exchanges and aggregations among nodes. To enhance robustness, self-supervised learning (SSL) has become a vital tool for data augmentation. Traditional methods often depend on fine-tuning with task-specific labels, limiting their effectiveness when labeled data is scarce. Our research tackles this by advancing graph model generalization in zero-shot learning environments. Inspired by the success of large language models (LLMs), we aim to create a graph-oriented LLM capable of exceptional generalization across various datasets and tasks without relying on downstream graph data. We introduce the GraphGPT framework, which integrates LLMs with graph structural knowledge through graph instruction tuning. This framework includes a text-graph grounding component to link textual and graph structures and a dual-stage instruction tuning approach with a lightweight graph-text alignment projector. These innovations allow LLMs to comprehend complex graph structures and enhance adaptability across diverse datasets and tasks. Our framework demonstrates superior generalization in both supervised and zero-shot graph learning tasks, surpassing existing benchmarks. The open-sourced model implementation of our GraphGPT is available at https://github.com/HKUDS/GraphGPT.

Instruction-based Hypergraph Pretraining

Mingdai Yang
Zhiwei Liu
Liangwei Yang
Xiaolong Liu
Chen Wang
Hao Peng
Philip S. Yu

Pretraining has been widely explored to augment the adaptability of graph learning models to transfer knowledge from large datasets to a downstream task, such as link prediction or classification. However, the gap between training objectives and the discrepancy between data distributions in pretraining and downstream tasks hinders the transfer of the pre-trained knowledge. Inspired by instruction-based prompts widely used in pre-trained language models, we introduce instructions into graph pertaining. In this paper, we propose a novel pretraining framework named Instruction-based Hypergraph Pretraining. To overcome the discrepancy between pretraining and downstream tasks, text-based instructions provide explicit guidance on specific tasks for representation learning. Compared to learnable prompts, whose effectiveness depends on the quality and diversity of training data, text-based instructions intrinsically encapsulate task information and support the model's generalization beyond the structure seen during pretraining. To capture high-order relations with task information in a context-aware manner, a novel prompting hypergraph convolution layer is devised to integrate instructions into information propagation in hypergraphs. Extensive experiments conducted on three public datasets verify the superiority of IHP in various scenarios.

LLM-enhanced Cascaded Multi-level Learning on Temporal Heterogeneous Graphs

Fengyi Wang
Guanghui Zhu
Chunfeng Yuan
Yihua Huang

Learning on temporal heterogeneous graphs (THGs) has attracted substantial attention in applications of information retrieval. Such graphs are ubiquitous in real-world domains like recommender systems and social networks. However, the spatial heterogeneity, rich semantic information, and intricate evolution patterns of THGs make it still difficult to generate high-quality embeddings for graph nodes. In this paper, we focus on two valuable and understudied issues related to THG learning: (a) How to capture the specific evolutionary characteristics of diverse temporal heterogeneous graphs? (b) Due to the heterogeneous nature of the graph, how to capture the unique temporal patterns of different node types? We explore these questions and present our solution by proposing a new method named CasMLN (Cascaded Multi-level Learning Network) for THG learning. Through the multi-level learning structure and aggregation methods specifically designed for different levels, we obtain information of multiple levels and fuse them to improve embedding generation. Additionally, we pioneer the use of large language models (LLMs) in the THG field. By leveraging the universality and powerful capabilities of LLMs, our method introduces LLM-based external knowledge to effectively capture the implicit nature of graphs and node types, which helps to enhance type- and graph-level representations. We evaluate our method on several real-world THG datasets for different downstream tasks. Extensive experimental results show that CasMLN outperforms the state-of-the-art baselines in both accuracy and efficiency.

SESSION: Session: Domain Specific RecSys

Course Recommender Systems Need to Consider the Job Market

Jibril Frej
Anna Dai
Syrielle Montariol
Antoine Bosselut
Tanja Käser

Current course recommender systems primarily leverage learner-course interactions, course content, learner preferences, and supplementary course details like instructor, institution, ratings, and reviews, to make their recommendation. However, these systems often overlook a critical aspect: the evolving skill demand of the job market. This paper focuses on the perspective of academic researchers, working in collaboration with the industry, aiming to develop a course recommender system that incorporates job market skill demands. In light of the job market's rapid changes and the current state of research in course recommender systems, we outline essential properties for course recommender systems to address these demands effectively, including explainable, sequential, unsupervised, and aligned with the job market and user's goals. Our discussion extends to the challenges and research questions this objective entails, including unsupervised skill extraction from job listings, course descriptions, and resumes, as well as predicting recommendations that align with learner objectives and the job market and designing metrics to evaluate this alignment. Furthermore, we introduce an initial system that addresses some existing limitations of course recommender systems using large Language Models (LLMs) for skill extraction and Reinforcement Learning (RL) for alignment with the job market. We provide empirical results using open-source data to demonstrate its effectiveness.

Leave No Patient Behind: Enhancing Medication Recommendation for Rare Disease Patients

Zihao Zhao
Yi Jing
Fuli Feng
Jiancan Wu
Chongming Gao
Xiangnan He

Medication recommendation systems have gained significant attention in healthcare as a means of providing tailored and effective drug combinations based on patients' clinical information. However, existing approaches often suffer from fairness issues, as recommendations tend to be more accurate for patients with common diseases compared to those with rare conditions. In this paper, we propose a novel model called Robust and Accurate REcommendations for Medication (RAREMed), which leverages the pretrain-finetune learning paradigm to enhance accuracy for rare diseases. RAREMed employs a transformer encoder with a unified input sequence approach to capture complex relationships among disease and procedure codes. Additionally, it introduces two self-supervised pre-training tasks, namely Sequence Matching Prediction (SMP) and Self Reconstruction (SR), to learn specialized medication needs and interrelations among clinical codes. Experimental results on two real-world datasets demonstrate that RAREMed provides accurate drug sets for both rare and common disease patients, thereby mitigating unfairness in medication recommendation systems. The implementation is available via https://github.com/zzhUSTC2016/RAREMed.

MIRROR: A Multi-View Reciprocal Recommender System for Online Recruitment

Zhi Zheng
Xiao Hu
Shanshan Gao
Hengshu Zhu
Hui Xiong

Reciprocal Recommender Systems (RRSs) which aim to satisfy the preferences of both service providers and seekers simultaneously has attracted significant research interest in recent years. Existing studies on RRSs mainly focus on modeling the bilateral interactions between the users on both sides to capture the user preferences. However, due to the presence of exposure bias, modeling user preferences solely based on bilateral interactions often lacks precision. Additionally, in RRSs, users may exhibit varying preferences when acting in different roles, and how to effectively model users from multiple perspectives remains a substantial problem. To solve the above challenges, in this paper, we propose a novel MultI-view Reciprocal Recommender system for Online Recruitment (MIRROR). Specifically, we first propose to model the users from three different views, respectively search, active, and passive views, and we further design several Transformer-based sequential models to capture the user representation corresponding to each view. Then, we propose to divide the bilateral matching process into three stages, respectively apply, reply, and match, and a multi-stage output layer is designed based on the above multi-view modeling results. To train our MIRROR model, we first design a multi-task learning loss based on the multi-stage output results. Moreover, to bridge the semantic gap between search queries and user behaviors, we additionally design a supplementary task for next-query prediction. Finally, we conduct both offline experiments on five real-world datasets and online A/B tests, and the experiment results clearly validate the effectiveness of our MIRROR model compared with several state-of-the-art baseline methods.

MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation

Andreea Iana
Goran Glavaš
Heiko Paulheim

Digital news platforms use news recommenders as the main instrument to cater to the individual information needs of readers. Despite an increasingly language-diverse online community, in which many Internet users consume news in multiple languages, the majority of news recommendation focuses on major, resource-rich languages. Moreover, nearly all news recommendation efforts assume monolingual news consumption, whereas more and more users tend to consume information in at least two languages. Accordingly, the existing body of work on news recommendation suffers from a lack of publicly available multilingual benchmarks that would catalyze development of news recommenders effective in multilingual settings and for low-resource languages. Aiming to fill this gap, we introduce xMIND, an open, multilingual news recommendation dataset derived from the English MIND dataset using machine translation, covering a set of 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. Using xMIND, we systematically benchmark several content-based neural news recommenders (NNRs) in zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer scenarios, considering both monolingual and bilingual news consumption patterns. Our findings reveal that (i) current NNRs, even when based on a multilingual language model, suffer from substantial performance losses under ZS-XLT and that (ii) inclusion of target-language data in FS-XLT training has limited benefits, particularly when combined with a bilingual news consumption. Our findings thus warrant a broader research effort in multilingual and cross-lingual news recommendation. We release xMIND at https://github.com/andreeaiana/xMIND.

MealRec+: A Meal Recommendation Dataset with Meal-Course Affiliation for Personalization and Healthiness

Ming Li
Lin Li
Xiaohui Tao
Jimmy Xiangji Huang

Meal recommendation, as a typical health-related recommendation task, contains complex relationships between users, courses, and meals. Among them, meal-course affiliation associates user-meal and user-course interactions. However, an extensive literature review demonstrates that there is a lack of publicly available meal recommendation datasets including meal-course affiliation. Meal recommendation research has been constrained in exploring the impact of cooperation between two levels of interaction on personalization and healthiness. To pave the way for meal recommendation research, we introduce a new benchmark dataset called MealRec^+. Due to constraints related to user health privacy and meal scenario characteristics, the collection of data that includes both meal-course affiliation and two levels of interactions is impeded. Therefore, a simulation method is adopted to derive meal-course affiliation and user-meal interaction from the user's dining sessions simulated based on user-course interaction data. Then, two well-known nutritional standards are used to calculate the healthiness scores of meals. Moreover, we experiment with several baseline models, including separate and cooperative interaction learning methods. Our experiment demonstrates that cooperating the two levels of interaction in appropriate ways is beneficial for meal recommendations. The dataset is available on GitHub (https://github.com/WUT-IDEA/MealRecPlus).

SESSION: Session: Multilingual Retrieval

Negative Sampling Techniques for Dense Passage Retrieval in a Multilingual Setting

Thilina Chaturanga Rajapakse
Andrew Yates
Maarten de Rijke

The bi-encoder transformer architecture has become popular in open-domain retrieval, surpassing traditional sparse retrieval methods. Using hard negatives during training can improve the effectiveness of dense retrievers, and various techniques have been proposed to generate these hard negatives. We investigate the effectiveness of multiple negative sampling methods based on lexical methods (BM25), clustering, and periodically updated dense indices. We examine techniques that were introduced for finding hard negatives in a monolingual setting and reproduce them in a multilingual setting. We discover a gap amongst these techniques that we fill by proposing a novel clustered training method. Specifically, we focus on monolingual retrieval using multilingual dense retrievers across a broad set of diverse languages. We find that negative sampling based on BM25 negatives is surprisingly effective in an in-distribution setting, but this finding does not generalize to out-of-distribution and zero-shot settings, where the newly proposed method achieves the best results. We conclude with recommendations on which negative sampling methods may be the most effective given different multilingual retrieval scenarios.

Steering Large Language Models for Cross-lingual Information Retrieval

Ping Guo
Yubing Ren
Yue Hu
Yanan Cao
Yunpeng Li
Heyan Huang

In today's digital age, accessing information across language barriers poses a significant challenge, with conventional search systems often struggling to interpret and retrieve multilingual content accurately. Addressing this issue, our study introduces a novel integration of applying Large Language Models (LLMs) as Cross-lingual Readers in information retrieval systems, specifically targeting the complexities of cross-lingual information retrieval (CLIR). We present an innovative approach: Activation Steered Multilingual Retrieval (ASMR) that employs "steering activations''-a method to adjust and direct the LLM's focus-enhancing its ability to understand user queries and generate accurate, language-coherent responses. ASMR adeptly combines a Multilingual Dense Passage Retrieval (mDPR) system with an LLM, overcoming the limitations of traditional search engines in handling diverse linguistic inputs. This approach is particularly effective in managing the nuances and intricacies inherent in various languages. Rigorous testing on established benchmarks such as XOR-TyDi QA, and MKQA demonstrates that ASMR not only meets but surpasses existing standards in CLIR, achieving state-of-the-art performance. The results of our research hold significant implications for understanding the inherent features of how LLMs understand and generate natural languages, offering an attempt towards more inclusive, effective, and linguistically diverse information access on a global scale.

Multilingual Meta-Distillation Alignment for Semantic Retrieval

Meryem M'hamdi
Jonathan May
Franck Dernoncourt
Trung Bui
Seunghyun Yoon

Multilingual semantic retrieval involves retrieving semantically relevant content to a query irrespective of the language. Compared to monolingual and bilingual semantic retrieval, multilingual semantic retrieval requires a stronger alignment approach to pull the contents to be retrieved close to the representation of their corresponding queries, no matter their language combinations. Traditionally, this is achieved through more supervision in the form of multilingual parallel resources, which are expensive to obtain, especially for low-resource languages. In this work, on top of an optimization-based Model-Agnostic Meta-Learner (MAML), we propose a data-efficient meta-distillation approach: MAML-Align,¹ specifically for low-resource multilingual semantic retrieval. Our approach simulates a gradual feedback loop from monolingual to bilingual and from bilingual to multilingual semantic retrieval. We systematically compare multilingual meta-distillation learning to different baselines and conduct ablation studies on the role of different sampling approaches in the meta-task construction. We show that MAML-Align's gradual feedback loop boosts the generalization to different languages, including zero-shot ones, better than naive fine-tuning and vanilla MAML.

SESSION: Session: NLP

DAC: Quantized Optimal Transport Reward-based Reinforcement Learning Approach to Detoxify Query Auto-Completion

Aishwarya Maheswaran
Kaushal Kumar Maurya
Manish Gupta
Maunendra Sankar Desarkar

Modern Query Auto-Completion (QAC) systems utilize natural language generation (NLG) using large language models (LLM) to achieve remarkable performance. However, these systems are prone to generating biased and toxic completions due to inherent learning biases. Existing detoxification approaches exhibit two key limitations: (1) They primarily focus on mitigating toxicity for grammatically well-formed long sentences but struggle to adapt to the QAC task, where queries are short and structurally different (include spelling errors, do not follow grammatical rules and have relatively flexible word order). (2) These approaches often view detoxification through a binary lens where all text labeled as toxic is undesirable, and non-toxic is considered desirable. To address these limitations, we propose DAC, an intuitive and efficient reinforcement learning-based model to detoxify QAC. With DAC, we introduce an additional perspective of considering the third query class of addressable toxicity. These queries can encompass implicit toxicity, subjective toxicity, or non-toxic queries containing toxic words. We incorporate this three-class query behavior perspective into the proposed model through quantized optimal transport to learn distinctions and generate truly non-toxic completions. We evaluate toxicity levels in the generated completions by DAC across two real-world QAC datasets (Bing and AOL) using two classifiers: a publicly available generic classifier (Detoxify) and a search query-specific classifier, which we develop (TClassify). We find that DAC consistently outperforms all existing baselines on the Bing dataset and achieves competitive performance on the AOL dataset for query detoxification. % providing high quality and low toxicity. We make the code publicly available.

Enhanced Packed Marker with Entity Information for Aspect Sentiment Triplet Extraction

You Li
Xupeng Zeng
Yixiao Zeng
Yuming Lin

Aspect sentiment triplet extraction (ASTE) is an emerging sentiment analysis task that aims to extract sentiment triplets from review sentences. Each sentiment triplet consists of an aspect, corresponding opinion, and sentiment. Although extensive research has been conducted on the ASTE task, existing methods use the span representations to predict the relationship between spans, failing to consider the interrelation between span pairs. On the other hand, early fusion of entity information is critical for sentiment classification. In this paper, we propose an Enhanced Packed Marker with Entity Information (EPMEI) framework for ASTE task to address the above limitations of the existing works. Specifically, EPMEI consists of entity recognition and sentiment classification models. The entity information is obtained from the entity recognition model first. After that, we insert solid markers with entity information at the input layer of the sentiment classification model to highlight the subject span and improve subject span representation. Furthermore, we introduce a subject-oriented packing strategy, which packs each subject span and all its levitated markers of object spans to model the interrelation between the same-subject span pairs. Extensive experimental results on four ASTE benchmark datasets demonstrate that EPMEI achieves the state-of-the-art baseline. Our code can be found in https://github.com/MKMaS-GUET/EPMEI.

Exogenous and Endogenous Data Augmentation for Low-Resource Complex Named Entity Recognition

Xinghua Zhang
Gaode Chen
Shiyao Cui
Jiawei Sheng
Tingwen Liu
Hongbo Xu

Low-resource Complex Named Entity Recognition aims to detect entities with the form of any linguistic constituent under scenarios with limited manually annotated data. Existing studies augment the text through the substitution of same type entities or language modeling, but suffer from the lower quality and the limited entity context patterns within low-resource corpora. In this paper, we propose a novel data augmentation method E²DA from both exogenous and endogenous perspectives. As for exogenous augmentation, we treat the limited manually annotated data as anchors, and leverage the powerful instruction-following capabilities of Large Language Models (LLMs) to expand the anchors by generating data that are highly dissimilar from the original anchor texts in terms of entity mentions and contexts. As regards the endogenous augmentation, we explore diverse semantic directions in the implicit feature space of the original and expanded anchors for effective data augmentation. Our complementary augmentation method from two perspectives not only continuously expands the global text-level space, but also fully explores the local semantic space for more diverse data augmentation. Extensive experiments on 10 diverse datasets across various low-resource settings demonstrate that the proposed method excels significantly over prior state-of-the-art data augmentation methods.

C-Pack: Packed Resources For General Chinese Embeddings

Shitao Xiao
Zheng Liu
Peitian Zhang
Niklas Muennighoff
Defu Lian
Jian-Yun Nie

We introduce C-Pack, a package of resources that significantly advances the field of general text embeddings for Chinese. C-Pack includes three critical resources. 1) C-MTP is a massive training dataset for text embedding, which is based on the curation of vast unlabeled corpora and the integration of high-quality labeled corpora. 2) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 3) BGE is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by more than +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for BGE. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models also achieve state-of-the-art performance on the MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. Both Chinese and English datasets are the largest public release of training data for text embeddings. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

QuanTemp: A real-world open-domain benchmark for fact-checking numerical claims

Venktesh V
Abhijit Anand
Avishek Anand
Vinay Setty

With the growth of misinformation on the web, automated fact checking has garnered immense interest for detecting growing misinformation and disinformation. Current systems have made significant advancements in handling synthetic claims sourced from Wikipedia, and noteworthy progress has been achieved in addressing real-world claims that are verified by fact-checking organizations as well. We compile and release QuanTemp, a diverse, multi-domain dataset focused exclusively on numerical claims, encompassing comparative, statistical, interval, and temporal aspects, with detailed metadata and an accompanying evidence collection. This addresses the challenge of verifying real-world numerical claims, which are complex and often lack precise information, a gap not filled by existing works that mainly focus on synthetic claims. We evaluate and quantify these gaps in existing solutions for the task of verifying numerical claims. We also evaluate claim decomposition based methods, numerical understanding based natural language inference (NLI) models and our best baselines achieves a macro-F1 of 58.32. This demonstrates that QuanTemp serves as a challenging evaluation set for numerical claim verification.

ACE-2005-PT: Corpus for Event Extraction in Portuguese

Luís Filipe Cunha
Purificação Silvano
Ricardo Campos
Alípio Jorge

Event extraction is an NLP task that commonly involves identifying the central word (trigger) for an event and its associated arguments in text. ACE-2005 is widely recognised as the standard corpus in this field. While other corpora, like PropBank, primarily focus on annotating predicate-argument structure, ACE-2005 provides comprehensive information about the overall event structure and semantics. However, its limited language coverage restricts its usability. This paper introduces ACE-2005-PT, a corpus created by translating ACE-2005 into Portuguese, with European and Brazilian variants. To speed up the process of obtaining ACE-2005-PT, we rely on automatic translators. This, however, poses some challenges related to automatically identifying the correct alignments between multi-word annotations in the original text and in the corresponding translated sentence. To achieve this, we developed an alignment pipeline that incorporates several alignment techniques: lemmatization, fuzzy matching, synonym matching, multiple translations and a BERT-based word aligner. To measure the alignment effectiveness, a subset of annotations from the ACE-2005-PT corpus was manually aligned by a linguist expert. This subset was then compared against our pipeline results which achieved exact and relaxed match scores of 70.55% and 87.55% respectively. As a result, we successfully generated a Portuguese version of the ACE-2005 corpus, which has been accepted for publication by LDC.

SESSION: Session: Multimodal RecSys

Who To Align With: Feedback-Oriented Multi-Modal Alignment in Recommendation Systems

Yang Li
Qi'Ao Zhao
Chen Lin
Jinsong Su
Zhilin Zhang

Multi-modal Recommendation Systems (MRSs) utilize diverse modalities, such as image and text, to enrich item representations and enhance recommendation accuracy. Current MRSs overlook the large misalignment between multi-modal content features and ID embeddings. While bidirectional alignment between visual and textual modalities has been extensively studied in large multi-modal models, this study suggests that multi-modal alignment in MRSs should be in a one-way direction. A plug-and-play framework is presented, called FEedback-orienTed mulTi-modal aLignmEnt (FETTLE). FETTLE contains three novel solutions: (1) it automatically determines item-level alignment direction between each pair of modalities based on estimated user feedback; (2) it coordinates the alignment directions among multiple modalities; (3) it implements cluster-level alignment from both user and item perspectives for more stable alignments. Extensive experiments on three real datasets demonstrate that FETTLE significantly improves various backbone models. Conventional collaborative filtering models are improved by 24.79%-62.79%, and recent MRSs are improved by 5.91% - 20.11%.

Multimodality Invariant Learning for Multimedia-Based New Item Recommendation

Haoyue Bai
Le Wu
Min Hou
Miaomiao Cai
Zhuangzhuang He
Yuyang Zhou
Richang Hong
Meng Wang

Multimedia-based recommendation provides personalized item suggestions by learning the content preferences of users. With the proliferation of digital devices and APPs, a huge number of new items are created rapidly over time. How to quickly provide recommendations for new items at the inference time is challenging. What's worse, real-world items exhibit varying degrees of modality missing(e.g., many short videos are uploaded without text descriptions). Though many efforts have been devoted to multimedia-based recommendations, they either could not deal with new multimedia items or assumed the modality completeness in the modeling process.

In this paper, we highlight the necessity of tackling the modality missing issue for new item recommendation. We argue that users' inherent content preference is stable and better kept invariant to arbitrary modality missing environments. Therefore, we approach this problem from a novel perspective of invariant learning. However, how to construct environments from finite user behavior training data to generalize any modality missing is challenging. To tackle this issue, we propose a novel Multimodality Invariant Learning reCommendation (a.k.a. MILK) framework. Specifically, MILK first designs a cross-modality alignment module to keep semantic consistency from pretrained multimedia item features. After that, MILK designs multi-modal heterogeneous environments with cyclic mixup to augment training data, in order to mimic any modality missing for invariant user preference learning.Extensive experiments on three real datasets verify the superiority of our proposed framework.The code is available at https://github.com/HaoyueBai98/MILK.

IISAN: Efficiently Adapting Multimodal Representation for Sequential Recommendation with Decoupled PEFT

Junchen Fu
Xuri Ge
Xin Xin
Alexandros Karatzoglou
Ioannis Arapakis
Jie Wang
Joemon M. Jose

Multimodal foundation models are transformative in sequential recommender systems, leveraging powerful representation learning capabilities. While Parameter-efficient Fine-tuning (PEFT) is commonly used to adapt foundation models for recommendation tasks, most research prioritizes parameter efficiency, often overlooking critical factors like GPU memory efficiency and training speed. Addressing this gap, our paper introduces IISAN (Intra- and Inter-modal Side Adapted Network for Multimodal Representation)¹ a simple plug-and-play architecture using a Decoupled PEFT structure and exploiting both intra- and inter-modal adaptation.

IISAN matches the performance of full fine-tuning (FFT) and state-of-the-art PEFT. More importantly, it significantly reduces GPU memory usage - from 47GB to just 3GB for multimodal sequential recommendation tasks. Additionally, it accelerates training time per epoch from 443s to 22s compared to FFT. This is also a notable improvement over the Adapter and LoRA, which require 37-39 GB GPU memory and 350-380 seconds per epoch for training.

Furthermore, we propose a new composite efficiency metric, TPME (Training-time, Parameter, and GPU Memory Efficiency) to alleviate the prevalent misconception that "parameter efficiency represents overall efficiency". TPME provides more comprehensive insights into practical efficiency comparisons between different methods. Besides, we give an accessible efficiency analysis of all PEFT and FFT approaches, which demonstrate the superiority of IISAN. We release our code and other materials to facilitate future research. Code is available at https://github.com/GAIR-Lab/IISAN

EEG-SVRec: An EEG Dataset with User Multidimensional Affective Engagement Labels in Short Video Recommendation

Shaorun Zhang
Zhiyu He
Ziyi Ye
Peijie Sun
Qingyao Ai
Min Zhang
Yiqun Liu

In recent years, short video platforms have gained widespread popularity, making the quality of video recommendations crucial for retaining users. Existing recommendation systems primarily rely on behavioral data, which faces limitations when inferring user preferences due to issues such as data sparsity and noise from accidental interactions or personal habits. To address these challenges and provide a more comprehensive understanding of user affective experience and cognitive activity, we propose EEG-SVRec, the first EEG dataset with User Multidimensional Affective Engagement Labels in Short Video Recommendation. The study involves 30 participants and collects 3,657 interactions, offering a rich dataset that can be used for a deeper exploration of user preference and cognitive activity. By incorporating self-assessment techniques and real-time, low-cost EEG signals, we offer a more detailed understanding user affective experiences (valence, arousal, immersion, interest, visual and auditory) and the cognitive mechanisms behind their behavior. We establish benchmarks for rating prediction by the recommendation algorithm, showing significant improvement with the inclusion of EEG signals. Furthermore, we demonstrate the potential of this dataset in gaining insights into the affective experience and cognitive activity behind user behaviors in recommender systems. This work presents a novel perspective for enhancing short video recommendation by leveraging the rich information contained in EEG signals and multidimensional affective engagement scores, paving the way for future research in short video recommendation systems.

Dataset and Models for Item Recommendation Using Multi-Modal User Interactions

Simone Borg Bruun
Krisztian Balog
Maria Maistro

While recommender systems with multi-modal item representations (image, audio, and text), have been widely explored, learning recommendations from multi-modal user interactions (e.g., clicks and speech) remains an open problem. We study the case of multi-modal user interactions in a setting where users engage with a service provider through multiple channels (website and call center). In such cases, incomplete modalities naturally occur, since not all users interact through all the available channels. To address these challenges, we publish a real-world dataset that allows progress in this under-researched area. We further present and benchmark various methods for leveraging multi-modal user interactions for item recommendations, and propose a novel approach that specifically deals with missing modalities by mapping user interactions to a common feature space. Our analysis reveals important interactions between the different modalities and that a frequently occurring modality can enhance learning from a less frequent one.

SESSION: Session: Retrieval Augmented Generation

The Power of Noise: Redefining Retrieval for RAG Systems

Florin Cuconasu
Giovanni Trappolini
Federico Siciliano
Simone Filice
Cesare Campagnano
Yoelle Maarek
Nicola Tonellotto
Fabrizio Silvestri

Retrieval-Augmented Generation (RAG) has recently emerged as a method to extend beyond the pre-trained knowledge of Large Language Models by augmenting the original prompt with relevant passages or documents retrieved by an Information Retrieval (IR) system. RAG has become increasingly important for Generative AI solutions, especially in enterprise settings or in any domain in which knowledge is constantly refreshed and cannot be memorized in the LLM. We argue here that the retrieval component of RAG systems, be it dense or sparse, deserves increased attention from the research community, and accordingly, we conduct the first comprehensive and systematic examination of the retrieval strategy of RAG systems. We focus, in particular, on the type of passages IR systems within a RAG solution should retrieve. Our analysis considers multiple factors, such as the relevance of the passages included in the prompt context, their position, and their number. One counter-intuitive finding of this work is that the retriever's highest-scoring documents that are not directly relevant to the query (e.g., do not contain the answer) negatively impact the effectiveness of the LLM. Even more surprising, we discovered that adding random documents in the prompt improves the LLM accuracy by up to 35%. These results highlight the need to investigate the appropriate strategies when integrating retrieval with LLMs, thereby laying the groundwork for future research in this area.

IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues

Diji Yang
Jinmeng Rao
Kezhen Chen
Xiaoyuan Guo
Yawen Zhang
Jie Yang
Yi Zhang

Although the Retrieval-Augmented Generation (RAG) paradigms can use external knowledge to enhance and ground the outputs of Large Language Models (LLMs) to mitigate generative hallucinations and static knowledge base problems, they still suffer from limited flexibility in adopting Information Retrieval (IR) systems with varying capabilities, constrained interpretability during the multi-round retrieval process, and a lack of end-to-end optimization. To address these challenges, we propose a novel LLM-centric approach, IM-RAG, that integrates IR systems with LLMs to support multi-round RAG through learning Inner Monologues (IM, i.e., the human inner voice that narrates one's thoughts). During the IM process, the LLM serves as the core reasoning model (i.e., Reasoner ) to either propose queries to collect more information via the Retriever or to provide a final answer based on the conversational context. We also introduce a Refiner that improves the outputs from the Retriever, effectively bridging the gap between the Reasoner and IR modules with varying capabilities and fostering multi-round communications. The entire IM process is optimized via Reinforcement Learning (RL) where a Progress Tracker is incorporated to provide mid-step rewards, and the answer prediction is further separately optimized via Supervised Fine-Tuning (SFT). We conduct extensive experiments with the HotPotQA dataset, a popular benchmark for retrieval-based, multi-step question-answering. The results show that our approach achieves state-of-the-art (SOTA) performance while providing high flexibility in integrating IR modules as well as strong interpretability exhibited in the learned inner monologue.

Towards a Search Engine for Machines: Unified Ranking for Multiple Retrieval-Augmented Large Language Models

Alireza Salemi
Hamed Zamani

This paper introduces uRAG-a framework with a unified retrieval engine that serves multiple downstream retrieval-augmented generation (RAG) systems. Each RAG system consumes the retrieval results for a unique purpose, such as open-domain question answering, fact verification, entity linking, and relation extraction. We introduce a generic training guideline that standardizes the communication between the search engine and the downstream RAG systems that engage in optimizing the retrieval model. This lays the groundwork for us to build a large-scale experimentation ecosystem consisting of 18 RAG systems that engage in training and 18 unknown RAG systems that use the uRAG as the new users of the search engine. Using this experimentation ecosystem, we answer a number of fundamental research questions that improve our understanding of promises and challenges in developing search engines for machines.

Optimization Methods for Personalizing Large Language Models through Retrieval Augmentation

Alireza Salemi
Surya Kallumadi
Hamed Zamani

This paper studies retrieval-augmented approaches for personalizing large language models (LLMs), which potentially have a substantial impact on various applications and domains. We propose the first attempt to optimize the retrieval models that deliver a limited number of personal documents to large language models for the purpose of personalized generation. We develop two optimization algorithms that solicit feedback from the downstream personalized generation tasks for retrieval optimization--one based on reinforcement learning whose reward function is defined using any arbitrary metric for personalized generation and another based on knowledge distillation from the downstream LLM to the retrieval model. This paper also introduces a pre- and post-generation retriever selection model that decides what retriever to choose for each LLM input. Extensive experiments on diverse tasks from the language model personalization (LaMP) benchmark reveal statistically significant improvements in six out of seven datasets.

FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation

Shuai Wang
Ekaterina Khramtsova
Shengyao Zhuang
Guido Zuccon

Federated search systems aggregate results from multiple search engines, selecting appropriate sources to enhance result quality and align with user intent. With the increasing uptake of Retrieval-Augmented Generation (RAG) pipelines, federated search can play a pivotal role in sourcing relevant information across heterogeneous data sources to generate informed responses. However, existing datasets, such as those developed in the past TREC FedWeb tracks, predate the RAG paradigm shift and lack representation of modern information retrieval challenges.

To bridge this gap, we present FeB4RAG, a novel dataset specifically designed for federated search within RAG frameworks. This dataset, derived from 16 sub-collections of the widely used BEIR benchmarking collection, includes 790 information requests (akin to conversational queries) tailored for chatbot applications, along with top results returned by each resource and associated LLM-derived relevance judgements. Additionally, to support the need for this collection, we demonstrate the impact on response generation of a high quality federated search system for RAG compared to a naive approach to federated search. We do so by comparing answers generated by the RAG pipeline with a qualitative side-by-side comparison. Our collection fosters and supports the development and evaluation of new federated search methods, especially in the context of RAG pipelines. The resource is publicly available at https://github.com/ielab/FeB4RAG.

SESSION: Session: Conversational IR and Recommendation

Dynamic Demonstration Retrieval and Cognitive Understanding for Emotional Support Conversation

Zhe Xu
Daoyuan Chen
Jiayi Kuang
Zihao Yi
Yaliang Li
Ying Shen

Emotional Support Conversation (ESC) systems are pivotal in providing empathetic interactions, aiding users through negative emotional states by understanding and addressing their unique experiences. In this paper, we tackle two key challenges in ESC: enhancing contextually relevant and empathetic response generation through dynamic demonstration retrieval, and advancing cognitive understanding to grasp implicit mental states comprehensively. We introduce Dynamic Demonstration Retrieval and Cognitive-Aspect Situation Understanding (D²RCU), a novel approach that synergizes these elements to improve the quality of support provided in ESCs. By leveraging in-context learning and persona information, we introduce an innovative retrieval mechanism that selects informative and personalized demonstration pairs. We also propose a cognitive understanding module that utilizes four cognitive relationships from the ATOMIC knowledge source to deepen situational awareness of help-seekers' mental states. Our supportive decoder integrates information from diverse knowledge sources, underpinning response generation that is both empathetic and cognitively aware. The effectiveness of D²RCU is demonstrated through extensive automatic and human evaluations, revealing substantial improvements over numerous state-of-the-art models, with up to 13.79% enhancement in overall performance of ten metrics. Our codes are available for public access to facilitate further research and development.

Broadening the View: Demonstration-augmented Prompt Learning for Conversational Recommendation

Huy Dao
Yang Deng
Dung D. Le
Lizi Liao

Conversational Recommender Systems (CRSs) leverage natural language dialogues to provide tailored recommendations. Traditional methods in this field primarily focus on extracting user preferences from isolated dialogues. It often yields responses with a limited perspective, confined to the scope of individual conversations. Recognizing the potential in collective dialogue examples, our research proposes an expanded approach for CRS models, utilizing selective analogues from dialogue histories and responses to enrich both generation and recommendation processes. This introduces significant research challenges, including: (1) How to secure high-quality collections of recommendation dialogue exemplars? (2) How to effectively leverage these exemplars to enhance CRS models?

To tackle these challenges, we introduce a novel Demonstration-enhanced Conversational Recommender System (DCRS), which aims to strengthen its understanding on the given dialogue contexts by retrieving and learning from demonstrations. In particular, we first propose a knowledge-aware contrastive learning method that adeptly taps into the mentioned entities and the dialogue's contextual essence for pretraining the demonstration retriever. Subsequently, we further develop two adaptive demonstration-augmented prompt learning approaches, involving contextualized prompt learning and knowledge-enriched prompt learning, to bridge the gap between the retrieved demonstrations and the two end tasks of CRS, i.e., response generation and item recommendation, respectively. Rigorous evaluations on two established benchmark datasets underscore DCRS's superior performance over existing CRS methods in both item recommendation and response generation.

Doing Personal LAPS: LLM-Augmented Dialogue Construction for Personalized Multi-Session Conversational Search

Hideaki Joko
Shubham Chatterjee
Andrew Ramsay
Arjen P. de Vries
Jeff Dalton
Faegheh Hasibi

The future of conversational agents will provide users with personalized information responses. However, a significant challenge in developing models is the lack of large-scale dialogue datasets that span multiple sessions and reflect real-world user preferences. Previous approaches rely on experts in a wizard-of-oz setup that is difficult to scale, particularly for personalized tasks. Our method, LAPS, addresses this by using large language models (LLMs) to guide a single human worker in generating personalized dialogues. This method has proven to speed up the creation process and improve quality. LAPS can collect large-scale, human-written, multi-session, and multi-domain conversations, including extracting user preferences. When compared to existing datasets, LAPS-produced conversations are as natural and diverse as expert-created ones, which stays in contrast with fully synthetic methods. The collected dataset is suited to train preference extraction and personalized response generation. Our results show that responses generated explicitly using extracted preferences better match user's actual preferences, highlighting the value of using extracted preferences over simple dialogue history. Overall, LAPS introduces a new method to leverage LLMs to create realistic personalized conversational data more efficiently and effectively than previous methods.

Towards Human-centered Proactive Conversational Agents

Yang Deng
Lizi Liao
Zhonghua Zheng
Grace Hui Yang
Tat-Seng Chua

Recent research on proactive conversational agents (PCAs) mainly focuses on improving the system's capabilities in anticipating and planning action sequences to accomplish tasks and achieve goals before users articulate their requests. This perspectives paper highlights the importance of moving towards building human-centered PCAs that emphasize human needs and expectations, and that considers ethical and social implications of these agents, rather than solely focusing on technological capabilities. The distinction between a proactive and a reactive system lies in the proactive system's initiative-taking nature. Without thoughtful design, proactive systems risk being perceived as intrusive by human users. We address the issue by establishing a new taxonomy concerning three key dimensions of human-centered PCAs, namely Intelligence, Adaptivity, and Civility. We discuss potential research opportunities and challenges based on this new taxonomy upon the five stages of PCA system construction. This perspectives paper lays a foundation for the emerging area of conversational information retrieval research and paves the way towards advancing human-centered proactive conversational systems.

TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants

Mohammad Aliannejadi
Zahra Abbasiantaeb
Shubham Chatterjee
Jeffrey Dalton
Leif Azzopardi

Conversational information seeking has evolved rapidly in the last few years with the development of Large Language Models (LLMs), providing the basis for interpreting and responding in a naturalistic manner to user requests. The extended TREC Interactive Knowledge Assistance Track (iKAT) collection aims to enable researchers to test and evaluate their Conversational Search Agent (CSA). The collection contains a set of 36 personalized dialogues over 20 different topics each coupled with a Personal Text Knowledge Base (PTKB) that defines the bespoke user personas. A total of 344 turns with approximately 26,000 passages are provided as assessments on relevance, as well as additional assessments on generated responses over four key dimensions: relevance, completeness, groundedness, and naturalness. The collection challenges CSAs to efficiently navigate diverse personal contexts, elicit pertinent persona information, and employ context for relevant conversations.

The integration of a PTKB and the emphasis on decisional search tasks contribute to the uniqueness of this test collection, making it an essential benchmark for advancing research in conversational and interactive knowledge assistants.

ProCIS: A Benchmark for Proactive Retrieval in Conversations

Chris Samarinas
Hamed Zamani

The field of conversational information seeking, which is rapidly gaining interest in both academia and industry, is changing how we interact with search engines through natural language interactions. Existing datasets and methods are mostly evaluating reactive conversational information seeking systems that solely provide response to every query from the user. We identify a gap in building and evaluating proactive conversational information seeking systems that can monitor a multi-party human conversation and proactively engage in the conversation at an opportune moment by retrieving useful resources and suggestions. In this paper, we introduce a large-scale dataset for proactive document retrieval that consists of over 2.8 million conversations. We conduct crowdsourcing experiments to obtain high-quality and relatively complete relevance judgments through depth-k pooling. We also collect annotations related to the parts of the conversation that are related to each document, enabling us to evaluate proactive retrieval systems. We introduce normalized proactive discounted cumulative gain (npDCG) for evaluating these systems, and further provide benchmark results for a wide range of models, including a novel model we developed for this task. We believe that the developed dataset, called ProCIS, paves the path towards developing proactive conversational information seeking systems.

An Empirical Analysis on Multi-turn Conversational Recommender Systems

Lu Zhang
Chen Li
Yu Lei
Zhu Sun
Guanfeng Liu

The rise of conversational recommender systems (CRSs) brings the evolution of the recommendation paradigm, which enables users to interact with the system and achieve dynamic recommendations. As one essential branch, multi-turn CRSs, built on the user simulator paradigm, have attracted great attention due to their powerful ability to accomplish recommendations without real dialogue resources. Recent multi-turn CRS models, equipped with various delicately designed components (e.g., conversation module), achieve state-of-the-art (SOTA) performance. We, for the first time, propose a comprehensive experimental evaluation for existing SOTA multi-turn CRSs to investigate three research questions: (1) reproducibility - are the designed components beneficial to target multi-turn CRSs? (2) scenario-specific adaptability - how do these components perform in various scenarios? and (3) generality - can the effective components from the target CRS be effectively transferred to other multi-turn CRSs? To answer these questions, we design and conduct experiments under different settings, including carefully selected SOTA baselines, components of CRSs, datasets, and evaluation metrics, thus providing an experimental aspect overview of multi-turn CRSs. As a result, we derive several significant insights whereby effective guidelines are provided for future multi-turn CRS model designs across diverse scenarios.

SESSION: Session: Multimodal

UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal Matching

Quanxing Zha
Xin Liu
Yiu-ming Cheung
Xing Xu
Nannan Wang
Jianjia Cao

Cross-modal matching has recently gained significant popularity to facilitate retrieval across multi-modal data, and existing works are highly relied on an implicit assumption that the training data pairs are perfectly aligned. However, such an ideal assumption is extremely impossible due to the inevitably mismatched data pairs, a.k.a. noisy correspondence, which can wrongly enforce the mismatched data to be similar and thus induces the performance degradation. Although some recent methods have attempted to address this problem, they still face two challenging issues: 1) unreliable data division for training inefficiency and 2) unstable prediction for matching failure. To address these problems, we propose an efficient Uncertainty-Guided Noisy Correspondence Learning (UGNCL) framework to achieve noise-robust cross-modal matching. Specifically, a novel Uncertainty Guided Division (UGD) algorithm is reliably designed leverage the potential benefits of derived uncertainty to divide the data into clean, noisy and hard partitions, which can effortlessly mitigate the impact of easily-determined noisy pairs. Meanwhile, an efficient Trusted Robust Loss (TRL) is explicitly designed to recast the soft margins, calibrated by confident yet error soft correspondence labels, for the data pairs in the hard partition through the uncertainty, leading to increase/decrease the importance of matched/mismatched pairs and further alleviate the impact of noisy pairs for robustness improvement. Extensive experiments conducted on three public datasets highlight the superiorities of the proposed framework, and show its competitive performance compared with the state-of-the-arts. The code is available at https://github.com/qxzha/UGNCL.

Universal Adversarial Perturbations for Vision-Language Pre-trained Models

Peng-Fei Zhang
Zi Huang
Guangdong Bai

Vision-language pre-trained (VLP) models have been the foundation of numerous vision-language tasks. Given their prevalence, it becomes imperative to assess their adversarial robustness, especially when deploying them in security-crucial real-world applications. Traditionally, adversarial perturbations generated for this assessment target specific VLP models, datasets, and/or downstream tasks. This practice suffers from low transferability and additional computation costs when transitioning to new scenarios.

In this work, we thoroughly investigate whether VLP models are commonly sensitive to imperceptible perturbations of a specific pattern for the image modality. To this end, we propose a novel black-box method to generate Universal Adversarial Perturbations (UAPs), which is so called the Effective and Transferable Universal Adversarial Attack (ETU), aiming to mislead a variety of existing VLP models in a range of downstream tasks. The ETU comprehensively takes into account the characteristics of UAPs and the intrinsic cross-modal interactions to generate effective UAPs. Under this regime, the ETU encourages both global and local utilities of UAPs. This benefits the overall utility while reducing interactions between UAP units, improving the transferability. To further enhance the effectiveness and transferability of UAPs, we also design a novel data augmentation method named ScMix. ScMix consists of self-mix and cross-mix data transformations, which can effectively increase the multi-modal data diversity while preserving the semantics of the original data. Through comprehensive experiments on various downstream tasks, VLP models, and datasets, we demonstrate that the proposed method is able to achieve effective and transferrable universal adversarial attacks.

Semi-supervised Prototype Semantic Association Learning for Robust Cross-modal Retrieval

Junsheng Wang
Tiantian Gong
Yan Yan

Semi-supervised cross-modal retrieval (SS-CMR) aims at learning modality invariance and semantic discrimination from labeled data and unlabeled data, which is crucial for practical applications in the real-world. The key to essentially addressing the SS-CMR task is to solve the semantic association and modality heterogeneity problems. To address these issues, in this paper, we propose a novel semi-supervised cross-modal retrieval method, namely Semi-supervised Prototype Semantic Association Learning (SPAL) for robust cross-modal retrieval. To be specific, we employ shared semantic prototypes to associate labeled and unlabeled data over both modalities to minimize intra-class and maximize inter-class variations, thereby improving discriminative representations on unlabeled data. What is more important is that we propose a novel pseudo-label guided contrastive learning to refine cross-modal representation consistency in the common space, which leverages pseudo-label semantic graph information to constrain cross-modal consistent representations. Meanwhile, multi-modal data inevitably suffers from the cost and difficulty of data collection, resulting in the incomplete multimodal data problem. Thus, to strengthen the robustness of the SS-CMR, we propose a novel prototype propagation method for incomplete data to reconstruct completion representations which preserves the semantic consistency. Extensive evaluations using several baseline methods across four benchmark datasets demonstrate the effectiveness of our method.

Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models

Xinwei Li
Li Lin
Shuai Wang
Chen Qian

Multimodal content generation, which leverages visual information to enhance the comprehension of cross-modal understanding, plays a critical role in Multimodal Information Retrieval. With the development of large language models (LLMs), recent research has adopted visual instruction tuning to inject the knowledge of LLMs into downstream multimodal tasks. The high complexity and great demand for resources urge researchers to study efficient distillation solutions to transfer the knowledge from pre-trained multimodal models.(teachers) to more compact student models. However, the instruction tuning for knowledge distillation in multimodal LLMs is resource-intensive and capability-restricted. The comprehension of students is highly reliant on the teacher models. To address this issue, we propose a novel Multimodal Distillation Calibration framework (MmDC). The main idea is to generate high-quality training instances that challenge student models to comprehend and prompt the teacher to calibrate the knowledge transferred to students, ultimately cultivating a better student model in downstream tasks. This framework comprises two stages: (1) multimodal alignment and (2) knowledge distillation calibration. In the first stage, parameter-efficient fine-tuning is used to enhance feature alignment between different modalities. In the second stage, we develop a calibration strategy to assess the student model's capability and generate high-quality instances to calibrate knowledge distillation from teacher to student. The experiments on diverse datasets show that our framework efficiently improves the student model's capabilities. Our 7B-size student model, after three iterations of distillation calibration, outperforms the current state-of-the-art LLaVA-13B model on the ScienceQA and LLaVA Test datasets and also exceeds other strong baselines in a zero-shot setting.

M3oE: Multi-Domain Multi-Task Mixture-of Experts Recommendation Framework

Zijian Zhang
Shuchang Liu
Jiaao Yu
Qingpeng Cai
Xiangyu Zhao
Chunxu Zhang
Ziru Liu
Qidong Liu
Hongwei Zhao
Lantao Hu
Peng Jiang
Kun Gai

Multi-domain recommendation and multi-task recommendation have demonstrated their effectiveness in leveraging common information from different domains and objectives for comprehensive user modeling. Nonetheless, the practical recommendation usually faces multiple domains and tasks simultaneously, which cannot be well-addressed by current methods. To this end, we introduce M3oE, an adaptive Multi-domain Multi-task Mixture-of-Experts recommendation framework. M3oE integrates multi-domain information, maps knowledge across domains and tasks, and optimizes multiple objectives. We leverage three mixture-of-experts modules to learn common, domain-aspect, and task-aspect user preferences respectively to address the complex dependencies among multiple domains and tasks in a disentangled manner. Additionally, we design a two-level fusion mechanism for precise control over feature extraction and fusion across diverse domains and tasks. The framework's adaptability is further enhanced by applying AutoML technique, which allows dynamic structure optimization. To the best of the authors' knowledge, our M3oE is the first effort to solve multi-domain multi-task recommendation self-adaptively. Extensive experiments on two benchmark datasets against diverse baselines demonstrate M3oE's superior performance. The implementation code is available to ensure reproducibility.

SESSION: Session: Graphs and RecSys 1

Hypergraph Convolutional Network for User-Oriented Fairness in Recommender Systems

Zhongxuan Han
Chaochao Chen
Xiaolin Zheng
Li Zhang
Yuyuan Li

The service system involves multiple stakeholders, making it crucial to ensure fairness. In this paper, we take the example of a typical service system, the recommender system, to investigate how to identify and tackle fairness issues within the service system. Recommender systems often exhibit bias towards a small user group, resulting in pronounced unfairness in recommendation performance, specifically the User-Oriented Fairness (UOF) issue. Existing research on UOF faces limitations in addressing two pivotal challenges: CH1: Current methods fall short in addressing the root cause of the UOF issue, stemming from an unfair training process between advantaged and disadvantaged users. CH2: Current methods struggle to unveil compelling correlations among users in sparse datasets. In this paper, we propose a novel Hypergraph Convolutional Network for User-Oriented Fairness, namely HyperUOF, to address the aforementioned challenges. HyperUOF serves as a versatile framework applicable to various backbone recommendation models for achieving UOF. To address CH1, HyperUOF employs an in-processing method that enhances the training process of disadvantaged users during model training. To addressCH2, HyperUOF incorporates a hypergraph-based approach, proven effective in sparse datasets, to explore high-order correlations among users. We conduct extensive experiments on three real-world datasets based on four backbone recommendation models to prove the effectiveness of our proposed HyperUOF.

DHMAE: A Disentangled Hypergraph Masked Autoencoder for Group Recommendation

Yingqi Zhao
Haiwei Zhang
Qijie Bai
Changli Nie
Xiaojie Yuan

Group recommendation aims to suggest items to a group of users that are suitable for the group. Although some existing powerful deep learning models have achieved improved performance, various aspects remain unexplored: (1) Most existing models using contrastive learning tend to rely on high-quality data augmentation which requires precise contrastive view generation; (2) There is multifaceted natural noise in group recommendation, and additional noise is introduced during data augmentation; (3) Most existing hypergraph neural network-based models over-entangle the information of members and items, ignoring their unique characteristics. In light of this, we propose a highly effective Disentangled Hypergraph Masked Auto Encoder-enhanced method for group recommendation (DHMAE), combining a disentangled hypergraph neural network with a graph masked autoencoder. This approach creates self-supervised signals without data augmentation by masking the features of some nodes and hyperedges and then reconstructing them. For the noise problem, we design a masking strategy that relies on pre-computed degree-sensitive probabilities for the process of masking features. Furthermore, we propose a disentangled hypergraph neural network for group recommendation scenarios to extract common messages of members and items and disentangle them during the convolution process. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art models and effectively addresses the noise issue.

SESSION: Session: Recommendation Systems

Are We Really Achieving Better Beyond-Accuracy Performance in Next Basket Recommendation?

Ming Li
Yuanna Liu
Sami Jullien
Mozhdeh Ariannezhad
Andrew Yates
Mohammad Aliannejadi
Maarten de Rijke

Next basket recommendation ( NBR) is a special type of sequential recommendation that is increasingly receiving attention. So far, most NBR studies have focused on optimizing the accuracy of the recommendation, whereas optimizing for beyond-accuracy metrics, e.g., item fairness and diversity remains largely unexplored. Recent studies into NBR have found a substantial performance difference between recommending repeat items and explore items. Repeat items contribute most of the users' perceived accuracy compared with explore items.

Informed by these findings, we identify a potential "short-cut" to optimize for beyond-accuracy metrics while maintaining high accuracy. To leverage and verify the existence of such short-cuts, we propose a plug-and-play two-step repetition-exploration (TREx) framework that treats repeat items and explores items separately, where we design a simple yet highly effective repetition module to ensure high accuracy, while two exploration modules target optimizing only beyond-accuracy metrics.

Experiments are performed on two widely-used datasets w.r.t. a range of beyond-accuracy metrics, viz. five fairness metrics and three diversity metrics. Our experimental results show that: (i) we can achieve state-of-the-art performance w.r.t. accuracy via the designed repetition module in TREx; and (ii) the simple TREx framework achieves "better" beyond-accuracy performance than existing sophisticated methods. Prima facie, this appears to be good news: we can achieve high accuracy and improved beyond-accuracy metrics at the same time. However, we argue that the real-world value of our algorithmic solution, TREx, is likely to be limited and reflect on the reasonableness of the evaluation setup. We end up challenging existing evaluation paradigms, particularly in the context of beyond-accuracy metrics, and provide insights for researchers to navigate potential pitfalls and determine reasonable metrics to consider when optimizing for accuracy and beyond-accuracy metrics.

NFARec: A Negative Feedback-Aware Recommender Model

Xinfeng Wang
Fumiyo Fukumoto
Jin Cui
Yoshimi Suzuki
Dongjin Yu

Graph neural network (GNN)-based models have been extensively studied for recommendations, as they can extract high-order collaborative signals accurately which is required for high-quality recommender systems. However, they neglect the valuable information gained through negative feedback in two aspects: (1) different users might hold opposite feedback on the same item, which hampers optimal information propagation in GNNs, and (2) even when an item vastly deviates from users' preferences, they might still choose it and provide a negative rating. In this paper, we propose a negative feedback-aware recommender model (NFARec) that maximizes the leverage of negative feedback. To transfer information to multi-hop neighbors along an optimal path effectively, NFARec adopts a feedback-aware correlation that guides hypergraph convolutions (HGCs) to learn users' structural representations. Moreover, NFARec incorporates an auxiliary task - predicting the feedback sentiment polarity (i.e., positive or negative) of the next interaction - based on the Transformer Hawkes Process. The task is beneficial for understanding users by learning the sentiment expressed in their previous sequential feedback patterns and predicting future interactions. Extensive experiments demonstrate that NFARec outperforms competitive baselines. Our source code and data are released at https://github.com/WangXFng/NFARec.

Behavior-Contextualized Item Preference Modeling for Multi-Behavior Recommendation

Mingshi Yan
Fan Liu
Jing Sun
Fuming Sun
Zhiyong Cheng
Yahong Han

In recommender systems, multi-behavior methods have demonstrated their effectiveness in mitigating issues like data sparsity, a common challenge in traditional single-behavior recommendation approaches. These methods typically infer user preferences from various auxiliary behaviors and apply them to the target behavior for recommendations. However, this direct transfer can introduce noise to the target behavior in recommendation, due to variations in user attention across different behaviors. To address this issue, this paper introduces a novel approach, Behavior-Contextualized Item Preference Modeling (BCIPM), for multi-behavior recommendation. Our proposed Behavior-Contextualized Item Preference Network discerns and learns users' specific item preferences within each behavior. It then considers only those preferences relevant to the target behavior for final recommendations, significantly reducing noise from auxiliary behaviors. These auxiliary behaviors are utilized solely for training the network parameters, thereby refining the learning process without compromising the accuracy of the target behavior recommendations. To further enhance the effectiveness of BCIPM, we adopt a strategy of pre-training the initial embeddings. This step is crucial for enriching the item-aware preferences, particularly in scenarios where data related to the target behavior is sparse. Comprehensive experiments conducted on four real-world datasets demonstrate BCIPM's superior performance compared to several leading state-of-the-art models, validating the robustness and efficiency of our proposed approach.

AutoDCS: Automated Decision Chain Selection in Deep Recommender Systems

Dugang Liu
Shenxian Xian
Yuhao Wu
Chaohua Yang
Xing Tang
Xiuqiang He
Zhong Ming

Multi-behavior recommender systems (MBRS) have been commonly deployed on real-world industrial platforms for their superior advantages in understanding user preferences and mitigating data sparsity. However, the cascade graph modeling paradigm adopted in mainstream MBRS usually assumes that users will refer to all types of behavioral knowledge they have when making decisions about target behaviors, i.e., use all types of behavioral interactions indiscriminately when modeling and predicting target behaviors for each user. We call this a full decision chain constraint and argue that it may be too strict by ignoring that different types of behavioral knowledge have varying importance for different users. In this paper, we propose a novel automated decision chain selection (AutoDCS) framework to relax this constraint, which can consider each user's unique decision dependencies and select a reasonable set of behavioral knowledge to activate for the prediction of target behavior. Specifically, AutoDCS first integrates some existing MBRS methods in a base cascade module to obtain a set of behavior-aware embeddings. Then, a bilateral matching gating mechanism is used to select an exclusive set of behaviors for the current user-item pair to form a decision chain, and the corresponding behavior-augmented embeddings are selectively activated. Subsequently, AutoDCS combines the behavior-augmented and original behavior-aware embeddings to predict the target behavior. Finally, we evaluate AutoDCS and demonstrate its effectiveness through experiments over four public multi-behavior benchmarks.

Adaptive In-Context Learning with Large Language Models for Bundle Generation

Zhu Sun
Kaidong Feng
Jie Yang
Xinghua Qu
Hui Fang
Yew-Soon Ong
Wenyuan Liu

Most existing bundle generation approaches fall short in generating fixed-size bundles. Furthermore, they often neglect the underlying user intents reflected by the bundles in the generation process, resulting in less intelligible bundles. This paper addresses these limitations through the exploration of two interrelated tasks, i.e., personalized bundle generation and the underlying intent inference, based on different user sessions. Inspired by the reasoning capabilities of large language models (LLMs), we propose an adaptive in-context learning paradigm, which allows LLMs to draw tailored lessons from related sessions as demonstrations, enhancing the performance on target sessions. Specifically, we first employ retrieval augmented generation to identify nearest neighbor sessions, and then carefully design prompts to guide LLMs in executing both tasks on these neighbor sessions. To tackle reliability and hallucination challenges, we further introduce (1) a self-correction strategy promoting mutual improvements of the two tasks without supervision signals and (2) an auto-feedback mechanism for adaptive supervision based on the distinct mistakes made by LLMs on different neighbor sessions. Thereby, the target session can gain customized lessons for improved performance by observing the demonstrations of its neighbor sessions. Experiments on three real-world datasets demonstrate the effectiveness of our proposed method.

EasyRL4Rec: An Easy-to-use Library for Reinforcement Learning Based Recommender Systems

Yuanqing Yu
Chongming Gao
Jiawei Chen
Heng Tang
Yuefeng Sun
Qian Chen
Weizhi Ma
Min Zhang

Reinforcement Learning (RL)-Based Recommender Systems (RSs) have gained rising attention for their potential to enhance long-term user engagement. However, research in this field faces challenges, including the lack of user-friendly frameworks, inconsistent evaluation metrics, and difficulties in reproducing existing studies. To tackle these issues, we introduce EasyRL4Rec, an easy-to-use code library designed specifically for RL-based RSs. This library provides lightweight and diverse RL environments based on five public datasets and includes core modules with rich options, simplifying model development. It provides unified evaluation standards focusing on long-term outcomes and offers tailored designs for state modeling and action representation for recommendation scenarios. Furthermore, we share our findings from insightful experiments with current methods. EasyRL4Rec seeks to facilitate the model development and experimental process in the domain of RL-based RSs. The library is available for public use.

SM-RS: Single- and Multi-Objective Recommendations with Contextual Impressions and Beyond-Accuracy Propensity Scores

Patrik Dokoupil
Ladislav Peska
Ludovico Boratto

Recommender systems (RS) rely on interaction data between users and items to generate effective results. Historically, RS aimed to deliver the most consistent (i.e., accurate) items to the trained user profiles. However, the attention towards additional (beyond-accuracy) quality criteria has increased tremendously in recent years. Both the research and applied models are being optimized for diversity, novelty, or fairness, to name a few. Naturally, the proper functioning of such optimization methods depends on the knowledge of users' propensities towards interacting with recommendations having certain quality criteria. However, so far, no dataset that captures such propensities exists.

To bridge this research gap, we present SM-RS (single-objective + multi-objective recommendations dataset) that links users' self-declared propensity toward relevance, novelty, and diversity criteria with impressions and corresponding item selections. After presenting the dataset's collection procedure and basic statistics, we propose three tasks that are rarely available to conduct using existing RS datasets: impressions-aware click prediction, users' propensity scores prediction, and construction of recommendations proportional to the users' propensity scores. For each task, we also provide detailed evaluation procedures and competitive baselines. The dataset is available at https://osf.io/hkzje/.

SESSION: Session: Users and Simulations

Modeling User Fatigue for Sequential Recommendation

Nian Li
Xin Ban
Cheng Ling
Chen Gao
Lantao Hu
Peng Jiang
Kun Gai
Yong Li
Qingmin Liao

Recommender systems filter out information that meets user interests. However, users may be tired of the recommendations that are too similar to the content they have been exposed to in a short historical period, which is the so-called user fatigue. Despite the significance for a better user experience, user fatigue is seldom explored by existing recommenders. In fact, there are three main challenges to be addressed for modeling user fatigue, including what features support it, how it influences user interests, and how its explicit signals are obtained. In this paper, we propose to model user Fatigue in interest learning for sequential Recommendations (FRec). To address the first challenge, based on a multi-interest framework, we connect the target item with historical items and construct an interest-aware similarity matrix as features to support fatigue modeling. Regarding the second challenge, built upon feature cross, we propose a fatigue-enhanced multi-interest fusion to capture long-term interest. In addition, we develop a fatigue-gated recurrent unit for short-term interest learning, with temporal fatigue representations as important inputs for constructing update and reset gates. For the last challenge, we propose a novel sequence augmentation to obtain explicit fatigue signals for contrastive learning. We conduct extensive experiments on real-world datasets, including two public datasets and one large-scale industrial dataset. Experimental results show that FRec can improve AUC and GAUC up to 0.026 and 0.019 compared with state-of-the-art models, respectively. Moreover, large-scale online experiments demonstrate the effectiveness of FRec for fatigue reduction. Our codes are released at https://github.com/tsinghua-fib-lab/SIGIR24-FRec.

Characterizing Information Seeking Processes with Multiple Physiological Signals

Kaixin Ji
Danula Hettiachchi
Flora D. Salim
Falk Scholer
Damiano Spina

Information access systems are getting complex, and our understanding of user behavior during information seeking processes is mainly drawn from qualitative methods, such as observational studies or surveys. Leveraging the advances in sensing technologies, our study aims to characterize user behaviors with physiological signals, particularly in relation to cognitive load, affective arousal, and valence. We conduct a controlled lab study with 26 participants, and collect data including Electrodermal Activities, Photoplethysmogram, Electroencephalogram, and Pupillary Responses. This study examines informational search with four stages: the realization of Information Need (IN), Query Formulation (QF), Query Submission (QS), and Relevance Judgment (RJ). We also include different interaction modalities to represent modern systems, e.g., QS by text-typing or verbalizing, and RJ with text or audio information. We analyze the physiological signals across these stages and report outcomes of pairwise non-parametric repeated-measure statistical tests. The results show that participants experience significantly higher cognitive loads at IN with a subtle increase in alertness, while QF requires higher attention. QS involves demanding cognitive loads than QF. Affective responses are more pronounced at RJ than QS or IN, suggesting greater interest and engagement as knowledge gaps are resolved. To the best of our knowledge, this is the first study that explores user behaviors in a search process employing a more nuanced quantitative analysis of physiological signals. Our findings offer valuable insights into user behavior and emotional responses in information seeking processes. We believe our proposed methodology can inform the characterization of more complex processes, such as conversational information seeking.

To Search or to Recommend: Predicting Open-App Motivation with Neural Hawkes Process

Zhongxiang Sun
Zihua Si
Xiao Zhang
Xiaoxue Zang
Yang Song
Hongteng Xu
Jun Xu

Incorporating Search and Recommendation (S&R) services within a singular application is prevalent in online platforms, leading to a new task termed open-app motivation prediction, which aims to predict whether users initiate the application with the specific intent of information searching, or to explore recommended content for entertainment. Studies have shown that predicting users' motivation to open an app can help to improve user engagement and enhance performance in various downstream tasks. However, accurately predicting open-app motivation is not trivial, as it is influenced by user-specific factors, search queries, clicked items, as well as their temporal occurrences. Furthermore, these activities occur sequentially and exhibit intricate temporal dependencies. Inspired by the success of the Neural Hawkes Process (NHP) in modeling temporal dependencies in sequences, this paper proposes a novel neural Hawkes process model to capture the temporal dependencies between historical user browsing and querying actions. The model, referred to as Neural Hawkes Process-based Open-App Motivation prediction model (NHP-OAM), employs a hierarchical transformer and a novel intensity function to encode multiple factors, and open-app motivation prediction layer to integrate time and user-specific information for predicting users' open-app motivations. To demonstrate the superiority of our NHP-OAM model and construct a benchmark for the Open-App Motivation Prediction task, we not only extend the public S&R dataset ZhihuRec but also construct a new real-world Open-App Motivation Dataset (OAMD). Experiments on these two datasets validate NHP-OAM's superiority over baseline models. Further downstream application experiments demonstrate NHP-OAM's effectiveness in predicting users' Open-App Motivation, highlighting the immense application value of NHP-OAM.

UniSAR: Modeling User Transition Behaviors between Search and Recommendation

Teng Shi
Zihua Si
Jun Xu
Xiao Zhang
Xiaoxue Zang
Kai Zheng
Dewei Leng
Yanan Niu
Yang Song

Nowadays, many platforms provide users with both search and recommendation services as important tools for accessing information. The phenomenon has led to a correlation between user search and recommendation behaviors, providing an opportunity to model user interests in a fine-grained way. Existing approaches either model user search and recommendation behaviors separately or overlook the different transitions between user search and recommendation behaviors. In this paper, we propose a framework named UniSAR that effectively models the different types of fine-grained behavior transitions for providing users a Unified Search And Recommendation service. Specifically, UniSAR models the user transition behaviors between search and recommendation through three steps: extraction, alignment, and fusion, which are respectively implemented by transformers equipped with pre-defined masks, contrastive learning that aligns the extracted fine-grained user transitions, and cross-attentions that fuse different transitions. To provide users with a unified service, the learned representations are fed into the downstream search and recommendation models. Joint learning on both search and recommendation data is employed to utilize the knowledge and enhance each other. Experimental results on two public datasets demonstrated the effectiveness of UniSAR in terms of enhancing both search and recommendation simultaneously. The experimental analysis further validates that UniSAR enhances the results by successfully modeling the user transition behaviors between search and recommendation.

SESSION: Session: Explanability in Search and Recommendation

Explainability for Transparent Conversational Information-Seeking

Weronika Łajewska
Damiano Spina
Johanne Trippas
Krisztian Balog

The increasing reliance on digital information necessitates advancements in conversational search systems, particularly in terms of information transparency. While prior research in conversational information-seeking has concentrated on improving retrieval techniques, the challenge remains in generating responses useful from a user perspective. This study explores different methods of explaining the responses, hypothesizing that transparency about the source of the information, system confidence, and limitations can enhance users' ability to objectively assess the response. By exploring transparency across explanation type, quality, and presentation mode, this research aims to bridge the gap between system-generated responses and responses verifiable by the user. We design a user study to answer questions concerning the impact of (1) the quality of explanations enhancing the response on its usefulness and (2) ways of presenting explanations to users. The analysis of the collected data reveals lower user ratings for noisy explanations, although these scores seem insensitive to the quality of the response. Inconclusive results on the explanations presentation format suggest that it may not be a critical factor in this setting.

Evaluating Search System Explainability with Psychometrics and Crowdsourcing

Catherine Chen
Carsten Eickhoff

As information retrieval (IR) systems, such as search engines and conversational agents, become ubiquitous in various domains, the need for transparent and explainable systems grows to ensure accountability, fairness, and unbiased results. Despite recent advances in explainable AI and IR techniques, there is no consensus on the definition of explainability. Existing approaches often treat it as a singular notion, disregarding the multidimensional definition postulated in the literature. In this paper, we use psychometrics and crowdsourcing to identify human-centered factors of explainability in Web search systems and introduce SSE (Search System Explainability), an evaluation metric for explainable IR (XIR) search systems. In a crowdsourced user study, we demonstrate SSE's ability to distinguish between explainable and non-explainable systems, showing that systems with higher scores indeed indicate greater interpretability. We hope that aside from these concrete contributions to XIR, this line of work will serve as a blueprint for similar explainability evaluation efforts in other domains of machine learning and natural language processing.

Sequential Recommendation with Collaborative Explanation via Mutual Information Maximization

Yi Yu
Kazunari Sugiyama
Adam Jatowt

Current research on explaining sequential recommendations lacks reliable benchmarks and quantitative metrics, making it difficult to compare explanation performance between different models. In this work, we propose a new explanation type, namely, collaborative explanation, into sequential recommendation, allowing a unified approach for modeling user actions and assessing the performance of both recommendation and explanation. We accomplish this by framing the problem as a joint sequential prediction task, which takes a sequence of user's past item-explanation pairs and predicts the next item along with its associated explanation. We propose a pipeline that comprises data preparation and a model adaptation framework called Sequential recommendation with Collaborative Explanation (SCE). This framework can be flexibly applied to any sequential recommendation model for this problem. Furthermore, to address the issue of inconsistency between item and explanation representations when learning both sub-tasks, we propose Sequential recommendation with Collaborative Explanation via Mutual Information Maximization (SCEMIM). Our extensive experiments demonstrate that: (i) SCE framework is effective in enabling sequential models to make recommendations and provide accurate explanations. (ii) Importantly, SCEMIM enhances the consistency between recommendations and explanations, leading to further improvements in the performance of both sub-tasks.

SESSION: Domain Specific

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Soumyadeep Roy
Aparup Khatua
Fatemeh Ghoochani
Uwe Hadler
Wolfgang Nejdl
Niloy Ganguly

GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4's incorrect responses is categorized as a "Reasonable response by GPT-4," by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy.

Hierarchical Semantics Alignment for 3D Human Motion Retrieval

Yang Yang
Haoyu Shi
Huaiwen Zhang

Text to 3D human Motion Retrieval (TMR) is a challenging task in information retrieval, aiming to query relevant motion sequences with the natural language description. The conventional approach for TMR is to represent the data instances as point embeddings for alignment. However, in real-world scenarios, multiple motions often co-occur and superimpose on a single avatar. Simply aggregating text and motion sequences into a single global embedding may be inadequate for capturing the intricate semantics of superimposing motions. In addition, most of the motion variations occur locally and subtly, which further presents considerable challenges in precisely aligning motion sequences with their corresponding text. To address the aforementioned challenges, we propose a novel Hierarchical Semantics Alignment (HSA) framework for text-to-3D human motion retrieval. Beyond global alignment, we propose the Probabilistic-based Distribution Alignment (PDA) and a Descriptors-based Fine-grained Alignment (DFA) to achieve precise semantic matching. Specifically, the PDA encodes the text and motion sequences into multidimensional probabilistic distributions, effectively capturing the semantics of superimposing motions. By optimizing the problem of probabilistic distribution alignment, PDA achieves a precise match between superimposing motions and their corresponding text. The DFA first adopts a fine-grained feature gating by selectively filtering to the significant and representative local representations and meanwhile excluding the interferences of meaningless features. Then we adaptively assign local representations from text and motion into a set of cross-modal local aggregated descriptors, enabling local comparison and interaction between fine-grained text and motion features. Extensive experiments on two widely used benchmark datasets, HumanML3D and KIT-ML, demonstrate the effectiveness of the proposed method. It significantly outperforms existing state-of-the-art retrieval methods, achieving Rsum improvements of 24.74% on HumanML3D and 23.08% on KIT-ML.

Enhancing Dataset Search with Compact Data Snippets

Qiaosheng Chen
Jiageng Chen
Xiao Zhou
Gong Cheng

In light of the growing availability and significance of open data, the problem of dataset search has attracted great attention in the field of information retrieval. Nevertheless, current metadata-based approaches have revealed shortcomings due to the low quality and availability of dataset metadata, while the magnitude and heterogeneity of actual data hindered the development of content-based solutions. To address these challenges, we propose to convert different formats of structured data into a unified form, from which we extract a compact data snippet that indicates the relevance of the whole data. Thanks to its compactness, we feed it into a dense reranker to improve search accuracy. We also convert it back to the original format to be presented for assisting users in relevance judgment. The effectiveness of our approach has been demonstrated by extensive experiments on two test collections for dataset search.

When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications

Qidong Liu
Xian Wu
Xiangyu Zhao
Yuanshao Zhu
Derong Xu
Feng Tian
Yefeng Zheng

The recent surge in Large Language Models (LLMs) has garnered significant attention across numerous fields. Fine-tuning is often required to fit general LLMs for a specific domain, like the web-based healthcare system. However, two problems arise during fine-tuning LLMs for medical applications. One is the task variety problem, which involves distinct tasks in real-world medical scenarios. The variety often leads to sub-optimal fine-tuning for data imbalance and seesaw problems. Besides, the large amount of parameters in LLMs leads to huge time and computation consumption by fine-tuning. To address these two problems, we propose a novel parameter efficient fine-tuning framework for multi-task medical applications, dubbed as MOELoRA. The designed framework aims to absorb both the benefits of mixture-of-expert (MOE) for multi-task learning and low-rank adaptation (LoRA) for parameter efficient fine-tuning. For unifying MOE and LoRA, we devise multiple experts as the trainable parameters, where each expert consists of a pair of low-rank matrices to retain the small size of trainable parameters. Then, a task-motivated gate function for all MOELoRA layers is proposed, which can control the contributions of each expert and produce distinct parameters for various tasks. We conduct experiments on a multi-task medical dataset, indicating MOELoRA outperforms the existing parameter efficient fine-tuning methods. The code is available online.

Resources for Combining Teaching and Research in Information Retrieval Coursework

Maik Fröbe
Harrisen Scells
Theresa Elstner
Christopher Akiki
Lukas Gienapp
Jan Heinrich Reimer
Sean MacAvaney
Benno Stein
Matthias Hagen
Martin Potthast

The first International Workshop on Open Web Search (WOWS) was held on Thursday, March 28th, at ECIR 2024 in Glasgow, UK. The full-day workshop had two calls for contributions: the first call aimed at scientific contributions to building, operating, and evaluating search engines cooperatively and the cooperative use of the web as a resource for researchers and innovators. The second call for implementations of retrieval components aimed to gain practical experience with joint, cooperative evaluation of search engines and their components. In total, 2~papers were accepted for the first call, and 11~software components were submitted for the second. The workshop ended with breakout sessions on how the OpenWebSearch.eu project can incorporate collaborative evaluations and a hub of search engines.

OEHR: An Orthopedic Electronic Health Record Dataset

Yibo Xie
Kaifan Wang
Jiawei Zheng
Feiyan Liu
Xiaoli Wang
Guofeng Huang

During the past decades, healthcare institutions continually amassed clinical data that is not intended to support research. Despite the increasing number of publicly available electronic health record (EHR) datasets, it is difficult to find publicly available datasets in Orthopedics that can be used to compare and evaluate downstream tasks. This paper presents OEHR, a healthcare benchmark dataset in Orthopedics, sourced from the EHR of real hospitals. Information available includes patient measurements, diagnoses, treatments, clinical notes, and medical images. OEHR is intended to support clinical research. To evaluate the quality of OEHR, we conduct extensive experiments by implementing state-of-the-art methods for performing downstream tasks. The results show that OEHR serves as a valuable extension to existing publicly available EHR datasets. The dataset is available at http://47.94.174.82/.

SuicidEmoji: Derived Emoji Dataset and Tasks for Suicide-Related Social Content

Tianlin Zhang
Kailai Yang
Shaoxiong Ji
Boyang Liu
Qianqian Xie
Sophia Ananiadou

Early suicidal ideation detection using social media is crucial for mental health surveillance. Simultaneously, emojis from the posts can help us better understand users' emotions and predict mental health conditions. However, research in emoji-based suicide analysis remains underexplored, with few resources available, which can restrict the development of studying emoji usage patterns among users with suicidal ideation. In this work, we build a derived suicide-related emoji dataset named SuicidEmoji, which contains 25k emoji posts (2,329 suicide-related posts and 22,722 posts for the control group users) filtered from about 1.3 million crawled Reddit data. To the best of our knowledge, SuicidEmoji is the first suicide-related emoji dataset. Based on SuicidEmoji, we propose two novel tasks: emoji-aware suicidal ideation detection and emoji prediction, for which we build two benchmark subdatasets from SuicidEmoji to evaluate the performance of advanced methods including pre-trained language models (PLMs) and large language models (LLMs). We analyze the experimental results of two PLMs and the highly capable LLMs, which reveal the significance and challenges of emoji-based suicide-related NLP tasks. The dataset is avaliable at https://github.com/TianlinZhang668/SuicidEmoji.

A Large Scale Test Corpus for Semantic Table Search

Aristotelis Leventidis
Martin Pekár Christensen
Matteo Lissandrini
Laura Di Rocco
Katja Hose
Renée J. Miller

Table search aims to answer a query with a ranked list of tables. Unfortunately, current test corpora have focused mostly on needle-in-the-haystack tasks, where only a few tables are expected to exactly match the query intent. Instead, table search tasks often arise in response to the need for retrieving new datasets or augmenting existing ones, e.g., for data augmentation within data science or machine learning pipelines. Existing table repositories and benchmarks are limited in their ability to test retrieval methods for table search tasks. Thus, to close this gap, we introduce a novel dataset for query-by-example Semantic Table Search. This novel dataset consists of two snapshots of the large-scale Wikipedia tables collection from 2013 and 2019 with two important additions: (1) a page and topic aware ground truth relevance judgment and (2) a large-scale DBpedia entity linking annotation. Moreover, we generate a novel set of entity-centric queries that allows testing existing methods under a novel search scenario: semantic exploratory search. The resulting resource consists of 9,296 novel queries, 610,553 query-table relevance annotations, and 238,038 entity-linked tables from the 2013 snapshot. Similarly, on the 2019 snapshot, the resource consists of 2,560 queries, 958,214 relevance annotations, and 457,714 total tables. This makes our resource the largest annotated table-search corpus to date (97 times more queries and 956 times more annotated tables than any existing benchmark). We perform a user study among domain experts and prove that these annotators agree with the automatically generated relevance annotations. As a result, we can re-evaluate some basic assumptions behind existing table search approaches identifying their shortcomings along with promising novel research directions.

JDivPS: A Diversified Product Search Dataset

Zhirui Deng
Zhicheng Dou
Yutao Zhu
Xubo Qin
Pengchao Cheng
Jiangxu Wu
Hao Wang

The diversification of product search aims to offer diverse products to satisfy different user intents. Existing diversified product search approaches mainly relied on datasets sourced from online platforms. However, these datasets often present challenges due to their restricted public access and the absence of manually labeled user intents. Such limitations may lead to irreproducible experimental results and unreliable conclusions, restricting the development of this field. To address these problems, this paper introduces a novel dataset JDivPS for diversified product search. To the best of our knowledge, JDivPS is the first publicly accessible dataset with human-annotated user intents. The dataset is collected from JD.com, a major Chinese e-commerce platform. It includes 10,000 queries, around 1,680,000 unique products, and an average of 10 human-labeled user intents for each query. We have extensively evaluated several diversified ranking models using the JDivPS dataset. The results of these models are recorded and presented, serving as a valuable benchmark for future research. More details about the dataset can be found in https://github.com/DengZhirui/JDivPS.

An E-Commerce Dataset Revealing Variations during Sales

Jianfu Zhang
Qingtao Yu
Yizhou Chen
Guoliang Zhou
Yawen Liu
Yawei Sun
Chen Liang
Guangda Huzhang
Yabo Ni
Anxiang Zeng
Han Yu

Since the development of artificial intelligence technology, E-Commerce has gradually become one of the world's largest commercial markets. Within this domain, sales events, which are based on sociological mechanisms, play a significant role. E-Commerce platforms frequently offer sales and promotions to encourage users to purchase items, leading to significant changes in live environments. Learning-To-Rank (LTR) is a crucial component of E-Commerce search and recommendations, and substantial efforts have been devoted to this area. However, existing methods often assume an independent and identically distributed data setting, which does not account for the evolving distribution of online systems beyond online finetuning strategies. This limitation can lead to inaccurate predictions of user behaviors during sales events, resulting in significant loss of revenue. In addition, models must readjust themselves once sales have concluded in order to eliminate any effects caused by the sales events, leading to further regret. To address these limitations, we introduce a long-term E-Commerce search data set specifically designed to incubate LTR algorithms during such sales events, with the objective of advancing the capabilities of E-Commerce search engines. Our investigation focuses on typical industry practices and aims to identify potential solutions to address these challenges.

LADy ᖡ: A Benchmark Toolkit for Latent Aspect Detection Enriched with Backtranslation Augmentation

Farinam Hemmatizadeh
Christine Wong
Alice Yu
Hossein Fani

We present LADy ᖡ, a Python-based benchmark toolkit to facilitate extracting aspects of products or services in reviews toward which customers target their opinions and sentiments. While there has been a significant increase in aspect-based sentiment analysis, yet the proposed methods' practical implications in real-world settings remain moot for their closed and irreproducible codebases, inability to accommodate datasets from various domains, and poor evaluation methodologies. LADy is an open-source benchmark toolkit with a standard pipeline and experimental details to fill the gaps. It incorporates a host of canonical models along with benchmark datasets from varying domains, including unsolicited online reviews. Leveraging an object-oriented design, LADy readily extends to new models and training datasets. The first of its kind, LADy also features review augmentation via natural language backtranslation that can be integrated into the training phase of the models to boost efficiency and improve efficacy during inference. LADy's codebase, along with the installation instructions and case studies on five datasets for seven methods with backtranslation augmentation over ten languages, can be obtained under cc-by-nc-sa-4.0 license at https://github.com/fani-lab/LADy.

SESSION: CTR, Ads and Click Models

DDPO: Direct Dual Propensity Optimization for Post-Click Conversion Rate Estimation

Hongzu Su
Lichao Meng
Lei Zhu
Ke Lu
Jingjing Li

In online advertising, the sample selection bias problem is a major cause of inaccurate conversion rate estimates. Current mainstream solutions only perform causality-based optimization in the click space since the conversion labels in the non-click space are absent. However, optimization for unclicked samples is equally essential because the non-click space contains more samples and user characteristics than the click space. To exploit the unclicked samples, we propose a Direct Dual Propensity Optimization (DDPO) framework to optimize the model directly in impression space with both clicked and unclicked samples. In this framework, we specifically design a click propensity network and a conversion propensity network. The click propensity network is dedicated to ensuring that optimization in the click space is unbiased. The conversion propensity network is designed to generate pseudo-conversion labels for unclicked samples, thus overcoming the challenge of absent labels in non-click space. With these two propensity networks, we are able to perform causality-based optimization in both click space and non-click space. In addition, to strengthen the causal relationship, we design two causal transfer modules for the conversion rate prediction model with the attention mechanism. The proposed framework is evaluated on five real-world public datasets and one private Tencent advertising dataset. Experimental results verify that our method is able to improve the prediction performance significantly. For instance, our method outperforms the previous state-of-the-art method by 7.0% in terms of the Area Under the Curve on the Ali-CCP dataset.

Deep Pattern Network for Click-Through Rate Prediction

Hengyu Zhang
Junwei Pan
Dapeng Liu
Jie Jiang
Xiu Li

Click-through rate (CTR) prediction plays a pivotal role in real-world applications, particularly in recommendation systems and online advertising. A significant research branch in this domain focuses on user behavior modeling. Current research predominantly centers on modeling co-occurrence relationships between the target item and items previously interacted with by users. However, this focus neglects the intricate modeling of user behavior patterns. In reality, the abundance of user interaction records encompasses diverse behavior patterns, indicative of a spectrum of habitual paradigms. These patterns harbor substantial potential to significantly enhance CTR prediction performance. To harness the informational potential within behavior patterns, we extend Target Attention (TA) to Target Pattern Attention (TPA) to model pattern-level dependencies. Furthermore, three critical challenges demand attention: the inclusion of unrelated items within patterns, data sparsity of patterns, and computational complexity arising from numerous patterns. To address these challenges, we introduce the Deep Pattern Network (DPN), designed to comprehensively leverage information from behavior patterns. DPN efficiently retrieves target-related behavior patterns using a target-aware attention mechanism. Additionally, it contributes to refining patterns through a pre-training paradigm based on self-supervised learning while promoting dependency learning within sparse patterns. Our comprehensive experiments, conducted across three public datasets, substantiate the superior performance and broad compatibility of DPN.

Counterfactual Ranking Evaluation with Flexible Click Models

Alexander Buchholz
Ben London
Giuseppe Di Benedetto
Jan Malte Lichtenberg
Yannik Stein
Thorsten Joachims

Evaluating a new ranking policy using data logged by a previously deployed policy requires a counterfactual (off-policy) estimator that corrects for presentation and selection biases. Some estimators (e.g., the position-based model) perform this correction by making strong assumptions about user behavior, which can lead to high bias if the assumptions are not met. Other estimators (e.g., the item-position model) rely on randomization to avoid these assumptions, but they often suffer from high variance. In this paper, we develop a new counterfactual estimator, called Interpol, that provides a tunable trade-off in the assumptions it makes, thus providing a novel ability to optimize the bias-variance trade-off. We analyze the bias of our estimator, both theoretically and empirically, and show that it achieves lower error than both the position-based model and the item-position model, on both synthetic and real datasets. This improvement in accuracy not only benefits offline evaluation of ranking policies, we also find that Interpol improves learning of new ranking policies when used as the training objective for learning-to-rank.

Deep Automated Mechanism Design for Integrating Ad Auction and Allocation in Feed

Xuejian Li
Ze Wang
Bingqi Zhu
Fei He
Yongkang Wang
Xingxing Wang

E-commerce platforms usually present an ordered list, mixed with several organic items and an advertisement, in response to each user's page view request. This list, the outcome of ad auction and allocation processes, directly impacts the platform's ad revenue and gross merchandise volume (GMV). Specifically, the ad auction determines which ad is displayed and the corresponding payment, while the ad allocation decides the display positions of the advertisement and organic items. The prevalent methods of segregating the ad auction and allocation into two distinct stages face two problems: 1) Ad auction does not consider externalities, such as the influence of actual display position and context on ad Click-Through Rate (CTR); 2) The ad allocation, which utilizes the auction-winning ad's payment to determine the display position dynamically, fails to maintain incentive compatibility (IC) for the advertisement. For instance, in the auction stage employing the traditional Generalized Second Price (GSP) , even if the winning ad increases its bid, its payment remains unchanged. This implies that the advertisement cannot secure a better position and thus loses the opportunity to achieve higher utility in the subsequent ad allocation stage. Previous research often focused on one of the two stages, neglecting the two-stage problem, which may result in suboptimal outcomes.

Therefore, this paper proposes a deep automated mechanism that integrates ad auction and allocation, ensuring both IC and Individual Rationality (IR) in the presence of externalities while maximizing revenue and GMV. The mechanism takes candidate ads and the ordered list of organic items as input. For each candidate ad, several candidate allocations are generated by inserting the ad in different positions of the ordered list of organic items. For each candidate allocation, a list-wise model takes the entire allocation as input and outputs the predicted result for each ad and organic item to model the global externalities. Finally, an automated auction mechanism, modeled by deep neural networks, is executed to select the optimal allocation. Consequently, this mechanism simultaneously decides the ranking, payment, and display position of the ad. Furthermore, the proposed mechanism results in higher revenue and GMV than state-of-the-art baselines in offline experiments and online A/B tests.

CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance Ranking

Josef Vonásek
Milan Straka
Rostislav Krč
Lenka Lasonová
Ekaterina Egorova
Jana Straková
Jakub Náplava

We present CWRCzech, Click Web Ranking dataset for Czech, a 100M query-document Czech click dataset for relevance ranking with user behavior data collected from search engine logs of Seznam.cz. To the best of our knowledge, CWRCzech is the largest click dataset with raw text published so far. It provides document positions in the search results as well as information about user behavior: 27.6M clicked documents and 10.8M dwell times. In addition, we also publish a manually annotated Czech test for the relevance task, containing nearly 50k query-document pairs, each annotated by at least 2 annotators. Finally, we analyze how the user behavior data improve relevance ranking and show that models trained on data automatically harnessed at sufficient scale can surpass the performance of models trained on human annotated data. CWRCzech is published under an academic non-commercial license and is available to the research community at https://github.com/seznam/CWRCzech.

Exploring Multi-Scenario Multi-Modal CTR Prediction with a Large Scale Dataset

Zhaoxin Huan
Ke Ding
Ang Li
Xiaolu Zhang
Xu Min
Yong He
Liang Zhang
Jun Zhou
Linjian Mo
Jinjie Gu
Zhongyi Liu
Wenliang Zhong
Guannan Zhang
Chenliang Li
Fajie Yuan

Click-through rate (CTR) prediction plays a crucial role in recommendation systems, with significant impact on user experience and platform revenue generation. Despite the various public CTR datasets available due to increasing interest from both academia and industry, these datasets have limitations. They cover a limited range of scenarios and predominantly focus on ID-based features, neglecting the vital role of multi-modal features for effective multi-scenario CTR prediction. Moreover, their scale is modest compared to real-world industrial datasets, hindering robust and comprehensive evaluation of complex models. To address these challenges, we introduce a large-scale Multi-Scenario Multi-Modal CTR dataset named AntM² C, built from real industrial data from Alipay. This dataset offers an impressive breadth and depth of information, covering CTR data from four diverse business scenarios, including advertisements, consumer coupons, mini-programs, and videos. Unlike existing datasets, AntM² C provides not only ID-based features but also five textual features and one image feature for both users and items, supporting more delicate multi-modal CTR prediction. AntM² C is also substantially larger than existing datasets, comprising 100 million CTR data. This scale allows for robust and comprehensive evaluation and comparison of CTR prediction models. We employ AntM² C to construct several typical CTR tasks, including multi-scenario modeling, item and user cold-start modeling, and multi-modal modeling. Initial experiments and comparisons with baseline methods have shown that AntM² C presents both new challenges and opportunities for CTR models, with the potential to significantly advance CTR research. The AntM² C dataset is available at https://www.atecup.cn/OfficalDataSet.

SESSION: Session: Graphs and RecSys 2

AFDGCF: Adaptive Feature De-correlation Graph Collaborative Filtering for Recommendations

Wei Wu
Chao Wang
Dazhong Shen
Chuan Qin
Liyi Chen
Hui Xiong

Collaborative filtering methods based on graph neural networks (GNNs) have witnessed significant success in recommender systems (RS), capitalizing on their ability to capture collaborative signals within intricate user-item relationships via message-passing mechanisms. However, these GNN-based RS inadvertently introduce excess linear correlation between user and item embeddings, contradicting the goal of providing personalized recommendations. While existing research predominantly ascribes this flaw to the over-smoothing problem, this paper underscores the critical, often overlooked role of the over-correlation issue in diminishing the effectiveness of GNN representations and subsequent recommendation performance. Up to now, the over-correlation issue remains unexplored in RS. Meanwhile, how to mitigate the impact of over-correlation while preserving collaborative filtering signals is a significant challenge. To this end, this paper aims to address the aforementioned gap by undertaking a comprehensive study of the over-correlation issue in graph collaborative filtering models. Firstly, we present empirical evidence to demonstrate the widespread prevalence of over-correlation in these models. Subsequently, we dive into a theoretical analysis which establishes a pivotal connection between the over-correlation and over-smoothing issues. Leveraging these insights, we introduce the Adaptive Feature De-correlation Graph Collaborative Filtering (AFDGCF) framework, which dynamically applies correlation penalties to the feature dimensions of the representation matrix, effectively alleviating both over-correlation and over-smoothing issues. The efficacy of the proposed framework is corroborated through extensive experiments conducted with four representative graph collaborative filtering models across four publicly available datasets. Our results show the superiority of AFDGCF in enhancing the performance landscape of graph collaborative filtering models.

Exploring the Individuality and Collectivity of Intents behind Interactions for Graph Collaborative Filtering

Yi Zhang
Lei Sang
Yiwen Zhang

Intent modeling has attracted widespread attention in recommender systems. As the core motivation behind user selection of items, intent is crucial for elucidating recommendation results. The current mainstream modeling method is to abstract the intent into unknowable but learnable shared or non-shared parameters. Despite considerable progress, we argue that it still confronts the following challenges: firstly, these methods only capture the coarse-grained aspects of intent, ignoring the fact that user-item interactions will be affected by collective and individual factors (e.g., a user may choose a movie because of its high box office or because of his own unique preferences); secondly, modeling believable intent is severely hampered by implicit feedback, which is incredibly sparse and devoid of true semantics. To address these challenges, we propose a novel recommendation framework designated as Bilateral Intent-guided Graph Collaborative Filtering (BIGCF). Specifically, we take a closer look at user-item interactions from a causal perspective and put forth the concepts of individual intent-which signifies private preferences-and collective intent-which denotes overall awareness. To counter the sparsity of implicit feedback, the feature distributions of users and items are encoded via a Gaussian-based graph generation strategy, and we implement the recommendation process through bilateral intent-guided graph reconstruction re-sampling. Finally, we propose graph contrastive regularization for both interaction and intent spaces to uniformize users, items, intents, and interactions in a self-supervised and non-augmented paradigm. Experimental results on three real-world datasets demonstrate the effectiveness of BIGCF compared with existing solutions.

Content-based Graph Reconstruction for Cold-start Item Recommendation

Jinri Kim
Eungi Kim
Kwangeun Yeo
Yujin Jeon
Chanwoo Kim
Sewon Lee
Joonseok Lee

Graph convolutions have been successfully applied to recommendation systems, utilizing high-order collaborative signals present in the user-item interaction graph. This idea, however, has not been applicable to the cold-start items, since cold nodes are isolated in the graph and thus do not take advantage of information exchange from neighboring nodes. Recently, there have been a few attempts to utilize graph convolutions on item-item or user-user attribute graphs to capture high-order collaborative signals for cold-start cases, but these approaches are still limited in that the item-item or user-user graph falls short in capturing the dynamics of user-item interactions, as their edges are constructed based on arbitrary and heuristic attribute similarity.

In this paper, we introduce Content-based Graph Reconstruction for Cold-start item recommendation (CGRC), employing a masked graph autoencoder structure and multimodal contents to directly incorporate interaction-based high-order connectivity, applicable even in cold-start scenarios. To address the cold-start items directly on the interaction graph, our approach trains the model to reconstruct plausible user-item interactions from masked edges of randomly chosen cold items, simulating fresh items without connection to users. This strategy enables the model to infer potential edges for unseen cold-start nodes. Extensive experiments on real-world datasets demonstrate the superiority of our model.

SIGformer: Sign-aware Graph Transformer for Recommendation

Sirui Chen
Jiawei Chen
Sheng Zhou
Bohao Wang
Shen Han
Chanfei Su
Yuqing Yuan
Can Wang

In recommender systems, most graph-based methods focus on positive user feedback, while overlooking the valuable negative feedback. Integrating both positive and negative feedback to form a signed graph can lead to a more comprehensive understanding of user preferences. However, the existing efforts to incorporate both types of feedback are sparse and face two main limitations: 1) They process positive and negative feedback separately, which fails to holistically leverage the collaborative information within the signed graph; 2) They rely on MLPs or GNNs for information extraction from negative feedback, which may not be effective. To overcome these limitations, we introduceSIGformer, a new method that employs the transformer architecture to sign-aware graph-based recommendation. SIGformer incorporates two innovative positional encodings that capture the spectral properties and path patterns of the signed graph, enabling the full exploitation of the entire graph. Our extensive experiments across five real-world datasets demonstrate the superiority of SIGformer over state-of-the-art methods. The code is available at https://github.com/StupidThree/SIGformer.

TransGNN: Harnessing the Collaborative Power of Transformers and Graph Neural Networks for Recommender Systems

Peiyan Zhang
Yuchen Yan
Xi Zhang
Chaozhuo Li
Senzhang Wang
Feiran Huang
Sunghun Kim

Graph Neural Networks (GNNs) have emerged as promising solutions for collaborative filtering (CF) through the modeling of user-item interaction graphs. The nucleus of existing GNN-based recommender systems involves recursive message passing along user-item interaction edges to refine encoded embeddings. Despite their demonstrated effectiveness, current GNN-based methods encounter challenges of limited receptive fields and the presence of noisy "interest-irrelevant" connections. In contrast, Transformer-based methods excel in aggregating information adaptively and globally. Nevertheless, their application to large-scale interaction graphs is hindered by inherent complexities and challenges in capturing intricate, entangled structural information. In this paper, we propose TransGNN, a novel model that integrates Transformer and GNN layers in an alternating fashion to mutually enhance their capabilities. Specifically, TransGNN leverages Transformer layers to broaden the receptive field and disentangle information aggregation from edges, which aggregates information from more relevant nodes, thereby enhancing the message passing of GNNs. Additionally, to capture graph structure information effectively, positional encoding is meticulously designed and integrated into GNN layers to encode such structural knowledge into node attributes, thus enhancing the Transformer's performance on graphs. Efficiency considerations are also alleviated by proposing the sampling of the most relevant nodes for the Transformer, along with two efficient sample update strategies to reduce complexity. Furthermore, theoretical analysis demonstrates that TransGNN offers increased expressiveness compared to GNNs, with only a marginal increase in linear complexity. Extensive experiments on five public datasets validate the effectiveness and efficiency of TransGNN. Our code is available at https://github.com/Peiyance/TransGNN-torch.

Lightweight Embeddings for Graph Collaborative Filtering

Xurong Liang
Tong Chen
Lizhen Cui
Yang Wang
Meng Wang
Hongzhi Yin

Graph neural networks (GNNs) are currently one of the most performant and versatile collaborative filtering methods. Meanwhile, like in traditional collaborative filtering, owing to the use of an embedding table to represent each user/item entity as a distinct vector, GNN-based recommenders have inherited its long-standing defect of parameter inefficiency. As a common practice for scalable embeddings, parameter sharing enables the use of fewer embedding vectors (which we term meta-embeddings), where each entity is represented by a unique combination of meta-embeddings instead. When assigning meta-embeddings, most existing methods are a heuristically designed, predefined mapping from each user/item entity's ID to the corresponding meta-embedding indexes (e.g., double hashing), thus simplifying the optimization problem into learning only the meta-embeddings. However, in the context of GNN-based collaborative filtering, such a fixed mapping omits the semantic correlations between entities that are evident in the user-item interaction graph, leading to suboptimal recommendation performance. To this end, we propose Lightweight Embeddings for Graph Collaborative Filtering (LEGCF), a parameter-efficient embedding framework dedicated to GNN-based recommenders. LEGCF innovatively introduces an assignment matrix as an additional learnable component on top of meta-embeddings. To jointly optimize these two heavily entangled components, aside from learning the meta-embeddings by minimizing the recommendation loss, LEGCF further performs efficient assignment update by enforcing a novel semantic similarity constraint and finding its closed-form solution based on matrix pseudo-inverse. The meta-embeddings and assignment matrix are alternately updated, where the latter is sparsified on the fly to ensure negligible storage overhead. Extensive experiments on three benchmark datasets have verified LEGCF's smallest trade-off between size and performance, with consistent accuracy gain over state-of-the-art baselines. The codebase of LEGCF is available in https://github.com/xurong-liang/LEGCF.

SESSION: Session: Dense Retrieval 1

Leveraging LLMs for Unsupervised Dense Retriever Ranking

Ekaterina Khramtsova
Shengyao Zhuang
Mahsa Baktashmotlagh
Guido Zuccon

In this paper we present Large Language Model Assisted Retrieval Model Ranking (LARMOR), an effective unsupervised approach that leverages LLMs for selecting which dense retriever to use on a test corpus (target). Dense retriever selection is crucial for many IR applications that rely on using dense retrievers trained on public corpora to encode or search a new, private target corpus. This is because when confronted with domain shift, where the downstream corpora, domains, or tasks of the target corpus differ from the domain/task the dense retriever was trained on, its performance often drops. Furthermore, when the target corpus is unlabeled, e.g., in a zero-shot scenario, the direct evaluation of the model on the target corpus becomes unfeasible. Unsupervised selection of the most effective pre-trained dense retriever becomes then a crucial challenge. Current methods for dense retriever selection are insufficient in handling scenarios with domain shift.

Our proposed solution leverages LLMs to generate pseudo-relevant queries, labels and reference lists based on a set of documents sampled from the target corpus. Dense retrievers are then ranked based on their effectiveness on these generated pseudo-relevant signals. Notably, our method is the first approach that relies solely on the target corpus, eliminating the need for both training corpora and test labels. To evaluate the effectiveness of our method, we construct a large pool of state-of-the-art dense retrievers. The proposed approach outperforms existing baselines with respect to both dense retriever selection and ranking. We make our code and results publicly available at https://github.com/ielab/larmor/.

Dimension Importance Estimation for Dense Information Retrieval

Guglielmo Faggioli
Nicola Ferro
Raffaele Perego
Nicola Tonellotto

Recent advances in Information Retrieval have shown the effectiveness of embedding queries and documents in a latent high-dimensional space to compute their similarity. While operating on such high-dimensional spaces is effective, in this paper, we hypothesize that we can improve the retrieval performance by adequately moving to a query-dependent subspace. More in detail, we formulate the Manifold Clustering (MC) Hypothesis: projecting queries and documents onto a subspace of the original representation space can improve retrieval effectiveness. To empirically validate our hypothesis, we define a novel class of Dimension IMportance Estimators (DIME). Such models aim to determine how much each dimension of a high-dimensional representation contributes to the quality of the final ranking and provide an empirical method to select a subset of dimensions where to project the query and the documents. To support our hypothesis, we propose an oracle DIME, capable of effectively selecting dimensions and almost doubling the retrieval performance. To show the practical applicability of our approach, we then propose a set of DIMEs that do not require any oracular piece of information to estimate the importance of dimensions. These estimators allow us to carry out a dimensionality selection that enables performance improvements of up to +11.5% (moving from 0.675 to 0.752 nDCG@10) compared to the baseline methods using all dimensions. Finally, we show that, with simple and realistic active feedback, such as the user's interaction with a single relevant document, we can design a highly effective DIME, allowing us to outperform the baseline by up to +0.224 nDCG@10 points (+58.6%, moving from 0.384 to 0.608).

Graded Relevance Scoring of Written Essays with Dense Retrieval

Salam Albatarni
Sohaila Eltanbouly
Tamer Elsayed

Automated Essay Scoring automates the grading process of essays, providing a great advantage for improving the writing proficiency of students. While holistic essay scoring research is prevalent, a noticeable gap exists in scoring essays for specific quality traits. In this work, we focus on the relevance trait, which measures the ability of the student to stay on-topic throughout the entire essay. We propose a novel approach for graded relevance scoring of written essays that employs dense retrieval encoders. Dense representations of essays at different relevance levels then form clusters in the embeddings space, such that their centroids are potentially separate enough to effectively represent their relevance levels. We hence use the simple 1-Nearest-Neighbor classification over those centroids to determine the relevance level of an unseen essay. As an effective unsupervised dense encoder, we leverage Contriever, which is pre-trained with contrastive learning and demonstrated comparable performance to supervised dense retrieval models. We tested our approach on both task-specific (i.e., training and testing on same task) and cross-task (i.e., testing on unseen task) scenarios using the widely used ASAP++ dataset. Our method establishes a new state-of-the-art performance in the task-specific scenario, while its extension for the cross-task scenario exhibited a performance that is on par with the state-of-the-art model for that scenario. We also analyzed the performance of our approach in a more practical few-shot scenario, showing that it can significantly reduce the labeling cost while sacrificing only 10% of its effectiveness.

Scaling Laws For Dense Retrieval

Yan Fang
Jingtao Zhan
Qingyao Ai
Jiaxin Mao
Weihang Su
Jia Chen
Yiqun Liu

Scaling laws have been observed in a wide range of tasks, particularly in language generation. Previous studies have found that the performance of large language models adheres to predictable patterns with respect to the size of models and datasets. This helps us design training strategies effectively and efficiently, especially as large-scale training becomes increasingly resource-intensive. Yet, in dense retrieval, such scaling law has not been fully explored. In this study, we investigate how scaling affects the performance of dense retrieval models. We implement dense retrieval models with different numbers of parameters, and train them with various amounts of annotated data. We propose to use the contrastive entropy as the evaluation metric, which is continuous compared with discrete ranking metrics and thus can accurately reflect model performance. Results indicate that the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations across different datasets and annotation methods. Additionally, we show that the scaling laws help optimize the training process, such as resolving the resource allocation problem under a budget constraint. We believe that these findings significantly contribute to understanding the scaling effect of dense retrieval models and offer meaningful guidance for future research.

SESSION: Session: Diffusion in RecSys

Diffusion Models for Generative Outfit Recommendation

Yiyan Xu
Wenjie Wang
Fuli Feng
Yunshan Ma
Jizhi Zhang
Xiangnan He

Outfit Recommendation (OR) in the fashion domain has evolved through two stages: Pre-defined Outfit Recommendation and Personalized Outfit Composition. However, both stages are constrained by existing fashion products, limiting their effectiveness in addressing users' diverse fashion needs. Recently, the advent of AI-generated content provides the opportunity for OR to transcend these limitations, showcasing the potential for personalized outfit generation and recommendation.

To this end, we introduce a novel task called Generative Outfit Recommendation (GOR), aiming to generate a set of fashion images and compose them into a visually compatible outfit tailored to specific users. The key objectives of GOR lie in the high fidelity, compatibility, and personalization of generated outfits. To achieve these, we propose a generative outfit recommender model named DiFashion, which empowers exceptional diffusion models to accomplish the parallel generation of multiple fashion images. To ensure three objectives, we design three kinds of conditions to guide the parallel generation process and adopt Classifier-Free-Guidance to enhance the alignment between the generated images and conditions. We apply DiFashion on both personalized Fill-In-The-Blank and GOR tasks and conduct extensive experiments on iFashion and Polyvore-U datasets. The quantitative and human-involved qualitative evaluation demonstrate the superiority of DiFashion over competitive baselines.

Collaborative Filtering Based on Diffusion Models: Unveiling the Potential of High-Order Connectivity

Yu Hou
Jin-Duk Park
Won-Yong Shin

A recent study has shown that diffusion models are well-suited for modeling the generative process of user--item interactions in recommender systems due to their denoising nature. However, existing diffusion model-based recommender systems do not explicitly leverage high-order connectivities that contain crucial collaborative signals for accurate recommendations. Addressing this gap, we propose \textsfCF-Diff, a new diffusion model-based collaborative filtering (CF) method, which is capable of making full use of collaborative signals along with multi-hop neighbors. Specifically, the forward-diffusion process adds random noise to user--item interactions, while the reverse-denoising process accommodates our own learning model, named cross-attention-guided multi-hop autoencoder (CAM-AE ), to gradually recover the original user--item interactions. CAM-AE consists of two core modules: 1) the attention-aided AE module, responsible for precisely learning latent representations of user--item interactions while preserving the model's complexity at manageable levels, and 2) the multi-hop cross-attention module, which judiciously harnesses high-order connectivity information to capture enhanced collaborative signals. Through comprehensive experiments on three real-world datasets, we demonstrate that CF-Diff is (a) Superior: outperforming benchmark recommendation methods, achieving remarkable gains up to 7.29% compared to the best competitor, (b) Theoretically-validated: reducing computations while ensuring that the embeddings generated by our model closely approximate those from the original cross-attention, and (c) Scalable: proving the computational efficiency that scales linearly with the number of users or items.

Denoising Diffusion Recommender Model

Jujia Zhao
Wang Wenjie
Yiyan Xu
Teng Sun
Fuli Feng
Tat-Seng Chua

Recommender systems often grapple with noisy implicit feedback. Most studies alleviate the noise issues from data cleaning perspective such as data resampling and reweighting, but they are constrained by heuristic assumptions. Another denoising avenue is from model perspective, which proactively injects noises into user-item interactions and enhances the intrinsic denoising ability of models. However, this kind of denoising process poses significant challenges to the recommender model's representation capacity to capture noise patterns.

To address this issue, we propose Denoising Diffusion Recommender Model (DDRM), which leverages multi-step denoising process of diffusion models to robustify user and item embeddings from any recommender models. DDRM injects controlled Gaussian noises in the forward process and iteratively removes noises in the reverse denoising process, thereby improving embedding robustness against noisy feedback. To achieve this target, the key lies in offering appropriate guidance to steer the reverse denoising process and providing a proper starting point to start the forward-reverse process during inference. In particular, we propose a dedicated denoising module that encodes collaborative information as denoising guidance. Besides, in the inference stage, DDRM utilizes the average embeddings of users' historically liked items as the starting point rather than using pure noise since pure noise lacks personalization, which increases the difficulty of the denoising process. Extensive experiments on three datasets with three representative backend recommender models demonstrate the effectiveness of DDRM.

Graph Signal Diffusion Model for Collaborative Filtering

Yunqin Zhu
Chao Wang
Qi Zhang
Hui Xiong

Collaborative filtering is a critical technique in recommender systems. It has been increasingly viewed as a conditional generative task for user feedback data, where newly developed diffusion model shows great potential. However, existing studies on diffusion model lack effective solutions for modeling implicit feedback. Particularly, the standard isotropic diffusion process overlooks correlation between items, misaligned with the graphical structure of the interaction space. Meanwhile, Gaussian noise destroys personalized information in a user's interaction vector, causing difficulty in its reconstruction. In this paper, we adapt standard diffusion model and propose a novel Graph Signal Diffusion Model for Collaborative Filtering (named GiffCF). To better represent the correlated distribution of user-item interactions, we define a generalized diffusion process using heat equation on the item-item similarity graph. Our forward process smooths interaction signals with an advanced family of graph filters, introducing the graph adjacency as beneficial prior knowledge for recommendation. Our reverse process iteratively refines and sharpens latent signals in a noise-free manner, where the updates are conditioned on the user's history and computed from a carefully designed two-stage denoiser, leading to high-quality reconstruction. Finally, through extensive experiments, we show that GiffCF effectively leverages the advantages of both diffusion model and graph signal processing, and achieves state-of-the-art performance on three benchmark datasets.

SESSION: Session: Neural IR

Multi-granular Adversarial Attacks against Black-box Neural Ranking Models

Yu-An Liu
Ruqing Zhang
Jiafeng Guo
Maarten de Rijke
Yixing Fan
Xueqi Cheng

Adversarial ranking attacks have gained increasing attention due to their success in probing vulnerabilities, and, hence, enhancing the robustness, of neural ranking models. Conventional attack methods employ perturbations at a single granularity, e.g., word or sentence level, to target documents. However, limiting perturbations to a single level of granularity may reduce the flexibility of adversarial examples, thereby diminishing the potential threat of the attack. Therefore, we focus on generating high-quality adversarial examples by incorporating multi-granular perturbations. Achieving this objective involves tackling a combinatorial explosion problem, which requires identifying an optimal combination of perturbations across all possible levels of granularity, positions, and textual pieces. To address this challenge, we transform the multi-granular adversarial attack into a sequential decision-making process, where perturbations in the next attack step build on the perturbed document in the current attack step. Since the attack process can only access the final state without direct intermediate signals, we use reinforcement learning to perform multi-granular attacks. During the reinforcement learning process, two agents work cooperatively to identify multi-granular vulnerabilities as attack targets and organize perturbation candidates into a final perturbation sequence. Experimental results show that our attack method surpasses prevailing baselines in both attack effectiveness and imperceptibility.

Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models

Catherine Chen
Jack Merullo
Carsten Eickhoff

Neural models have demonstrated remarkable performance across diverse ranking tasks. However, the processes and internal mechanisms along which they determine relevance are still largely unknown. Existing approaches for analyzing neural ranker behavior with respect to IR properties rely either on assessing overall model behavior or employing probing methods that may offer an incomplete understanding of causal mechanisms. To provide a more granular understanding of internal model decision-making processes, we propose the use of causal interventions to reverse engineer neural rankers, and demonstrate how mechanistic interpretability methods can be used to isolate components satisfying term-frequency axioms within a ranking model. We identify a group of attention heads that detect duplicate tokens in earlier layers of the model, then communicate with downstream heads to compute overall document relevance. More generally, we propose that this style of mechanistic analysis opens up avenues for reverse engineering the processes neural retrieval models use to compute relevance. This work aims to initiate granular interpretability efforts that will not only benefit retrieval model development and training, but ultimately ensure safer deployment of these models.

A Reproducibility Study of PLAID

Sean MacAvaney
Nicola Tonellotto

The PLAID (Performance-optimized Late Interaction Driver) algorithm for ColBERTv2 uses clustered term representations to retrieve and progressively prune documents for final (exact) document scoring. In this paper, we reproduce and fill in missing gaps from the original work. By studying the parameters PLAID introduces, we find that its Pareto frontier is formed of a careful balance among its three parameters; deviations beyond the suggested settings can substantially increase latency without necessarily improving its effectiveness. We then compare PLAID with an important baseline missing from the paper: re-ranking a lexical system. We find that applying ColBERTv2 as a re-ranker atop an initial pool of BM25 results provides better efficiency-effectiveness trade-offs in low-latency settings. However, re-ranking cannot reach peak effectiveness at higher latency settings due to limitations in recall of lexical matching and provides a poor approximation of an exhaustive ColBERTv2 search. We find that recently proposed modifications to re-ranking that pull in the neighbors of top-scoring documents overcome this limitation, providing a Pareto frontier across all operational points for ColBERTv2 when evaluated using a well-annotated dataset. Curious about why re-ranking methods are highly competitive with PLAID, we analyze the token representation clusters PLAID uses for retrieval and find that most clusters are predominantly aligned with a single token and vice versa. Given the competitive trade-offs that re-ranking baselines exhibit, this work highlights the importance of carefully selecting pertinent baselines when evaluating the efficiency of retrieval engines. https://github.com/seanmacavaney/plaidrepro

Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIR

Nandan Thakur
Luiz Bonifacio
Maik Fröbe
Alexander Bondarenko
Ehsan Kamalloo
Martin Potthast
Matthias Hagen
Jimmy Lin

The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark---a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR~subset Touché 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special''. To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the Touché 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards retrieving short passages from the Touché 2020 data, and we also find that quite a few of the neural models' results are unjudged in the Touché 2020 data. As many of the short Touché passages are not argumentative and thus non-relevant per se, and as the missing judgments complicate fair comparison, we denoise the Touché 2020 data by excluding very short passages (less than 20 words) and by augmenting the unjudged data with post-hoc judgments following the Touché guidelines. On the denoised data, the effectiveness of the neural models improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code and the augmented Touché 2020 dataset are available at https://github.com/castorini/touche-error-analysis.

Resources for Brewing BEIR: Reproducible Reference Models and Statistical Analyses

Ehsan Kamalloo
Nandan Thakur
Carlos Lassance
Xueguang Ma
Jheng-Hong Yang
Jimmy Lin

BEIR is a benchmark dataset originally designed for zero-shot evaluation of retrieval models across 18 different domain/task combinations. In recent years, we have witnessed the growing popularity of models based on representation learning, which naturally begs the question: How effective are these models when presented with queries and documents that differ from the training data? While BEIR was designed to answer this question, our work addresses two shortcomings that prevent the benchmark from achieving its full potential: First, the sophistication of modern neural methods and the complexity of current software infrastructure create barriers to entry for newcomers. To this end, we provide reproducible reference implementations that cover learned dense and sparse models. Second, comparisons on BEIR are performed by reducing scores from heterogeneous datasets into a single average that is difficult to interpret. To remedy this, we present meta-analyses focusing on effect sizes across datasets that are able to accurately quantify model differences. By addressing both shortcomings, our work facilitates future explorations in a range of interesting research questions.

SESSION: Session: Point-of-Interest Recommendation

Optimal Transport Enhanced Cross-City Site Recommendation

Xinhang Li
Xiangyu Zhao
Zihao Wang
Yang Duan
Yong Zhang
Chunxiao Xing

Site recommendation, which aims at predicting the optimal location for brands to open new branches, has demonstrated an important role in assisting decision-making in modern business. In contrast to traditional recommender systems that can benefit from extensive information, site recommendation starkly suffers from extremely limited information and thus leads to unsatisfactory performance. Therefore, existing site recommendation methods primarily focus on several specific name brands and heavily rely on fine-grained human-crafted features to avoid the data sparsity problem. However, such solutions are not able to fulfill the demand for rapid development in modern business. Therefore, we aim to alleviate the data sparsity problem by effectively utilizing data across multiple cities and thereby propose a novel Optimal Transport enhanced Cross-city (OTC) framework for site recommendation. Specifically, OTC leverages optimal transport (OT) on the learned embeddings of brands and regions separately to project the brands and regions from the source city to the target city. Then, the projected embeddings of brands and regions are utilized to obtain the inference recommendation in the target city. By integrating the original recommendation and the inference recommendations from multiple cities, OTC is able to achieve enhanced recommendation results. The experimental results on the real-world OpenSiteRec dataset, encompassing thousands of brands and regions across four metropolises, demonstrate the effectiveness of our proposed OTC in further improving the performance of site recommendation models.

Disentangled Contrastive Hypergraph Learning for Next POI Recommendation

Yantong Lai
Yijun Su
Lingwei Wei
Tianqi He
Haitao Wang
Gaode Chen
Daren Zha
Qiang Liu
Xingxing Wang

Next point-of-interest (POI) recommendation has been a prominent and trending task to provide next suitable POI suggestions for users. Most existing sequential-based and graph neural network-based methods have explored various approaches to modeling user visiting behaviors and have achieved considerable performances. However, two key issues have received less attention: i) Most previous studies have ignored the fact that user preferences are diverse and constantly changing in terms of various aspects, leading to entangled and suboptimal user representations. ii) Many existing methods have inadequately modeled the crucial cooperative associations between different aspects, hindering the ability to capture complementary recommendation effects during the learning process. To tackle these challenges, we propose a novel framework Disentangled Contrastive Hypergraph Learning (DCHL) for next POI recommendation. Specifically, we design a multi-view disentangled hypergraph learning component to disentangle intrinsic aspects among collaborative, transitional and geographical views with adjusted hypergraph convolutional networks. Additionally, we propose an adaptive fusion method to integrate multi-view information automatically. Finally, cross-view contrastive learning is employed to capture cooperative associations among views and reinforce the quality of user and POI representations based on self-discrimination. Extensive experiments on three real-world datasets validate the superiority of our proposal over various state-of-the-arts. To facilitate future research, our code is available at https://github.com/icmpnorequest/SIGIR2024_DCHL.

Large Language Models for Next Point-of-Interest Recommendation

Peibo Li
Maarten de Rijke
Hao Xue
Shuang Ao
Yang Song
Flora D. Salim

The next Point of Interest (POI) recommendation task is to predict users' immediate next POI visit given their historical data. Location-Based Social Network (LBSN) data, which is often used for the next POI recommendation task, comes with challenges. One frequently disregarded challenge is how to effectively use the abundant contextual information present in LBSN data. Previous methods are limited by their numerical nature and fail to address this challenge. In this paper, we propose a framework that uses pretrained Large Language Models (LLMs) to tackle this challenge. Our framework allows us to preserve heterogeneous LBSN data in its original format, hence avoiding the loss of contextual information. Furthermore, our framework is capable of comprehending the inherent meaning of contextual information due to the inclusion of commonsense knowledge. In experiments, we test our framework on three real-world LBSN datasets. Our results show that the proposed framework outperforms the state-of-the-art models in all three datasets. Our analysis demonstrates the effectiveness of the proposed framework in using contextual information as well as alleviating the commonly encountered cold-start and short trajectory problems.

CLLP: Contrastive Learning Framework Based on Latent Preferences for Next POI Recommendation

Hongli Zhou
Zhihao Jia
Haiyang Zhu
Zhizheng Zhang

Next Point-Of-Interest (POI) recommendation plays an important role in various location-based services. Its main objective is to predict the users' next interested POI based on their previous check-in information. Most existing studies view the next POI recommendation as a sequence prediction problem but pay little attention to the fine-grained latent preferences of users, neglecting the diversity of user motivations on visiting the POIs. In this paper, we propose a contrastive learning framework based on latent preferences (CLLP) for next POI recommendation, which models the latent preference distributions of users at each POI and then yield disentangled latent preference representations. Specifically, we leverage the cross-local and global spatio-temporal contexts to learn POI representations for dynamically modeling user preferences. And we design a novel distillation strategy to make full use of the collaborative signals from other users for representation optimization. Then, we disentangle multiple latent preferences in POI representations using predefined preference prototypes, while leveraging preference-level contrastive learning to encourage independence of different latent preferences by improving the quality of latent preference representation space. Meanwhile, we employ a multi-task training strategy to jointly optimize all parameters. Experimental results on two real-world datasets show that CLLP achieves the state-of-the-art performance and significantly outperforms all existing solutions. Further investigations demonstrate the robustness of CLLP against sparse and noisy data.

As a representative information retrieval task, site recommendation, which aims at predicting the optimal sites for a brand or an institution to open new branches in an automatic data-driven way, is beneficial and crucial for brand development in modern business. However, there is no publicly available dataset so far and most existing approaches are limited to an extremely small scope of brands, which seriously hinders the research on site recommendation. Therefore, we collect, construct and release an open comprehensive dataset, namely OpenSiteRec, to facilitate and promote the research on site recommendation. Specifically, OpenSiteRec leverages a heterogeneous graph schema to represent various types of real-world entities and relations in four international metropolises. To evaluate the performance of the existing general methods on the site recommendation task, we conduct benchmarking experiments of several representative recommendation models on OpenSiteRec. Furthermore, we also highlight the potential application directions to demonstrate the wide applicability of OpenSiteRec. We believe that our OpenSiteRec dataset is significant and anticipated to encourage the development of advanced methods for site recommendation. OpenSiteRec is available online at https://OpenSiteRec.github.io/.

SESSION: Session: Fairness

A Taxation Perspective for Fair Re-ranking

Chen Xu
Xiaopeng Ye
Wenjie Wang
Liang Pang
Jun Xu
Tat-Seng Chua

Fair re-ranking aims to redistribute ranking slots among items more equitably to ensure responsibility and ethics. The exploration of redistribution problems has a long history in economics, offering valuable insights for conceptualizing fair re-ranking as a taxation process. Such a formulation provides us with a fresh perspective to re-examine fair re-ranking and inspire the development of new methods. From a taxation perspective, we theoretically demonstrate that most previous fair re-ranking methods can be reformulated as an item-level tax policy. Ideally, a good tax policy should be effective and conveniently controllable to adjust ranking resources. However, both empirical and theoretical analyses indicate that the previous item-level tax policy cannot meet two ideal controllable requirements: (1) continuity, ensuring minor changes in tax rates result in small accuracy and fairness shifts; (2) controllability over accuracy loss, ensuring precise estimation of the accuracy loss under a specific tax rate. To overcome these challenges, we introduce a new fair re-ranking method named Tax-rank, which levies taxes based on the difference in utility between two items. Then, we efficiently optimize such an objective by utilizing the Sinkhorn algorithm in optimal transport. Upon a comprehensive analysis, Our model Tax-rank offers a superior tax policy for fair re-ranking, theoretically demonstrating both continuity and controllability over accuracy loss. Experimental results show that Tax-rank outperforms all state-of-the-art baselines on two ranking tasks.

Fairness-Aware Exposure Allocation via Adaptive Reranking

Thomas Jaenich
Graham McDonald
Iadh Ounis

In the first stage of a re-ranking pipeline, an inexpensive ranking model is typically deployed to retrieve a set of documents that are highly likely to be relevant to the user's query. The retrieved documents are then re-ranked by a more effective but expensive ranking model, e.g., a deep neural ranker such as BERT. However, in such a standard pipeline, no new documents are typically discovered after the first stage retrieval. Hence, the amount of exposure that a particular group of documents - e.g., documents from a particular demographic category - can receive is limited by the number of documents that are retrieved in the first stage retrieval. Indeed, if too few documents from a group are retrieved in the first stage retrieval, ensuring that the group receives a fair amount of exposure to the user may become infeasible. Therefore, it is useful to identify more documents from underrepresented groups that are potentially relevant to the query during the re-ranking stage. In this work, we investigate how deploying adaptive re-ranking, which enables the discovery of additional potentially relevant documents in the re-ranking stage, can improve the exposure that a given group of documents receives in the final ranking. We propose six adaptive re-ranking policies that can discover documents from underrepresented groups to increase the disadvantaged groups' exposure in the final ranking. Our experiments on the TREC 2021 and 2022 Fair Ranking Track test collections show that our policies consistently improve the fairness of the exposure distribution in the final ranking, compared to standard adaptive re-ranking approaches, resulting in increases of up to ~13% in Attention Weighted Ranked Fairness (AWRF). Moreover, our best performing policy, Policy 6, consistently maintains and frequently increases the utility of the search results in terms of nDCG.

The Impact of Group Membership Bias on the Quality and Fairness of Exposure in Ranking

Ali Vardasbi
Maarten de Rijke
Fernando Diaz
Mostafa Dehghani

When learning to rank from user interactions, search and recommender systems must address biases in user behavior to provide a high-quality ranking. One type of bias that has recently been studied in the ranking literature is when sensitive attributes, such as gender, have an impact on a user's judgment about an item's utility. For example, in a search for an expertise area, some users may be biased towards clicking on male candidates over female candidates. We call this type of bias group membership bias.

Increasingly, we seek rankings that are fair to individuals and sensitive groups. Merit-based fairness measures rely on the estimated utility of the items. With group membership bias, the utility of the sensitive groups is underestimated, hence, without correcting for this bias, a supposedly fair ranking is not truly fair. In this paper, first, we analyze the impact of group membership bias on ranking quality as well as merit-based fairness metrics and show that group membership bias can hurt both ranking and fairness. Then, we provide a correction method for group bias that is based on the assumption that the utility score of items in different groups comes from the same distribution. This assumption has two potential issues of sparsity and equality-instead-of-equity; we use an amortized approach to address these. We show that our correction method can consistently compensate for the negative impact of group membership bias on ranking quality and fairness metrics.

Optimizing Learning-to-Rank Models for Ex-Post Fair Relevance

Sruthi Gorantla
Eshaan Bhansali
Amit Deshpande
Anand Louis

Learning-to-rank (LTR) models rank items based on specific features, aiming to maximize ranking utility by prioritizing highly relevant items. However, optimizing only for ranking utility can lead to representational harm and may fail to address implicit bias in relevance scores. Prior studies introduced algorithms to train stochastic ranking models, such as the Plackett-Luce ranking model, that maximize expected ranking utility while achieving fairness in expectation (ex-ante fairness). Still, every sampled ranking may not satisfy group fairness (ex-post fairness). Post-processing methods ensure ex-post fairness; however, the LTR model lacks awareness of this step, creating a mismatch between the objective function the LTR model optimizes and the one it is supposed to optimize. In this paper, we first propose a novel objective where the relevance (or the expected ranking utility) is computed over only those rankings that satisfy given representation constraints for groups of items. We call this the ex-post fair relevance. We then give a framework for training Group-Fair LTR models to maximize our proposed ranking objective.

Leveraging an efficient sampler for ex-post group-fair rankings and efficient algorithms to train the Plackett-Luce LTR model, we demonstrate their use in training the Group-Fair Plackett-Luce model in our framework. Experiments on MovieLens and Kiva datasets reveal improved fairness and relevance with our group-fair Plackett-Luce model compared to post-processing. In scenarios with implicit bias, our algorithm generally outperforms existing LTR baselines in both fairness and relevance.

Unbiased Learning-to-Rank Needs Unconfounded Propensity Estimation

Dan Luo
Lixin Zou
Qingyao Ai
Zhiyu Chen
Chenliang Li
Dawei Yin
Brian D. Davison

The logs of the use of a search engine provide sufficient data to train a better ranker. However, it is well known that such implicit feedback reflects biases, and in particular a presentation bias that favors higher-ranked results. Unbiased Learning-to-Rank (ULTR) methods attempt to optimize performance by jointly modeling this bias along with the ranker so that the bias can be removed. Such methods have been shown to provide theoretical soundness, and promise superior performance and low deployment costs. However, existing ULTR methods don't recognize that query-document relevance is a confounder -- it affects both the likelihood of a result being clicked because of relevance and the likelihood of the result being ranked high by the base ranker. Moreover, the performance guarantees of existing ULTR methods assume the use of a weak ranker -- one that does a poor job of ranking documents based on relevance to a query. In practice, of course, commercial search engines use highly tuned rankers, and desire to improve upon them using the implicit judgments in search logs. This results in a significant correlation between position and relevance, which leads existing ULTR methods to overestimate click propensities in highly ranked results, reducing ULTR's effectiveness. This paper is the first to demonstrate the problem of propensity overestimation by ULTR algorithms, based on a causal analysis. We develop a new learning objective based on a backdoor adjustment. In addition, we introduce the Logging-Policy-aware Propensity (LPP) model that can jointly learn LPP and a more accurate ranker. We extensively test our approach on two public benchmark tasks and show that our proposal is effective, practical and significantly outperforms the state of the art.

Unbiased Learning to Rank Meets Reality: Lessons from Baidu's Large-Scale Search Dataset

Philipp Hager
Romain Deffayet
Jean-Michel Renders
Onno Zoeter
Maarten de Rijke

Unbiased learning-to-rank (ULTR) is a well-established framework for learning from user clicks, which are often biased by the ranker collecting the data. While theoretically justified and extensively tested in simulation, ULTR techniques lack empirical validation, especially on modern search engines. The Baidu-ULTR dataset released for the WSDM Cup 2023, collected from Baidu's search engine, offers a rare opportunity to assess the real-world performance of prominent ULTR techniques. Despite multiple submissions during the WSDM Cup 2023 and the subsequent NTCIR ULTRE-2 task, it remains unclear whether the observed improvements stem from applying ULTR or other learning techniques.

In this work, we revisit and extend the available experiments on the Baidu-ULTR dataset. We find that standard unbiased learning-to-rank techniques robustly improve click predictions but struggle to consistently improve ranking performance, especially considering the stark differences obtained by choice of ranking loss and query-document features. Our experiments reveal that gains in click prediction do not necessarily translate to enhanced ranking performance on expert relevance annotations, implying that conclusions strongly depend on how success is measured in this benchmark.

SESSION: Session: Sequential Recommendation

Scenario-Adaptive Fine-Grained Personalization Network: Tailoring User Behavior Representation to the Scenario Context

Moyu Zhang
Yongxiang Tang
Jinxin Hu
Yu Zhang

As e-commerce has evolved, commercial platforms accommodate various scenarios to cater to the diverse shopping preferences of users. To conserve resources, current methods utilize a unified framework to deliver personalized recommendations across various scenarios. Given the overlap of users and items in multiple scenarios, current methods typically employ shared bottom representations, capturing similarities and differences between scenarios through adaptive adjustments. However, they adjust representations adaptively after aggregating user behavior sequences. This coarse-grained approach to re-weighting the entire user sequence hampers the model's ability to model the user interest migration across different scenarios. To enhance the model's capacity to capture user interests across scenarios, we develop a ranking framework named the Scenario-Adaptive Fine-Grained Personalization Network (SFPNet), which designs a fine-grained method for multi-scenario personalized recommendations. Specifically, SFPNet comprises a series of blocks, stacked sequentially. Each block initially deploys a parameter personalization unit to integrate scenario information into fundamental features at a coarse-grained level, where adjusted feature representations will serve as context information. By employing residual connection, we incorporate the context into the representation of each historical behavior, allowing for context-aware fine-grained customization of the behavior representations at the scenario-level, which supports scenario-aware user interest modeling. Ultimately, the effectiveness of our method is strongly substantiated by extensive experiments and online A/B testing.

Scaling Sequential Recommendation Models with Transformers

Pablo Zivic
Hernan Vazquez
Jorge Sánchez

Modeling user preferences has been mainly addressed by looking at users' interaction history with the different elements available in the system. Tailoring content to individual preferences based on historical data is the main goal of sequential recommendation. The nature of the problem, as well as the good performance observed across various domains, has motivated the use of the transformer architecture, which has proven effective in leveraging increasingly larger amounts of training data when accompanied by an increase in the number of model parameters. This scaling behavior has brought a great deal of attention, as it provides valuable guidance in the design and training of even larger models. Taking inspiration from the scaling laws observed in training large language models, we explore similar principles for sequential recommendation. Addressing scalability in this context requires special considerations as some particularities of the problem depart from the language modeling case. These particularities originate in the nature of the content catalogs, which are significantly larger than the vocabularies used for language and might change over time. In our case, we start from a well-known transformer-based model from the literature and make two crucial modifications. First, we pivot from the traditional representation of catalog items as trainable embeddings to representations computed with a trainable feature extractor, making the parameter count independent of the number of items in the catalog. Second, we propose a contrastive learning formulation that provides us with a better representation of the catalog diversity. We demonstrate that, under this setting, we can train our models effectively on increasingly larger datasets under a common experimental setup. We use the full Amazon Product Data dataset, which has only been partially explored in other studies, and reveal scaling behaviors similar to those found in language models. Compute-optimal training is possible but requires a careful analysis of the compute-performance trade-offs specific to the application. We also show that performance scaling translates to downstream tasks by fine-tuning larger pre-trained models on smaller task-specific domains. Our approach and findings provide a strategic roadmap for model training and deployment in real high-dimensional preference spaces, facilitating better training and inference efficiency. We hope this paper bridges the gap between the potential of transformers and the intrinsic complexities of high-dimensional sequential recommendation in real-world recommender systems. Code and models can be found at https://github.com/mercadolibre/srt.

A Generic Behavior-Aware Data Augmentation Framework for Sequential Recommendation

Jing Xiao
Weike Pan
Zhong Ming

Multi-behavior sequential recommendation (MBSR), which models multi-behavior sequentiality and heterogeneity to better learn users' multifaceted intentions has achieved remarkable success. Though effective, the performance of these approaches may be limited due to the sparsity inherent in a real-world data. Existing data augmentation methods in recommender systems focus solely on a single type of behavior, overlooking the variations in expressing user preferences via different types of behaviors. During the augmentation of samples, it is easy to introduce excessive disturbance or noise, which may mislead the next-item recommendation. To address this limitation, we propose a novel generic framework called multi-behavior data augmentation for sequential recommendation (MBASR). Specifically, we design three behavior-aware data augmentation operations to construct rich training samples. Each augmentation operation takes into account the correlations between behaviors and aligns with the users' behavior patterns. In addition, we introduce a position-based sampling strategy that can effectively reduce the perturbation brought by the augmentation operations to the original data. Note that our model is data-oriented and can thus be embedded in different downstream MBSR models, so the overall framework is generic. Extensive experiments on three real-world datasets demonstrate the effectiveness of our MBASR and its applicability to a wide variety of mainstream MBSR models. Our source code is available at https://github.com/XiaoJing-C/MBASR.

CMCLRec: Cross-modal Contrastive Learning for User Cold-start Sequential Recommendation

Xiaolong Xu
Hongsheng Dong
Lianyong Qi
Xuyun Zhang
Haolong Xiang
Xiaoyu Xia
Yanwei Xu
Wanchun Dou

Sequential recommendation models generate embeddings for items through the analysis of historical user-item interactions and utilize the acquired embeddings to predict user preferences. Despite being effective in revealing personalized preferences for users, these models heavily rely on user-item interactions. However, due to the lack of interaction information, new users face challenges when utilizing sequential recommendation models for predictions, which is recognized as the cold-start problem. Recent studies, while addressing this problem within specific structures, often neglect the compatibility with existing sequential recommendation models, making seamless integration into existing models unfeasible.To address this challenge, we propose CMCLRec, a Cross-Modal Contrastive Learning framework for user cold-start RECommendation. This approach aims to solve the user cold-start problem by customizing inputs for cold-start users that align with the requirements of sequential recommendation models in a cross-modal manner. Specifically, CMCLRec adopts cross-modal contrastive learning to construct a mapping from user features to user-item interactions based on warm user data. It then generates a simulated behavior sequence for each cold-start user in turn for recommendation purposes. In this way, CMCLRec is theoretically compatible with any extant sequential recommendation model. Comprehensive experiments conducted on real-world datasets substantiate that, compared with state-of-the-art baseline models, CMCLRec markedly enhances the performance of conventional sequential recommendation models, particularly for cold-start users.

FineRec: Exploring Fine-grained Sequential Recommendation

Xiaokun Zhang
Bo Xu
Youlin Wu
Yuan Zhong
Hongfei Lin
Fenglong Ma

Sequential recommendation is dedicated to offering items of interest for users based on their history behaviors. The attribute-opinion pairs, expressed by users in their reviews for items, provide the potentials to capture user preferences and item characteristics at a fine-grained level. To this end, we propose a novel framework FineRec that explores the attribute-opinion pairs of reviews to finely handle sequential recommendation. Specifically, we utilize a large language model to extract attribute-opinion pairs from reviews. For each attribute, a unique attribute-specific user-opinion-item graph is created, where corresponding opinions serve as the edges linking heterogeneous user and item nodes. Afterwards, we devise a diversity-aware convolution operation to aggregate information within the graphs, enabling attribute-specific user and item representation learning. Ultimately, we present an interaction-driven fusion mechanism to integrate attribute-specific user/item representations across all attributes for generating recommendations. Extensive experiments conducted on several real-world datasets demonstrate the superiority of our FineRec over existing state-ofthe-art methods. Further analysis also verifies the effectiveness of our fine-grained manner in handling the task.

SelfGNN: Self-Supervised Graph Neural Networks for Sequential Recommendation

Yuxi Liu
Lianghao Xia
Chao Huang

Sequential recommendation effectively addresses information overload by modeling users' temporal and sequential interaction patterns. To overcome the limitations of supervision signals, recent approaches have adopted self-supervised learning techniques in recommender systems. However, there are still two critical challenges that remain unsolved. Firstly, existing sequential models primarily focus on long-term modeling of individual interaction sequences, overlooking the valuable short-term collaborative relationships among the behaviors of different users. Secondly, real-world data often contain noise, particularly in users' short-term behaviors, which can arise from temporary intents or misclicks. Such noise negatively impacts the accuracy of both graph and sequence models, further complicating the modeling process. To address these challenges, we propose a novel framework called Self-Supervised Graph Neural Network (SelfGNN) for sequential recommendation. The SelfGNN framework encodes short-term graphs based on time intervals and utilizes Graph Neural Networks (GNNs) to learn short-term collaborative relationships. It captures long-term user and item representations at multiple granularity levels through interval fusion and dynamic behavior modeling. Importantly, our personalized self-augmented learning structure enhances model robustness by mitigating noise in short-term graphs based on long-term user interests and personal stability. Extensive experiments conducted on four real-world datasets demonstrate that SelfGNN outperforms various state-of-the-art baselines. Our model implementation codes are available at https://github.com/HKUDS/SelfGNN.

EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention

Zhen Tian
Wayne Xin Zhao
Changwang Zhang
Xin Zhao
Zhongrui Ma
Ji-Rong Wen

To capture user preference, transformer models have been widely applied to model sequential user behavior data. The core of transformer architecture lies in the self-attention mechanism, which computes the pairwise attention scores in a sequence. Due to the permutation-equivariant nature, positional encoding is used to enhance the attention between token representations. In this setting, the pairwise attention scores can be derived by both semantic difference and positional difference. However, prior studies often model the two kinds of difference measurements in different ways, which potentially limits the expressive capacity of sequence modeling.

To address this issue, this paper proposes a novel transformer variant with complex vector attention, named EulerFormer, which provides a unified theoretical framework to formulate both semantic difference and positional difference. The EulerFormer involves two key technical improvements. First, it employs a new transformation function for efficiently transforming the sequence tokens into polar-form complex vectors using Euler's formula, enabling the unified modeling of both semantic and positional information in a complex rotation form. Secondly, it develops a differential rotation mechanism, where the semantic rotation angles can be controlled by an adaptation function, enabling the adaptive integration of the semantic and positional information according to the semantic contexts. Furthermore, a phase contrastive learning task is proposed to improve the isotropy of contextual representations in EulerFormer. Our theoretical framework possesses a high degree of completeness and generality (e.g., RoPE can be instantiated as a special case). It is more robust to semantic variations and possesses more superior theoretical properties (e.g., long-term decay) in principle. Extensive experiments conducted on four public datasets demonstrate the effectiveness and efficiency of our approach. Our code is available at https://github.com/RUCAIBox/EulerFormer.

SESSION: Session: Networks and Graphs

Bootstrap Deep Metric for Seed Expansion in Attributed Networks

Chunquan Liang
Yifan Wang
Qiankun Chen
Xinyuan Feng
Luyue Wang
Mei Li
Hongming Zhang

Seed expansion tasks play an important role in various network applications such as recommendation systems, social network analysis, and bioinformatics. Given a network and a small group of examples as seeds, these tasks involve identifying additional members of interest from the same community. While most existing expansion methods focus on defining a fixed metric function based on the network structure alone, they often overlook the rich content associated with nodes in attributed networks.

In this paper, we bridge the gap by learning a deep metric that takes into account both the network structure and node attributes, and by utilizing the recent advanced graph neural networks as encoding functions. The key challenge lies in the extreme scarcity of given positive examples (i.e., the seed nodes) in real-world applications and the absence of negatives (i.e., non-members of the target community). We introduce Bootstrap Deep Metric (BDM), a graph deep metric learning framework for seed expansion problems. BDM utilizes previous versions of representations to generate anchors for positive and unlabeled nodes, and learns enhanced node representations by minimizing the metric losses on both positive and unlabeled nodes. It eliminates the need for negative nodes, while producing closely aligned representations for members of target community and uniformly distributed representations for non-members, which effectively aid in selecting expansion nodes. Experimental results on real-life datasets show that our BDM not only substantially outperforms state-of-the-art approaches but also remarkably surpasses fully labeled classification models in most cases. Codes are available at https://github.com/wangyfnwsuaf/bdm.

Grand: A Fast and Accurate Graph Retrieval Framework via Knowledge Distillation

Lin Lan
Pinghui Wang
Rui Shi
Tingqing Liu
Juxiang Zeng
Feiyang Sun
Yang Ren
Jing Tao
Xiaohong Guan

Graph retrieval aims to find the most similar graphs in a graph database given a query graph, which is a fundamental problem with many real-world applications in chemical engineering, code analysis, etc. To date, existing neural graph retrieval methods generally fall into two categories: Embedding Based Paradigm (Ebp) and Matching Based Paradigm (Mbp). The Ebp models learn an individual vectorial representation for each graph and the retrieval process can be accelerated by pre-computing these representations. The Mbp models learn a neural matching function to compare graphs on a pair-by-pair basis, in which the fine-grained pairwise comparison leads to higher retrieval accuracy but severely degrades retrieval efficiency. In this paper, to combine the advantage of Ebp in retrieval efficiency with that of Mbp in retrieval accuracy, we propose a novel Graph RetrievAl framework via KNowledge Distillation, namely GRAND. The key point is to leverage the idea of knowledge distillation to transfer the fine-grained graph comparison knowledge from an Mbp model to an Ebp model, such that the Ebp model can generate better graph representations and thus yield higher retrieval accuracy. At the same time, we can still pre-compute and index the improved graph representations to retain the retrieval speed of Ebp. Towards this end, we propose to perform knowledge distillation from three perspectives: score, node, and subgraph levels. In addition, we propose to perform mutual two-way knowledge transfer between Mbp and Ebp, such that Mbp and Ebp complement and benefit each other. Extensive experiments on three real-world datasets show that GRAND improves the performance of Ebp by a large margin and the improvement is consistent for different combinations of Ebp and Mbp models. For example, GRAND achieves performance gains of mostly more than 10% and up to 16.88% in terms of Recall@K on different datasets.

Intent Distribution based Bipartite Graph Representation Learning

Haojie Li
Wei Wei
Guanfeng Liu
Jinhuan Liu
Feng Jiang
Junwei Du

Bipartite graph representation learning embeds users and items into a low-dimensional latent space based on observed interactions. Previous studies mainly fall into two categories: one reconstructs the structural relations of the graph through the representations of nodes, while the other aggregates neighboring node information using graph neural networks. However, existing methods only explore the local structural information of nodes during the learning process. This makes it difficult to represent the macroscopic structural information and leaves it easily affected by data sparsity and noise. To address this issue, we propose the Intent Distribution based Bipartite graph Representation learning (IDBR) model, which explicitly integrates node intent distribution information into the representation learning process. Specifically, we obtain node intent distributions through clustering and design an intent distribution based graph convolution neural network to generate node representations. Compared to traditional methods, we expand the scope of node representations, enabling us to obtain more comprehensive representations of global intent. When constructing the intent distributions, we effectively alleviated the issues of data sparsity and noise. Additionally, we enrich the representations of nodes by integrating potential neighboring nodes from both structural and semantic dimensions. Experiments on the link prediction and recommendation tasks illustrate that the proposed approach outperforms existing state-of-the-art methods. The code of IDBR is available at https://github.com/rookitkitlee/IDBR.

TGOnline: Enhancing Temporal Graph Learning with Adaptive Online Meta-Learning

Ruijie Wang
Jingyuan Huang
Yutong Zhang
Jinyang Li
Yufeng Wang
Wanyu Zhao
Shengzhong Liu
Charith Mendis
Tarek Abdelzaher

Temporal graphs, depicting time-evolving node connections through temporal edges, are extensively utilized in domains where temporal connection patterns are essential, such as recommender systems, financial networks, healthcare, and sensor networks. Despite recent advancements in temporal graph representation learning, performance degradation occurs with periodic collections of new temporal edges, owing to their dynamic nature and newly emerging information. This paper investigates online representation learning on temporal graphs, aiming for efficient updates of temporal models to sustain predictive performance during deployment. Unlike costly retraining or exclusive fine-tuning susceptible to catastrophic forgetting, our approach aims to distill information from previous model parameters and adapt it to newly gathered data. To this end, we propose TGOnline, an adaptive online meta-learning framework, tackling two key challenges. First, to distill valuable knowledge from complex temporal parameters, we establish an optimization objective that determines new parameters, either by leveraging global ones or by placing greater reliance on new data, where global parameters are meta-trained across various data collection periods to enhance temporal generalization. Second, to accelerate the online distillation process, we introduce an edge reduction mechanism that skips new edges lacking additional information and a node deduplication mechanism to prevent redundant computation within training batches on new data. Extensive experiments on four real-world temporal graphs demonstrate the effectiveness and efficiency of TGOnline for online representation learning, outperforming 18 state-of-the-art baselines. Notably, TGOnline not only outperforms the commonly utilized retraining strategy but also achieves a significant speedup of ~30x.

A Dual-Embedding Based DQN for Worker Recruitment in Spatial Crowdsourcing with Social Network

Yucen Gao
Wei Liu
Jianxiong Guo
Xiaofeng Gao
Guihai Chen

Spatial Crowdsourcing (SC) is a promising service that incentives workers to finish location-based tasks with high quality by providing rewards. Worker recruitment is a core issue in SC, for which most state-of-the-art algorithms focus on designing incentive mechanisms based on the existing SC worker pool. However, they may fail when the number of SC workers is not enough, especially for the new SC platforms. In recent years, social networks have been found to be helpful for worker recruitment by selecting seed workers to spread the task information so as to inspire more social users to participate, but how to select seed workers remains a challenge. Existing methods typically require numerous iterative searches leading to inefficiency in facing the big picture and failing to cope with dynamic environments.

In the paper, we formulate the Effective Coverage Maximization (ECM) problem. We prove that the ECM problem is NP-hard and propose a novel worker recruitment method combined with the dual-embedding and Rainbow Deep Q-network (DQN), which is called DQNSelector. The dual-embedding extracts long-range social influence information from the social network and near-range coverage quality information from the geographic information map using the inner-product method and our proposed efficient Path Increment Iterative Calculation (PIIC) algorithm respectively. We then combine the dual embedding to design a Rainbow DQN-based reinforcement learning model so as to select seed workers. Extensive experiments and ablation studies based on real-world datasets verify the superiority of DQNSelector.

Scalable Community Search over Large-scale Graphs based on Graph Transformer

Yuxiang Wang
Xiaoxuan Gou
Xiaoliang Xu
Yuxia Geng
Xiangyu Ke
Tianxing Wu
Zhiyuan Yu
Runhuai Chen
Xiangying Wu

Given a graph G and a query node q, community search (CS) aims to find a structurally cohesive subgraph from G that contains q. CS is widely used in many real-world applications, such as online recommendation and expert finding. Recently, the rise of learning-based CS methods has garnered extensive research interests, showcasing the promising potential of neural solutions. However, there remains room for optimization: (1) They initialize node features via classical methods, e.g., one-hot, random, and position encoding, which may fall short in capturing valuable community cohesiveness-related features. (2) The reliance on GCN or GCN-like models poses challenges in scaling to large graphs. (3) Existing methods do not adapt well to dynamic graphs, often requiring retraining from scratch. To handle this, we present CSFormer, a scalable CS based on Graph Transformer. First, we present a novel l-hop neighborhood community vector based on n-order h-index to represent each node's community features, generating a sequence of feature vectors by varying the neighborhood scope l. Then, we build a Transformer backbone to learn a good graph embedding that carries rich community features, based on which we perform a prediction-filtering-based online CS to efficiently return a community of q. We extend CSFormer to dynamic graphs and various community models. Extensive experiments on seven real-world graphs show our solution's superiority on effectiveness, e.g., we attain an average improvement of 20.6% in F1-score compared to the latest competitors.

Efficient Community Search Based on Relaxed k-Truss Index

Xiaoqin Xie
Shuangyuan Liu
Jiaqi Zhang
Shuai Han
Wei Wang
Wu Yang

Communities are prevalent in large graphs such as social networks, protein networks, etc. Community search aims to find a cohesive subgraph that contains the query nodes. Existing community search algorithms often adopt community models to find target communities, and k-truss model is a popularly used one that provides structural constraints. However, the structural constraints presented by k-truss is so tight that the searching algorithm often can not find the target communities. There always exist some subgraphs that may not conform to k-truss structure but do have cohesive characteristics to meet users' personalized requirements. Moreover, the k-truss based community search algorithms can not meet users' real-time demands on large graphs. To address the above problems, this paper proposes the relaxed k-truss community search problem for the first time. Then we construct a relaxed k-truss index, which can help to find cohesive communities in linear time and provide flexible searching for nested communities. We also design an index maintenance algorithm to dynamically update the index. Furthermore, a community search algorithm based on the relaxed k-truss index is presented. Extensive experimental results on real datasets prove the effectiveness and efficiency of our model and algorithms.

SESSION: Session: Privacy, Security and Federated Learning

Untargeted Adversarial Attack on Knowledge Graph Embeddings

Tianzhe Zhao
Jiaoyan Chen
Yanchi Ru
Qika Lin
Yuxia Geng
Jun Liu

Knowledge graph embedding (KGE) methods have achieved great success in handling various knowledge graph (KG) downstream tasks. However, KGE methods may learn biased representations on low-quality KGs that are prevalent in the real world. Some recent studies propose adversarial attacks to investigate the vulnerabilities of KGE methods, but their attackers are target-oriented with the KGE method and the target triples to predict are given in advance, which lacks practicability. In this work, we explore untargeted attacks with the aim of reducing the global performances of KGE methods over a set of unknown test triples and conducting systematic analyses on KGE robustness. Considering logic rules can effectively summarize the global structure of a KG, we develop rule-based attack strategies to enhance the attack efficiency. In particular, we consider adversarial deletion which learns rules, applying the rules to score triple importance and delete important triples, and adversarial addition which corrupts the learned rules and applies them for negative triples as perturbations. Extensive experiments on two datasets over three representative classes of KGE methods demonstrate the effectiveness of our proposed untargeted attacks in diminishing the link prediction results. And we also find that different KGE methods exhibit different robustness to untargeted attacks. For example, the robustness of methods engaged with graph neural networks and logic rules depends on the density of the graph. But rule-based methods like NCRL are easily affected by adversarial addition attacks to capture negative rules.

Poisoning Decentralized Collaborative Recommender System and Its Countermeasures

Ruiqi Zheng
Liang Qu
Tong Chen
Kai Zheng
Yuhui Shi
Hongzhi Yin

To make room for privacy and efficiency, the deployment of many recommender systems is experiencing a shift from central servers to personal devices, where the federated recommender systems (FedRecs) and decentralized collaborative recommender systems (DecRecs) are arguably the two most representative paradigms. While both leverage knowledge (e.g., gradients) sharing to facilitate learning local models, FedRecs rely on a central server to coordinate the optimization process, yet in DecRecs, the knowledge sharing directly happens between clients. On the flip side, knowledge sharing also opens a backdoor for model poisoning attacks, where adversaries disguise themselves as benign clients and disseminate polluted knowledge to achieve malicious goals like promoting an item's exposure rate. Although research on such poisoning attacks provides valuable insights into finding security loopholes and corresponding countermeasures, existing attacks mostly focus on FedRecs, and are either inapplicable or ineffective for DecRecs. Compared with FedRecs where the tampered information can be universally distributed to all clients once uploaded to the cloud, each adversary in DecRecs can only communicate with neighbor clients of a small size, confining its impact to a limited range.

To fill the gap, we present a novel attack method named Poisoning with Adaptive Malicious Neighbors (PAMN). With item promotion in top-K recommendation as the attack objective, PAMN effectively boosts target items' ranks with several adversaries that emulate benign clients (i.e., users) and transfers adaptively crafted gradients conditioned on each adversary's neighbors. A diversity-driven regularizer is further designed in PAMN to allow the adversaries to reach a broader group of multifaceted benign users. Moreover, with the vulnerabilities of DecRecs uncovered, a dedicated defensive mechanism based on user-level gradient clipping with sparsified updating is proposed. Extensive experiments demonstrate the effectiveness of the poisoning attack and the robustness of our defensive mechanism.

Revisit Targeted Model Poisoning on Federated Recommendation: Optimize via Multi-objective Transport

Jiajie Su
Chaochao Chen
Weiming Liu
Zibin Lin
Shuheng Shen
Weiqiang Wang
Xiaolin Zheng

Federated Recommendation (FedRec) is popularly investigated in personalized recommenders for preserving user privacy. However, due to the distributed training paradigm, FedRec is vulnerable to model poisoning attacks. In this paper, we focus on the targeted model poisoning attack against FedRec, which aims at effectively attacking the FedRec via uploading poisoned gradients to raise the exposure ratio of a multi-target item set. Previous attack methods excel with fewer target items but suffer performance decline as the amount of target items increases, which reveals two perennially neglected issues: (i) The simple promotion of prediction scores without considering intrinsic collaborations between users and items is ineffective in multi-target cases. (ii) Target items are heterogeneous, which requires discriminative attacking users and strategies for different targets. To address the issues, we propose a novel Heterogeneous Multi-target Transfer Attack framework named HMTA which consists of two stages, i.e., (1) diverse user agent generation and (2) optimal multi-target transport attack. The former stage leverages collaboration-aware manifold learning to extract latent associations among users and items, and develops a differentiable contrastive sorting to generate user agents from both difficulty and diversity scale. The latter stage conducts poisoning in a fine-grained and distinguishing way, which first completes distribution mapping from target items to generated user agents and then achieves a hybrid multi-target attack. Extensive experiments on benchmark datasets demonstrate the effectiveness of HMTA.

LoRec: Combating Poisons with Large Language Model for Robust Sequential Recommendation

Kaike Zhang
Qi Cao
Yunfan Wu
Fei Sun
Huawei Shen
Xueqi Cheng

Sequential recommender systems stand out for their ability to capture users' dynamic interests and the patterns of item transitions. However, the inherent openness of sequential recommender systems renders them vulnerable to poisoning attacks, where fraudsters are injected into the training data to manipulate learned patterns. Traditional defense methods predominantly depend on predefined assumptions or rules extracted from specific known attacks, limiting their generalizability to unknown attacks. To solve the above problems, considering the rich open-world knowledge encapsulated in Large Language Models (LLMs), we attempt to introduce LLMs into defense methods to broaden the knowledge beyond limited known attacks. We propose LoRec, an innovative framework that employs LLM-Enhanced Calibration to strengthen the robustness of sequential Recommender systems against poisoning attacks. LoRec integrates an LLM-enhanced CalibraTor (LCT) that refines the training process of sequential recommender systems with knowledge derived from LLMs, applying a user-wise reweighting to diminish the impact of attacks. Incorporating LLMs' open-world knowledge, the LCT effectively converts the limited, specific priors or rules into a more general pattern of fraudsters, offering improved defenses against poisons. Our comprehensive experiments validate that LoRec, as a general framework, significantly strengthens the robustness of sequential recommender systems.

Improving the Accuracy of Locally Differentially Private Community Detection by Order-consistent Data Perturbation

Taolin Guo
Shunshun Peng
Zhejian Zhang
Mengmeng Yang
Kwok-Yan Lam

Community detection refers to mechanisms that aim to identify groups of interacting nodes in a network according to the structural properties of the network. It has been used to analyze various graphs. In the context of social networks, it requires the collection of each user's social relations, posing the risk of user privacy intrusion caused by untrusted servers. Local differential privacy is a widely adopted approach for providing privacy protection while allowing acceptable utility of the protected data for analytics. There has been growing research interest in applying local differential privacy protection to community detection. However, such protection approaches typically suffer from poor accuracy due to the excessive noise in the protected data. This paper proposes LDP-Cd, a two-phase community detection framework under local differential privacy. LDP-Cd initializes the community groups using the Louvain community detection algorithm and iteratively refines the community in the second phase. Besides, we propose an order-consistent data perturbation method over the degree vector, thus ensuring the ordering consistency of the fitness between the user and community groups, thereby improving the accuracy of community detection. Experimental results on real datasets show that LDP-Cd has significant advantages over existing methods regarding community detection accuracy and a trade-off between user privacy and community detection utility.

Unmasking Privacy: A Reproduction and Evaluation Study of Obfuscation-based Perturbation Techniques for Collaborative Filtering

Alex Martinez
Mihnea Tufis
Ludovico Boratto

Recommender systems (RecSys) solve personalisation problems and therefore heavily rely on personal data - demographics, user preferences, user interactions - each baring important privacy risks. It is also widely accepted that in RecSys performance and privacy are at odds, with the increase of one resulting in the decrease of the other. Among the diverse approaches in privacy enhancing technologies (PET) for RecSys, perturbation stands out for its simplicity and computational efficiency. It involves adding noise to sensitive data, thus hiding its real value from an untrusted actor. We reproduce and test a set of four randomization-based perturbation techniques developed by Batmaz and Polat \citebatmaz2016randomization for privacy preserving collaborative filtering. While the framework presents great advantages - low computational requirements, several useful privacy-enhancing parameters - the supporting paper lacks conclusions drawn from empirical evaluation. We address this shortcoming by proposing - in absence of an implementation by the authors - our own implementation of the obfuscation framework. We then develop an evaluation framework to test the main assumption of the reference paper - that RecSys privacy and performance are competing goals. We extend this study to understand how much we can enhance privacy, within reasonable losses of the RecSys performance. We reproduce and test the framework for the more realistic scenario where only implicit feedback is available, using two well-known datasets (MovieLens-1M and Last.fm-1K), and several state-of-the-art recommendation algorithms (NCF and LightGCN from the Microsoft Recommenders public repository).

ReFer: Retrieval-Enhanced Vertical Federated Recommendation for Full Set User Benefit

Wenjie Li
Zhongren Wang
Jinpeng Wang
Shu-Tao Xia
Jile Zhu
Mingjian Chen
Jiangke Fan
Jia Cheng
Jun Lei

As an emerging privacy-preserving approach to leveraging cross-platform user interactions, vertical federated learning (VFL) has been increasingly applied in recommender systems. However, vanilla VFL is only applicable to overlapped users, ignoring potential universal interest patterns hidden among non-overlapped users and suffers from limited user group benefits, which hinders its application in real-world recommenders.

In this paper, we extend the traditional vertical federated recommendation problem (VFR) to a more realistic Fully-Vertical federated recommendation setting (Fully-VFR) which aims to utilize all available data and serve full user groups. To tackle challenges in implementing Fully-VFR, we propose a Retrieval-enhanced Vertical Federated recommender (ReFer), a groundbreaking initiative that explores retrieval-enhanced machine learning approaches in VFL. Specifically, we establish a general "retrieval-and-utilization" algorithm to enhance the quality of representations across all parties. We design a flexible federated retrieval augmentation (RA) mechanism for VFL: (i) Cross-RA to complement field missing and (ii) Local-RA to promote mutual understanding between user groups. We conduct extensive experiments on both public and industry datasets. Results on both sequential and non-sequential CTR prediction tasks demonstrate that our method achieves significant performance improvements over baselines and is beneficial for all user groups.

SESSION: Session: Prompts, Instructions and LLMs in Recommender Systems

GPT4Rec: Graph Prompt Tuning for Streaming Recommendation

Peiyan Zhang
Yuchen Yan
Xi Zhang
Liying Kang
Chaozhuo Li
Feiran Huang
Senzhang Wang
Sunghun Kim

In the realm of personalized recommender systems, the challenge of adapting to evolving user preferences and the continuous influx of new users and items is paramount. Conventional models, typically reliant on a static training-test approach, struggle to keep pace with these dynamic demands. Streaming recommendation, particularly through continual graph learning, has emerged as a novel solution, attracting significant attention in academia and industry. However, existing methods in this area either rely on historical data replay, which is increasingly impractical due to stringent data privacy regulations; or are inability to effectively address the over-stability issue; or depend on model-isolation and expansion strategies, which necessitate extensive model expansion and are hampered by time-consuming updates due to large parameter sets. To tackle these difficulties, we present GPT4Rec, a Graph Prompt Tuning method for streaming Recommendation. Given the evolving user-item interaction graph, GPT4Rec first disentangles the graph patterns into multiple views. After isolating specific interaction patterns and relationships in different views, GPT4Rec utilizes lightweight graph prompts to efficiently guide the model across varying interaction patterns within the user-item graph. Firstly, node-level prompts are employed to instruct the model to adapt to changes in the attributes or properties of individual nodes within the graph. Secondly, structure-level prompts guide the model in adapting to broader patterns of connectivity and relationships within the graph. Finally, view-level prompts are innovatively designed to facilitate the aggregation of information from multiple disentangled views. These prompt designs allow GPT4Rec to synthesize a comprehensive understanding of the graph, ensuring that all vital aspects of the user-item interactions are considered and effectively integrated. Experiments on four diverse real-world datasets demonstrate the effectiveness and efficiency of our proposal.

LLaRA: Large Language-Recommendation Assistant

Jiayi Liao
Sihang Li
Zhengyi Yang
Jiancan Wu
Yancheng Yuan
Xiang Wang
Xiangnan He

Sequential recommendation aims to predict users' next interaction with items based on their past engagement sequence. Recently, the advent of Large Language Models (LLMs) has sparked interest in leveraging them for sequential recommendation, viewing it as language modeling. Previous studies represent items within LLMs' input prompts as either ID indices or textual metadata. However, these approaches often fail to either encapsulate comprehensive world knowledge or exhibit sufficient behavioral understanding. To combine the complementary strengths of conventional recommenders in capturing behavioral patterns of users and LLMs in encoding world knowledge about items, we introduce Large Language-Recommendation Assistant (LLaRA). Specifically, it uses a novel hybrid prompting method that integrates ID-based item embeddings learned by traditional recommendation models with textual item features. Treating the "sequential behaviors of users" as a distinct modality beyond texts, we employ a projector to align the traditional recommender's ID embeddings with the LLM's input space. Moreover, rather than directly exposing the hybrid prompt to LLMs, a curriculum learning strategy is adopted to gradually ramp up training complexity. Initially, we warm up the LLM using text-only prompts, which better suit its inherent language modeling ability. Subsequently, we progressively transition to the hybrid prompts, training the model to seamlessly incorporate the behavioral knowledge from the traditional sequential recommender into the LLM. Empirical results validate the effectiveness of our proposed framework. Codes are available at https://github.com/ljy0ustc/LLaRA.

Let Me Do It For You: Towards LLM Empowered Recommendation via Tool Learning

Yuyue Zhao
Jiancan Wu
Xiang Wang
Wei Tang
Dingxian Wang
Maarten de Rijke

Conventional recommender systems (RSs) face challenges in precisely capturing users' fine-grained preferences. Large language models (LLMs) have shown capabilities in commonsense reasoning and leveraging external tools that may help address these challenges. However, existing LLM-based RSs suffer from hallucinations, misalignment between the semantic space of items and the behavior space of users, or overly simplistic control strategies (e.g., whether to rank or directly present existing results). To bridge these gap, we introduce ToolRec, a framework for LLM-empowered recommendations via tool learning that uses LLMs as surrogate users, thereby guiding the recommendation process and invoking external tools to generate a recommendation list that aligns closely with users' nuanced preferences.

We formulate the recommendation process as a process aimed at exploring user interests in attribute granularity. The process factors in the nuances of the context and user preferences. The LLM then invokes external tools based on a user's attribute instructions and probes different segments of the item pool. We consider two types of attribute-oriented tools: rank tools and retrieval tools. Through the integration of LLMs, ToolRec enables conventional recommender systems to become external tools with a natural language interface. Extensive experiments verify the effectiveness of ToolRec, particularly in scenarios that are rich in semantic content.

On Generative Agents in Recommendation

An Zhang
Yuxin Chen
Leheng Sheng
Xiang Wang
Tat-Seng Chua

Recommender systems are the cornerstone of today's information dissemination, yet a disconnect between offline metrics and online performance greatly hinders their development. Addressing this challenge, we envision a recommendation simulator, capitalizing on recent breakthroughs in human-level intelligence exhibited by Large Language Models (LLMs). We propose Agent4Rec, a user simulator in recommendation, leveraging LLM-empowered generative agents equipped with user profile, memory, and actions modules specifically tailored for the recommender system. In particular, these agents' profile modules are initialized using real-world datasets (e.g., MovieLens, Steam, Amazon-Book), capturing users' unique tastes and social traits; memory modules log both factual and emotional memories and are integrated with an emotion-driven reflection mechanism; action modules support a wide variety of behaviors, spanning both taste-driven and emotion-driven actions. Each agent interacts with personalized recommender models in a page-by-page manner, relying on a pre-implemented collaborative filtering-based recommendation algorithm. We delve into both the capabilities and limitations of Agent4Rec, aiming to explore an essential research question: "To what extent can LLM-empowered generative agents faithfully simulate the behavior of real, autonomous humans in recommender systems?" Extensive and multi-faceted evaluations of Agent4Rec highlight both the alignment and deviation between agents and user-personalized preferences. Beyond mere performance comparison, we explore insightful experiments, such as emulating the filter bubble effect and discovering the underlying causal relationships in recommendation tasks.

SESSION: Session: Dense Retrieval 2

Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval.

Guangyuan Ma
Xing Wu
Zijia Lin
Songlin Hu

Masked auto-encoder pre-training has emerged as a prevalent technique for initializing and enhancing dense retrieval systems. It generally utilizes additional Transformer decoder blocks to provide sustainable supervision signals and compress contextual information into dense representations. However, the underlying reasons for the effectiveness of such a pre-training technique remain unclear. The usage of additional Transformer-based decoders also incurs significant computational costs. In this study, we aim to shed light on this issue by revealing that masked auto-encoder (MAE) pre-training with enhanced decoding significantly improves the term coverage of input tokens in dense representations, compared to vanilla BERT checkpoints. Building upon this observation, we propose a modification to the traditional MAE by replacing the decoder of a masked auto-encoder with a completely simplified Bag-of-Word prediction task. This modification enables the efficient compression of lexical signals into dense representations through unsupervised pre-training. Remarkably, our proposed method achieves state-of-the-art retrieval performance on several large-scale retrieval benchmarks without requiring any additional parameters, which provides a 67% training speed-up compared to standard masked auto-encoder pre-training with enhanced decoding.

Generative Retrieval as Multi-Vector Dense Retrieval

Shiguang Wu
Wenda Wei
Mengqi Zhang
Zhumin Chen
Jun Ma
Zhaochun Ren
Maarten de Rijke
Pengjie Ren

For a given query generative retrieval generates identifiers of relevant documents in an end-to-end manner using a sequence-to-sequence architecture. The relation between generative retrieval and other retrieval methods, especially those based on matching within dense retrieval models, is not yet fully comprehended. Prior work has demonstrated that generative retrieval with atomic identifiers is equivalent to single-vector dense retrieval. Accordingly, generative retrieval exhibits behavior analogous to hierarchical search within a tree index in dense retrieval when using hierarchical semantic identifiers. However, prior work focuses solely on the retrieval stage without considering the deep interactions within the decoder of generative retrieval.

In this paper, we fill this gap by demonstrating that generative retrieval and multi-vector dense retrieval share the same framework for measuring the relevance to a query of a document. Specifically, we examine the attention layer and prediction head of generative retrieval, revealing that generative retrieval can be understood as a special case of multi-vector dense retrieval. Both methods compute relevance as a sum of products of query and document vectors and an alignment matrix. We then explore how generative retrieval applies this framework, employing distinct strategies for computing document token vectors and the alignment matrix. We have conducted experiments to verify our conclusions and show that both paradigms exhibit commonalities of term matching in their alignment matrix.

Our findings apply to many generative retrieval identifier designs and provide possible explanations on how generative retrieval can express query-document relevance. As multi-vector dense retrieval is the state-of-the-art dense retrieval method currently, understanding the connection between generative retrieval and multi-vector dense retrieval is crucial for shedding light on the underlying mechanisms of generative retrieval and for developing, and understanding the potential of, new retrieval models.

I3: Intent-Introspective Retrieval Conditioned on Instructions

Kaihang Pan
Juncheng Li
Wenjie Wang
Hao Fei
Hongye Song
Wei Ji
Jun Lin
Xiaozhong Liu
Tat-Seng Chua
Siliang Tang

Recent studies indicate that dense retrieval models struggle to perform well on a wide variety of retrieval tasks that lack dedicated training data, as different retrieval tasks often entail distinct search intents. To address this challenge, in this work we leverage instructions to flexibly describe retrieval intents and introduce I3, a unified retrieval system that performs Intent-Introspective retrieval across various tasks, conditioned on Instructions without any task-specific training. I3 innovatively incorporates a pluggable introspector in a parameter-isolated manner to comprehend specific retrieval intents by jointly reasoning over the input query and instruction, and seamlessly integrates the introspected intent into the original retrieval model for intent-aware retrieval. Furthermore, we propose progressively-pruned intent learning. It utilizes extensive LLM-generated data to train I3 phase-by-phase, embodying two key designs: progressive structure pruning and drawback extrapolation-based data refinement. Extensive experiments show that in the BEIR benchmark, I3 significantly outperforms baseline methods designed with task-specific retrievers, achieving state-of-the-art zero-shot performance without any task-specific tuning.

SESSION: Session: Long-term and Session Recommendation

Reinforcing Long-Term Performance in Recommender Systems with User-Oriented Exploration Policy

Changshuo Zhang
Sirui Chen
Xiao Zhang
Sunhao Dai
Weijie Yu
Jun Xu

Reinforcement learning (RL) has gained popularity in recommender systems for improving long-term performance by effectively exploring users' interests. However, modern recommender systems face the challenge of different user behavioral patterns among millions of items, making exploration more difficult. For example, users with varying activity levels require different exploration intensities. Unfortunately, previous studies often overlook this aspect and apply a uniform exploration strategy to all users, which ultimately hampers long-term user experiences. To tackle these challenges, we propose User-Oriented Exploration Policy (UOEP), a novel approach that enables fine-grained exploration among user groups. We first construct a distributional critic that allows policy optimization based on varying quantile levels of cumulative reward feedback from users, representing user groups with different activity levels. Using this critic as a guide, we design a population of distinct actors dedicated to effective and fine-grained exploration within their respective user groups. To simultaneously enhance diversity and stability during the exploration process, we also introduce a population-level diversity regularization term and a supervision module. Experimental results on public recommendation datasets validate the effectiveness of our approach, as it outperforms all other baselines in terms of long-term performance. Moreover, further analyses reveal the benefits of our approach, including improved performance for low-activity users and increased fairness among users.

Treatment Effect Estimation for User Interest Exploration on Recommender Systems

Jiaju Chen
Wang Wenjie
Chongming Gao
Peng Wu
Jianxiong Wei
Qingsong Hua

Recommender systems learn personalized user preferences from user feedback like clicks. However, user feedback is usually biased towards partially observed interests, leaving many users' hidden interests unexplored. Existing approaches typically mitigate the bias, increase recommendation diversity, or use bandit algorithms to balance exploration-exploitation trade-offs. Nevertheless, they fail to consider the potential rewards of recommending different categories of items and lack the global scheduling of allocating top-N recommendations to categories, leading to suboptimal exploration. In this work, we propose an Uplift model-based Recommender (UpliftRec) framework, which regards top-N recommendation as a treatment optimization problem. UpliftRec estimates the treatment effects, i.e., the click-through rate (CTR) under different category exposure ratios, by using observational user feedback. UpliftRec calculates group-level treatment effects to discover users' hidden interests with high CTR rewards and leverages inverse propensity weighting to alleviate confounder bias. Thereafter, UpliftRec adopts a dynamic programming method to calculate the optimal treatment for overall CTR maximization. We implement UpliftRec on different backend models and conduct extensive experiments on three datasets. The empirical results validate the effectiveness of UpliftRec in discovering users' hidden interests while achieving superior recommendation accuracy.

Sequential Recommendation for Optimizing Both Immediate Feedback and Long-term Retention

Ziru Liu
Shuchang Liu
Zijian Zhang
Qingpeng Cai
Xiangyu Zhao
Kesen Zhao
Lantao Hu
Peng Jiang
Kun Gai

In Recommender System (RS) applications, reinforcement learning (RL) has recently emerged as a powerful tool, primarily due to its proficiency in optimizing long-term rewards. Nevertheless, it suffers from instability in the learning process, stemming from the intricate interactions among bootstrapping, off-policy training, and function approximation. Moreover, in multi-reward recommendation scenarios, designing a proper reward setting that reconciles the inner dynamics of various tasks is quite intricate. To this end, we propose a novel decision transformer-based recommendation model, DT4IER, to not only elevate the effectiveness of recommendations but also to achieve a harmonious balance between immediate user engagement and long-term retention. The DT4IER applies an innovative multi-reward design that adeptly balances short and long-term rewards with user-specific attributes, which serve to enhance the contextual richness of the reward sequence, ensuring a more informed and personalized recommendation process. To enhance its predictive capabilities, DT4IER incorporates a high-dimensional encoder to identify and leverage the intricate interrelations across diverse tasks. Furthermore, we integrate a contrastive learning approach within the action embedding predictions, significantly boosting the model's overall performance. Experiments on three real-world datasets demonstrate the effectiveness of DT4IER against state-of-the-art baselines in terms of both immediate user engagement and long-term retention. The source code is accessible online to facilitate replication.

Disentangling ID and Modality Effects for Session-based Recommendation

Xiaokun Zhang
Bo Xu
Zhaochun Ren
Xiaochen Wang
Hongfei Lin
Fenglong Ma

Session-based recommendation aims to predict intents of anonymous users based on their limited behaviors. Modeling user behaviors involves two distinct rationales: co-occurrence patterns reflected by item IDs, and fine-grained preferences represented by item modalities (e.g., text and images). However, existing methods typically entangle these causes, leading to their failure in achieving accurate and explainable recommendations. To this end, we propose a novel framework DIMO to disentangle the effects of ID and modality in the task. DIMO aims to disentangle these causes at both item and session levels. At the item level, we introduce a co-occurrence representation schema to explicitly incorporate co-occurrence patterns into ID representations. Simultaneously, DIMO aligns different modalities into a unified semantic space to represent them uniformly. At the session level, we present a multi-view self-supervised disentanglement, including proxy mechanism and counterfactual inference, to disentangle ID and modality effects without supervised signals. Leveraging these disentangled causes, DIMO provides recommendations via causal inference and further creates two templates for generating explanations. Extensive experiments on multiple real-world datasets demonstrate the consistent superiority of DIMO over existing methods. Further analysis also confirms DIMO's effectiveness in generating explanations.

Large Language Models are Learnable Planners for Long-Term Recommendation

Wentao Shi
Xiangnan He
Yang Zhang
Chongming Gao
Xinyue Li
Jizhi Zhang
Qifan Wang
Fuli Feng

Planning for both immediate and long-term benefits becomes increasingly important in recommendation. Existing methods apply Reinforcement Learning (RL) to learn planning capacity by maximizing cumulative reward for long-term recommendation. However, the scarcity of recommendation data presents challenges such as instability and susceptibility to overfitting when training RL models from scratch, resulting in sub-optimal performance. In this light, we propose to leverage the remarkable planning capabilities over sparse data of Large Language Models (LLMs) for long-term recommendation. The key to achieving the target lies in formulating a guidance plan following principles of enhancing long-term engagement and grounding the plan to effective and executable actions in a personalized manner. To this end, we propose a Bi-level Learnable LLM Planner framework, which consists of a set of LLM instances and breaks down the learning process into macro-learning and micro-learning to learn macro-level guidance and micro-level personalized recommendation policies, respectively. Extensive experiments validate that the framework facilitates the planning ability of LLMs for long-term recommendation.

SESSION: Session: Evaluation with and for LLMs

On the Evaluation of Machine-Generated Reports

James Mayfield
Eugene Yang
Dawn Lawrie
Sean MacAvaney
Paul McNamee
Douglas W. Oard
Luca Soldaini
Ian Soboroff
Orion Weller
Efsun Kayi
Kate Sanders
Marc Mason
Noah Hibbler

Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of users. In this perspective paper, we draw together opinions from industry and academia, and from a variety of related research areas, to present our vision for automatic report generation, and---critically---a flexible framework by which such reports can be evaluated. In contrast with other summarization tasks, automatic report generation starts with a detailed description of an information need, stating the necessary background, requirements, and scope of the report. Further, the generated reports should be complete, accurate, and verifiable. These qualities, which are desirable---if not required---in many analytic report-writing settings, require rethinking how to build and evaluate systems that exhibit these qualities. To foster new efforts in building these systems, we present an evaluation framework that draws on ideas found in various evaluations. To test completeness and accuracy, the framework uses nuggets of information, expressed as questions and answers, that need to be part of any high-quality generated report. Additionally, evaluation of citations that map claims made in the report to their source documents ensures verifiability.

Evaluating Generative Ad Hoc Information Retrieval

Lukas Gienapp
Harrisen Scells
Niklas Deckers
Janek Bevendorff
Shuai Wang
Johannes Kiesel
Shahbaz Syed
Maik Fröbe
Guido Zuccon
Benno Stein
Matthias Hagen
Martin Potthast

Recent advances in large language models have enabled the development of viable generative retrieval systems. Instead of a traditional document ranking, generative retrieval systems often directly return a grounded generated text as a response to a query. Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval. Yet, the established evaluation methodology for ranking-based ad hoc retrieval is not suited for the reliable and reproducible evaluation of generated responses. To lay a foundation for developing new evaluation methods for generative retrieval systems, we survey the relevant literature from the fields of information retrieval and natural language processing, identify search tasks and system architectures in generative retrieval, develop a new user model, and study its operationalization.

Large Language Models can Accurately Predict Searcher Preferences

Paul Thomas
Seth Spielman
Nick Craswell
Bhaskar Mitra

Much of the evaluation and tuning of a search system relies on relevance labels---annotations that say whether a document is useful for a given search and searcher. Ideally these come from real searchers, but it is hard to collect this data at scale, so typical experiments rely on third-party labellers who may or may not produce accurate annotations. Label quality is managed with ongoing auditing, training, and monitoring. We discuss an alternative approach. We take careful feedback from real searchers and use this to select a large language model (LLM), and prompt, that agrees with this feedback; the LLM can then produce labels at scale. Our experiments show LLMs are as accurate as human labellers and as useful for finding the best systems and hardest queries. LLM performance varies with prompt features, but also varies unpredictably with simple paraphrases. This unpredictability reinforces the need for high-quality "gold" labels.

Are Large Language Models Good at Utility Judgments?

Hengran Zhang
Ruqing Zhang
Jiafeng Guo
Maarten de Rijke
Yixing Fan
Xueqi Cheng

Retrieval-augmented generation (RAG) is considered to be a promising approach to alleviate the hallucination issue of large language models (LLMs), and it has received widespread attention from researchers recently. Due to the limitation in the semantic understanding of retrieval models, the success of RAG heavily lies on the ability of LLMs to identify passages with utility. Recent efforts have explored the ability of LLMs to assess the relevance of passages in retrieval, but there has been limited work on evaluating the utility of passages in supporting question answering.

In this work, we conduct a comprehensive study about the capabilities of LLMs in utility evaluation for open-domain question answering (QA). Specifically, we introduce a benchmarking procedure and collection of candidate passages with different characteristics, facilitating a series of experiments with five representative LLMs. Our experiments reveal that: (i) well-instructed LLMs can distinguish between relevance and utility, and that LLMs are highly receptive to newly generated counterfactual passages. Moreover, (ii) we scrutinize key factors that affect utility judgments in the instruction design. And finally, (iii) to verify the efficacy of utility judgments in practical retrieval augmentation applications, we delve into LLMs' QA capabilities using the evidence judged with utility and direct dense retrieval results. (iv) We propose a k-sampling, listwise approach to reduce the dependency of LLMs on the sequence of input passages, thereby facilitating subsequent answer generation. We believe that the way we formalize and study the problem along with our findings contributes to a critical assessment of retrieval-augmented LLMs. Our code and benchmark can be found at https://github.com/ict-bigdatalab/utility_judgments.

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs

Clemencia Siro
Mohammad Aliannejadi
Maarten de Rijke

In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and large language models (LLMs) as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation quality. Our findings indicate that there is a distinct difference in ratings assigned by both annotator groups in the two setups, indicating that user feedback does influence system evaluation. Workers are more susceptible to user feedback on usefulness and interestingness compared to LLMs on interestingness and relevance. User feedback leads to a more personalized assessment of usefulness by workers, aligning closely with the user's explicit feedback. Additionally, in cases of ambiguous or complex user requests, user feedback improves agreement among crowdworkers. These findings emphasize the significance of user feedback in refining system evaluations and suggest the potential for automated feedback integration in future research.

A Workbench for Autograding Retrieve/Generate Systems

Laura Dietz

This resource paper addresses the challenge of evaluating Information Retrieval (IR) systems in the era of autoregressive Large Language Models (LLMs). Traditional methods relying on passage-level judgments are no longer effective due to the diversity of responses generated by LLM-based systems. We provide a workbench to explore several alternative evaluation approaches to judge the relevance of a system's response that incorporate LLMs: 1. Asking an LLM whether the response is relevant; 2. Asking the LLM which set of nuggets (i.e., relevant key facts) is covered in the response; 3. Asking the LLM to answer a set of exam questions with the response. This workbench aims to facilitate the development of new, reusable test collections. Researchers can manually refine sets of nuggets and exam questions, observing their impact on system evaluation and leaderboard rankings. Resource available at https://github.com/TREMA-UNH/rubric-grading-workbench

SESSION: Session: Question Answering and Summarisation

CIQA: A Coding Inspired Question Answering Model

Mousa Arraf
Kira Radinsky

Methods in question-answering (QA) that transform texts detailing processes into an intermediate code representation, subsequently executed to generate a response to the presented question, have demonstrated promising results in analyzing scientific texts that describe intricate processes. The limitations of these existing text-to-code models are evident when attempting to solve QA problems that require knowledge beyond what is presented in the input text. We propose a novel domain-agnostic model to address the problem by leveraging domain-specific and open-source code libraries. We introduce an innovative QA text-to-code algorithm that learns to represent and utilize external APIs from code repositories, such as GitHub, within the intermediate code representation. The generated code is then executed to answer a question about a text. We present three QA datasets, focusing on scientific problems in the domains of chemistry, astronomy, and biology, for the benefit of the community. Our study demonstrates that our proposed method is a competitive alternative to current state-of-the-art (SOTA) QA text-to-code models and generic SOTA QA models.

Let Me Show You Step by Step: An Interpretable Graph Routing Network for Knowledge-based Visual Question Answering

Duokang Wang
Linmei Hu
Rui Hao
Yingxia Shao
Xin Lv
Liqiang Nie
Juanzi Li

Visual Question Answering based on external Knowledge Bases (KB-VQA) requires a model to incorporate knowledge beyond the content of given image and question for answer prediction. Most existing works made efforts on using graph neural networks or Multi-modal Large Language Models to incorporate external knowledge for answer generation. Despite the promising results, they have limited interpretability and exhibit a deficiency in handling questions with unseen answers. In this paper, we propose a novel interpretable graph routing network (GRN) which explicitly conducts entity routing over a constructed scene knowledge graph step by step for KB-VQA. At each step, GRN keeps an entity score vector representing how likely of each entity to be activated as the answer, and a transition matrix representing the transition probability from one entity to another. To answer the given question, GRN will focus on certain keywords of the question at each step and correspondingly conduct entity routing by transiting the entity scores according to the transition matrix computed referring to the focused question keywords. In this way, it clearly provides the reasoning process of KB-VQA and can handle the questions with unseen answers without distinction. Experiments on the benchmark dataset KRVQA have demonstrated that GRN improves the performance of KB-VQA by a large margin, surpassing existing state-of-the art KB-VQA methods and Multi-modal Large Language Models, as well as shows competent capability in handling unseen answers and good interpretability in KB-VQA.

MTMS: Multi-teacher Multi-stage Knowledge Distillation for Reasoning-Based Machine Reading Comprehension

Zhuo Zhao
Zhiwen Xie
Guangyou Zhou
Jimmy Xiangji Huang

As the field of machine reading comprehension (MRC) continues to evolve, it is unlocking enormous potential for its practical application. However, the currently well-performing models predominantly rely on massive pre-trained language models with at least several hundred million or even over one hundred billion parameters. These complex models not only require immense computational power but also extensive storage, presenting challenges for resource-limited environments such as online education.Current research indicates that specific capabilities of larger models can be transferred to smaller models through knowledge distillation. However, prior to our work, there were no small models specifically designed for MRC task with complex reasoning abilities. In light of this, we present a novel multi-teacher multi-stage distillation approach, MTMS. It facilitates the easier deployment of reasoning-based MRC task on resource-constrained devices, thereby enabling effective applications. In this method, we design a multi-teacher distillation framework that includes both a logical teacher and a semantic teacher. This framework allows MTMS to simultaneously extract features from different perspectives of the text, mitigating the limitations inherent in single-teacher information representations. Furthermore, we introduce a multi-stage contrastive learning strategy. Through this strategy, the student model can progressively align with the teacher models, effectively bridging the gap between them. Extensive experimental outcomes on two inference-based datasets from real-world scenarios demonstrate that MTMS requires nearly 10 times fewer parameters compared with the teacher model size while achieving the competitive performance.

Exploring the Trade-Off within Visual Information for MultiModal Sentence Summarization

Minghuan Yuan
Shiyao Cui
Xinghua Zhang
Shicheng Wang
Hongbo Xu
Tingwen Liu

MultiModal Sentence Summarization (MMSS) aims to generate a brief summary based on the given source sentence and its associated image. Previous studies on MMSS have achieved success by either selecting the task-relevant visual information or filtering out the task-irrelevant visual information to help the textual modality to generate the summary. However, enhancing from a single perspective usually introduces over-preservation or over-compression problems. To tackle these issues, we resort to Information Bottleneck (IB), which seeks to find a maximally compressed mapping of the input information that preserves as much information about the target as possible. Specifically, we propose a novel method, T(³), which adopts IB to balance the Trade-off between Task-relevant and Task-irrelevant visual information through the variational inference framework. In this way, the task-irrelevant visual information is compressed to the utmost while the task-relevant visual information is maximally retained. With the holistic perspective, the generated summary could maintain as many key elements as possible while discarding the unnecessary ones as far as possible. Extensive experiments on the representative MMSS dataset demonstrate the superiority of our proposed method. Our code is available at https://github.com/YuanMinghuan/T3.

Flexible and Adaptable Summarization via Expertise Separation

Xiuying Chen
Mingzhe Li
Shen Gao
Xin Cheng
Qingqing Zhu
Rui Yan
Xin Gao
Xiangliang Zhang

A proficient summarization model should exhibit both flexibility -- the capacity to handle a range of in-domain summarization tasks, and adaptability -- the competence to acquire new knowledge and adjust to unseen out-of-domain tasks. Unlike large language models (LLMs) that achieve this through parameter scaling, we propose a more parameter-efficient approach in this study. Our motivation rests on the principle that the general summarization ability to capture salient information can be shared across different tasks, while the domain-specific summarization abilities need to be distinct and tailored. Concretely, we propose MoeSumm, a Mixture-of-Expert Summarization architecture, which utilizes a main expert for gaining the general summarization capability and deputy experts that selectively collaborate to meet specific summarization task requirements. We further propose a max-margin loss to stimulate the separation of these abilities. Our model's distinct separation of general and domain-specific summarization abilities grants it with notable flexibility and adaptability, all while maintaining parameter efficiency. MoeSumm achieves flexibility by managing summarization across multiple domains with a single model, utilizing a shared main expert and selected deputy experts. It exhibits adaptability by tailoring deputy experts to cater to out-of-domain few-shot and zero-shot scenarios. Experimental results on 11 datasets show the superiority of our model compared with recent baselines and LLMs. We also provide statistical and visual evidence of the distinct separation of the two abilities in MoeSumm https://github.com/iriscxy/MoE_Summ

Disentangling Instructive Information from Ranked Multiple Candidates for Multi-Document Scientific Summarization

Pancheng Wang
Shasha Li
Dong Li
Kehan Long
Jintao Tang
Ting Wang

Automatically condensing multiple topic-related scientific papers into a succinct and concise summary is referred to as Multi-Document Scientific Summarization (MDSS). Currently, while commonly used abstractive MDSS methods can generate flexible and coherent summaries, the difficulty in handling global information and the lack of guidance during decoding still make it challenging to generate better summaries. To alleviate these two shortcomings, this paper introduces summary candidates into MDSS, utilizing the global information of the document set and additional guidance from the summary candidates to guide the decoding process. Our insights are twofold: Firstly, summary candidates can provide instructive information from both positive and negative perspectives, and secondly, selecting higher-quality candidates from multiple options contributes to producing better summaries. Drawing on the insights, we propose a summary candidates fusion framework - Disentangling Instructive information from Ranked candidates (DIR) for MDSS. Specifically, DIR first uses a specialized pairwise comparison method towards multiple candidates to pick out those of higher quality. Then DIR disentangles the instructive information of summary candidates into positive and negative latent variables with Conditional Variational Autoencoder. These variables are further incorporated into the decoder to guide generation. We evaluate our approach with three different types of Transformer-based models and three different types of candidates, and consistently observe noticeable performance improvements according to automatic and human evaluation. More analyses further demonstrate the effectiveness of our model in handling global information and enhancing decoding controllability.

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Bhawna Piryani
Jamshid Mozafari
Adam Jatowt

Question answering (QA) and Machine Reading Comprehension (MRC) tasks have significantly advanced in recent years due to the rapid development of deep learning techniques and, more recently, large language models. At the same time, many benchmark datasets have become available for QA and MRC tasks. However, most existing large-scale benchmark datasets have been created predominantly using synchronous document collections like Wikipedia or the Web. Archival document collections, such as historical newspapers, contain valuable information from the past that is still not widely used to train large language models. To further contribute to advancing QA and MRC tasks and to overcome the limitation of previous datasets, we introduce ChroniclingAmericaQA, a large-scale temporal QA dataset with 487K question-answer pairs created based on the historical newspaper collection Chronicling America. Our dataset is constructed from a subset of the Chronicling America newspaper collection spanning 120 years. One of the significant challenges for utilizing digitized historical newspaper collections is the low quality of OCR text. Therefore, to enable realistic testing of QA models, our dataset can be used in three different ways: answering questions from raw and noisy content, answering questions from cleaner, corrected version of the content, as well as answering questions from scanned images of newspaper pages. This and the fact that ChroniclingAmericaQA spans the longest time period among available QA datasets make it quite a unique and useful resource.

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

Abdelrahman Abdallah
Mahmoud Kasem
Mahmoud Abdalla
Mohamed Mahmoud
Mohamed Elkasaby
Yasser Elbendary
Adam Jatowt

In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.

TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions

Jamshid Mozafari
Anubhav Jangra
Adam Jatowt

Nowadays, individuals tend to engage in dialogues with Large Language Models, seeking answers to their questions. In times when such answers are readily accessible to anyone, the stimulation and preservation of human's cognitive abilities, as well as the assurance of maintaining good reasoning skills by humans becomes crucial. This study addresses such needs by proposing hints (instead of final answers or before giving answers) as a viable solution. We introduce a framework for the automatic hint generation for factoid questions, employing it to construct TriviaHG, a novel large-scale dataset featuring 160,230 hints corresponding to 16,645 questions from the TriviaQA dataset. Additionally, we present an automatic evaluation method that measures the Convergence and Familiarity quality attributes of hints. To evaluate the TriviaHG dataset and the proposed evaluation method, we enlisted 10 individuals to annotate 2,791 hints and tasked 6 humans with answering questions using the provided hints. The effectiveness of hints varied, with success rates of 96%, 78%, and 36% for questions with easy, medium, and hard answers, respectively. Moreover, the proposed automatic evaluation methods showed a robust correlation with annotators' results. Conclusively, the findings highlight three key insights: the facilitative role of hints in resolving unknown questions, the dependence of hint quality on answer difficulty, and the feasibility of employing automatic evaluation methods for hint assessment.

SESSION: Session: Cross-Domain Recommendation

Pacer and Runner: Cooperative Learning Framework between Single- and Cross-Domain Sequential Recommendation

Chung Park
Taesan Kim
Hyungjun Yoon
Junui Hong
Yelim Yu
Mincheol Cho
Minsung Choi
Jaegul Choo

Cross-Domain Sequential Recommendation (CDSR) improves recommendation performance by utilizing information from multiple domains, which contrasts with Single-Domain Sequential Recommendation (SDSR) that relies on a historical interaction within a specific domain. However, CDSR may underperform compared to the SDSR approach in certain domains due to negative transfer, which occurs when there is a lack of relation between domains or different levels of data sparsity. To address the issue of negative transfer, our proposed CDSR model estimates the degree of negative transfer of each domain and adaptively assigns it as a weight factor to the prediction loss, to control gradient flows through domains with significant negative transfer. To this end, our model compares the performance of a model trained on multiple domains (CDSR) with a model trained solely on the specific domain (SDSR) to evaluate the negative transfer of each domain using our asymmetric cooperative network. In addition, to facilitate the transfer of valuable cues between the SDSR and CDSR tasks, we developed an auxiliary loss that maximizes the mutual information between the representation pairs from both tasks on a per-domain basis. This cooperative learning between SDSR and CDSR tasks is similar to the collaborative dynamics between pacers and runners in a marathon. Our model outperformed numerous previous works in extensive experiments on two real-world industrial datasets across ten service domains. We also have deployed our model in the recommendation system of our personal assistant app service, resulting in 21.4% increase in click-through rate compared to existing models, which is valuable to real-world business¹.

Aiming at the Target: Filter Collaborative Information for Cross-Domain Recommendation

Hanyu Li
Weizhi Ma
Peijie Sun
Jiayu Li
Cunxiang Yin
Yancheng He
Guoqiang Xu
Min Zhang
Shaoping Ma

As recommender systems become pervasive in various scenarios, cross-domain recommenders (CDR) are proposed to enhance the performance of one target domain with data from other related source domains. However, irrelevant information from the source domain may instead degrade target domain performance, which is known as the negative transfer problem. Most existing efforts to tackle this issue primarily focus on designing adaptive representations for overlapped users. Whereas, these methods rely on the learned representations of the model, lacking explicit constraints to filter irrelevant source-domain collaborative information for the target domain, which limits their cross-domain transfer capability.

In this paper, we propose a novel Collaborative information regularized User Transformation (CUT) framework to tackle the negative transfer problem by directly filtering users' collaborative information. In CUT, target domain user similarity is adopted as a constraint for user transformation to filter user collaborative information from the source domain. First, CUT learns user similarity relationships from the target domain. Then, source-target information transfer is guided by the user similarity, where we design a user transformation layer to learn target-domain user representations and a contrastive loss to supervise the user collaborative information transferring. As a flexible and lightweight framework, CUT can be applied with various single-domain recommender systems as the backbone and extend them to multi-domain tasks. Empirical studies on two real-world datasets show that CUT effectively alleviates the negative transfer problem, and it significantly outperforms other SOTA single and cross-domain methods.

Identifiability of Cross-Domain Recommendation via Causal Subspace Disentanglement

Jing Du
Zesheng Ye
Bin Guo
Zhiwen Yu
Lina Yao

Cross-Domain Recommendation~(CDR) seeks to enable effective knowledge transfer across domains. Most existing works rely on either representation alignment or transformation bridges, but they come with shortcomings regarding identifiability of domain-shared and domain-specific latent factors. Specifically, while CDR describes user representations as a joint distribution over two domains, these methods fail to account for its joint identifiability as they primarily fixate on the marginal distribution within a particular domain. Such a failure may overlook the conditionality between two domains and how it contributes to latent factor disentanglement, leading to negative transfer when domains are weakly correlated. In this study, we explore what should and should not be transferred in cross-domain user representations from a causality perspective. We propose a Hierarchical causal subspace disentanglement approach to explore the Joint IDentifiability of cross-domain joint distribution, termed HJID, to preserve domain-specific behaviors from domain-shared factors. HJID abides by the feature hierarchy and divides user representations into generic shallow subspace and domain-oriented deep subspaces. We first encode the generic pattern in the shallow subspace by minimizing the Maximum Mean Discrepancy of initial layer activation. Then, to dissect how domain-oriented latent factors are encoded in deeper layers activation, we construct a cross-domain causality-based data generation graph, which identifies cross-domain consistent and domain-specific components, adhering to the Minimal Change principle. This allows HJID to maintain stability whilst discovering unique factors for different domains, all within a generative framework of invertible transformations that guarantee the joint identifiability. With experiments on real-world datasets, we show that HJID outperforms SOTA methods on both strong- and weak-correlation CDR tasks.

On the Negative Perception of Cross-domain Recommendations and Explanations

Denis Kotkov
Alan Medlar
Yang Liu
Dorota Glowacka

Recommender systems typically operate within a single domain, for example, recommending books based on users' reading habits. If such data is unavailable, it may be possible to make cross-domain recommendations and recommend books based on user preferences from another domain, such as movies. However, despite considerable research on cross-domain recommendations, no studies have investigated their impact on users' behavioural intentions or system perceptions compared to single-domain recommendations. Similarly, while single-domain explanations have been shown to improve users' perceptions of recommendations, there are no comparable studies for the cross-domain case.

In this article, we present a between-subject study (N=237) of users' behavioural intentions and perceptions of book recommendations. The study was designed to disentangle the effects of whether recommendations were single- or cross-domain from whether explanations were present or not. Our results show that cross-domain recommendations have lower trust and interest than single-domain recommendations, regardless of their quality. While these negative effects can be ameliorated by cross-domain explanations, they are still perceived as inferior to single-domain recommendations without explanations. Last, we show that explanations decrease interest in the single-domain case, but increase perceived transparency and scrutability in both single- and cross-domain recommendations. Our findings offer valuable insights into the impact of recommendation provenance on user experience and could inform the future development of cross-domain recommender systems.

DeCoCDR: Deployable Cloud-Device Collaboration for Cross-Domain Recommendation

Yu Li
Yi Zhang
Zimu Zhou
Qiang Li

Cross-domain recommendation (CDR) is a widely used methodology in recommender systems to combat data sparsity. It leverages user data across different domains or platforms for providing personalized recommendations. Traditional CDR assumes user preferences and behavior data can be shared freely among cloud and users, which is now impractical due to strict restrictions of data privacy. In this paper, we propose a Deployment-friendly Cloud-Device Collaboration framework for Cross-Domain Recommendation (DeCoCDR). It splits CDR into a two-stage recommendation model through cloud-device collaborations, i.e., item-recall on cloud and item re-ranking on device. This design enables effective CDR while preserving data privacy for both the cloud and the device. Extensive offline and online experiments are conducted to validate the effectiveness of DeCoCDR. In offline experiments, DeCoCDR outperformed the state-of-the-arts in three large datasets. While in real-world deployment, DeCoCDR improved the conversion rate by 45.3% compared with the baseline.

Mutual Information-based Preference Disentangling and Transferring for Non-overlapped Multi-target Cross-domain Recommendations

Zhi Li
Daichi Amagata
Yihong Zhang
Takahiro Hara
Shuichiro Haruta
Kei Yonekawa
Mori Kurokawa

Building high-quality recommender systems is challenging for new services and small companies, because of their sparse interactions. Cross-domain recommendations (CDRs) alleviate this issue by transferring knowledge from data in external domains. However, most existing CDRs leverage data from only a single external domain and serve only two domains. CDRs serving multiple domains require domain-shared entities (i.e., users and items) to transfer knowledge, which significantly limits their applications due to the hardness and privacy concerns of finding such entities. We therefore focus on a more general scenario, non-overlapped multi-target CDRs (NO-MTCDRs), which require no domain-shared entities and serve multiple domains. Existing methods require domain-shared users to learn user preferences and cannot work on NO-MTCDRs. We hence propose MITrans, a novel mutual information-based (MI-based) preference disentangling and transferring framework to improve recommendations for all domains. MITrans effectively leverages knowledge from multiple domains as well as learning both domain-shared and domain-specific preferences without using domain-shared users. In MITrans, we devise two novel MI constraints to disentangle domain-shared and domain-specific preferences. Moreover, we introduce a module that fuses domain-shared preferences in different domains and combines them with domain-specific preferences to improve recommendations. Our experimental results on two real-world datasets demonstrate the superiority of MITrans in terms of recommendation quality and application range against state-of-the-art overlapped and non-overlapped CDRs.

Multi-Domain Sequential Recommendation via Domain Space Learning

Junyoung Hwang
Hyunjun Ju
SeongKu Kang
Sanghwan Jang
Hwanjo Yu

This paper explores Multi-Domain Sequential Recommendation (MDSR), an advancement of Multi-Domain Recommendation that incorporates sequential context. Recent MDSR approach exploits domain-specific sequences, decoupled from mixed-domain histories, to model domain-specific sequential preference, and use mixeddomain histories to model domain-shared sequential preference. However, the approach faces challenges in accurately obtaining domain-specific sequential preferences in the target domain, especially when users only occasionally engage with it. In such cases, the history of users in the target domain is limited or not recent, leading the sequential recommender system to capture inaccurate domain-specific sequential preferences. To address this limitation, this paper introduces Multi-Domain Sequential Recommendation via Domain Space Learning (MDSR-DSL). Our approach utilizes cross-domain items to supplement missing sequential context in domain-specific sequences. It involves creating a "domain space" to maintain and utilize the unique characteristics of each domain and a domain-to-domain adaptation mechanism to transform item representations across domain spaces. To validate the effectiveness of MDSR-DSL, this paper extensively compares it with state-of-the-art MD(S)R methods and provides detailed analyses.

SESSION: Session: Multimedia 2

Capability-aware Prompt Reformulation Learning for Text-to-Image Generation

Jingtao Zhan
Qingyao Ai
Yiqun Liu
Jia Chen
Shaoping Ma

Text-to-image generation systems have emerged as revolutionary tools in the realm of artistic creation, offering unprecedented ease in transforming textual prompts into visual art. However, the efficacy of these systems is intricately linked to the quality of user-provided prompts, which often poses a challenge to users unfamiliar with prompt crafting. This paper addresses this challenge by leveraging user reformulation data from interaction logs to develop an automatic prompt reformulation model. Our in-depth analysis of these logs reveals that user prompt reformulation is heavily dependent on the individual user's capability, resulting in significant variance in the quality of reformulation pairs. To effectively use this data for training, we introduce the Capability-aware Prompt Reformulation (CAPR) framework. CAPR innovatively integrates user capability into the reformulation process through two key components: the Conditional Reformulation Model (CRM) and Configurable Capability Features (CCF). CRM reformulates prompts according to a specified user capability, as represented by CCF. The CCF, in turn, offers the flexibility to tune and guide the CRM's behavior. This enables CAPR to effectively learn diverse reformulation strategies across various user capacities and to simulate high-capability user reformulation during inference. Extensive experiments demonstrate CAPR's superior performance over existing baselines and its remarkable robustness on unseen systems. Furthermore, comprehensive analyses validate the effectiveness of different components. CAPR can facilitate user-friendly interaction with text-to-image systems and make advanced artistic creation more achievable for a broader range of users.

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo

We present a Recipe for Effective and Efficient zero-shot video-text Retrieval, dubbed M²-RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text pre-training methods are confronted by three major issues, i.e., noisy data corpus, time-consuming pre-training, and limited performance gain. Towards this end, we conduct a comprehensive study including four critical steps in video-text pre-training. Specifically, we investigate 1) data filtering and refinement, 2) video input type selection, 3) temporal modeling, and 4) video feature enhancement. We then summarize this empirical study into the M2-RAAP recipe, where our technical contributions lie in 1) the data filtering and text re-writing pipeline resulting in 1M high-quality bilingual video-text pairs, 2) the promotion of video inputs with key-frames to accelerate pre-training, and 3) the Auxiliary-Caption-Guided (ACG) strategy to enhance video features. We conduct extensive experiments by adapting three image-text foundation models on two refined video-text datasets from different languages, validating the robustness and reproducibility of M²-RAAP for adaptation-based pre-training. Results demonstrate that M²-RAAP yields superior performance with significantly less data (-90%) and time consumption (-95%), establishing a new SOTA on four English zero-shot retrieval datasets and two Chinese ones. Codebase and refined bilingual data annotations are available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_RAAP.

Short Video Ordering via Position Decoding and Successor Prediction

Shiping Ge
Qiang Chen
Zhiwei Jiang
Yafeng Yin
Ziyao Chen
Qing Gu

Short video collection is an easy way for users to consume coherent content on various online short video platforms, such as TikTok, YouTube, Douyin, and WeChat Channel. These collections cover a wide range of content, including online courses, TV series, movies, and cartoons. However, short video creators occasionally publish videos in a disorganized manner due to various reasons, such as revisions, secondary creations, deletions, and reissues, which often result in a poor browsing experience for users. Therefore, accurately reordering videos within a collection based on their content coherence is a vital task that can enhance user experience and presents an intriguing research problem in the field of video narrative reasoning. In this work, we curate a dedicated multimodal dataset for this Short Video Ordering (SVO) task and present the performance of some benchmark methods on the dataset. In addition, we further propose an advanced SVO framework with the aid of position decoding and successor prediction. The proposed framework combines both pairwise and listwise ordering paradigms, which can get rid of the issues from both quadratic growth and cascading conflict in the pairwise paradigm, and improve the performance of existing listwise methods. Extensive experiments demonstrate that our method achieves the best performance on our open SVO dataset, and each component of the framework contributes to the final performance. Both the SVO dataset and code will be released at https://github.com/ShipingGe/SVO.

CaLa: Complementary Association Learning for Augmenting Comoposed Image Retrieval

Xintong Jiang
Yaxiong Wang
Mengjian Li
Yujiao Wu
Bingwen Hu
Xueming Qian

Composed image retrieval (CIR) is the task of searching target images using an image-text pair as a query. Given the straightforward relation of query pair-target image, the dominant methods follow the learning paradigm of common image-text retrieval and simply model this problem as the query-target matching problem. Particularly, the common practice first encodes the multi-modal query into one feature and then aligns it with the target image. However, such a learning paradigm only explores the naive relation in the triplets. We argue that CIR triplets encompass additional associations besides the primary query-target relation, which is overlooked in existing works. In this paper, we disclose two new relations residing in the triplets by viewing the triplet as a graph node. In analogy with the graph node, we mine two associations of text-bridged image alignment and complementary text reasoning. The text-bridged image alignment considers composed image retrieval as a specialized form of image retrieval, where the query text acts as a bridge between the query image and the target one, and a hinge-based cross attention is proposed to incorporate this relation into the network learning. On the other hand, the association of complementary text reasoning regards composed image retrieval as a specific type of cross-modal retrieval, where the composite two images are used to reason the complementary text. To integrate these views effectively, a twin attention-based compositor is designed. By combining these two types of complementary associations with the explicit query pair-target image relation, we establish a comprehensive set of constraints for composed image retrieval. With the above designs, we finally developed our CaLa, a Complementary Association Learning framework for Augmenting Composed Image Retrieval. Experimental evaluations are conducted on the widely-used CIRR and FashIionIQ benchmarks with multiple backbones to validate the effectiveness of our CaLa. The results demonstrate the superiority of our method in the composed image retrieval task. Our code and models are available at https://github.com/Chiangsonw/CaLa

CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora

Zijun Long
Xuri Ge
Richard McCreadie
Joemon M. Jose

Text-to-image retrieval aims to find the relevant images based on a text query, which is important in various use-cases, such as digital libraries, e-commerce, and multimedia databases. Although Multimodal Large Language Models (MLLMs) demonstrate state-of-the-art performance, they exhibit limitations in handling large-scale, diverse, and ambiguous real-world needs of retrieval, due to the computation cost and the injective embeddings they produce. This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective large-scale long-text to image retrieval. The first stage, Entity-based Ranking (ER), adapts to long-text query ambiguity by employing a multiple-queries-to-multiple-targets paradigm, facilitating candidate filtering for the next stage. The second stage, Summary-based Re-ranking (SR), refines these rankings using summarized queries. We also propose a specialized Decoupling-BEiT-3 encoder, optimized for handling ambiguous user needs and both stages, which also enhances computational efficiency through vector-based similarity inference. Evaluation on the AToMiC dataset reveals that CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively. We will release our code to facilitate future research at https://github.com/longkukuhi/CFIR.

SESSION: Session: Legal

CaseLink: Inductive Graph Learning for Legal Case Retrieval

Yanran Tang
Ruihong Qiu
Hongzhi Yin
Xue Li
Zi Huang

In case law, the precedents are the relevant cases that are used to support the decisions made by the judges and the opinions of lawyers towards a given case. This relevance is referred to as the case-to-case reference relation. To efficiently find relevant cases from a large case pool, retrieval tools are widely used by legal practitioners. Existing legal case retrieval models mainly work by comparing the text representations of individual cases. Although they obtain a decent retrieval accuracy, the intrinsic case connectivity relationships among cases have not been well exploited for case encoding, therefore limiting the further improvement of retrieval performance. In a case pool, there are three types of case connectivity relationships: the case reference relationship, the case semantic relationship, and the case legal charge relationship. Due to the inductive manner in the task of legal case retrieval, using case reference as input is not applicable for testing. Thus, in this paper, a CaseLink model based on inductive graph learning is proposed to utilise the intrinsic case connectivity for legal case retrieval, a novel Global Case Graph is incorporated to represent both the case semantic relationship and the case legal charge relationship. A novel contrastive objective with a regularisation on the degree of case nodes is proposed to leverage the information carried by the case reference relationship to optimise the model. Extensive experiments have been conducted on two benchmark datasets, which demonstrate the state-of-the-art performance of CaseLink. The code has been released on https://github.com/yanran-tang/CaseLink.

Explicitly Integrating Judgment Prediction with Legal Document Retrieval: A Law-Guided Generative Approach

Weicong Qin
Zelin Cao
Weijie Yu
Zihua Si
Sirui Chen
Jun Xu

Legal document retrieval and judgment prediction are crucial tasks in intelligent legal systems. In practice, determining whether two documents share the same judgments is essential for establishing their relevance in legal retrieval. However, existing legal retrieval studies either ignore the vital role of judgment prediction or rely on implicit training objectives, expecting a proper alignment of legal documents in vector space based on their judgments. Neither approach provides explicit evidence of judgment consistency for relevance modeling, leading to inaccuracies and a lack of transparency in retrieval. To address this issue, we propose a law-guided method, namely GEAR, within the generative retrieval framework. GEAR explicitly integrates judgment prediction with legal document retrieval in a sequence-to-sequence manner. Specifically, given the intricate nature of legal documents, we first extract rationales from documents based on the definition of charges in law. We then employ these rationales as queries, ensuring efficiency and producing a shared, informative document representation for both tasks. Second, in accordance with the inherent hierarchy of law, we construct a law structure constraint tree and represent each candidate document as a hierarchical semantic ID based on this tree. This empowers GEAR to perform dual predictions for judgment and relevant documents in a single inference, i.e., traversing the tree from the root through intermediate judgment nodes, to document-specific leaf nodes. Third, we devise the revision loss that jointly minimizes the discrepancy between the IDs of predicted and labeled judgments, as well as retrieved documents, thus improving accuracy and consistency for both tasks. Extensive experiments on two Chinese legal case retrieval datasets show the superiority of GEAR over state-of-the-art methods while maintaining competitive judgment prediction performance. Moreover, we validate the effectiveness of GEAR on a French statutory article retrieval dataset, reaffirming its robustness across languages and domains.

Event Grounded Criminal Court View Generation with Cooperative (Large) Language Models

Linan Yue
Qi Liu
Lili Zhao
Li Wang
Weibo Gao
Yanqing An

With the development of legal intelligence, Criminal Court View Generation has attracted much attention as a crucial task of legal intelligence, which aims to generate concise and coherent texts that summarize case facts and provide explanations for verdicts. Existing researches explore the key information in case facts to yield the court views. Most of them employ a coarse-grained approach that partitions the facts into broad segments (e.g., verdict-related sentences) to make predictions. However, this approach fails to capture the complex details present in the case facts, such as various criminal elements and legal events. To this end, in this paper, we propose an Event Grounded Generation (EGG) method for criminal court view generation with cooperative (Large) Language Models, which introduces the fine-grained event information into the generation. Specifically, we first design a LLMs-based extraction method that can extract events in case facts without massive annotated events. Then, we incorporate the extracted events into court view generation by merging case facts and events. Besides, considering the computational burden posed by the use of LLMs in the extraction phase of EGG, we propose a LLMs-free EGG method that can eliminate the requirement for event extraction using LLMs in the inference phase. Extensive experimental results on a real-world dataset clearly validate the effectiveness of our proposed method.

Legal Statute Identification: A Case Study using State-of-the-Art Datasets and Methods

Shounak Paul
Rajas Bhatt
Pawan Goyal
Saptarshi Ghosh

Legal Statute Identification (LSI) involves identifying the relevant statutes (articles of law) given the facts (evidence) of a legal case. There are several key challenges in LSI, such as (i)~usage of label (statute) semantics which can be complicated and confusing; (ii)~the input text (i.e., the facts) are very long and noisy; (iii)~the label distribution usually follows a long tail, making predictions for the rare labels challenging. Although multiple methods have been proposed to address these challenges, there has not been any comprehensive study to establish the effects of these factors on different models/approaches. In this work, we reproduce several LSI models on two popular LSI datasets and study the effect of the above-mentioned challenges. We conduct thorough experiments with transformer-based encoders such as BERT and Longformer. We further try out different combinations of these encoders with approaches devised specifically for LSI, which essentially use different mechanisms to model the statute texts to enhance fact representations. Our experiments yield several interesting insights into how the above-mentioned challenges are addressed by different models, the interplay of different encoding and statute text handling measures, and how the nature of the LSI datasets affects the model performances. Finally, we also analyze the explanability capabilities of different approaches using human-annotated rationales.

CivilSum: A Dataset for Abstractive Summarization of Indian Court Decisions

Manuj Malik
Zheng Zhao
Marcio Fonseca
Shrisha Rao
Shay B. Cohen

Extracting relevant information from legal documents is a challenging task due to the technical complexity and volume of their content. These factors also increase the costs of annotating large datasets, which are required to train state-of-the-art summarization systems. To address these challenges, we introduce CivilSum, a collection of 23,350 legal case decisions from the Supreme Court of India and other Indian High Courts paired with human-written summaries. Compared to previous datasets such as IN-Abs, CivilSum not only has more legal decisions but also provides shorter and more abstractive summaries, thus offering a challenging benchmark for legal summarization. Unlike other domains such as news articles, our analysis shows the most important content tends to appear at the end of the documents. We measure the effect of this tail bias on summarization performance using strong architectures for long-document abstractive summarization, and the results highlight the importance of long sequence modeling for the proposed task. CivilSum and related code are publicly available to the research community to advance text summarization in the legal domain.

LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset

Haitao Li
Yunqiu Shao
Yueyue Wu
Qingyao Ai
Yixiao Ma
Yiqun Liu

As an important component of intelligent legal systems, legal case retrieval plays a critical role in ensuring judicial justice and fairness. However, the development of legal case retrieval technologies in the Chinese legal system is restricted by three problems in existing datasets: limited data size, narrow definitions of legal relevance, and naive candidate pooling strategies used in data sampling.

To alleviate these issues, we introduce LeCaRDv2, a large-scale Legal Case Retrieval Dataset (version 2). It consists of 800 queries and 55,192 candidates extracted from 4.3 million criminal case documents. To the best of our knowledge, LeCaRDv2 is one of the largest Chinese legal case retrieval datasets, providing extensive coverage of criminal charges. Additionally, we enrich the existing relevance criteria by considering three key aspects: characterization, penalty, procedure. This comprehensive criteria enriches the dataset and may provides a more holistic perspective. Furthermore, we propose a two-level candidate set pooling strategy that effectively identify potential candidates for each query case. It's important to note that all cases in the dataset have been annotated by multiple legal experts specializing in criminal law. Their expertise ensures the accuracy and reliability of the annotations. We evaluate several state-of-the-art retrieval models at LeCaRDv2, demonstrating that there is still significant room for improvement in legal case retrieval. The details of LeCaRDv2 can be found at the anonymous website https://github.com/THUIR/LeCaRDv2.

SESSION: Session: Short Research Papers

A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor Search

Thomas Vecchiato
Claudio Lucchese
Franco Maria Nardini
Sebastian Bruch

A critical piece of the modern information retrieval puzzle is approximate nearest neighbor search. Its objective is to return a set of k data points that are closest to a query point, with its accuracy measured by the proportion of exact nearest neighbors captured in the returned set. One popular approach to this question is clustering: The indexing algorithm partitions data points into non-overlapping subsets and represents each partition by a point such as its centroid. The query processing algorithm first identifies the nearest clusters---a process known as routing---then performs a nearest neighbor search over those clusters only. In this work, we make a simple observation: The routing function solves a ranking problem. Its quality can therefore be assessed with a ranking metric, making the function amenable to learning-to-rank. Interestingly, ground-truth is often freely available: Given a query distribution in a top-k configuration, the ground-truth is the set of clusters that contain the exact top-k vectors. We develop this insight and apply it to Maximum Inner Product Search (MIPS). As we demonstrate empirically on various datasets, learning a simple linear function consistently improves the accuracy of clustering-based MIPS.

A Persona-Infused Cross-Task Graph Network for Multimodal Emotion Recognition with Emotion Shift Detection in Conversations

Geng Tu
Feng Xiong
Bin Liang
Ruifeng Xu

Recent research in Multimodal Emotion Recognition in Conversations (MERC) focuses on multimodal fusion and modeling speaker-sensitive context. In addition to contextual information, personality traits also affect emotional perception. However, current MERC methods solely consider the personality influence of speakers, neglecting speaker-addressee interaction patterns. Additionally, the bottleneck problem of Emotion Shift (ES), where consecutive utterances by the same speaker exhibit different emotions has been long neglected in MERC. Early ES research fails to distinguish diverse shift patterns and simply introduces whether shifts occur as knowledge into the MERC model without considering the complementary nature of the two tasks. Based on this, we propose a Persona-infused Cross-task Graph Network (PCGNet). It first models the speaker-addressee interactive relationships by the persona-infused refinement network. Then, it learns the auxiliary task of ES Detection and the main task of MERC using cross-task connections to capture correlations across two tasks. Finally, we introduce shift-aware contrastive learning to discern diverse shift patterns. Experimental results demonstrate that PCGNet outperforms state-of-the-art methods on two widely used datasets.

A Surprisingly Simple yet Effective Multi-Query Rewriting Method for Conversational Passage Retrieval

Ivica Kostric
Krisztian Balog

Conversational passage retrieval is challenging as it often requires the resolution of references to previous utterances and needs to deal with the complexities of natural language, such as coreference and ellipsis. To address these challenges, pre-trained sequence-to-sequence neural query rewriters are commonly used to generate a single de-contextualized query based on conversation history. Previous research shows that combining multiple query rewrites for the same user utterance has a positive effect on retrieval performance. We propose the use of a neural query rewriter to generate multiple queries and show how to integrate those queries in the passage retrieval pipeline efficiently. The main strength of our approach lies in its simplicity: it leverages how the beam search algorithm works and can produce multiple query rewrites at no additional cost. Our contributions further include devising ways to utilize multi-query rewrites in both sparse and dense first-pass retrieval. We demonstrate that applying our approach on top of a standard passage retrieval pipeline delivers state-of-the-art performance without sacrificing efficiency.

Analyzing and Mitigating Repetitions in Trip Recommendation

Wenzheng Shu
Kangqi Xu
Wenxin Tai
Ting Zhong
Yong Wang
Fan Zhou

Trip recommendation has emerged as a highly sought-after service over the past decade. Although current studies significantly understand human intention consistency, they struggle with undesired repetitive outcomes that need resolution. We make two pivotal discoveries using statistical analyses and experimental designs: (1) The occurrence of repetitions is intricately linked to the models and decoding strategies. (2) During training and decoding, adding perturbations to logits can reduce repetition. Motivated by these observations, we introduce AR-Trip (Anti Repetition for Trip Recommendation), which incorporates a cycle-aware predictor comprising three mechanisms to avoid duplicate Points-of-Interest (POIs) and demonstrates their effectiveness in alleviating repetition. Experiments on four public datasets illustrate that AR-Trip successfully mitigates repetition issues while enhancing precision.

Analyzing Fusion Methods Using the Condorcet Rule

Liron Tyomkin
Oren Kurland

The fusion task is to merge document lists retrieved from a corpus for a query. We use the Condorcet voting rule to theoretically and empirically analyze fusion methods. We also demonstrate the merits of a novel fusion method based on a different voting rule: Copeland.

Behavior Alignment: A New Perspective of Evaluating LLM-based Conversational Recommendation Systems

Dayu Yang
Fumian Chen
Hui Fang

Large Language Models (LLMs) have demonstrated great potential in Conversational Recommender Systems (CRS). However, the application of LLMs to CRS has exposed a notable discrepancy in behavior between LLM-based CRS and human recommenders: LLMs often appear inflexible and passive, frequently rushing to complete the recommendation task without sufficient inquiry. This behavior discrepancy can lead to decreased accuracy in recommendations and lower user satisfaction. Despite its importance, existing studies in CRS lack a study about how to measure such behavior discrepancy. To fill this gap, we propose Behavior Alignment, a new evaluation metric to measure how well the recommendation strategies made by a LLM-based CRS are consistent with human recommenders'. Our experiment results show that the new metric is better aligned with human preferences and can better differentiate how systems perform than existing evaluation metrics. As Behavior Alignment requires explicit and costly human annotations on the recommendation strategies, we also propose a classification-based method to implicitly measure the Behavior Alignment based on the responses. The evaluation results confirm the robustness of the method.

Behavior Pattern Mining-based Multi-Behavior Recommendation

Haojie Li
Zhiyong Cheng
Xu Yu
Jinhuan Liu
Guanfeng Liu
Junwei Du

Multi-behavior recommendation systems enhance effectiveness by leveraging auxiliary behaviors (such as page views and favorites) to address the limitations of traditional models that depend solely on sparse target behaviors like purchases. Existing approaches to multi-behavior recommendations typically follow one of two strategies: some derive initial node representations from individual behavior subgraphs before integrating them for a comprehensive profile, while others interpret multi-behavior data as a heterogeneous graph, applying graph neural networks to achieve a unified node representation. However, these methods do not adequately explore the intricate patterns of behavior among users and items. To bridge this gap, we introduce a novel algorithm called Behavior Pattern mining-based Multi-behavior Recommendation (BPMR). Our method extensively investigates the diverse interaction patterns between users and items, utilizing these patterns as features for making recommendations. We employ a Bayesian approach to streamline the recommendation process, effectively circumventing the challenges posed by graph neural network algorithms, such as the inability to accurately capture user preferences due to over-smoothing. Our experimental evaluation on three realworld datasets demonstrates that BPMR significantly outperforms existing state-of-the-art algorithms, showing an average improvement of 268.29% in Recall@10 and 248.02% in NDCG@10 metrics. The code of our BPMR is openly accessible for use and further research at https://github.com/rookitkitlee/BPMR.

Bi-Objective Negative Sampling for Sensitivity-Aware Search

Jack McKechnie
Graham McDonald
Craig Macdonald

Cross-encoders leverage fine-grained interactions between documents and queries for effective relevance ranking. Such ranking models are typically trained to satisfy the single objective of providing relevant information to the users. However, not all information should be made available. For example, documents containing sensitive information, such as personal or confidential information, should not be returned in the search results. Sensitivity-aware search (SAS) aims to develop retrieval models that can satisfy two objectives, namely: (1) providing the user with relevant search results, while (2) ensuring that no documents that contain sensitive information are included in the ranking. In this work, we propose three novel negative sampling strategies that enable cross-encoders to be trained to satisfy the bi-objective task of SAS. Additionally, we investigate and compare with filtering sensitive documents in ranking pipelines. Our experiments on a collection labelled for sensitivity show that our proposed negative sampling strategies lead to a ~37% increase in terms of cost-sensitive nDCG (nCSDCG) for SAS.

Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-Check

Linhao Ye
Zhikai Lei
Jianghao Yin
Qin Chen
Jie Zhou
Liang He

Retrieval-Augmented Generation (RAG) aims to generate more reliable and accurate responses, by augmenting large language models(LLMs) with the external vast and dynamic knowledge. Most previous work focuses on using RAG for single-round question answering, while how to adapt RAG to the complex conversational setting wherein the question is interdependent on the preceding context is not well studied. In this paper, we propose a conversation-level RAG (ConvRAG) approach, which incorporates fine-grained retrieval augmentation and self-check for conversational question answering (CQA). In particular, our approach consists of three components, namely conversational question refiner, fine-grained retriever and self-check based response generator, which work collaboratively for question understanding and relevant information acquisition in conversational settings. Extensive experiments demonstrate the great advantages of our approach over the state-of-the-art baselines. Moreover, we also release a Chinese CQA dataset with new features including reformulated question, extracted keyword, retrieved paragraphs and their helpfulness, which facilitates further researches in RAG enhanced CQA.

BRB-KMeans: Enhancing Binary Data Clustering for Binary Product Quantization

Suwon Lee
Sang-Min Choi

In Binary Product Quantization (BPQ), where product quantization is applied to binary data, the traditional k-majority method is used for clustering, with centroids determined based on Hamming distance and majority vote for each bit. However, this approach often leads to a degradation in clustering quality, negatively impacting BPQ's performance. To address these challenges, we introduce Binary-to-Real-and-Back K-Means (BRB-KMeans), a novel method that initially transforms binary data into real-valued vectors, performs k-means clustering on these vectors, and then converts the generated centroids back into binary data. This innovative approach significantly enhances clustering quality by leveraging the high clustering quality of k-means in the real-valued vector space, thereby facilitating future quantization for binary data. Through extensive experiments, we demonstrate that BRB-KMeans significantly enhances clustering quality and overall BPQ performance, notably outperforming traditional methods.

Breaking the Length Barrier: LLM-Enhanced CTR Prediction in Long Textual User Behaviors

Binzong Geng
Zhaoxin Huan
Xiaolu Zhang
Yong He
Liang Zhang
Fajie Yuan
Jun Zhou
Linjian Mo

With the rise of large language models (LLMs), recent works have leveraged LLMs to improve the performance of click-through rate (CTR) prediction. However, we argue that a critical obstacle remains in deploying LLMs for practical use: the efficiency of LLMs when processing long textual user behaviors. As user sequences grow longer, the current efficiency of LLMs is inadequate for training on billions of users and items. To break through the efficiency barrier of LLMs, we propose Behavior Aggregated Hierarchical Encoding (BAHE) to enhance the efficiency of LLM-based CTR modeling. Specifically, BAHE proposes a novel hierarchical architecture that decouples the encoding of user behaviors from inter-behavior interactions. Firstly, to prevent computational redundancy from repeated encoding of identical user behaviors, BAHE employs the LLM's pre-trained shallow layers to extract embeddings of the most granular, atomic user behaviors from extensive user sequences and stores them in the offline database. Subsequently, the deeper, trainable layers of the LLM facilitate intricate inter-behavior interactions, thereby generating comprehensive user embeddings. This separation allows the learning of high-level user representations to be independent of low-level behavior encoding, significantly reducing computational complexity. Finally, these refined user embeddings, in conjunction with correspondingly processed item embeddings, are incorporated into the CTR model to compute the CTR scores. Extensive experimental results show that BAHE reduces training time and memory by five times for CTR models using LLMs, especially with longer user sequences. BAHE has been deployed in a real-world system, allowing for daily updates of 50 million CTR data on 8 A100 GPUs, making LLMs practical for industrial CTR prediction.

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Ankit Satpute
Noah Gießing
André Greiner-Petter
Moritz Schubotz
Olaf Teschke
Akiko Aizawa
Bela Gipp

Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this work, we follow a two-step approach to investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our case analysis indicates that while GPT-4 can generate relevant answers, it isn't consistently accurate. This paper explores the current limitations of LLMs in navigating complex mathematical question-answering. We make our code and findings publicly available for research: https://github.com/gipplab/LLM-Investig-MathStackExchange

Can Query Expansion Improve Generalization of Strong Cross-Encoder Rankers?

Minghan Li
Honglei Zhuang
Kai Hui
Zhen Qin
Jimmy Lin
Rolf Jagerman
Xuanhui Wang
Michael Bendersky

Query expansion has been widely used to improve the search results of first-stage retrievers, yet its influence on second-stage, cross-encoder rankers remains under-explored. A recent study shows that current expansion techniques benefit weaker models but harm stronger rankers. In this paper, we re-examine this conclusion and raise the following question: Can query expansion improve generalization of strong cross-encoder rankers? To answer this question, we first apply popular query expansion methods to different cross-encoder rankers and verify the deteriorated zero-shot effectiveness. We identify two vital steps in the experiment: high-quality keyword generation and minimally-disruptive query modification. We show that it is possible to improve the generalization of a strong neural ranker, by generating keywords through a reasoning chain and aggregating the ranking results of each expanded query via self-consistency, reciprocal rank weighting, and fusion. Experiments on BEIR and TREC Deep Learning 2019/2020 show that the nDCG@10 scores of both MonoT5 and RankT5 following these steps are improved, which points out a direction for applying query expansion to strong cross-encoder rankers.

Cluster-based Partial Dense Retrieval Fused with Sparse Text Retrieval

Yingrui Yang
Parker Carlson
Shanxiu He
Yifan Qiao
Tao Yang

Previous work has demonstrated the potential to combine document rankings from dense and sparse retrievers for higher relevance effectiveness. This paper proposes a cluster-based partial dense retrieval scheme guided by sparse retrieval results to optimize fusion between dense and sparse retrieval at a low space and CPU-time cost while retaining a competitive relevance. This scheme exploits the overlap of sparse retrieval results and document embedding clusters, and judiciously selects a limited number of clusters to probabilistically guarantee the inclusion of top sparse results. This paper provides an evaluation of this scheme on its in-domain and zero-shot retrieval performance for the MS MARCO and BEIR datasets.

Combining Large Language Models and Crowdsourcing for Hybrid Human-AI Misinformation Detection

Xia Zeng
David La Barbera
Kevin Roitero
Arkaitz Zubiaga
Stefano Mizzaro

Research on misinformation detection has primarily focused either on furthering Artificial Intelligence (AI) for automated detection or on studying humans' ability to deliver an effective crowdsourced solution. Each of these directions however shows different benefits. This motivates our work to study hybrid human-AI approaches jointly leveraging the potential of large language models and crowdsourcing, which is understudied to date. We propose novel combination strategies Model First, Worker First, and Meta Vote, which we evaluate along with baseline methods such as mean, median, hard- and soft-voting. Using 120 statements from the PolitiFact dataset, and a combination of state-of-the-art AI models and crowdsourced assessments, we evaluate the effectiveness of these combination strategies. Results suggest that the effectiveness varies with scales granularity, and that combining AI and human judgments enhances truthfulness assessments' effectiveness and robustness.

Contextualization with SPLADE for High Recall Retrieval

Eugene Yang

High Recall Retrieval (HRR), such as eDiscovery and medical systematic review, is a search problem that optimizes the cost of retrieving most relevant documents in a given collection. Iterative approaches, such as iterative relevance feedback and uncertainty sampling, are shown to be effective under various operational scenarios. Despite neural models demonstrating success in other text-related tasks, linear models such as logistic regression, in general, are still more effective and efficient in HRR since the model is trained and retrieves documents from the same fixed collection. In this work, we leverage SPLADE, an efficient retrieval model that transforms documents into contextualized sparse vectors, for HRR. Our approach combines the best of both worlds, leveraging both the contextualization from pretrained language models and the efficiency of linear models. It reduces 10% and 18% of the review cost in two HRR evaluation collections under a one-phase review workflow with a target recall of 80%. The experiment is implemented with TARexp and is available at https://github.com/eugene-yang/LSR-for-TAR.

Convex Feature Embedding for Face and Voice Association

Jiwoo Kang
Taewan Kim
Young-ho Park

Face-and-voice association learning poses significant challenges in the field of deep learning. In this paper, we propose a straightforward yet effective approach for cross-modal feature embedding, specifically targeting the correlation between facial and voice association. Previous studies have examined cross-modal association tasks in order to establish the relationship between voice clips and facial images. Previous studies have examined the issue of cross-modal discrimination; however, they have not adequately recognized the importance of managing the heterogeneity in inter-modal features between audio and video. As a result, there is a significant prevalence of false positives and false negatives. To address the issue, the proposed method learns the embeddings of cross-modal features by introducing an additional feature that bridges the gap between these features. This facilitates the embedding of voice and face features belonging to the same individual within a convex hull. Through the utilization of cross-modal feature learning, cross-modal attention particularly reduces inter-class variance, resulting in a notable enhancement of the clustering power. We comprehensively evaluated our approach on cross-modal verification, matching, and retrieval tasks using the large-scale VoxCeleb dataset. Extensive experimental results demonstrate that the proposed method achieves notable improvements over existing state-of-the-art methods.

Counterfactual Augmentation for Robust Authorship Representation Learning

Hieu Man
Thien Huu Nguyen

Authorship attribution is a task that aims to identify the author of given pieces of writing. Authorship representation learning using neural networks has been shown to work in open-set environment settings with hundreds of thousands of authors. However, the performance of authorship attribution models often degrades significantly when texts are from different domains than the training data. In this work, we propose addressing this issue by adopting a novel causal framework for authorship representation learning. Our key insight is to use causal interventions during training to make models robust to differences in domains. Specifically, we introduce generating style-counterfactual examples by retrieving the most similar content texts by different authors on the same topics/domains. This exposes the model to challenging examples with similar content but distinct styles. Furthermore, we introduce causal masking of topic-indicative words to generate content-counterfactual examples. Content-counterfactuals hide topic content to encourage focusing on writing style. Experiments on three disparate domains - Amazon reviews, fanfiction stories, and Reddit comments - demonstrate that our approach significantly outperforms previous state-of-the-art methods for authorship attribution.

Cross-reconstructed Augmentation for Dual-target Cross-domain Recommendation

Qingyang Mao
Qi Liu
Zhi Li
Likang Wu
Bing Lv
Zheng Zhang

To alleviate the long-standing data sparsity issue in recommender systems, numerous studies in cross-domain recommendation (CDR) have been conducted to facilitate information transfer processes across domains. In recent years, dual-target CDR has been introduced to gain mutual improvements between two domains through more general bidirectional transfer rather than traditional one-way transit. Existing methods in dual-target CDR focus primarily on designing powerful encoders to learn representative cross-domain information, without tackling the fundamental issue of interaction data shortage. In this paper, we present CrossAug, a novel data augmentation approach to leverage interactions more efficiently in two domains. Specifically, we propose intra-domain and inter-domain augmentations based on cross-reconstructed representations in terms of sampled records. To reduce the harm of domain shift, we project domain-shared representations in two domains into a joint space with Householder transformations and apply center alignments. All these modules boost the utilization of interactions with little influence from negative transfer. Extensive experimental results over public datasets demonstrate the effectiveness of CrossAug and its components in dual-target CDR.

Dense Retrieval with Continuous Explicit Feedback for Systematic Review Screening Prioritisation

Xinyu Mao
Shengyao Zhuang
Bevan Koopman
Guido Zuccon

The goal of screening prioritisation in systematic reviews is to identify relevant documents with high recall and rank them in early positions for review. This saves reviewing effort if paired with a stopping criterion, and speeds up review completion if performed alongside downstream tasks. Recent studies have shown that neural models have good potential on this task, but their time-consuming fine-tuning and inference discourage their widespread use for screening prioritisation. In this paper, we propose an alternative approach that still relies on neural models, but leverages dense representations and relevance feedback to enhance screening prioritisation, without the need for costly model fine-tuning and inference. This method exploits continuous relevance feedback from reviewers during document screening to efficiently update the dense query representation, which is then applied to rank the remaining documents to be screened. We evaluate this approach across the CLEF TAR datasets for this task. Results suggest that the investigated dense query-driven approach is more efficient than directly using neural models and shows promising effectiveness compared to previous methods developed on the considered datasets. Our code is available at https://github.com/ielab/dense-screening-feedback.

Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation

Yoori Oh
Yoseob Han
Kyogu Lee

There has been growing interest in audio-language retrieval research, where the objective is to establish the correlation between audio and text modalities. However, most audio-text paired datasets often lack rich expression of the text data compared to the audio samples. One of the significant challenges facing audio-text datasets is the presence of similar or identical captions despite different audio samples. Therefore, under many-to-one mapping conditions, audio-text datasets lead to poor performance of retrieval tasks. In this paper, we propose a novel approach to tackle the data imbalance problem in audio-language retrieval task. To overcome the limitation, we introduce a method that employs a distance sampling-based paraphraser leveraging ChatGPT, utilizing distance function to generate a controllable distribution of manipulated text data. For a set of sentences with the same context, the distance is used to calculate a degree of manipulation for any two sentences, and ChatGPT's few-shot prompting is performed using a text cluster with a similar distance defined by the Jaccard similarity. Therefore, ChatGPT, when applied to few-shot prompting with text clusters, can adjust the diversity of the manipulated text based on the distance. The proposed approach is shown to significantly enhance performance in audio-text retrieval, outperforming conventional text augmentation techniques.

Distillation for Multilingual Information Retrieval

Eugene Yang
Dawn Lawrie
James Mayfield

Recent work in cross-language information retrieval (CLIR), where queries and documents are in different languages, has shown the benefit of the Translate-Distill framework that trains a cross-language neural dual-encoder model using translation and distillation. However, Translate-Distill only supports a single document language. Multilingual information retrieval (MLIR), which ranks a multilingual document collection, is harder to train than CLIR because the model must assign comparable relevance scores to documents in different languages. This work extends Translate-Distill and propose Multilingual Translate-Distill (MTD) for MLIR. We show that ColBERT-X models trained with MTD outperform their counterparts trained with Multilingual Translate-Train, which is the previous state-of-the-art training approach, by 5% to 25% in nDCG@20 and 15% to 45% in MAP. We also show that the model is robust to the way languages are mixed in training batches. Our implementation is available on GitHub.

EASE-DR: Enhanced Sentence Embeddings for Dense Retrieval

Xixi Zhou
Yang Gao
Xin Jie
Xiaoxu Cai
Jiajun Bu
Haishuai Wang

Recent neural information retrieval models using dense text representations generated by pre-trained models commonly face two issues. First, a pre-trained model (e.g., BERT) usually truncates a long document before giving its representation, which may cause the loss of some important semantic information. Second, although pre-training models like BERT have been widely used in generating sentence embeddings, a substantial body of literature has shown that the pre-training models often represent sentence embeddings in a homogeneous and narrow space, known as the problem of representation anisotropy, which hurts the quality of dense vector retrieval. In this paper, we split the query and the document in information retrieval into two sets of natural sentences and generate their sentence embeddings with BERT, the most popular pre-trained model. Before aggregating the sentence embeddings to get the entire embedding representations of the input query and document, to alleviate the usual representation degeneration problem of sentence embeddings from BERT, we sample the variational auto-encoder's latent space distribution to obtain isotropic sentence embeddings and utilize supervised contrastive learning to uniform the distribution of these sentence embeddings in the representation space. Our proposed model undergoes training optimization for both the query and the document in the abovementioned aspects. Our model performs well in evaluating three extensively researched neural information retrieval datasets.

Enhancing Criminal Case Matching through Diverse Legal Factors

Jie Zhao
Ziyu Guan
Wei Zhao
Yue Jiang

Criminal case matching endeavors to determine the relevance between different criminal cases. Conventional methods predict the relevance solely based on instance-level semantic features and neglect the diverse legal factors (LFs), which are associated with diverse court judgments. Consequently, comprehensively representing a criminal case remains a challenge for these approaches. Moreover, extracting and utilizing these LFs for criminal case matching face two challenges: (1) the manual annotations of LFs rely heavily on specialized legal knowledge; (2) overlaps among LFs may potentially harm the model's performance. In this paper, we propose a two-stage framework named Diverse Legal Factor-enhanced Criminal Case Matching (DLF-CCM). Firstly, DLF-CCM employs a multi-task learning framework to pre-train an LF extraction network on a large-scale legal judgment prediction dataset. In stage two, DLF-CCM introduces an LF de-redundancy module to learn shared LF and exclusive LFs. Moreover, an entropy-weighted fusion strategy is introduced to dynamically fuse the multiple relevance generated by all LFs. Experimental results validate the effectiveness of DLF-CCM and show its significant improvements over competitive baselines. Code: https://github.com/jiezhao6/DLF-CCM.

Enhancing Task Performance in Continual Instruction Fine-tuning Through Format Uniformity

Xiaoyu Tan
Leijun Cheng
Xihe Qiu
Shaojie Shi
Yuan Cheng
Wei Chu
Yinghui Xu
Yuan Qi

In recent advancements, large language models (LLMs) have demonstrated remarkable capabilities in diverse tasks, primarily through interactive question-answering with humans. This development marks significant progress towards artificial general intelligence (AGI). Despite their superior performance, LLMs often exhibit limitations when adapted to domain-specific tasks through instruction fine-tuning (IF). The primary challenge lies in the discrepancy between the data distribution in general and domain-specific contexts, leading to suboptimal accuracy in specialized tasks. To address this, continual instruction fine-tuning (CIF), particularly supervised fine-tuning (SFT), on targeted domain-specific instruction datasets is necessary. Our ablation study reveals that the structure of these instruction datasets critically influences CIF performance, with substantial data distributional shifts resulting in notable performance degradation. In this paper, we introduce a novel framework that enhances CIF by promoting format uniformity. We assess our approach using the Llama2 chat model across various domain-specific instruction datasets. The results demonstrate not only an improvement in task-specific performance under CIF but also a reduction in catastrophic forgetting (CF). This study contributes to the optimization of LLMs for domain-specific applications, highlighting the significance of data structure and distribution in CIF.

Estimating the Hessian Matrix of Ranking Objectives for Stochastic Learning to Rank with Gradient Boosted Trees

Jingwei Kang
Maarten de Rijke
Harrie Oosterhuis

Stochastic learning to rank (LTR) is a recent branch in the LTR field that concerns the optimization of probabilistic ranking models. Their probabilistic behavior enables certain ranking qualities that are impossible with deterministic models. For example, they can increase the diversity of displayed documents, increase fairness of exposure over documents, and better balance exploitation and exploration through randomization. A core difficulty in LTR is gradient estimation, for this reason, existing stochastic LTR methods have been limited to differentiable ranking models (e.g., neural networks). This is in stark contrast with the general field of LTR where Gradient Boosted Decision Trees (GBDTs) have long been considered the state-of-the-art. In this work, we address this gap by introducing the first stochastic LTR method for GBDTs. Our main contribution is a novel estimator for the second-order derivatives, i.e., the Hessian matrix, which is a requirement for effective GBDTs. To efficiently compute both the first and second-order derivatives simultaneously, we incorporate our estimator into the existing PL-Rank framework, which was originally designed for first-order derivatives only. Our experimental results indicate that stochastic LTR without the Hessian has extremely poor performance, whilst the performance is competitive with the current state-of-the-art with our estimated Hessian. Thus, through the contribution of our novel Hessian estimation method, we have successfully introduced GBDTs to stochastic LTR.

Evaluating Retrieval Quality in Retrieval-Augmented Generation

Alireza Salemi
Hamed Zamani

Evaluating retrieval-augmented generation (RAG) presents challenges, particularly for retrieval models within these systems. Traditional end-to-end evaluation methods are computationally expensive. Furthermore, evaluation of the retrieval model's performance based on query-document relevance labels shows a small correlation with the RAG system's downstream performance. We propose a novel evaluation approach, eRAG, where each document in the retrieval list is individually utilized by the large language model within the RAG system. The output generated for each document is then evaluated based on the downstream task ground truth labels. In this manner, the downstream performance for each document serves as its relevance label. We employ various downstream task metrics to obtain document-level annotations and aggregate them using set-based or ranking metrics. Extensive experiments on a wide range of datasets demonstrate that eRAG achieves a higher correlation with downstream RAG performance compared to baseline methods, with improvements in Kendall's tau correlation ranging from 0.168 to 0.494. Additionally, eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation.

Explainable Uncertainty Attribution for Sequential Recommendation

Carles Balsells-Rodas
Fan Yang
Zhishen Huang
Yan Gao

Sequential recommendation systems suggest products based on users' historical behaviours. The inherent sparsity of user-item interactions in a vast product space often leads to unreliable recommendations. Recent research addresses this challenge by leveraging auxiliary product relations to mitigate recommendation uncertainty, and quantifying uncertainty in recommendation scores to modify the candidates selection. However, such approaches may not be efficient due to the requirement of additional side information or providing suboptimal recommendations. To enhance sequential recommendation performance by leveraging uncertainty information, we introduce Explainable Uncertainty Attribution (ExUA). We employ gradient-based saliency attribution to identify sources of uncertainty stemming from sequential interactions. Experimental findings on Amazon and MovieLens datasets demonstrate ExUA's effectiveness in identifying interactions that induce uncertainty, resulting in a 6%+ improvement in NDCG@20 scores when the uncertainty information is integrated into a post-hoc training phase.

Fake News Detection via Multi-scale Semantic Alignment and Cross-modal Attention

Jiandong Wang
Hongguang Zhang
Chun Liu
Xiongjun Yang

The effective utilization of multimodal information is crucial for the task of fake news detection. The primary challenge lies in reducing the semantic distance between the news evidences of different modalities and accurately aligning them in the latent space. It has been observed that the relevance of news image content varies across coarse-to-fine scales, depending on the object and its class label. For instance, the visual evidence to justify a news item sometimes may be a very local detail such as the tiny changes of facial expression and limb position while sometimes may be the global composition. Consequently, how to align news text with image at the most discriminative scale significantly impacts detection performance. However, very few studies have addressed this issue in fake news detection. In this paper, we delve deeper into this issue and propose a simple yet effective Multi-scale Semantic Alignment and Cross-modal Attention (MSACA) network. Specifically, we construct hierarchical multi-scale images for each news data, enhance the semantic consistency between text and images in the latent space, and employ an attention module to select the deterministic embeddings in an end-to-end manner. Extensive experiments on two real-world benchmarks demonstrate the superior performance of our proposed MSACA network.

Faster Learned Sparse Retrieval with Block-Max Pruning

Antonio Mallia
Torsten Suel
Nicola Tonellotto

Learned sparse retrieval systems aim to combine the effectiveness of contextualized language models with the scalability of conventional data structures such as inverted indexes. Nevertheless, the indexes generated by these systems exhibit significant deviations from the ones that use traditional retrieval models, leading to a discrepancy in the performance of existing query optimizations that were specifically developed for traditional structures. These disparities arise from structural variations in query and document statistics, including sub-word tokenization, leading to longer queries, smaller vocabularies, and different score distributions within posting lists. This paper introduces Block-Max Pruning (BMP), an innovative dynamic pruning strategy tailored for indexes arising in learned sparse retrieval environments. BMP employs a block filtering mechanism to divide the document space into small, consecutive document ranges, which are then aggregated and sorted on the fly, and fully processed only as necessary, guided by a defined safe early termination criterion or based on approximate retrieval requirements. Through rigorous experimentation, we show that BMP substantially outperforms existing dynamic pruning strategies, offering unparalleled efficiency in safe retrieval contexts and improved trade-offs between precision and efficiency in approximate retrieval tasks.

FedUD: Exploiting Unaligned Data for Cross-Platform Federated Click-Through Rate Prediction

Wentao Ouyang
Rui Dong
Ri Tao
Xiangzheng Liu

Click-through rate (CTR) prediction plays an important role in online advertising platforms. Most existing methods use data from the advertising platform itself for CTR prediction. As user behaviors also exist on many other platforms, e.g., media platforms, it is beneficial to further exploit such complementary information for better modeling user interest and for improving CTR prediction performance. However, due to privacy concerns, data from different platforms cannot be uploaded to a server for centralized model training. Vertical federated learning (VFL) provides a possible solution which is able to keep the raw data on respective participating parties and learn a collaborative model in a privacy-preserving way. However, traditional VFL methods only utilize aligned data with common keys across parties, which strongly restricts their application scope. In this paper, we propose FedUD, which is able to exploit unaligned data, in addition to aligned data, for more accurate federated CTR prediction. FedUD contains two steps. In the first step, FedUD utilizes aligned data across parties like traditional VFL, but it additionally includes a knowledge distillation module. This module distills useful knowledge from the guest party's high-level representations and guides the learning of a representation transfer network. In the second step, FedUD applies the learned knowledge to enrich the representations of the host party's unaligned data such that both aligned and unaligned data can contribute to federated model training. Experiments on two real-world datasets demonstrate the superior performance of FedUD for federated CTR prediction.

Fine-Tuning LLaMA for Multi-Stage Text Retrieval

Xueguang Ma
Liang Wang
Nan Yang
Furu Wei
Jimmy Lin

While large language models (LLMs) have shown impressive NLP capabilities, existing IR applications mainly focus on prompting LLMs to generate query expansions or generating permutations for listwise reranking. In this study, we leverage LLMs directly to serve as components in the widely used multi-stage text ranking pipeline. Specifically, we fine-tune the open-source LLaMA-2 model as a dense retriever (repLLaMA) and a pointwise reranker (rankLLaMA). This is performed for both passage and document retrieval tasks using the MS MARCO training data. Our study shows that finetuned LLM retrieval models outperform smaller models. They are more effective and exhibit greater generalizability, requiring only a straightforward training strategy. Moreover, our pipeline allows for the fine-tuning of LLMs at each stage of a multi-stage retrieval pipeline. This demonstrates the strong potential for optimizing LLMs to enhance a variety of retrieval tasks. Furthermore, as LLMs are naturally pre-trained with longer contexts, they can directly represent longer documents. This eliminates the need for heuristic segmenting and pooling strategies to rank long documents. On the MS MARCO and BEIR datasets, our repLLaMA-rankLLaMA pipeline demonstrates a high level of effectiveness.

From Text to Context: An Entailment Approach for News Stakeholder Classification

Alapan Kuila
Sudeshna Sarkar

Navigating the complex landscape of news articles involves understanding the various actors or entities involved, referred to as news stakeholders. These stakeholders, ranging from policymakers to opposition figures, citizens, and more, play pivotal roles in shaping news narratives. Recognizing their stakeholder types, reflecting their roles, political alignments, social standing, and more, is paramount for a nuanced comprehension of news content. Despite existing works focusing on salient entity extraction, coverage variations, and political affiliations through social media data, the automated detection of stakeholder roles within news content remains an underexplored domain. In this paper, we bridge this gap by introducing an effective approach to classify stakeholder types in news articles. Our method involves transforming the stakeholder classification problem into a natural language inference task, utilizing contextual information from news articles and external knowledge to enhance the accuracy of stakeholder type detection. Moreover, our proposed model showcases efficacy in zero-shot settings, further extending its applicability to diverse news contexts.

General-Purpose User Modeling with Behavioral Logs: A Snapchat Case Study

Qixiang Fang
Zhihan Zhou
Francesco Barbieri
Yozen Liu
Leonardo Neves
Dong Nguyen
Daniel Oberski
Maarten Bos
Ron Dotsch

Learning general-purpose user representations based on user behavioral logs is an increasingly popular user modeling approach. It benefits from easily available, privacy-friendly yet expressive data, and does not require extensive re-tuning of the upstream user model for different downstream tasks. While this approach has shown promise in search engines and e-commerce applications, its fit for instant messaging platforms, a cornerstone of modern digital communication, remains largely uncharted. We explore this research gap using Snapchat data as a case study. Specifically, we implement a Transformer-based user model with customized training objectives and show that the model can produce high-quality user representations across a broad range of evaluation tasks, among which we introduce three new downstream tasks that concern pivotal topics in user research: user safety, engagement and churn. We also tackle the challenge of efficient extrapolation of long sequences at inference time, by applying a novel positional encoding method.

Generalizable Tip-of-the-Tongue Retrieval with LLM Re-ranking

Luís Borges
Rohan Jha
Jamie Callan
Bruno Martins

Tip-of-the-Tongue (ToT) retrieval is challenging for search engines because the queries are usually natural-language, verbose, and contain uncertain and inaccurate information. This paper studies the generalization capabilities of existing retrieval methods with ToT queries in multiple domains. We curate a multi-domain dataset and evaluate the effectiveness of recall-oriented first-stage retrieval methods across the different domains, considering in-domain, out-of-domain, and multi-domain training settings. We further explore the use of a Large Language Model (LLM), i.e. GPT-4, for zero-shot re-ranking in various ToT domains, relying solely on the item titles. Results show that multi-domain training enhances recall, and that LLMs are strong zero-shot re-rankers, especially for popular items, outperforming direct GPT-4 prompting without first-stage retrieval. Datasets and code can be found on GitHub https://github.com/LuisPB7/TipTongue

Graph Diffusive Self-Supervised Learning for Social Recommendation

Jiuqiang Li
Hongjun Wang

Social recommendation aims at augmenting user-item interaction relationships and boosting recommendation quality by leveraging social information. Recently, self-supervised learning (SSL) has gained widespread adoption for social recommender. However, most existing methods exhibit poor robustness when faced with sparse user behavior data and are susceptible to inevitable social noise. To overcome the aforementioned limitations, we introduce a new Graph Diffusive Self-Supervised Learning (GDSSL) paradigm for social recommendation. Our approach involves the introduction of a guided social graph diffusion model that can adaptively mitigate the impact of social relation noise commonly found in real-world scenarios. This model progressively introduces random noise to the initial social graph and then iteratively restores it to recover the original structure. Additionally, to enhance robustness against noise and sparsity, we propose graph diffusive self-supervised learning, which utilizes the denoised social relation graph generated by our diffusion model for contrastive learning. The extensive experimental outcomes consistently indicate that our proposed GDSSL outmatches existing advanced solutions in social recommendation.

Graph Reasoning Enhanced Language Models for Text-to-SQL

Zheng Gong
Ying Sun

Text-to-SQL parsing has attracted substantial attention recently due to its potential to remove barriers for non-expert end users interacting with databases. A key challenge in Text-to-SQL parsing is developing effective encoding mechanisms to capture the complex relationships between question words, database schemas, and their associated connections within the heterogeneous graph structure. Existing approaches typically introduce some useful multi-hop structures manually and then incorporate them into graph neural networks (GNNs) by stacking multiple layers, which (1) ignore the difficult-to-identify but meaningful semantics embedded in the multi-hop reasoning path, and (2) are limited by the expressive capability of GNN to capture long-range dependencies among the heterogeneous graph. To address these shortcomings, we introduce GRL-SQL, a graph reasoning enhanced language model, which innovatively applies structure encoding to capture the dependencies between node pairs, encompassing one-hop, multi-hop and distance information, subsequently enriched through self-attention for enhanced representational power over GNNs. Furthermore, GRL-SQL incorporates an interaction module that enables joint reasoning and fusion over the question-schema representations for enhancing global context modeling. Comprehensive experiments demonstrate the effectiveness and robustness of our proposed GRL-SQL.

Grasping Both Query Relevance and Essential Content for Query-focused Summarization

Ye Xiong
Hidetaka Kamigaito
Soichiro Murakami
Peinan Zhang
Hiroya Takamura
Manabu Okumura

Numerous effective methods have been developed to improve query-focused summarization (QFS) performance, e.g., pre-trained model-based and query-answer relevance-based methods. However, these methods still suffer from missing or redundant information due to the inability to capture and effectively utilize the interrelationship between the query and the source document, as well as between the source document and its generated summary, resulting in the summary being unable to answer the query or containing additional unrequired information. To mitigate this problem, we propose an end-to-end hierarchical two-stage summarization model, that first predicts essential content, and then generates a summary by emphasizing the predicted important sentences while maintaining separate encodings for the query and the source, so that it can comprehend not only the query itself but also the essential information in the source. We evaluated the proposed model on two QFS datasets, and the results indicated its overall effectiveness and that of each component.

IdmGAE: Importance-Inspired Dynamic Masking for Graph Autoencoders

Ge Chen
Yulan Hu
Sheng Ouyang
Zhirui Yang
Yong Liu
Cuicui Luo

Generative self-supervised learning, exemplified by masked graph autoencoders (GAEs), aims to reconstruct the masked graph characteristics, garnering increasing research interest and achieving promising results. Despite the progress made, existing research on masked GAEs predominantly relies on a random strategy to obscure input graph characteristics, disregarding the potential variability in node importance during the reconstruction process. In this study, we propose an efficient masking strategy termed importance-inspired dynamic masking to explore diverse node sampling. Our approach employs an auxiliary network to assess the importance of nodes within the graph. Subsequently, we derive sampling probabilities guided by the learned node importance. To enhance model training, we introduce a dynamic sampling strategy that adapts to nodes of varying importance across different training stages. This importance-inspired dynamic masking strategy empowers masked GAEs to acquire refined representations for graph-related tasks. We have conducted extensive experiments on node classification, which validate the efficacy of our proposed strategy.

Improving In-Context Learning via Sequentially Selection and Preference Alignment for Few-Shot Aspect-Based Sentiment Analysis

Qianlong Wang
Keyang Ding
Xuan Luo
Ruifeng Xu

In this paper, we leverage in-context learning (ICL) paradigm to handle few-shot aspect-based sentiment analysis (ABSA). Previous works first rank candidate examples by some metrics and then independently retrieve examples similar to test samples. However, their effectiveness may be discounted because of two limitations: in-context example redundancy and example preference misalignment between retriever and LLM. To alleviate them, we propose a novel framework that sequentially retrieves in-context examples. It not only considers which example is useful for the test sample but also prevents its information from being duplicated by already retrieved examples. Subsequently, we exploit the rewards of LLMs on retrieved in-context examples to optimize parameters for bridging preference gaps. Experiments on four ABSA datasets show that our framework is significantly superior to previous works.

Inferring Climate Change Stances from Multimodal Tweets

Nan Bai
Ricardo da Silva Torres
Anna Fensel
Tamara Metze
Art Dewulf

Climate change is a heated discussion topic in public arenas such as social media. Both texts and visuals play key roles in the debate, as they can complement, contradict, or reinforce each other in nuanced ways. It is therefore urgently needed to study the messages as multimodal objects to better understand the polarized debate about climate change impacts and policies. Multimodal representation models such as CLIP are known to be able to transfer knowledge across domains and modalities, enabling the investigation of textual and visual semantics together. Yet they are not directly able to distinguish the nuances between supporting and sceptic climate change stances. This paper explores a simple but effective strategy combining modality fusion and domain-knowledge enhancing to prepare CLIP-based models with knowledge of climate change stances. A multimodal Dutch Twitter dataset is collected and experimented with the proposed strategy, which increased the macro-average F1 score across stances from 51% to 86%. The outcomes can be applied in both data science and public policy studies, to better analyse how the combined use of texts and visuals generates meanings during debates, in the context of climate change and beyond.

Information Diffusion Prediction via Cascade-Retrieved In-context Learning

Ting Zhong
Jienan Zhang
Zhangtao Cheng
Fan Zhou
Xueqin Chen

Information diffusion prediction, which aims to infer the infected behavior of individual users during information spread, is critical for understanding the dynamics of information propagation and users' influence on online social media. To date, existing methods either focus on capturing limited contextual information from a single cascade, overlooking the potentially complex dependencies across different cascades, or they are committed to improving model performance by using intricate technologies to extract additional features as supplements to user representations, neglecting the drift of model performance across different platforms. To address these limitations, we propose a novel framework called CARE (CAscade-REtrieved In-Context Learning) inspired by the concept of in-context learning in LLMs. Specifically, CARE first constructs a prompts pool derived from historical cascades, then utilizes ranking-based search engine techniques to retrieve prompts with similar patterns based on the query. Moreover, CARE also introduces two augmentation strategies alongside social relationship enhancement to enrich the input context. Finally, the transformed query-cascade representation from a GPT-type architecture is projected to obtain the prediction. Experiments on real-world datasets from various platforms show that CARE outperforms state-of-the-art baselines in terms of effectiveness and robustness in information diffusion prediction.

Instruction-Guided Bullet Point Summarization of Long Financial Earnings Call Transcripts

Subhendu Khatuya
Koushiki Sinha
Niloy Ganguly
Saptarshi Ghosh
Pawan Goyal

While automatic summarization techniques have made significant advancements, their primary focus has been on summarizing short news articles or documents that have clear structural patterns like scientific articles or government reports. There has not been much exploration into developing efficient methods for summarizing financial documents, which often contain complex facts and figures. Here, we study the problem of bullet point summarization of long Earning Call Transcripts (ECTs) using the recently released ECTSum dataset. We leverage an unsupervised question-based extractive module followed by a parameter efficient instruction-tuned abstractive module to solve this task. Our proposed model FLANFinBPS achieves new state-of-the-art performances outperforming the strongest baseline with 14.88% average ROUGE score gain, and is capable of generating factually consistent bullet point summaries that capture the important facts discussed in the ECTs. We make the codebase publicly available at https://github.com/subhendukhatuya/FLAN-FinBPS.

Label Hierarchical Structure-Aware Multi-Label Few-Shot Intent Detection via Prompt Tuning

Xiaotong Zhang
Xinyi Li
Han Liu
Xinyue Liu
Xianchao Zhang

Multi-label intent detection aims to recognize multiple user intents behind dialogue utterances. The diversity of user utterances and the scarcity of training data motivate multi-label few-shot intent detection. However, existing methods ignore the hybrid of verb and noun within an intent, which is essential to identify the user intent. In this paper, we propose a label hierarchical structure-aware method for multi-label few-shot intent detection via prompt tuning (LHS). Firstly, for the support data, we concatenate the original utterance with the label description generated by GPT-4 to obtain the utterance-level representation. Then we construct a multi-label hierarchical structure-aware prompt model to learn the label hierarchical information. To learn more discriminative class prototypes, we devise a prototypical contrastive learning method to pull the utterances close to their corresponding intent labels and away from other intent labels. Extensive experiments on two datasets demonstrate the superiority of our method.

Language Fairness in Multilingual Information Retrieval

Eugene Yang
Thomas Jänich
James Mayfield
Dawn Lawrie

Multilingual information retrieval (MLIR) considers the problem of ranking documents in several languages for a query expressed in a language that may differ from any of those languages. Recent work has observed that approaches such as combining ranked lists representing a single document language each or using multilingual pretrained language models demonstrate a preference for one language over others. This results in systematic unfair treatment of documents in different languages. This work proposes a language fairness metric to evaluate whether documents across different languages are fairly ranked through statistical equivalence testing using the Kruskal-Wallis test. In contrast to most prior work in group fairness, we do not consider any language to be an unprotected group. Thus our proposed measure, PEER (Probability of Equal Expected Rank), is the first fairness metric specifically designed to capture the language fairness of MLIR systems. We demonstrate the behavior of PEER on artificial ranked lists. We also evaluate real MLIR systems on two publicly available benchmarks and show that the PEER scores align with prior analytical findings on MLIR fairness. Our implementation is compatible with ir-measures and is available at http://github.com/hltcoe/peer_measure.

Large Language Models Based Stemming for Information Retrieval: Promises, Pitfalls and Failures

Shuai Wang
Shengyao Zhuang
Guido Zuccon

Text stemming is a natural language processing technique that is used to reduce words to their base form, also known as the root form. In Information Retrieval (IR), stemming is used in keyword-based matching pipelines to normalise text before indexing and query processing to improve subsequent matching between document and query keywords. The use of stemming has been shown to often improve the effectiveness of keyword-matching models such as BM25. However, traditional stemming methods, focusing solely on individual terms, overlook the richness of contextual information.

Recognizing this gap, in this paper, we investigate the promising idea of using large language models (LLMs) to stem words by lever-aging its capability of context understanding. With this respect, we identify three avenues, each characterised by different trade-offs in terms of computational cost, effectiveness and robustness : (1) use LLMs to stem the vocabulary for a collection, i.e., the set of unique words that appear in the collection (vocabulary stemming), (2) use LLMs to stem each document separately (contextual stemming), and (3) use LLMs to extract from each document entities that should not be stemmed, then use vocabulary stemming to stem the rest of the terms (entity-based contextual stemming). Through a series of empirical experiments, we compare the use of LLMs for stemming with that of traditional lexical stemmers such as Porter and Krovetz for English text. We find that while vocabulary stemming and contextual stemming fail to achieve higher effectiveness than traditional stemmers, entity-based contextual stemming can achieve a higher effectiveness than using Porter stemmer alone, under specific conditions. Code and results are made available at https://github.com/ielab/SIGIR-2024-LLM-Stemming.

MACA: Memory-aided Coarse-to-fine Alignment for Text-based Person Search

Liangxu Su
Rong Quan
Zhiyuan Qi
Jie Qin

Text-based person search (TBPS) aims to search for the target person in the full image through textual descriptions. The key to addressing this task is to effectively perform cross-modality alignment between text and images. In this paper, we propose a novel TBPS framework, named Memory-Aided Coarse-to-fine Alignment (MACA), to learn an accurate and reliable alignment between the two modalities. Firstly, we introduce a proposal-based alignment module, which performs contrastive learning to accurately align the textual modality with different pedestrian proposals at a coarse-grained level. Secondly, for the fine-grained alignment, we propose an attribute-based alignment module to mitigate unreliable features by aligning text-wise details with image-wise global features. Moreover, we introduce an intuitive memory bank strategy to supplement useful negative samples for more effective contrastive learning, improving the convergence and generalization ability of the model based on the learned discriminative features. Extensive experiments on CUHK-SYSU-TBPS and PRW-TBPS demonstrate the superiority of MACA over state-of-the-art approaches. The code is available at https://github.com/suliangxu/MACA.

Masked Graph Transformer for Large-Scale Recommendation

Huiyuan Chen
Zhe Xu
Chin-Chia Michael Yeh
Vivian Lai
Yan Zheng
Minghua Xu
Hanghang Tong

Graph Transformers have garnered significant attention for learning graph-structured data, thanks to their superb ability to capture long-range dependencies among nodes. However, the quadratic space and time complexity hinders the scalability of Graph Transformers, particularly for large-scale recommendation. Here we propose an efficient Masked Graph Transformer, named MGFormer, capable of capturing all-pair interactions among nodes with a linear complexity. To achieve this, we treat all user/item nodes as independent tokens, enhance them with positional embeddings, and feed them into a kernelized attention module. Additionally, we incorporate learnable relative degree information to appropriately reweigh the attentions. Experimental results show the superior performance of our MGFormer, even with a single attention layer.

Memory-Efficient Deep Recommender Systems using Approximate Rotary Compositional Embedding

Dongning Ma
Xun Jiao

Embedding tables in deep recommender systems (DRS) process categorical data, which can be memory-intensive due to the high feature cardinality. In this paper, we propose Approximate Rotary Compositional Embedding (ARCE), which intentionally trades off performance to aggressively reduce the size of the embedding tables. Specifically, ARCE uses compositional embedding to split large embedding tables into smaller compositions and replaces index look-ups with vector rotations. To regain the performance loss of this trade-off, ARCE features an input approximation where one index is mapped into multiple indices, creating a larger space for a potential increased learning capability. Experimental results show that using ARCE can reduce the memory overhead of embedding tables in DRS by more than 1000x with less than 3% performance loss, highlighting the potential of using ARCE for less memory intensive DRS designs. We open-source ARCE at https://github.com/VU-DETAIL/arce.

MKV: Mapping Key Semantics into Vectors for Rumor Detection

Yang Li
Liguang Liu
Jiacai Guo
Lap-Kei Lee
Fu Lee Wang
Zhenguo Yang

The cross-attention mechanism has been widely employed in the multimodal rumor detection task, which is computation-intensive and suffers from the restricted modal receptive field. In this paper, we propose a multimodal rumor detection model (MKV), which maps multimodal key semantics with discrimination into feature vectors for rumor detection. More specifically, MKV extracts high-dimensional features for each modality separately by the Multimodal Feature Extractor (MFE). The mapping mechanism learns low-dimensional mapping scheme (Map) and key semantics (Key) with discrimination from the different modal features respectively. Subsequently, the Map and Key jointly construct a state matrix (State) containing all possible permutations of modalities. In particular, a max pooling operation is performed on State and products a feature vector (Vector). The mapping mechanism is able to incrementally learn the discriminative semantics by stacking manner. Vectors from the stacking process are leveraged in the Rumor Detection module (RD). Extensive experiments on two public datasets show that the MKV achieves the state-of-the-art performance.

Modeling Domains as Distributions with Uncertainty for Cross-Domain Recommendation

Xianghui Zhu
Mengqun Jin
Hengyu Zhang
Chang Meng
Daoxin Zhang
Xiu Li

In the field of dual-target Cross-Domain Recommendation (DTCDR), improving the performance in both the information sparse domain and rich domain has been a mainstream research trend. However, prior embedding-based methods are insufficient to adequately describe the dynamics of user actions and items across domains. Moreover, previous efforts frequently lacked a comprehensive investigation of the entire domain distributions. This paper proposes a novel framework entitled Wasserstein Cross-Domain Recommendation (WCDR) that captures uncertainty in Wasserstein space to address above challenges. In this framework, we abstract user/item actions as Elliptical Gaussian distributions and divide them into local-intrinsic and global-domain parts. To further model the domain diversity, we adopt shared-specific pattern for global-domain distributions and present Masked Domain-aware Sub-distribution Aggregation (MDSA) module to produce informative and diversified global-domain distributions, which incorporates attention-based aggregation method and masking strategy that alleviates negative transfer issues. Extensive experiments on two public datasets and one business dataset are conducted. Experimental results demonstrate the superiority of WCDR over state-of-the-art methods.

Modeling Scholarly Collaboration and Temporal Dynamics in Citation Networks for Impact Prediction

Pengwei Yan
Yangyang Kang
Zhuoren Jiang
Kaisong Song
Tianqianjin Lin
Changlong Sun
Xiaozhong Liu

Accurately evaluating the impact of scientific papers is crucial. However, existing methodologies face certain challenges, including latent factors affecting citation behaviors and dynamic intrinsic of citation networks. To address these challenges, this study introduces a novel framework named CoDy (modeling scholarly Collaboration and temporal Dynamics in citation networks for impact prediction). CoDy strategically predicts author collaborations as an auxiliary task, forecasting not only the number of current collaborations between scholars but also the number of future collaborations among them. Besides, CoDy proposes a fine-grained temporal encoding module to model the multiple different temporal patterns for publication and citation. Extensive experimental validations demonstrate CoDy's effectiveness in predicting citation counts and classifying impact levels. In-depth analyses provide further validation of its reliability and robustness. CoDy can significantly enhance impact prediction by explicitly modeling collaboration and temporal patterns and offer valuable insights into paper impact formation.

MoME: Mixture-of-Masked-Experts for Efficient Multi-Task Recommendation

Jiahui Xu
Lu Sun
Dengji Zhao

Multi-task learning techniques have attracted great attention in recommendation systems because they can meet the needs of modeling multiple perspectives simultaneously and improve recommendation performance. As promising multi-task recommendation system models, Mixture-of-Experts (MoE) and related methods use an ensemble of expert sub-networks to improve generalization and have achieved significant success in practical applications. However, they still face key challenges in efficient parameter sharing and resource utilization, especially when they are applied to real-world datasets and resource-constrained devices. In this paper, we propose a novel framework called Mixture-of-Masked-Experts (MoME) to address the challenges. Unlike MoE, expert sub-networks in MoME are extracted from an identical over-parameterized base network by learning binary masks. It utilizes a binary mask learning mechanism composed of neuron-level model masking and weight-level expert masking to achieve coarse-grained base model pruning and fine-grained expert pruning, respectively. Compared to existing MoE-based models, MoME achieves efficient parameter sharing and requires significantly less sub-network storage since it actually only trains a base network and a mixture of partially overlapped binary expert masks. Experimental results on real-world datasets demonstrate the superior performance of MoME in terms of recommendation accuracy and computational efficiency. Our code is available at https://https://github.com/Xjh0327/MoME.

Multi-intent-aware Session-based Recommendation

Minjin Choi
Hye-young Kim
Hyunsouk Cho
Jongwuk Lee

Session-based recommendation (SBR) aims to predict the following item a user will interact with during an ongoing session. Most existing SBR models focus on designing sophisticated neural-based encoders to learn a session representation, capturing the relationship among session items. However, they tend to focus on the last item, neglecting diverse user intents that may exist within a session. This limitation leads to significant performance drops, especially for longer sessions. To address this issue, we propose a novel SBR model, called Multi-intent-aware Session-based Recommendation Model (MiaSRec). It adopts frequency embedding vectors indicating the item frequency in session to enhance the information about repeated items. MiaSRec represents various user intents by deriving multiple session representations centered on each item and dynamically selecting the important ones. Extensive experimental results show that MiaSRec outperforms existing state-of-the-art SBR models on six datasets, particularly those with longer average session length, achieving up to 6.27% and 24.56% gains for MRR@20 and Recall@20. Our code is available at https://github.com/jin530/MiaSRec.

Multi-Layer Ranking with Large Language Models for News Source Recommendation

Wenjia Zhang
Lin Gui
Rob Procter
Yulan He

To seek reliable information sources for news events, we introduce a novel task of expert recommendation, which aims to identify trustworthy sources based on their previously quoted statements. To achieve this, we built a novel dataset, called NewsQuote, consisting of 23,571 quote-speaker pairs sourced from a collection of news articles. We formulate the recommendation task as the retrieval of experts based on their likelihood of being associated with a given query. We also propose a multi-layer ranking framework employing Large Language Models to improve the recommendation performance. Our results show that employing an in-context learning based LLM ranker and a multi-layer ranking-based filter significantly improve both the predictive quality and behavioural quality of the recommender system.

Multi-view Mixed Attention for Contrastive Learning on Hypergraphs

Jongsoo Lee
Dong-Kyu Chae

Hypergraphs are effective in learning high-order relationships between nodes, which naturally represent group interactions as hyperedges (i.e., arbitrary-sized subsets of nodes). However, most approaches currently used for learning hypergraph representations do not consider pairwise relationships between nodes. While high-order relationships provide insight into the general connections among nodes in a group, they do not reveal the pairwise relationships between individual nodes within that group. Considering that it is unlikely for all nodes in the same group to share identical relationships, we argue that considering pairwise relationships is a critical aspect. In this paper, we propose Multi-view Mixed Attention for Contrastive Learning (MMACL) to address the aforementioned problem. MMACL proposes Mixed-Attention, which blends high-order relationships derived from the hypergraph attention network and pairwise relationships derived from the graph attention network. Then, it performs node-level contrastive learning to the graph structure with different views learned at each layer to finally obtain an expressive node representation. Our extensive experimental results on several popular datasets validate the effectiveness of the proposed MMACL for hypergraph node classification. Our code is available at: https://github.com/JongsooLee-HYU/MMACL

Negative as Positive: Enhancing Out-of-distribution Generalization for Graph Contrastive Learning

Zixu Wang
Bingbing Xu
Yige Yuan
Huawei Shen
Xueqi Cheng

Graph contrastive learning (GCL), standing as the dominant paradigm in the realm of graph pre-training, has yielded considerable progress. Nonetheless, its capacity for out-of-distribution (OOD) generalization has been relatively underexplored. In this work, we point out that the traditional optimization of InfoNCE in GCL restricts the cross-domain pairs only to be negative samples, which inevitably enlarges the distribution gap between different domains. This violates the requirement of domain invariance under OOD scenario and consequently impairs the model's OOD generalization performance. To address this issue, we propose a novel strategy ''Negative as Positive'', where the most semantically similar cross-domain negative pairs are treated as positive during GCL. Our experimental results, spanning a wide array of datasets, confirm that this method substantially improves the OOD generalization performance of GCL.

Neural Click Models for Recommender Systems

Mikhail Shirokikh
Ilya Shenbin
Anton Alekseev
Anna Volodkevich
Alexey Vasilev
Andrey V. Savchenko
Sergey Nikolenko

We develop and evaluate neural architectures to model the user behavior in recommender systems (RS) inspired by click models for Web search but going beyond standard click models. Proposed architectures include recurrent networks, Transformer-based models that alleviate the quadratic complexity of self-attention, adversarial and hierarchical architectures. Our models outperform baselines on the ContentWise and RL4RS datasets and can be used in RS simulators to model user response for RS evaluation and pretraining.

Old IR Methods Meet RAG

Oz Huly
Idan Pogrebinsky
David Carmel
Oren Kurland
Yoelle Maarek

Retrieval augmented generation (RAG) is an important approach to provide large language models (LLMs) with context pertaining to the text generation task: given a prompt, passages are retrieved from external corpora to ground the generation with more relevant and/or fresher data. Most previous studies used dense retrieval methods for applying RAG in question answering scenarios. However, recent work showed that traditional information retrieval methods (a.k.a. sparse methods) can do as well as or even better than dense retrieval ones. In particular, it was shown that Okapi BM25 outperforms dense retrieval methods, in terms of perplexity, for the fundamental text completion task in LLMs. We extend this study and show, using two popular LLMs, that a broad set of sparse retrieval methods achieve better results than all the dense retrieval methods we experimented with, for varying lengths of queries induced from the prompt. Furthermore, we found that Okapi BM25 is substantially outperformed by a term-proximity retrieval method (MRF), which is in turn outperformed by a pseudo-feedback-based bag-of-terms approach (relevance model). Additional exploration sheds some light on the effectiveness of lexical retrieval methods for RAG. Our findings call for further study of classical retrieval methods for RAG.

On Backbones and Training Regimes for Dense Retrieval in African Languages

Akintunde Oladipo
Mofetoluwa Adeyemi
Jimmy Lin

The effectiveness of dense retrieval models trained with multilingual language models as backbones has been demonstrated in multilingual and cross-lingual information retrieval contexts. The optimal choice of a backbone model for a given retrieval task is dependent on the target retrieval domain as well as the pre-training domain of available language models and their generalization capabilities, the availability of relevance judgements, etc. In this work, we study the impact of these factors on retrieval effectiveness for African languages using three multilingual benchmark datasets: Mr. TyDi, MIRACL, and the newly released CIRAL dataset. We compare the effectiveness of mBERT as a backbone for dense retrieval models against multilingual language models such as AfriBERTa and AfroXLMR, which are specialized for African languages. Furthermore, we examine the impact of different training regimes on the effectiveness of dense retrieval in different domains for African languages. Our findings show that the pre-training domain of the backbone LM plays a huge role in retrieval effectiveness, especially in the absence of retrieval training data. Code artifacts are available at https://github.com/castorini/afridpr_backbones.

PAG-LLM: Paraphrase and Aggregate with Large Language Models for Minimizing Intent Classification Errors

Vikas Yadav
Zheng Tang
Vijay Srinivasan

Large language models (LLM) have achieved remarkable success in natural language generation but lesser focus has been given to their applicability in key tasks such as intent-classification. We show that LLMs like LLaMa can achieve high performance on intent classification tasks with large number of classes but still make classification errors and worse, generate out-of-vocabulary intent labels. To address these critical issues, we introduce Paraphrase and AGgregate (PAG)-LLM approach wherein an LLM generates multiple paraphrases of the input query (parallel queries), performs intent classification for the original query and each paraphrase, and at the end aggregate all the predicted intent labels based on their confidence scores. We evaluate PAG-LLM on two large intent classification datasets: CLINC, and Banking and show 22.7% and 15.1% error reduction. We show that PAG-LLM is especially effective for hard examples where LLM is uncertain, and reduces the critical misclassification and hallucinated label generation errors.

PLAID SHIRTTT for Large-Scale Streaming Dense Retrieval

Dawn Lawrie
Efsun Kayi
Eugene Yang
James Mayfield
Douglas W. Oard

PLAID, an efficient implementation of the ColBERT late interaction bi-encoder using pretrained language models for ranking, consistently achieves state-of-the-art performance in monolingual, cross-language, and multilingual retrieval. PLAID differs from ColBERT by assigning terms to clusters and representing those terms as cluster centroids plus compressed residual vectors. While PLAID is effective in batch experiments, its performance degrades in streaming settings where documents arrive over time because representations of new tokens may be poorly modeled by the earlier tokens used to select cluster centroids. PLAID Streaming Hierarchical Indexing that Runs on Terabytes of Temporal Text (PLAID HIRTTT) addresses this concern using multi-phase incremental indexing based on hierarchical sharding. Experiments on ClueWeb09 and the multilingual NeuCLIR collection demonstrate the effectiveness of this approach both for the largest collection indexed to date by the ColBERT architecture and in the multilingual setting, respectively.

Predicting Micro-video Popularity via Multi-modal Retrieval Augmentation

Ting Zhong
Jian Lang
Yifan Zhang
Zhangtao Cheng
Kunpeng Zhang
Fan Zhou

Accurately predicting the popularity of micro-videos is crucial for real-world applications such as recommender systems and identifying viral marketing opportunities. Existing methods often focus on limited cross-modal information within individual micro-videos, overlooking the potential advantages of exploiting vast repository of past videos. We present MMRA, a multi-modal retrieval-augmented popularity prediction model that enhances prediction accuracy using relevant retrieved information. MMRA first retrieves relevant instances from a multi-modal memory bank, aligning video and text through transformation mechanisms involving a vision model and a text-based retriever. Additionally, a multi-modal interaction network is carefully designed to jointly capture cross-modal correlations within the target video and extract informative knowledge through retrieved instances, ultimately enhancing the prediction. Extensive experiments conducted on the real-world micro-video dataset demonstrate the superiority of MMRA when compared to state-of-the-art models. The code and data are available at https://github.com/ICDM-UESTC/MMRA.

Prediction of the Realisation of an Information Need: An EEG Study

Niall McGuire
Yashar Moshfeghi

One of the foundational goals of Information Retrieval (IR) is to satisfy searchers' Information Needs (IN). Understanding how INs physically manifest has long been a complex and elusive process. However, recent studies utilising Electroencephalography (EEG) data have provided real-time insights into the neural processes associated with INs. Unfortunately, they have yet to demonstrate how this insight can practically benefit the search experience. As such, within this study, we explore the ability to predict the realisation of IN within EEG data across 14 participants whilst partaking in a Question-Answering (Q/A) task. Furthermore, we investigate the combinations of EEG features that yield optimal predictive performance, as well as identify regions within the Q/A queries where a subject's realisation of IN is more pronounced. The findings from this work demonstrate that EEG data is sufficient for the real-time prediction of the realisation of an IN across all participants with an accuracy of 73.5% (SD 2.6%) and on a per-subject basis with an accuracy of 90.1% (SD 22.1%). This work helps to close the gap by bridging theoretical neuroscientific advancements with tangible improvements in information retrieval practices, paving the way for real-time prediction of the realisation of IN.

PromptLink: Leveraging Large Language Models for Cross-Source Biomedical Concept Linking

Yuzhang Xie
Jiaying Lu
Joyce Ho
Fadi Nahab
Xiao Hu
Carl Yang

Linking (aligning) biomedical concepts across diverse data sources enables various integrative analyses, but it is challenging due to the discrepancies in concept naming conventions. Various strategies have been developed to overcome this challenge, such as those based on string-matching rules, manually crafted thesauri, and machine learning models. However, these methods are constrained by limited prior biomedical knowledge and can hardly generalize beyond the limited amounts of rules, thesauri, or training samples. Recently, large language models (LLMs) have exhibited impressive results in diverse biomedical NLP tasks due to their unprecedentedly rich prior knowledge and strong zero-shot prediction abilities. However, LLMs suffer from issues including high costs, limited context length, and unreliable predictions. In this research, we propose PromptLink, a novel biomedical concept linking framework that leverages LLMs. Empirical results on the concept linking task between two EHR datasets and an external biomedical KG demonstrate the effectiveness of PromptLink. Furthermore, PromptLink is a generic framework without reliance on additional prior knowledge, context, or training data, making it well-suited for concept linking across various types of data sources. The source code of this study is available at https://github.com/constantjxyz/PromptLink.

R-ODE: Ricci Curvature Tells When You Will be Informed

Li Sun
Jingbin Hu
Mengjie Li
Hao Peng

Information diffusion prediction is fundamental to understand the structure and organization of the online social networks, and plays a crucial role to blocking rumor spread, influence maximization, political propaganda, etc. So far, most existing solutions primarily predict the next user who will be informed with historical cascades, but ignore an important factor in the diffusion process - the time. Such limitation motivates us to pose the problem of the time-aware personalized information diffusion prediction for the first time, telling the time when the target user will be informed. In this paper, we address this problem from a fresh geometric perspective of Ricci curvature, and propose a novel Ricci-curvature regulated Ordinary Differential Equation (R-ODE). In the diffusion process, R-ODE considers that the inter-correlated users are organized in a dynamic system in the representation space, and the cascades give the observations sampled from the continuous realm. At each infection time, the message diffuses along the largest Ricci curvature, signifying less transportation effort. In the continuous realm, the message triggers users' movement, whose trajectory in the space is parameterized by an ODE with graph neural network. Consequently, R-ODE predicts the infection time of a target user by the movement trajectory learnt from the observations. Extensive experiments evaluate the personalized time prediction ability of R-ODE, and show R-ODE outperforms the state-of-the-art baselines.

ReCODE: Modeling Repeat Consumption with Neural ODE

Sunhao Dai
Changle Qu
Sirui Chen
Xiao Zhang
Jun Xu

In real-world recommender systems, such as in the music domain, repeat consumption is a common phenomenon where users frequently listen to a small set of preferred songs or artists repeatedly. The key point of modeling repeat consumption is capturing the temporal patterns between a user's repeated consumption of the items. Existing studies often rely on heuristic assumptions, such as assuming an exponential distribution for the temporal gaps. However, due to the high complexity of real-world recommender systems, these pre-defined distributions may fail to capture the intricate dynamic user consumption patterns, leading to sub-optimal performance. Drawing inspiration from the flexibility of neural ordinary differential equations (ODE) in capturing the dynamics of complex systems, we propose ReCODE, a novel model-agnostic framework that utilizes neural ODE to model repeat consumption. ReCODE comprises two essential components: a user's static preference prediction module and the modeling of user dynamic repeat intention. By considering both immediate choices and historical consumption patterns, ReCODE offers comprehensive modeling of user preferences in the target context. Moreover, ReCODE seamlessly integrates with various existing recommendation models, including collaborative-based and sequential-based models, making it easily applicable in different scenarios. Experimental results on two real-world datasets consistently demonstrate that ReCODE significantly improves the performance of base models and outperforms other baseline methods.

RLStop: A Reinforcement Learning Stopping Method for TAR

Reem Bin-Hezam
Mark Stevenson

We present RLStop, a novel Technology Assisted Review (TAR) stopping rule based on reinforcement learning that helps minimise the number of documents that need to be manually reviewed within TAR applications. RLStop is trained on example rankings using a reward function to identify the optimal point to stop examining documents. Experiments at a range of target recall levels on multiple benchmark datasets (CLEF e-Health, TREC Total Recall, and Reuters RCV1) demonstrated that RLStop substantially reduces the workload required to screen a document collection for relevance. RLStop outperforms a wide range of alternative approaches, achieving performance close to the maximum possible for the task under some circumstances.

SCM4SR: Structural Causal Model-based Data Augmentation for Robust Session-based Recommendation

Muskan Gupta
Priyanka Gupta
Jyoti Narwariya
Lovekesh Vig
Gautam Shroff

With mounting privacy concerns, and movement towards a cookie-less internet, session-based recommendation (SR) models are gaining increasing popularity. The goal of SR models is to recommend top-K items to a user by utilizing information from past actions within a session. Many deep neural networks (DNN) based SR have been proposed in the literature, however, they experience performance declines in practice due to inherent biases (e.g., popularity bias) present in training data. To alleviate this, we propose an underlying neural-network (NN) based Structural Causal Model (SCM) which comprises an evolving user behavior (simulator) and recommendation model. The causal relations between the two sub-models and variables at consecutive timesteps are defined by a sequence of structural equations, whose parameters are learned using logged data. The learned SCM enables the simulation of a user's response on a counterfactual list of recommended items (slate). For this, we intervene on recommendation slates with counterfactual slates and simulate the user's response through learned SCM thereby generating counterfactual sessions to augment the training data. Through extensive empirical evaluation on simulated and real-world datasets, we show that the augmented data mitigates the impact of sparse training data and improves the performance of the SR models.

Searching for Physical Documents in Archival Repositories

Tokinori Suzuki
Douglas W. Oard
Emi Ishita
Yoichi Tomiura

Early retrieval systems were used to search physical media (e.g., paper) using manually created metadata. Modern ranked retrieval techniques are far more capable, but they require that content be either born digital or digitized. For physical content, searching metadata remains the state of the art. This paper seeks to change that, using a textual-edge graph neural network to learn relations between items from available metadata and from any content that has been digitized. Results show that substantial improvement over the best prior method can be achieved.

Self-Explainable Next POI Recommendation

Kai Yang
Yi Yang
Qiang Gao
Ting Zhong
Yong Wang
Fan Zhou

Point-of-Interest (POI) recommendation involves predicting users' next preferred POI and is becoming increasingly significant in location-based social networks. However, users are often reluctant to trust recommended results due to the lack of transparency in these systems. While recent work on explaining recommender systems has gained attention, prevailing methods only provide post-hoc explanations based on results or rudimentary explanations according to attention scores. Such limitations hinder reliability and applicability in risk-sensitive scenarios. Inspired by the information theory, we propose a self-explainable framework with an ante-hoc view called \M~for next POI recommendation aimed at overcoming these limitations. Specifically, we endow self-explainability to POI recommender systems through compact representation learning using a variational information bottleneck approach. The learned representation further improves accuracy by reducing redundancy behind massive spatial-temporal trajectories, which, in turn, boosts the recommendation performance. Experiments on three real-world datasets show significant improvements in both model explainability and recommendation performance.

Self-Referential Review: Exploring the Impact of Self-Reference Effect in Review

Kyusik Kim
Hyungwoo Song
Bongwon Suh

The self-reference effect is a psychological phenomenon where information relating to oneself is processed more deeply and remembered more effectively than other information. We propose "self-referential reviews," crafted by merging personal information with existing reviews using the novel "Self-Referential ReviewMaker" prototype, which leverages Large Language Models (LLMs). The essence of the "self-referential review" lies in harnessing the self-reference effect, making the readers feel as if they are the protagonist of the review. To validate the efficacy of self-referential reviews, we conducted a user study focusing on online reviews with thirty-four participants. The contributions of our paper are centered around self-referential reviews, highlighting (1) the creation of these reviews using our new prototype, Self-Referential ReviewMaker, (2) their effectiveness in enhancing review helpfulness through the self-reference effect, and (3) the identification of additional factors influencing the self-reference effect with further discussion on enhancing user-focused review systems.

SpherE: Expressive and Interpretable Knowledge Graph Embedding for Set Retrieval

Zihao Li
Yuyi Ao
Jingrui He

Knowledge graphs (KGs), which store an extensive number of relational facts (head, relation, tail), serve various applications. While many downstream tasks highly rely on the expressive modeling and predictive embedding of KGs, most of the current KG representation learning methods, where each entity is embedded as a vector in the Euclidean space and each relation is embedded as a transformation, follow an entity ranking protocol. On one hand, such an embedding design cannot capture many-to-many relations. On the other hand, in many retrieval cases, the users wish to get an exact set of answers without any ranking, especially when the results are expected to be precise, e.g., which genes cause an illness. Such scenarios are commonly referred to as "set retrieval". This work presents a pioneering study on the KG set retrieval problem. We show that the set retrieval highly depends on expressive modeling of many-to-many relations, and propose a new KG embedding model SpherE to address this problem. SpherE is based on rotational embedding methods, but each entity is embedded as a sphere instead of a vector. While inheriting the high interpretability of rotational-based models, our SpherE can more expressively model one-to-many, many-to-one, and many-to-many relations. Through extensive experiments, we show that our SpherE can well address the set retrieval problem while still having a good predictive ability to infer missing facts. The code is available at https://github.com/Violet24K/SpherE.

SPLATE: Sparse Late Interaction Retrieval

Thibault Formal
Stéphane Clinchant
Hervé Déjean
Carlos Lassance

The late interaction paradigm introduced with ColBERT stands out in the neural Information Retrieval space, offering a compelling effectiveness-efficiency trade-off across many benchmarks. Efficient late interaction retrieval is based on an optimized multi-step strategy, where an approximate search first identifies a set of candidate documents to re-rank exactly. In this work, we introduce SPLATE, a simple and lightweight adaptation of the ColBERTv2 model which learns an "MLM adapter'', mapping its frozen token embeddings to a sparse vocabulary space with a partially learned SPLADE module. This allows us to perform the candidate generation step in late interaction pipelines with traditional sparse retrieval techniques, making it particularly appealing for running ColBERT in CPU environments. Our SPLATE ColBERTv2 pipeline achieves the same effectiveness as the PLAID ColBERTv2 engine by re-ranking 50 documents that can be retrieved under 10ms.

Stochastic RAG: End-to-End Retrieval-Augmented Generation through Expected Utility Maximization

Hamed Zamani
Michael Bendersky

This paper introduces Stochastic RAG--a novel approach for end-to-end optimization of retrieval-augmented generation (RAG) models that relaxes the simplifying assumptions of marginalization and document independence, made in most prior work. Stochastic RAG casts the retrieval process in RAG as a stochastic sampling without replacement process. Through this formulation, we employ straight-through Gumbel-top-k that provides a differentiable approximation for sampling without replacement and enables effective end-to-end optimization for RAG. We conduct extensive experiments on seven diverse datasets on a wide range of tasks, from open-domain question answering to fact verification to slot-filling for relation extraction and to dialogue systems. By applying this optimization method to a recent and effective RAG model, we advance state-of-the-art results on six out of seven datasets.

Synthetic Test Collections for Retrieval Evaluation

Hossein A. Rahmani
Nick Craswell
Emine Yilmaz
Bhaskar Mitra
Daniel Campos

Constructing test collections in Information Retrieval (IR) is vital for evaluating search algorithms. Obtaining a diverse set of user queries for test collection construction can be challenging, and acquiring relevance judgments, which indicate the appropriateness of retrieved documents, is often costly and resource-intensive. Generating synthetic datasets using Large Language Models (LLMs) has recently gained significant attention in various applications. In information retrieval, while previous work exploited the capabilities of LLMs to generate synthetic queries or documents to augment training data and improve the performance of ranking models, using LLMs for constructing synthetic test collections is relatively unexplored. Previous studies demonstrate that LLMs have the potential to generate synthetic relevance judgments for use in the evaluation of information retrieval systems. In this paper, we comprehensively investigate whether it is possible to use LLMs to construct fully synthetic test collections by generating not only synthetic judgments but also synthetic queries. To qualify the efficacy of synthetic queries for examining system ordering, we analyze how these synthetic data are suitable for building a reliable and reusable test collection and the potential risks of bias such test collections may exhibit towards LLM-based models. Our comprehensive experiments indicate that test collections generated using LLMs can effectively and reliably evaluate system performance.

The Surprising Effectiveness of Rankers trained on Expanded Queries

Abhijit Anand
Venktesh V
Vinay Setty
Avishek Anand

An significant challenge in text-ranking systems is handling hard queries that form the tail end of the query distribution. Difficulty may arise due to the presence of uncommon, underspecified, or incomplete queries. In this work, we improve the ranking performance of hard or difficult queries while maintaining the performance of other queries. Firstly, we do LLM-based query enrichment for training queries using relevant documents. Next, a specialized ranker is fine-tuned only on the enriched hard queries instead of the original queries. We combine the relevance scores from the specialized ranker and the base ranker, along with a query performance score estimated for each query. Our approach departs from existing methods that usually employ a single ranker for all queries, which is biased towards easy queries, which form the majority of the query distribution. In our extensive experiments on the DL-Hard dataset, we find that a principled query performance based scoring method using base and specialized ranker offers a significant improvement of up to 48.4% on the document ranking task and up to 25% on the passage ranking task compared to the baseline performance of using original queries, even outperforming SOTA model.

Timeline Summarization in the Era of LLMs

Daivik Sojitra
Raghav Jain
Sriparna Saha
Adam Jatowt
Manish Gupta

Timeline summarization is the task of automatically generating concise overviews of documents that capture the key events and their progression on timelines. While this capability is useful for quickly comprehending event sequences without reading lengthy descriptions, timeline summarization remains a relatively underexplored area in recent years when compared to traditional document summarization task and their evolution. The advent of large language models (LLMs) has led some to presume summarization as a solved problem. However, timeline summarization poses unique challenges for LLMs. Our investigation is centered on evaluating the performance of LLMs, against state-of-the-art models in this field. We employed three different approaches: chunking, knowledge graph-based summarization, and TimeRanker. Each of these methods was systematically tested on three benchmark datasets for timeline summarization to assess their effectiveness in capturing and condensing key events and their evolution within timelines. Our findings reveal that while LLMs show promise, timeline summarization remains a complex task that is not yet fully resolved.

TouchUp-G: Improving Feature Representation through Graph-Centric Finetuning

Jing Zhu
Xiang Song
Vassilis Ioannidis
Danai Koutra
Christos Faloutsos

How can we enhance the node features acquired from Pretrained Models (PMs) to better suit downstream graph learning tasks? Graph Neural Networks (GNNs) have become the state-of-the-art approach for many high-impact, real-world graph applications. For feature-rich graphs, a prevalent practice involves directly utilizing a PM to generate features. Nevertheless, this practice is suboptimal as the node features extracted from PMs are graph-agnostic and prevent GNNs from fully utilizing the potential correlations between the graph structure and node features, leading to a decline in GNN performance. In this work, we seek to improve the node features obtained from a PM for graph tasks and introduce TouchUp-G, a "Detect & Correct" approach for refining node features extracted from PMs. TouchUp-G detects the alignment using a novel feature homophily metric and corrects the misalignment through a simple touchup on the PM. It is (a) General: applicable to any downstream graph task; (b) Multi-modal: able to improve raw features of any modality; (c) Principled: it is closely related to a novel metric, feature homophily, which we propose to quantify the alignment between the graph structure and node features; (d) Effective: achieving state-of-the-art results on four real-world datasets spanning different tasks and modalities.

Towards Ethical Item Ranking: A Paradigm Shift from User-Centric to Item-Centric Approaches

Guilherme Ramos
Mirko Marras
Ludovico Boratto

Ranking systems are instrumental in shaping user experiences by determining the relevance and order of presented items. However, current approaches, particularly those revolving around user-centric reputation scoring, raise ethical concerns associated with scoring individuals. To counter such issues, in this paper, we introduce a novel item ranking system approach that strategically transitions its emphasis from scoring users to calculating item rankings relying exclusively on items' ratings information, to achieve the same objective. Experiments on three datasets show that our approach achieves higher effectiveness and efficiency than state-of-the-art baselines. Furthermore, the resulting rankings are more robust to spam and resistant to bribery, contributing to a novel and ethically sound direction for item ranking systems.

Turbo-CF: Matrix Decomposition-Free Graph Filtering for Fast Recommendation

Jin-Duk Park
Yong-Min Shin
Won-Yong Shin

A series of graph filtering (GF) -based collaborative filtering (CF) showcases state-of-the-art performance on the recommendation accuracy by using a low-pass filter (LPF) without a training process. However, conventional GF-based CF approaches mostly perform matrix decomposition on the item-item similarity graph to realize the ideal LPF, which results in a non-trivial computational cost and thus makes them less practical in scenarios where rapid recommendations are essential. In this paper, we propose Turbo-CF, a GF-based CF method that is both training-free and matrix decomposition-free. Turbo-CF employs a polynomial graph filter to circumvent the issue of expensive matrix decompositions, enabling us to make full use of modern computer hardware components (i.e., GPU). Specifically, Turbo-CF first constructs an item-item similarity graph whose edge weights are effectively regulated. Then, our own polynomial LPFs are designed to retain only low-frequency signals without explicit matrix decompositions. We demonstrate that Turbo-CF is extremely fast yet accurate, achieving a runtime of less than 1 second on real-world benchmark datasets while achieving recommendation accuracies comparable to best competitors.

Unbiased Validation of Technology-Assisted Review for eDiscovery

Gordon V Cormack
Maura R Grossman
Andrew Harbison
Tom O'Halloran
Bronagh McManus

Although it is well established that recall estimates are valid only when based on independent relevance assessments, and useful only to compare the relative effectiveness of competing methods, these conditions are seldom met when validating eDiscovery efforts in litigation. We present two unbiased validation strategies that embed blind relevance assessments into a technology-assisted review (TAR) process, so as to compare its recall to that which would have been achieved by exhaustive manual review. We illustrate the use of these strategies within the context of TAR occasioned by litigation over accounting practices preceding the collapse of a major insurance company.

Unifying Graph Retrieval and Prompt Tuning for Graph-Grounded Text Classification

Le Dai
Yu Yin
Enhong Chen
Hui Xiong

Text classification has long time been researched as a fundamental problem in information retrieval. Since text data are frequently connected with graph structures, it poses new possibilities for a more accurate and explainable classification. One common approach of this graph-text integration is to consider text as graph attributes and utilize GNNs to conduct a node classification task. While both text and graph data are modeled, GNNs treat text in a rather coarse-grained way, have limitations in preserving the detailed structures of a graph, and are less robust to graph sparsity. In this paper, we propose to take an alternative perspective instead, viewing graph as the context of texts, as enlightened by retrieval augmented generation. We propose a novel framework called Graph Retrieval Prompt Tuning (GRPT), consisting of a Graph Retrieval Module and a Prompt Tuning Module integrated with graph context. For graph retrieval, two retrieval strategies are designed to retrieve node context and path context, preserving both node proximity and detailed connectivity patterns. Extensive experiments on four real-world datasets show the effectiveness of our framework in both standard supervised and sparse settings.

USimAgent: Large Language Models for Simulating Search Users

Erhan Zhang
Xingzhu Wang
Peiyuan Gong
Yankai Lin
Jiaxin Mao

Due to the advantages in the cost-efficiency and reproducibility, user simulation has become a promising solution to the user-centric evaluation of information retrieval systems. Nonetheless, accurately simulating user search behaviors has long been a challenge, because users' actions in search are highly complex and driven by intricate cognitive processes such as learning, reasoning, and planning. Recently, Large Language Models (LLMs) have demonstrated remarked potential in simulating human-level intelligence and have been used in building autonomous agents for various tasks. However, the potential of using LLMs in simulating search behaviors has not yet been fully explored. In this paper, we introduce a LLM-based user search behavior simulator, USimAgent. The proposed simulator can simulate users' querying, clicking, and stopping behaviors during search, and thus, is capable of generating complete search sessions for specific search tasks. Empirical investigation on a real user behavior dataset shows that the proposed simulator outperforms existing methods in query generation and is comparable to traditional methods in predicting user clicks and stopping behaviors. These results not only validate the effectiveness of using LLMs for user simulation but also shed light on the development of a more robust and generic user simulators.

Using Large Language Models for Math Information Retrieval

Behrooz Mansouri
Reihaneh Maarefdoust

Large language models, such as Orca-2, have demonstrated notable problem-solving abilities in mathematics. However, their potential to enhance math information retrieval remains largely unexplored. This paper investigates the use of two large language models, LLaMA-2 and Orca-2 for three tasks in math information retrieval. First, the study explores the use of these models for relevance assessment, evaluating the relevance of answers to math questions. Then, the application of these models for math data augmentation is studied. Using the existing math information retrieval test collection, ARQMath, answers of different relevance degrees are generated for each topic. These answers are then used for fine-tuning a cross-encoder re-ranker and are compared against fine-tuning with answers that are manually labeled. Finally, the use of these models for ranking candidate answers to math questions is explored. The experimental results indicate that, while these models may not be effective for relevance assessment and ranking tasks, Orca-2 can be a valuable resource for math data augmentation.

Weighted KL-Divergence for Document Ranking Model Refinement

Yingrui Yang
Yifan Qiao
Shanxiu He
Tao Yang

Transformer-based retrieval and reranking models for text document search are often refined through knowledge distillation together with contrastive learning. A tight distribution matching between the teacher and student models can be hard as over-calibration may degrade training effectiveness when a teacher does not perform well. This paper contrastively reweights KL divergence terms to prioritize the alignment between a student and a teacher model for proper separation of positive and negative documents. This paper analyzes and evaluates the proposed loss function on the MS MARCO and BEIR datasets to demonstrate its effectiveness in improving the relevance of tested student models.

What do Users Really Ask Large Language Models? An Initial Log Analysis of Google Bard Interactions in the Wild

Johanne R. Trippas
Sara Fahad Dawood Al Lawati
Joel Mackenzie
Luke Gallagher

Advancements in large language models (LLMs) have changed information retrieval, offering users a more personalised and natural search experience with technologies like OpenAI ChatGPT, Google Bard (Gemini), or Microsoft Copilot. Despite these advancements, research into user tasks and information needs remains scarce. This preliminary work analyses a Google Bard prompt log with 15,023 interactions called the Bard Intelligence and Dialogue Dataset (BIDD), providing an understanding akin to query log analyses. We show that Google Bard prompts are often verbose and structured, encapsulating a broader range of information needs and imperative (e.g., directive) tasks distinct from traditional search queries. We show that LLMs can support users in tasks beyond the three main types based on user intent: informational, navigational, and transactional. Our findings emphasise the versatile application of LLMs across content creation, LLM writing style preferences, and information extraction. We document diverse user interaction styles, showcasing the adaptability of users to LLM capabilities.

SESSION: Session: Demo Papers

A Question-Answering Assistant over Personal Knowledge Graph

Lingyuan Liu
Huifang Du
Xiaolian Zhang
Mengying Guo
Haofen Wang
Meng Wang

We develop a Personal Knowledge Graph Question-Answering (PKGQA) assistant, seamlessly integrating information from multiple mobile applications into a unified and user-friendly query interface to offer users convenient information retrieval and personalized knowledge services. Based on a fine-grained schema customized for PKG, the PKGQA system in this paper comprises Symbolic Semantic Parsing, Frequently Asked Question (FAQ) Semantic Matching, and Neural Semantic Parsing modules, which are designed to take into account both accuracy and efficiency. The PKGQA system achieves high accuracy on the constructed dataset and demonstrates good performance in answering complex questions. Our system is implemented through an Android application, which is shown in https://youtu.be/p732U5KPEq4.

An Integrated Data Processing Framework for Pretraining Foundation Models

Yiding Sun
Feng Wang
Yutao Zhu
Wayne Xin Zhao
Jiaxin Mao

The ability of the foundation models heavily relies on large-scale, diverse, and high-quality pretraining data. In order to improve data quality, researchers and practitioners often have to manually curate datasets from difference sources and develop dedicated data cleansing pipeline for each data repository. Lacking a unified data processing framework, this process is repetitive and cumbersome. To mitigate this issue, we propose a data processing framework that integrates a Processing Module which consists of a series of operators at different granularity levels, and an Analyzing Module which supports probing and evaluation of the refined data. The proposed framework is easy to use and highly flexible. In this demo paper, we first introduce how to use this framework with some example use cases and then demonstrate its effectiveness in improving the data quality with an automated evaluation with ChatGPT and an end-to-end evaluation in pretraining the GPT-2 model. The code and demonstration video are accessible on GitHub.

CLIP-Branches: Interactive Fine-Tuning for Text-Image Retrieval

Christian Lülf
Denis Mayr Lima Martins
Marcos Antonio Vaz Salles
Yongluan Zhou
Fabian Gieseke

The advent of text-image models, most notably CLIP, has significantly transformed the landscape of information retrieval. These models enable the fusion of various modalities, such as text and images. One significant outcome of CLIP is its capability to allow users to search for images using text as a query, as well as vice versa. This is achieved via a joint embedding of images and text data that can, for instance, be used to search for similar items. Despite efficient query processing techniques such as approximate nearest neighbor search, the results may lack precision and completeness. We introduce CLIP-Branches, a novel text-image search engine built upon the CLIP architecture. Our approach enhances traditional text-image search engines by incorporating an interactive fine-tuning phase, which allows the user to further concretize the search query by iteratively defining positive and negative examples. Our framework involves training a classification model given the additional user feedback and essentially outputs all positively classified instances of the entire data catalog. By building upon recent techniques, this inference phase, however, is not implemented by scanning the entire data catalog, but by employing efficient index structures pre-built for the data. Our results show that the fine-tuned results can improve the initial search outputs in terms of relevance and accuracy while maintaining swift response times.

ConvLogRecaller: Real-Time Conversational Lifelog Recaller

Yuan-Chi Lee
An-Zi Yen
Hen-Hsen Huang
Hsin-Hsi Chen

The popularization of networks fosters the convenience of communication. People can easily share their life experiences and thoughts with relatives and friends via instant messaging software. As time passes, individuals may forget certain details of life events, leading to difficulties in effectively communicating with others. The propensity of individuals to forget or mix up life events highlights the importance of services aimed at retrieving information about past experiences. This paper presents a conversational information recall system, ConvLogRecaller, which proactively supports real-time memory recall assistance during online conversations. Given a conversation of the user with others, ConvLogRecaller suggests a message if the user forgets the details of the life experiences. The services provided by our system can avoid hesitations or memory lapses that might hinder the efficiency of a conversation.

CoSearchAgent: A Lightweight Collaborative Search Agent with Large Language Models

Peiyuan Gong
Jiamian Li
Jiaxin Mao

Collaborative search supports multiple users working together to accomplish a specific search task. Research has found that designing lightweight collaborative search plugins within instant messaging platforms aligns better with users' collaborative habits. However, due to the complexity of multi-user interaction scenarios, it is challenging to implement a fully functioning lightweight collaborative search system. Therefore, previous studies on lightweight collaborative search had to rely on the Wizard of Oz paradigm. In recent years, large language models (LLMs) have been demonstrated to interact naturally with users and achieve complex information-seeking tasks through LLM-based agents. Hence, to better support the research in collaborative search, in this demo, we propose CoSearchAgent, a lightweight collaborative search agent powered by LLMs. CoSearchAgent is designed as a Slack plugin that can support collaborative search during multi-party conversations on this platform. Equipped with the capacity to understand the queries and context in multi-user conversations and the ability to search the Web for relevant information via APIs, CoSearchAgent can respond to user queries with answers grounded on the relevant search results. It can also ask clarifying questions when the information needs are unclear. The proposed CoSearchAgent is highly flexible and would be useful for supporting further research on collaborative search. The code and demo are accessible at https://github.com/pygongnlp/CoSearchAgent

Detecting and Explaining Emotions in Video Advertisements

Joachim Vanneste
Manisha Verma
Debasis Ganguly

The use of video advertisements is a common marketing strategy in today's digital age. Extensive research is conducted by companies to comprehend the emotions conveyed in video advertisements, as they play a crucial role in crafting memorable commercials. Understanding and explaining these abstract concepts in videos is an unsolved problem. There is a large body of work that tries to predict human emotion or activity from videos, however, this is not sufficient. In this paper, we propose a novel framework for detecting and, most importantly, explaining emotions in video advertisements. Our framework consists of two main stages: emotion detection and explanation generation. We use a deep learning model to detect the underlying emotions of a video advertisement and generate visual explanations to give insight into our model's predictions. We demonstrate our system on a dataset of video advertisements and show that our framework can accurately detect and explain emotions in video advertisements. Our results suggest that our novel algorithm has the potential to explain decisions from any video classification model.

Embark on DenseQuest: A System for Selecting the Best Dense Retriever for a Custom Collection

Ekaterina Khramtsova
Teerapong Leelanupab
Shengyao Zhuang
Mahsa Baktashmotlagh
Guido Zuccon

In this demo we present a web-based application for selecting an effective pre-trained dense retriever to use on a private collection. Our system, DenseQuest, provides unsupervised selection and ranking capabilities to predict the best dense retriever among a pool of available dense retrievers, tailored to an uploaded target collection. DenseQuest implements a number of existing approaches, including a recent, highly effective method powered by Large Language Models (LLMs), which requires neither queries nor relevance judgments. The system is designed to be intuitive and easy to use for those information retrieval engineers and researchers who need to identify a general-purpose dense retrieval model to encode or search a new private target collection. Our demonstration illustrates conceptual architecture and the different use case scenarios of the system implemented on the cloud, enabling universal access and use. DenseQuest is available at https://densequest.ielab.io.

FactCheck Editor: Multilingual Text Editor with End-to-End fact-checking

Vinay Setty

We introduce 'FactCheck Editor', an advanced text editor designed to automate fact-checking and correct factual inaccuracies. Given the widespread issue of misinformation, often a result of unintentional mistakes by content creators, our tool aims to address this challenge. It supports over 90 languages and utilizes transformer models to assist humans in the labor-intensive process of fact verification. This demonstration showcases a complete workflow that detects text claims in need of verification, generates relevant search engine queries, and retrieves appropriate documents from the web. It employs Natural Language Inference (NLI) to predict the veracity of claims and uses LLMs to summarize the evidence and suggest textual revisions to correct any errors in the text. Additionally, the effectiveness of models used in claim detection and veracity assessment is evaluated across multiple languages.

Img2Loc: Revisiting Image Geolocalization using Multi-modality Foundation Models and Image-based Retrieval-Augmented Generation

Zhongliang Zhou
Jielu Zhang
Zihan Guan
Mengxuan Hu
Ni Lao
Lan Mu
Sheng Li
Gengchen Mai

Geolocating precise locations from images presents a challenging problem in computer vision and information retrieval. Traditional methods typically employ either classification-dividing the Earth's surface into grid cells and classifying images accordingly, or retrieval-identifying locations by matching images with a database of image-location pairs. However, classification-based approaches are limited by the cell size and cannot yield precise predictions, while retrieval-based systems usually suffer from poor search quality and inadequate coverage of the global landscape at varied scale and aggregation levels. To overcome these drawbacks, we present Img2Loc, a novel system that redefines image geolocalization as a text generation task. This is achieved using cutting-edge large multi-modality models (LMMs) like GPT-4V or LLaVA with retrieval augmented generation. Img2Loc first employs CLIP-based representations to generate an image-based coordinate query database. It then uniquely combines query results with images itself, forming elaborate prompts customized for LMMs. When tested on benchmark datasets such as Im2GPS3k and YFCC4k, Img2Loc not only surpasses the performance of previous state-of-the-art models but does so without any model training. A video demonstration of the system can be accessed via this link https://drive.google.com/file/d/16A6A-mc7AyUoKHRH3_WBRToRC13sn7tU/view?usp=sharing

JPEC: A Novel Graph Neural Network for Competitor Retrieval in Financial Knowledge Graphs

Wanying Ding
Manoj Cherukumalli
Santosh Chikoti
Vinay K. Chaudhri

Knowledge graphs have gained popularity for their ability to organize and analyze complex data effectively. When combined with graph embedding techniques, such as graph neural networks (GNNs), knowledge graphs become a potent tool in providing valuable insights. This study explores the application of graph embedding in identifying competitors from a financial knowledge graph. Existing state-of-the-art(SOTA) models face challenges due to the unique attributes of our knowledge graph, including directed and undirected relationships, attributed nodes, and minimal annotated competitor connections. To address these challenges, we propose a novel graph embedding model, JPEC(JPMorgan Proximity Embedding for Competitor Detection), which utilizes graph neural network to learn from both first-order and second-order node proximity together with vital features for competitor retrieval. JPEC had outperformed most existing models in extensive experiments, showcasing its effectiveness in competitor retrieval.

MACRec: A Multi-Agent Collaboration Framework for Recommendation

Zhefan Wang
Yuanqing Yu
Wendi Zheng
Weizhi Ma
Min Zhang

LLM-based agents have gained considerable attention for their decision-making skills and ability to handle complex tasks. Recognizing the current gap in leveraging agent capabilities for multi-agent collaboration in recommendation systems, we introduce MACRec, a novel framework designed to enhance recommendation systems through multi-agent collaboration. Unlike existing work on using agents for user/item simulation, we aim to deploy multi-agents to tackle recommendation tasks directly. In our framework, recommendation tasks are addressed through the collaborative efforts of various specialized agents, including Manager, User/Item Analyst, Reflector, Searcher, and Task Interpreter, with different working flows. Furthermore, we provide application examples of how developers can easily use MACRec on various recommendation tasks, including rating prediction, sequential recommendation, conversational recommendation, and explanation generation of recommendation results. The framework and demonstration video are publicly available at https://github.com/wzf2000/MACRec.

MeMemo: On-device Retrieval Augmentation for Private and Personalized Text Generation

Zijie J. Wang
Duen Horng Chau

Retrieval-augmented text generation (RAG) addresses the common limitations of large language models (LLMs), such as hallucination, by retrieving information from an updatable external knowledge base. However, existing approaches often require dedicated backend servers for data storage and retrieval, thereby limiting their applicability in use cases that require strict data privacy, such as personal finance, education, and medicine. To address the pressing need for client-side dense retrieval, we introduce MeMemo, the first open-source JavaScript toolkit that adapts the state-of-the-art approximate nearest neighbor search technique HNSW to browser environments. Developed with modern and native Web technologies, such as IndexedDB and Web Workers, our toolkit leverages client-side hardware capabilities to enable researchers and developers to efficiently search through millions of high-dimensional vectors in the browser. MeMemo enables exciting new design and research opportunities, such as private and personalized content creation and interactive prototyping, as demonstrated in our example application RAG Playground. Reflecting on our work, we discuss the opportunities and challenges for on-device dense retrieval. MeMemo is available at https://github.com/poloclub/mememo.

ModelGalaxy: A Versatile Model Retrieval Platform

Wenling Zhang
Yixiao Li
Zhaotian Li
Hailong Sun
Xiang Gao
Xudong Liu

With the growing number of available machine learning models and the emergence of model-sharing platforms, model reuse has become a significant approach to harnessing the power of artificial intelligence. One of the key issues to realizing model reuse resides in efficiently and accurately finding the target models that meet user needs from a model repository. However, the existing popular model-sharing platforms (e.g., Hugging Face) mainly support model retrieval based on model name matching and task filtering. If not familiar with the platform or specific models, users may suffer from low retrieval efficiency and a less user-friendly interaction experience. To address these issues, we have developed ModelGalaxy, a versatile model retrieval platform supporting multiple model retrieval methods, including keyword-based search, dataset-based search, and user-task-centric search. Moreover, ModelGalaxy leverages the power of large language models to provide users with easily retrieving and using models. Our source code is available at https://github.com/zwl906711886/ModelGalaxy.

RAG-Ex: A Generic Framework for Explaining Retrieval Augmented Generation

Viju Sudhi
Sinchana Ramakanth Bhat
Max Rudat
Roman Teucher

Owing to their size and complexity, large language models (LLMs) hardly explain why they generate a response. This effectively reduces the trust and confidence of end users in LLM-based applications, including Retrieval Augmented Generation (RAG) for Question Answering (QA) tasks. In this work, we introduce RAG-Ex, a model- and language-agnostic explanation framework that presents approximate explanations to the users revealing why the LLMs possibly generated a piece of text as a response, given the user input. Our framework is compatible with both open-source and proprietary LLMs. We report the significance scores of the approximated explanations from our generic explainer in both English and German QA tasks and also study their correlation with the downstream performance of LLMs. In the extensive user studies, our explainer yields an F1-score of 76.9% against the end user annotations and attains almost on-par performance with model-intrinsic approaches.

ResumeFlow: An LLM-facilitated Pipeline for Personalized Resume Generation and Refinement

Saurabh Bhausaheb Zinjad
Amrita Bhattacharjee
Amey Bhilegaonkar
Huan Liu

Crafting the ideal, job-specific resume is a challenging task for many job applicants, especially for early-career applicants. While it is highly recommended that applicants tailor their resume to the specific role they are applying for, manually tailoring resumes to job descriptions and role-specific requirements is often (1) extremely time-consuming, and (2) prone to human errors. Furthermore, performing such a tailoring step at scale while applying to several roles may result in a lack of quality of the edited resumes. To tackle this problem, in this demo paper, we propose ResumeFlow: a Large Language Model (LLM) aided tool that enables an end user to simply provide their detailed resume and the desired job posting, and obtain a personalized resume specifically tailored to that specific job posting in the matter of a few seconds. Our proposed pipeline leverages the language understanding and information extraction capabilities of state-of-the-art LLMs such as OpenAI's GPT-4 and Google's Gemini, in order to (1) extract details from a job description, (2) extract role-specific details from the user-provided resume, and then (3) use these to refine and generate a role-specific resume for the user. Our easy-to-use tool leverages the user-chosen LLM in a completely off-the-shelf manner, thus requiring no fine-tuning. We demonstrate the effectiveness of our tool via a https://www.youtube.com/watch?v=Agl7ugyu1N4 and propose novel task-specific evaluation metrics to control for alignment and hallucination. Our tool is available at https://job-aligned-resume.streamlit.app.

Retrieval-Augmented Conversational Recommendation with Prompt-based Semi-Structured Natural Language State Tracking

Sara Kemper
Justin Cui
Kai Dicarlantonio
Kathy Lin
Danjie Tang
Anton Korikov
Scott Sanner

Conversational recommendation (ConvRec) systems must understand rich and diverse natural language (NL) expressions of user preferences and intents, often communicated in an indirect manner (e.g., "I'm watching my weight''). Such complex utterances make retrieving relevant items challenging, especially if only using often incomplete or out-of-date metadata. Fortunately, many domains feature rich item reviews that cover standard metadata categories and offer complex opinions that might match a user's interests (e.g., "classy joint for a date''). However, only recently have large language models (LLMs) let us unlock the commonsense connections between user preference utterances and complex language in user-generated reviews. Further, LLMs enable novel paradigms for semi-structured dialogue state tracking, complex intent and preference understanding, and generating recommendations, explanations, and question answers. We thus introduce a novel technology RA-Rec, a Retrieval-Augmented, LLM-driven dialogue state tracking system for ConvRec, showcased with a video, open source GitHub repository, and interactive Google Colab notebook.

ScholarNodes: Applying Content-based Filtering to Recommend Interdisciplinary Communities within Scholarly Social Networks

Md Asaduzzaman Noor
Jason A. Clark
John W. Sheppard

Detecting communities within dynamic academic social networks and connecting these community detection findings to search and retrieval interfaces presents a multifaceted challenge. We explore an information retrieval method that integrates both partition-based and similarity-based network analysis to identify and recommend communities within content-based datasets. Our prototype "ScholarNodes" web interface bridges the gap between community detection algorithms (Louvain, K-means, Spectral clustering) and the BM25 (Best Matching 25) ranking algorithm within a cohesive user interface. From free-text keyword queries, ScholarNodes recommends collaborations, identifies local and external researcher networks, and visualizes an interdisciplinarity graph for individual researchers using the OpenAlex dataset, a global collection of academic papers and authors. Beyond the specific information retrieval use case, we discuss the broader applicability of the methods to generic social network analysis, community detection, and recommender systems. Additionally, we delve into the technical aspects of generating topical terms, community alignment techniques, and interface design considerations for integrating community detection algorithms into a search experience.

Shadowfax: Harnessing Textual Knowledge Base Population

Maxime Prieur
Cédric Du Mouza
Guillaume Gadek
Bruno Grilheres

Knowledge base population (KBP) from texts involves the extraction and organization of information from unstructured textual data to enhance or create a structured knowledge base. This process is crucial for various applications, such as natural language understanding, question-answering systems, and knowledge-driven decision-making. However the difficulty lies in the complexity of natural language, which is nuanced, ambiguous, and context-dependent. Extracting accurate and reliable information requires overcoming challenges such as entity disambiguation and relation extraction which are time-consuming tasks for users.Shadowfax is an interactive platform designed to support users by streamlining the process of knowledge base population (KPB) from text documents. Unlike other existing tools, it relies on a unified machine learning model to extract relevant information from unstructured text, enabling operational agents to gain a quick overview. The proposed system supports a variety of natural language processing (NLP) tasks using a single architecture, while presenting information in the most comprehensive way possible to the end user.

SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation Tasks

Michael Shliselberg
Ashkan Kazemi
Scott A. Hale
Shiri Dori-Hacohen

Diaspora communities are disproportionately impacted by off-the-radar misinformation and often neglected by mainstream fact-checking efforts, creating a critical need to scale-up efforts of nascent fact-checking initiatives. In this paper we present SynDy, a framework for Synthetic Dynamic Dataset Generation to leverage the capabilities of the largest frontier Large Language Models (LLMs) to train local, specialized language models. To the best of our knowledge, SynDy is the first paper utilizing LLMs to create fine-grained synthetic labels for tasks of direct relevance to misinformation mitigation, namely Claim Matching, Topical Clustering, and Claim Relationship Classification. SynDy utilizes LLMs and social media queries to automatically generate distantly-supervised, topically-focused datasets with synthetic labels on these three tasks, providing essential tools to scale up human-led fact-checking at a fraction of the cost of human-annotated data. Training on SynDy's generated labels shows improvement over a standard baseline and is not significantly worse compared to training on human labels (which may be infeasible to acquire). SynDy is being integrated into Meedan's chatbot tiplines that are used by over 50 organizations, serve over 230K users annually, and automatically distribute human-written fact-checks via messaging apps such as WhatsApp. SynDy will also be integrated into our deployed Co·Insights toolkit, enabling low-resource organizations to launch tiplines for their communities. Finally, we envision SynDy enabling additional fact-checking tools such as matching new misinformation claims to high-quality explainers on common misinformation topics.

TextData: Save What You Know and Find What You Don't

Kevin Ros
Kedar Takwane
Ashwin Patil
Rakshana Jayaprakash
ChengXiang Zhai

In this demonstration, we present TextData, a novel online system that enables users to both "save what they know" and "find what they don't". TextData was developed based on the Community Digital Library (CDL) system. Although the CDL allowed users to bookmark webpages with plain text and provided search and recommendation, it fell short in key features. To better help users save what they know, TextData offers the addition of markdown to submissions for providing a richer method of note-taking. To better help users find what they don't, TextData provides methods for visualizing the relationships among submissions and provides in-context interactive search intent prediction with question-answering via a generative large language model. TextData is free-to-use, can be accessed online, and the source code is publicly available.

Towards Robust QA Evaluation via Open LLMs

Ehsan Kamalloo
Shivani Upadhyay
Jimmy Lin

Instruction-tuned large language models (LLMs) have been shown to be viable surrogates for the widely used, albeit overly rigid, lexical matching metrics in evaluating question answering (QA) models. However, these LLM-based evaluation methods are invariably based on proprietary LLMs. Despite their remarkable capabilities, proprietary LLMs are costly and subject to internal changes that can affect their output, which inhibits the reproducibility of their results and limits the widespread adoption of LLM-based evaluation. In this demo, we aim to use publicly available LLMs for standardizing LLM-based QA evaluation. However, open-source LLMs lag behind their proprietary counterparts. We overcome this gap by adopting chain-of-thought prompting with self-consistency to build a reliable evaluation framework. We demonstrate that our evaluation framework, based on 750M and 7B open LLMs, correlates competitively with human judgment, compared to most recent GPT-3 and GPT-4 models. Our codebase and data are available at https://github.com/castorini/qa-eval.

Truth-O-Meter: Handling Multiple Inconsistent Sources Repairing LLM Hallucinations

Boris Galitsky
Anton Chernyavskiy
Dmitry Ilvovsky

Large Language Models (LLM) often produce text with incorrect facts and hallucinations. To address this issue, we developed a fact-checking system Truth-O-Meter¹² which verifies LLM results on the Internet and other sources of information to detect wrong claims/facts and proposes corrections for them. NLP and reasoning techniques such as Abstract Meaning Representation and syntactic alignment are applied to match hallucinating sentences with truthful ones. To handle inconsistent sources while fact-checking, we rely on argumentation analysis in the form of defeasible logic programming, selecting the most authoritative source. Our evaluation shows that LLM content can be substantially improved for factual correctness and meaningfulness on an industrial scale.

unKR: A Python Library for Uncertain Knowledge Graph Reasoning by Representation Learning

Jingting Wang
Tianxing Wu
Shilin Chen
Yunchang Liu
Shutong Zhu
Wei Li
Jingyi Xu
Guilin Qi

Recently, uncertain knowledge graphs (UKGs), where each relation between entities is associated with a confidence score, have gained much attention. Compared with traditional knowledge graphs, UKGs possess the capability of uncertainty knowledge expression, which facilitates more reliable and precise knowledge graph reasoning by not only completing missing triples but also predicting triple confidences. In this paper, we release unKR, the first open-source python library for uncertain Knowledge graph (UKG) Reasoning by representation learning. We design a unified framework to implement two types of representation learning models for UKG reasoning, i.e., normal and few-shot ones. Besides, we standardize the evaluation tasks and metrics for UKG reasoning to ensure fair comparisons, and report the detailed results of each model under the consistent test setting. With unKR, it is effortless for users to reproduce existing models, as well as efficiently customize their own models. The library, documentation, demo, and re-implementing results are all publicly released at https://github.com/seucoin/unKR.

SESSION: Session: SIRIP: LLMs 1

"Ask Me Anything": How Comcast Uses LLMs to Assist Agents in Real Time

Scott Rome
Tianwen Chen
Raphael Tang
Luwei Zhou
Ferhan Ture

Customer service is how companies interface with their customers. It can contribute heavily towards the overall customer satisfaction. However, high-quality service can become expensive, creating an incentive to make it as cost efficient as possible and prompting most companies to utilize AI-powered assistants, or "chat bots". On the other hand, human-to-human interaction is still desired by customers, especially when it comes to complex scenarios such as disputes and sensitive topics like bill payment.

This raises the bar for customer service agents. They need to accurately understand the customer's question or concern, identify a solution that is acceptable yet feasible (and within the company's policy), all while handling multiple conversations at once.

In this work, we introduce "Ask Me Anything" (AMA) as an add-on feature to an agent-facing customer service interface. AMA allows agents to ask questions to a large language model (LLM) on demand, as they are handling customer conversations---the LLM provides accurate responses in real-time, reducing the amount of context switching the agent needs. In our internal experiments, we find that agents using AMA versus a traditional search experience spend approximately 10% fewer seconds per conversation containing a search, translating to millions of dollars of savings annually. Agents that used the AMA feature provided positive feedback nearly 80% of the time, demonstrating its usefulness as an AI-assisted feature for customer care.

A Field Guide to Automatic Evaluation of LLM-Generated Summaries

Tempest A. van Schaik
Brittany Pugh

Large Language models (LLMs) are rapidly being adopted for tasks such as text summarization, in a wide range of industries. This has driven the need for scalable, automatic, reliable, and cost-effective methods to evaluate the quality of LLM-generated text. What is meant by evaluating an LLM is not yet well defined and there are widely different expectations about what kind of information evaluation will produce. Evaluation methods that were developed for traditional Natural Language Processing (NLP) tasks (before the rise of LLMs) remain applicable but are not sufficient for capturing high-level semantic qualities of summaries. Emerging evaluation methods that use LLMs to evaluate LLM-output, appear to be powerful but lacking in reliability. New elements of LLM generated text that were not an element of previous NLP tasks, such as the artifacts of hallucination, need to be considered. We outline the different types of LLM evaluation currently used in the literature but focus on offline, system-level evaluation of the text generated by LLMs. Evaluating LLM-generated summaries is a complex and fast-evolving area, and we propose strategies for applying evaluation methods to avoid common pitfalls. Despite having promising strategies for evaluating LLM summaries, we highlight some open challenges that remain.

Synthetic Query Generation using Large Language Models for Virtual Assistants

Sonal Sannigrahi
Thiago Fraga-Silva
Youssef Oualil
Christophe Van Gysel

Virtual Assistants (VAs) are important Information Retrieval platforms that help users accomplish various tasks through spoken commands. The speech recognition system (speech-to-text) uses query priors, trained solely on text, to distinguish between phonetically confusing alternatives. Hence, the generation of synthetic queries that are similar to existing VA usage can greatly improve upon the VA's abilities---especially for use-cases that do not (yet) occur in paired audio/text data. In this paper, we provide a preliminary exploration of the use of Large Language Models (LLMs) to generate synthetic queries that are complementary to template-based methods. We investigate whether the methods (a) generate queries that are similar to randomly sampled, representative, and anonymized user queries from a popular VA, and (b) whether the generated queries are specific. We find that LLMs generate more verbose queries, compared to template-based methods, and reference aspects specific to the entity. However, the generated queries are similar to VA user queries, and are specific enough to retrieve the relevant entity. We conclude that queries generated by LLMs and templates are complementary.

Surprising Efficacy of Fine-Tuned Transformers for Fact-Checking over Larger Language Models

Vinay Setty

In this paper, we explore the challenges associated with establishing an end-to-end fact-checking pipeline in a real-world context, covering over 90 languages. Our real-world experimental benchmarks demonstrate that fine-tuning Transformer models specifically for fact-checking tasks, such as claim detection and veracity prediction, provide superior performance over large language models (LLMs) like GPT-4, GPT-3.5-Turbo, and Mistral-7b. However, we illustrate that LLMs excel in generative tasks such as question decomposition for evidence retrieval. Through extensive evaluation, we show the efficacy of fine-tuned models for fact-checking in a multilingual setting and complex claims that include numerical quantities.

LLMGR: Large Language Model-based Generative Retrieval in Alipay Search

Chen Wei
Yixin Ji
Zeyuan Chen
Jia Xu
Zhongyi Liu

The search system aims to help users quickly find items according to queries they enter, which includes the retrieval and ranking modules. Traditional retrieval is a multi-stage process, including indexing and sorting, which cannot be optimized end-to-end. With the real data about mini-apps in the Alipay search, we find that many complex queries fail to display the relevant mini-apps, seriously threatening users' search experience. To address the challenges, we propose a Large Language Model-based Generative Retrieval (LLMGR) approach for retrieving mini-app candidates. The information of the mini-apps is encoded into the large model, and the title of the mini-app is directly generated. Through the online A/B test in Alipay search, LLMGR as a supplementary source has statistically significant improvements in the Click-Through Rate (CTR) of the search system compared to traditional methods. In this paper, we have deployed a novel retrieval method for the Alipay search system and demonstrated that generative retrieval methods based on LLM can improve the performance of search system, particularly for complex queries, which have an average increase of 0.2% in CTR.

SESSION: Session: SIRIP: Domain-Specific 1

Misinformation Mitigation Praxis: Lessons Learned and Future Directions from Co·Insights

Scott A. Hale
Kiran Garimella
Shiri Dori-Hacohen

Misinformation is a global challenge, but successful mitigations must come from the communities affected and not be imposed by external entities. Co·Insights is a multi-year NSF-funded initiative building capacity to respond to misinformation and harmful narratives in Asian American and Pacific Islander (AAPI) communities, based on a deep involvement with grassroots organizations and a co-construction of tools grounded on community needs. Co·Insights' unique cross-sectoral, cross-disciplinary collaboration is a convergence of information retrieval, computational social science, and ethnographic inquiry with a unique platform that enables community organizations, fact-checkers, and academics to work together to respond effectively to harmful content targeting communities. In this SIRIP talk designed for a technical audience, we will share lessons learned from the first 2.5 years of Co·Insights, and how we are bridging the academic--praxis divide by integrating state-of-the-art research developments into large-scale deployed systems. Topics covered will include: the challenges and joys of collaborating across disciplinary and sectoral boundaries; community-driven approaches that utilize information retrieval techniques such as claim matching in concert with emerging best practices in misinformation mitigation; open problems we have encountered in the space; and future directions we find promising. Co·Insights is led by Meedan, a global non-profit providing award-winning software solutions to mitigate misinformation; and in partnership with AAPI community organizations and several academic institutions.

Enhancing Baidu Multimodal Advertisement with Chinese Text-to-Image Generation via Bilingual Alignment and Caption Synthesis

Kang Zhao
Xinyu Zhao
Zhipeng Jin
Yi Yang
Wen Tao
Cong Han
Shuanglong Li
Lin Liu

Recent advances in generative artificial intelligence have revolutionized information retrieval and content generation, opening up new opportunities for the e-commerce industry. In particular, text-to-image generation models offer a novel approach to guiding the image generation process using natural language input, which is inspiring for multimodal search advertising. Traditional multimodal search ads require advertisers to prepare ad creatives, such as ad images, which is time-consuming and requires uniform image specifications and content quality inspection. To this end, we propose a streamlined generation framework for search ad image creatives. First, we prepare a Chinese image caption model with high-quality image-caption pairs to bootstrap training data refinement. With curated high-quality images and synthesized descriptive captions, we then train a Chinese text-to-image generation model, the largest to date, using SDXL and a 10-billion multimodal text encoder. Specifically, we introduce a two-stage bilingual multimodal representation alignment process to seamlessly integrate the text encoder with the generation model. Extensive experiments validate the effectiveness of our framework, including assessments of image captioning and image generation. The implementation of our framework in Baidu Search Ads shows significant revenue increase, For example, beauty industry ads with generated image creatives achieve a 29% higher click-through rate (CTR).

Relevance Feedback Method For Patent Searching Using Vector Subspaces

Sebastian Björkqvist

Searching for novelty-destroying prior art is an important part of patent application drafting and invalidation. The task is challenging due to the detailed information needed to determine whether a document is novelty-destroying or simply closely related, resulting in the original search results not always being fully on target. Allowing the user to provide feedback on the relevance of the initial search results and iterating on the search may thus improve the results significantly. We present a relevance feedback method based on computing the affine vector subspace spanned by the relevant document vectors. The method can be used with any dense retrieval system, and we demonstrate its effectiveness in improving recall in prior art searches. We compare the subspace-based method to the Rocchio algorithm and show that the method is less sensitive to changes in hyperparameters when the number of relevant documents increases.

A Study on Unsupervised Question and Answer Generation for Legal Information Retrieval and Precedents Understanding

Johny Moreira
Altigran da Silva
Edleno de Moura
Leandro Marinho

Traditional retrieval systems are hardly adequate for Legal Research, mainly because only returning the documents related to a given query is usually insufficient. Legal documents are extensive, and we posit that generating questions about them and detecting the answers provided by these documents help the Legal Research journey. This paper presents a pipeline that relates Legal Questions with documents answering them. We align features generated by Large Language Models with traditional clustering methods to find convergent and divergent answers to the same legal matter. We performed a case study with 50 legal documents on the Brazilian judiciary system. Our pipeline found convergent and divergent answers to 23 major legal questions regarding the case law for daily fines in Civil Procedural Law. The pipeline manual evaluation shows it managed to group diverse similar answers to the same question with an average precision of 0.85. It also managed to detect two divergent legal matters with an average F1 Score of 0.94.

SESSION: Session: SIRIP: E-commerce

Homogeneous-listing-augmented Self-supervised Multimodal Product Title Refinement

Jiaqi Deng
Kaize Shi
Huan Huo
Dingxian Wang
Guandong Xu

Product titles on e-commerce marketplaces often suffer from verbosity and inaccuracy, hindering effective communication of essential product details to customers. Refining titles to be more concise and informative is crucial for better user experience and product promotion. Recent solutions to product title refinement follow the standard text extractive and generative methods. Some also leverage multimodal information, e.g. using product images to supplement original titles with visual knowledge. However, these generative methods often produce additional terms not endorsed by sellers. Thus, it remains challenging to incorporate visual information missing from original titles into refined titles without excessively introducing novel terms. Additionally, most existing methods require human-labeled datasets, which are laborious to construct. In response to the two challenges, we present a self-supervised multimodal framework (HLATR) for title refinement that comprises two key modules: (1) a perturbated sample generator that constructs training data by systematically mining homogeneous listing information and (2) a title refinement network that effectively harnesses visual information to refine the original titles. To explicitly balance the extraction from original titles and the generation of supplementary novel terms, we adapt the copy mechanism that is guided by a focused refinement loss. Extensive experiments demonstrate that our proposed framework consistently outperforms others in generating refined titles that contain essential multimodal semantics with minimal deviation from the original ones.

Optimizing E-commerce Search: Toward a Generalizable and Rank-Consistent Pre-Ranking Model

Enqiang Xu
Yiming Qiu
Junyang Bai
Ping Zhang
Dadong Miao
Songlin Wang
Guoyu Tang
Lin Liu
MingMing Li

In large e-commerce platforms, search systems are typically composed of a series of modules, including recall, pre-ranking, and ranking phases. The pre-ranking phase, serving as a lightweight module, is crucial for filtering out the bulk of products in advance for the downstream ranking module. Industrial efforts on optimizing the pre-ranking model have predominantly focused on enhancing ranking consistency, model structure, and generalization towards long-tail items. Beyond these optimizations, meeting the system performance requirements presents a significant challenge. Contrasting with existing industry works, we propose a novel method: a Generalizable and RAnk-ConsistEnt Pre-Ranking Model (GRACE), which achieves: 1) Ranking consistency by introducing multiple binary classification tasks that predict whether a product is within the top-k results as estimated by the ranking model, which facilitates the addition of learning objectives on common point-wise ranking models; 2) Generalizability through contrastive learning of representation for all products by pre-training on a subset of ranking product embeddings; 3) Ease of implementation in feature construction and online deployment. Our extensive experiments demonstrate significant improvements in both offline metrics and online A/B test: a 0.75% increase in AUC and a 1.28% increase in CVR.

A Large-scale Offer Alignment Model for Partitioning Filtering and Matching Product Offers

Wenyu Huang
André Melo
Jeff Z. Pan

Offer alignment is a key step in a product knowledge graph construction pipeline. It aims to align retailer offers of the same product for better coverage of product details. With the rapid development of online shopping services, the offer alignment task is applied in ever larger datasets. This work aims to build an offer alignment system that can efficiently be used in large-scale offer data. The key components of this system include: 1) common offer encoders for encoding text offer data into representations; 2) trainable LSH partitioning module to divide similar offers into small blocks; 3) lightweight sophisticated late-interactions for efficient filtering and scoring of offer alignment candidate pairs. We evaluate the system on public WDC offer alignment dataset, as well as DBLP-Scholar and DBLP-ACM.

ECAT: A Entire space Continual and Adaptive Transfer Learning Framework for Cross-Domain Recommendation

Chaoqun Hou
Yuanhang Zhou
Yi Cao
Tong Liu

In industrial recommendation systems, there are several mini-apps designed to meet the diverse interests and needs of users. The sample space of them is merely a small subset of the entire space, making it challenging to train an efficient model. In recent years, there have been many excellent studies related to cross-domain recommendation aimed at mitigating the problem of data sparsity. However, few of them have simultaneously considered the adaptability of both sample and representation continual transfer setting to the target task. To overcome the above issue, we propose a Entire space Continual and Adaptive Transfer learning framework called ECAT which includes two core components: First, as for sample transfer, we propose a two-stage method that realizes a coarse-to-fine process. Specifically, we perform an initial selection through a graph guided method, followed by a fine-grained selection using domain adaptation method. Second, we propose an adaptive knowledge distillation method for continually transferring the representations from a model that is well-trained on the entire space dataset. ECAT enables full utilization of the entire space samples and representations under the supervision of the target task, while avoiding negative migration. Comprehensive experiments on real-world industrial datasets from Taobao show that ECAT advances state-of-the-art performance on offline metrics, and brings +13.6% CVR and +8.6% orders for Baiyibutie, a famous mini-app of Taobao.

A Unified Search and Recommendation Framework Based on Multi-Scenario Learning for Ranking in E-commerce

Jinhan Liu
Qiyu Chen
Junjie Xu
Junjie Li
Baoli Li
Sulong Xu

Search and recommendation (S&R) are the two most important scenarios in e-commerce. The majority of users typically interact with products in S&R scenarios, indicating the need and potential for joint modeling. Traditional multi-scenario models use shared parameters to learn the similarity of multiple tasks, and task-specific parameters to learn the divergence of individual tasks. This coarse-grained modeling approach does not effectively capture the differences between S&R scenarios. Furthermore, this approach does not sufficiently exploit the information across the global label space. These issues can result in the suboptimal performance of multi-scenario models in handling both S&R scenarios. To address these issues, we propose an effective and universal framework for Unified Search and Recommendation (USR), designed with S&R Views User Interest Extractor Layer (IE) and S&R Views Feature Generator Layer (FG) to separately generate user interests and scenario-agnostic feature representations for S&R. Next, we introduce a Global Label Space Multi-Task Layer (GLMT) that uses global labels as supervised signals of auxiliary tasks and jointly models the main task and auxiliary tasks using conditional probability. Extensive experimental evaluations on real-world industrial datasets show that USR can be applied to various multi-scenario models and significantly improve their performance. Online A/B testing also indicates substantial performance gains across multiple metrics. Currently, USR has been successfully deployed in the 7Fresh App.

A Preference-oriented Diversity Model Based on Mutual-information in Re-ranking for E-commerce Search

Huimu Wang
Mingming Li
Dadong Miao
Songlin Wang
Guoyu Tang
Lin Liu
Sulong Xu
Jinghe Hu

Re-ranking is a process of rearranging ranking list to more effectively meet user demands by accounting for the interrelationships between items. Existing methods predominantly enhance the precision of search results, often at the expense of diversity, leading to outcomes that may not fulfill the varied needs of users. Conversely, methods designed to promote diversity might compromise the precision of the results, failing to satisfy the users' requirements for accuracy. To alleviate the above problems, this paper proposes a Preference-oriented Diversity Model Based on Mutual-information (PODM-MI), which consider both accuracy and diversity in the re-ranking process. Specifically, PODM-MI adopts Multidimensional Gaussian distributions based on variational inference to capture users' diversity preferences with uncertainty. Then we maximize the mutual information between the diversity preferences of the users and the candidate items using the maximum variational inference lower bound to enhance their correlations. Subsequently, we derive a utility matrix based on the correlations, enabling the adaptive ranking of items in line with user preferences and establishing a balance between the aforementioned objectives. Experimental results on real-world online e-commerce systems demonstrate the significant improvements of PODM-MI, and we have successfully deployed PODM-MI on an e-commerce search platform.

SESSION: Session: SIRIP: LLMs 2

Reflections on the Coding Ability of LLMs for Analyzing Market Research Surveys

Shi Zong
Santosh Kolagati
Amit Chaudhary
Josh Seltzer
Jimmy Lin

The remarkable success of large language models (LLMs) has drawn people's great interest in their deployment in specific domains and downstream applications. In this paper, we present the first systematic study of applying large language models (in our case, GPT-3.5 and GPT-4) for the automatic coding (multi-class classification) problem in market research. Our experimental results show that large language models could achieve a macro F1 score of over 0.5 for all our collected real-world market research datasets in a zero-shot setting. We also provide in-depth analyses of the errors made by the large language models. We hope this study sheds light on the lessons we learn and the open challenges large language models have when adapting to a specific market research domain.

Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering

Zhentao Xu
Mark Jerome Cruz
Matthew Guevara
Tie Wang
Manasi Deshpande
Xiaofeng Wang
Zheng Li

In customer service technical support, swiftly and accurately retrieving relevant past issues is critical for efficiently resolving customer inquiries. The conventional retrieval methods in retrieval-augmented generation (RAG) for large language models (LLMs) treat a large corpus of past issue tracking tickets as plain text, ignoring the crucial intra-issue structure and inter-issue relations, which limits performance. We introduce a novel customer service question-answering method that amalgamates RAG with a knowledge graph (KG). Our method constructs a KG from historical issues for use in retrieval, retaining the intra-issue structure and inter-issue relations. During the question-answering phase, our method parses consumer queries and retrieves related sub-graphs from the KG to generate answers. This integration of a KG not only improves retrieval accuracy by preserving customer service structure information but also enhances answering quality by mitigating the effects of text segmentation. Empirical assessments on our benchmark datasets, utilizing key retrieval (MRR, Recall@K, NDCG@K) and text generation (BLEU, ROUGE, METEOR) metrics, reveal that our method outperforms the baseline by 77.6% in MRR and by 0.32 in BLEU. Our method has been deployed within LinkedIn's customer service team for approximately six months and has reduced the median per-issue resolution time by 28.6%.

LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction

Chenhao Fang
Xiaohan Li
Zezhong Fan
Jianpeng Xu
Kaushiki Nag
Evren Korpeoglu
Sushant Kumar
Kannan Achan

Product attribute value extraction is a pivotal component in Natural Language Processing (NLP) and the contemporary e-commerce industry. The provision of precise product attribute values is fundamental in ensuring high-quality recommendations and enhancing customer satisfaction. The recently emerging Large Language Models (LLMs) have demonstrated state of-the-art performance in numerous attribute extraction tasks, without the need for domain-specific training data. Nevertheless, varying strengths and weaknesses are exhibited by different LLMs due to the diversity in data, architectures, and hyperparameters. This variation makes them complementary to each other, with no single LLM dominating all others. Considering the diverse strengths and weaknesses of LLMs, it becomes necessary to develop an ensemble method that leverages their complementary potentials.

In this paper, we propose a novel algorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute value extraction. We iteratively learn the weights for different LLMs to aggregate the labels with weights to predict the final attribute value. Not only can our proposed method be proven theoretically optimal, but it also ensures efficient computation, fast convergence, and safe deployment. We have also conducted extensive experiments with various state-of-the-art LLMs on Walmart's internal data. Our offline metrics demonstrate that the LLM-ensemble method outperforms all the state-of-the-art single LLMs on Walmart's internal dataset. This method has been launched in several production models, leading to improved Gross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate (CVR), and Add-to-Cart Rate (ATC).

SESSION: Session: SIRIP: Recsys and Social Media

Interest Clock: Time Perception in Real-Time Streaming Recommendation System

Yongchun Zhu
Jingwu Chen
Ling Chen
Yitan Li
Feng Zhang
Zuotao Liu

User preferences follow a dynamic pattern over a day, e.g., at 8 am, a user might prefer to read news, while at 8 pm, they might prefer to watch movies. Time modeling aims to enable recommendation systems to perceive time changes to capture users' dynamic preferences over time, which is an important and challenging problem in recommendation systems. Especially, streaming recommendation systems in the industry, with only available samples of the current moment, present greater challenges for time modeling. There is still a lack of effective time modeling methods for streaming recommendation systems. In this paper, we propose an effective and universal method Interest Clock to perceive time information in recommendation systems. Interest Clock first encodes users' time-aware preferences into a clock (hour-level personalized features) and then uses Gaussian distribution to smooth and aggregate them into the final interest clock embedding according to the current time for the final prediction. By arming base models with Interest Clock, we conduct online A/B tests, obtaining +0.509% and +0.758% improvements on user active days and app duration respectively. Besides, the extended offline experiments show improvements as well. Interest Clock has been deployed on Douyin Music App.

GATS: Generative Audience Targeting System for Online Advertising

Cong Jiang
Zhongde Chen
Bo Zhang
Yankun Ren
Xin Dong
Lei Cheng
Xinxing Yang
Longfei Li
Jun Zhou
Linjian Mo

This paper presents GATS (Generative Audience Targeting S ystem for Online Advertising), a new framework using large language models (LLMs) to improve audience targeting in online advertising. GATS overcomes the shortcomings of rule-based, look-alike, and graph-based methods by facilitating flexible and interpretable audience criteria expression. The framework integrates intent recognition, knowledge mining, and Data Management Platform (DMP) mapping to translate advertiser demands into actionable user tags and correlate them within a DMP. A small, white-box model called LightGATS (base on QWen-14B), fine-tuned with a high-quality LLM corpus, ensures the framework's safety and efficiency, operating within a scalable hybrid online-offline architecture. GATS's effectiveness is validated through extensive experiments, marking a significant advancement in audience targeting technology.

Minimizing Live Experiments in Recommender Systems: User Simulation to Evaluate Preference Elicitation Policies

Chih-Wei Hsu
Martin Mladenov
Ofer Meshi
James Pine
Hubert Pham
Shane Li
Xujian Liang
Anton Polishko
Li Yang
Ben Scheetz
Craig Boutilier

Evaluation of policies in recommender systems typically involves A/B live experiments on real users to assess a new policy's impact on relevant metrics. This "gold standard'' comes at a high cost, however, in terms of cycle time, user cost, and potential user retention. In developing policies for onboarding users, these costs can be especially problematic, since on-boarding occurs only once. In this work, we describe a simulation methodology used to augment (and reduce) the use of live experiments. We illustrate its deployment for the evaluation of preference elicitation algorithms used to onboard new users of the YouTube Music platform. By developing counterfactually robust user behavior models, and a simulation service that couples such models with production infrastructure, we can test new algorithms in a way that reliably predicts their performance on key metrics when deployed live.

Improving Embedding-Based Retrieval in Friend Recommendation with ANN Query Expansion

Pau Perng-Hwa Kung
Zihao Fan
Tong Zhao
Yozen Liu
Zhixin Lai
Jiahui Shi
Yan Wu
Jun Yu
Neil Shah
Ganesh Venkataraman

Embedding-based retrieval in graph-based recommendation has shown great improvements over traditional graph walk retrieval methods, and has been adopted in large-scale industry applications such as friend recommendations [16]. However, it is not without its challenges: retraining graph embeddings frequently due to changing data is slow and costly, and producing high recall of approximate nearest neighbor search (ANN) on such embeddings is challenging due to the power law distribution of the indexed users. In this work, we address theses issues by introducing a simple query expansion method in ANN, called FriendSeedSelection, where for each node query, we construct a set of 1-hop embeddings and run ANN search. We highlight our approach does not require any model-level tuning, and is inferred from the data at test-time. This design choice effectively enables our recommendation system to adapt to the changing graph distribution without frequent heavy model retraining. We also discuss how we design our system to efficiently construct such queries online to support 10k+ QPS. For friend recommendation, our method shows improvements of recall, and 11% relative friend reciprocated communication metric gains, now serving over 800 million monthly active users at Snapchat.

Monitoring the Evolution of Behavioural Embeddings in Social Media Recommendation

Srijan Saket
Olivier Jeunen
Md. Danish Kalim

Emerging short-video platforms like TikTok, Instagram Reels, and ShareChat present unique challenges for recommender systems, primarily originating from a continuous stream of new content. ShareChat alone receives approximately 2 million pieces of fresh content daily, complicating efforts to assess quality, learn effective latent representations, and accurately match content with the appropriate user base, especially given limited user feedback. Embedding-based approaches are a popular choice for industrial recommender systems because they can learn low-dimensional representations of items, leading to effective recommendation that can easily scale to millions of items and users.

Our work characterizes the evolution of such embeddings in short-video recommendation systems, comparing the effect of batch and real-time updates to content embeddings. We investigate how embeddings change with subsequent updates, explore the relationship between embeddings and popularity bias, and highlight their impact on user engagement metrics. Our study unveils the contrast in the number of interactions needed to achieve mature embeddings in a batch learning setup versus a real-time one, identifies the point of highest information updates, and explores the distribution of l₂-norms across the two competing learning modes. Utilizing a production system deployed on a large-scale short-video app with over 180 million users, our findings offer insights into designing effective recommendation systems and enhancing user satisfaction and engagement in short-video applications.

Striking the Right Chord: A Comprehensive Approach to Amazon Music Search Spell Correction

Siddharth Sharma
Shiyun Yang
Ajinkya Walimbe
Tarun Sharma
Joaquin Delgado

Music and media search spell correction is distinct as it involves named entities like artist, album and podcast names, keywords from track titles and catchy phrases from lyrics. Users often mix artist names and keywords from track title or lyrics making spell correction highly contextual. Data drift in search queries caused during calendar event days or a newly released music album, brings a unique challenge of quickly adapting to new data points. Scalability of the solution is an essential requirement as the Music catalog is extremely large. In this work, we build a multi-stage framework for spell correction solution for music, media and named entity heavy search engines. We offer contextual spelling suggestions using a generative text transformer model and a mechanism to rapidly adapt to data drift as well as different market needs by using parameter efficient based fine tuning techniques. Furthermore, using a reinforcement learning approach our spell correction system can learn from a user's implicit and explicit feedback in real-time. Some key components of this system are being used in search at Amazon Music and showing significant improvements in customer engagement rate and other relevant metrics.

SESSION: Session: SIRIP: Search Assistance

A Semantic Search Engine for Helping Patients Find Doctors and Locations in a Large Healthcare Organization

Mayank Kejriwal
Hamid Haidarian
Min-Hsueh Chiu
Andy Xiang
Deep Shrestha
Faizan Javed

Efficiently finding doctors and locations (FDL) is an important search problem for patients in the healthcare domain, for which traditional information retrieval (IR) methods tend to be sub-optimal. This paper introduces and defines FDL as an important healthcare industry-specific problem in IR. We then propose a semantic search engine as a robust solution to FDL in Kaiser Permanente (KP), a large healthcare organization with 12 million members. Our solution meets practical needs of data security and privacy, scalability, cost-effectiveness, backward compatibility with existing indexes and search infrastructure, and interpretability of outputs for patients. It uses a concept-rich ontology to model raw data from multiple sources as entities, relations, and attributes in a knowledge graph that is stored and indexed in an industry-scale graph database. We evaluate the solution on a real patient-query log and demonstrate its practical utility. The system has been implemented and deployed live to KP customers.

Clinical Trial Retrieval via Multi-grained Similarity Learning

Junyu Luo
Cheng Qian
Lucas Glass
Fenglong Ma

Clinical trial analysis is one of the main business directions and services in IQVIA, and reviewing past similar studies is one of the most critical steps before starting a commercial clinical trial. The current review process is manual and time-consuming, requiring a clinical trial analyst to manually search through an extensive clinical trial database and then review all candidate studies. Therefore, it is of great interest to develop an automatic retrieval algorithm to select similar studies by giving new study information. To achieve this goal, we propose a novel group-based trial similarity learning network named GTSLNet, consisting of two kinds of similarity learning modules. The pair-wise section-level similarity learning module aims to compare the query trial and the candidate trial from the abstract semantic level via the proposed section transformer. Meanwhile, a word-level similarity learning module uses the word similarly matrix to capture the low-level similarity information. Additionally, an aggregation module combines these similarities. To address potential false negatives and noisy data, we introduce a variance-regularized group distance loss function. Experiment results show that the proposed GTSLNet significantly and consistently outperforms state-of-the-art baselines.

Embedding Based Deduplication in E-commerce AutoComplete

Shaodan Zhai
Yuwei Chen
Yixue Li

Query AutoComplete (QAC) is an important feature in e-commerce search engines, aimed at enhancing user experience by offering relevant query suggestions. However, these suggestions often include semantically duplicate entries derived from user logs. While the existing literature has made significant progress in query similarity learning for e-commerce applications, the specific challenge of query deduplication has received less attention. To address this issue, this paper presents a new industry-scale framework for QAC deduplication at Coupang, utilizing diverse data augmentation techniques to enhance deduplication accuracy effectively. Our results reveal that this approach substantially outperforms existing query similarity methods, providing valuable insights into the utility of various pre-trained models and data augmentation strategies. Online A/B testing further validates the significant impact of our deduplication framework on improving the e-commerce search experience, highlighting the importance of addressing semantic duplicates in QAC suggestions and offering a practical solution with proven effectiveness in a live e-commerce environment.

Question Suggestion for Conversational Shopping Assistants Using Product Metadata

Nikhita Vedula
Oleg Rokhlenko
Shervin Malmasi

Digital assistants have become ubiquitous in e-commerce applications, following the recent advancements in Information Retrieval (IR), Natural Language Processing (NLP) and Generative Artificial Intelligence (AI). However, customers are often unsure or unaware of how to effectively converse with these assistants to meet their shopping needs. In this work, we emphasize the importance of providing customers a fast, easy to use, and natural way to interact with conversational shopping assistants. We propose a framework that employs Large Language Models (LLMs) to automatically generate contextual, useful, answerable, fluent and diverse questions about products, via in-context learning and supervised fine-tuning. Recommending these questions to customers as helpful suggestions or hints to both start and continue a conversation can result in a smoother and faster shopping experience with reduced conversation overhead and friction. We perform extensive offline evaluations, and discuss in detail about potential customer impact, and the type, length and latency of our generated product questions if incorporated into a real-world shopping assistant.

SESSION: Session: SIRIP: Domain-Specific 2

SLH-BIA: Short-Long Hawkes Process for Buy It Again Recommendations at Scale

Rankyung Park
Amit Pande
David Relyea
Pushkar Chennu
Prathyusha Kanmanth Reddy

Buy It Again (BIA) recommendations are a crucial component in enhancing the customer experience and site engagement for retailers. In this paper, we build a short (S) and long (L) term Hawkes (H) process for each item and use it to obtain BIA recommendations for each customer. The challenges of deploying into a production environment including model scalability, an evolving item catalog, and real-time inference are discussed along with solutions such as model compression, frequency-based item filtering, training data sampling, data parallelization, parallel execution and microservice-based real-time recommendations. We significantly reduced model training time from roughly 250 hours to about 3 hours by applying the solutions, while serving real-time inference with less than 70ms latency. We compare our BIA model against state-of-the-art baselines using three publicly available datasets and provide results from A/B tests with millions of live customers. On 3 public datasets, our model outperforms SOTA baseline models in recall and NDCG metrics by around 85% and 10%, respectively, and in live A/B testing it exhibited more than 30% increase in click-through rate and roughly 30% revenue increase compared to other state of the art models.

Graph-Based Audience Expansion Model for Marketing Campaigns

Md Mostafizur Rahman
Daisuke Kikuta
Yu Hirate
Toyotaro Suzumura

Audience Expansion, a technique for identifying new audiences with similar behaviors to the original target or seed users. The major challenges include a heterogeneous user base, intricate marketing campaigns, constraints imposed by sparsity, and limited seed users, which lead to overfitting. In this context, we propose a novel solution named AudienceLinkNet, specifically designed to address the challenges associated with audience expansion in the context of Rakuten's diverse services and its clients. Our approach formulates the audience expansion problem as a graph problem and explores the combination of a Pre-trained Knowledge Graph Embedding Model and a Graph Convolutional Networks (GCNs). It emphasizes the structural retention properties of GCNs, enabling the model to overcome challenges related to cross-service data usage, sparsity and limited seed data. AudienceLinkNet simplifies the targeting process for small and large marketing campaigns and better utilizes demographics and behavioral attributes for targeting. Extensive experiments on our advertising platform, Rakuten AIris Target Prospecting, demonstrate the effectiveness of our audience expansion model. Additionally, we present the limitations of AudienceLinkNet.

ScienceDirect Topic Pages: A Knowledge Base of Scientific Concepts Across Various Science Domains

Artemis Capari
Hosein Azarbonyad
Georgios Tsatsaronis
Zubair Afzal
Judson Dunham

From undergraduate students to renowned scholars, everyone occasionally encounters unknown concepts within their field of interest, especially when reading scientific articles. ScienceDirectTopic Pages (TP) are intended to facilitate learning and to provide users with a structured overview of sources to deepen their knowledge about such unfamiliar topics. Our free service provides insight into a vast set of technical topics across 20 different scientific domains. Designed to emulate the natural flow of learning, TPs are embedded within millions of articles so that users can click on unfamiliar concepts they come across whilst reading an article. This redirects the user to a TP, consisting of a definition of the concept, which provides the user with a basic understanding of the concept. The TP further presents a collection of relevant snippets extracted from books and review articles published by ScienceDirect for users interested in references and more detailed explanations and applications of the concept. Finally, a set of related topics is provided to extend the user's knowledge even further. To build TPs, we utilize various information retrieval methods across our product. We retrieve the most relevant snippets for each topic/concept using a semantic search model fine-tuned on our scientific database. We further leverage the power of Retrieval Augmented Generation to generate reliable definitions on the topics sourced from ScienceDirect's content. To retrieve a list of relevant concepts for each topic, we use the co-occurrence statistics of concepts within books and articles.

SESSION: Session: SIRIP: Panel

Are Embeddings Enough? SIRIP Panel on the Future of Embeddings in Industry IR Systems

Jon Degenhardt
Tracy Holloway King

The IR community as a whole is considering whether search and recommendations can move entirely to embedding-based technologies. This SIRIP panel discusses the future of embedding-based technologies in industry search given its broad range of document types, its specific query types, its performance requirements, and the features that accompany search. The panel comprises long-time industry experts and academics with industry ties. The panelists vary as to whether they believe that the industry in practice will move entirely to embeddings or will remain a hybrid domain.

SESSION: Session: Tutorials

Empowering Large Language Models: Tool Learning for Real-World Interaction

Hongru Wang
Yujia Qin
Yankai Lin
Jeff Z. Pan
Kam-Fai Wong

Since the advent of large language models (LLMs), the field of tool learning has remained very active in solving various tasks in practice, including but not limited to information retrieval. This half-day tutorial provides basic concepts of this field and an overview of recent advancements with several applications. In specific, we start with some foundational components and architecture of tool learning (i.e., cognitive tool and physical tool), and then we categorize existing studies in this field into tool-augmented learning and tool-oriented learning, and introduce various learning methods to empower LLMs this kind of capability. Furthermore, we provide several cases about when, what, and how to use tools in different applications. We end with some open challenges and several potential research directions for future studies. We believe this tutorial is suited for both researchers at different stages (introductory, intermediate, and advanced) and industry practitioners who are interested in LLMs and tool learning.

High Recall Retrieval Via Technology-Assisted Review

Lenora Gray
David D. Lewis
Jeremy Pickens
Eugene Yang

High Recall Retrieval (HRR) tasks, including eDiscovery in the law, systematic literature reviews, and sunshine law requests focus on efficiently prioritizing relevant documents for human review.Technology-assisted review (TAR) refers to iterative human-in-the-loop workflows that combine human review with IR and AI techniques to minimize both time and manual effort while maximizing recall. This full-day tutorial provides a comprehensive introduction to TAR. The morning session presents an overview of the key technologies and workflow designs used, the basics of practical evaluation methods, and the social and ethical implications of TAR deployment. The afternoon session provides more technical depth on the implications of TAR workflows for supervised learning algorithm design, how generative AI is can be applied in TAR, more sophisticated statistical evaluation techniques, and a wide range of open research questions.

Large Language Model Powered Agents for Information Retrieval

An Zhang
Yang Deng
Yankai Lin
Xu Chen
Ji-Rong Wen
Tat-Seng Chua

The vital goal of information retrieval today extends beyond merely connecting users with relevant information they search for. It also aims to enrich the diversity, personalization, and interactivity of that connection, ensuring the information retrieval process is as seamless, beneficial, and supportive as possible in the global digital era. Current information retrieval systems often encounter challenges like a constrained understanding of queries, static and inflexible responses, limited personalization, and restricted interactivity. With the advent of large language models (LLMs), there's a transformative paradigm shift as we integrate LLM-powered agents into these systems. These agents bring forth crucial human capabilities like memory and planning to make them behave like humans in completing various tasks, effectively enhancing user engagement and offering tailored interactions. In this tutorial, we delve into the cutting-edge techniques of LLM-powered agents across various information retrieval fields, such as search engines, social networks, recommender systems, and conversational assistants. We will also explore the prevailing challenges in seamlessly incorporating these agents and hint at prospective research avenues that can revolutionize the way of information retrieval.

Large Language Models for Recommendation: Past, Present, and Future

Keqin Bao
Jizhi Zhang
Xinyu Lin
Yang Zhang
Wenjie Wang
Fuli Feng

Large language models (LLMs) have significantly influenced recommender systems, spurring interest across academia and industry in leveraging LLMs for recommendation tasks. This includes using LLMs for generative item retrieval and ranking, and developing versatile LLMs for various recommendation tasks, potentially leading to a paradigm shift in the field of recommender systems. This tutorial aims to demystify the Large Language Model for Recommendation (LLM4Rec) by reviewing its evolution and delving into cutting-edge research. We will explore how LLMs enhance recommender systems in terms of architecture, learning paradigms, and functionalities such as conversational abilities, generalization, planning, and content generation. The tutorial will shed light on the challenges and open problems in this burgeoning field, including trustworthiness, efficiency, online training, and evaluation of LLM4Rec. We will conclude by summarizing key learnings from existing studies and outlining potential avenues for future research, with the goal of equipping the audience with a comprehensive understanding of LLM4Rec and inspiring further exploration in this transformative domain.

Large Language Models for Tabular Data: Progresses and Future Directions

Haoyu Dong
Zhiruo Wang

Tables contain a significant portion of the world's structured information. The ability to efficiently and accurately understand, process, reason about, analyze, and generate tabular data is critical for achieving Artificial General Intelligence (AGI) systems. However, despite their prevalence and importance, tables present unique challenges due to their structured nature and the diverse semantics embedded within them. Textual content, numerical values, visual formats, and even formulas in tables carry rich semantic information that is often underutilized due to the complexity of accurately interpreting and integrating. Fortunately, the advent of Large Language Models (LLMs) has opened new frontiers in natural language processing (NLP) and machine learning (ML), showing remarkable success in understanding and generating text, code, etc. Applying these advanced models to the domain of tabular data holds the promise of significant breakthroughs in how we process and leverage structured information. Therefore, this tutorial aims to provide a comprehensive study of the advances, challenges, and opportunities in leveraging cutting-edge LLMs for tabular data. By introducing methods of prompting or training cutting-edge LLMs for table interpreting, processing, reasoning, analytics, and generation, we aim to equip researchers and practitioners with the knowledge and tools needed to unlock the full potential of LLMs for tabular data in their domains.

Preventing and Detecting Misinformation Generated by Large Language Models

Aiwei Liu
Qiang Sheng
Xuming Hu

As large language models (LLMs) become increasingly capable and widely deployed, the risk of them generating misinformation poses a critical challenge. Misinformation from LLMs can take various forms, from factual errors due to hallucination to intentionally deceptive content, and can have severe consequences in high-stakes domains.This tutorial covers comprehensive strategies to prevent and detect misinformation generated by LLMs. We first introduce the types of misinformation LLMs can produce and their root causes. We then explore two broad categories: Preventing misinformation generation: a) AI alignment training techniques to reduce LLMs' propensity for misinformation and refuse malicious instructions during model training. b) Training-free mitigation methods like prompt guardrails, retrieval-augmented generation (RAG), and decoding strategies to curb misinformation at inference time. Detecting misinformation after generation, including a) using LLMs themselves to detect misinformation through embedded knowledge or retrieval-enhanced judgments, and b) distinguishing LLM-generated text from human-written text through black-box approaches (e.g., classifiers, probability analysis) and white-box approaches (e.g., watermarking). We also discuss the challenges and limitations of detecting LLM-generated misinformation.

Recent Advances in Generative Information Retrieval

Yubao Tang
Ruqing Zhang
Zhaochun Ren
Jiafeng Guo
Maarten de Rijke

Generative retrieval (GR) has witnessed significant growth recently in the area of information retrieval. Compared to the traditional "index-retrieve-then-rank'' pipeline, the GR paradigm aims to consolidate all information within a corpus into a single model. Typically, a sequence-to-sequence model is trained to directly map a query to its relevant document identifiers (i.e., docids). This tutorial offers an introduction to the core concepts of the GR paradigm and a comprehensive overview of recent advances in its foundations and applications. We start by providing preliminary information covering foundational aspects and problem formulations of GR. Then, our focus shifts towards recent progress in docid design, training approaches, inference strategies, and applications of GR. We end by outlining challenges and issuing a call for future GR research. Throughout the tutorial we highlight the availability of relevant resources so as to enable a broad audience to contribute to this topic. This tutorial is intended to be beneficial to both researchers and industry practitioners interested in developing novel GR solutions or applying them in real-world scenarios.

Robust Information Retrieval

Yu-An Liu
Ruqing Zhang
Jiafeng Guo
Maarten de Rijke

Beyond effectiveness, the robustness of an information retrieval (IR) system is increasingly attracting attention. When deployed, a critical technology such as IR should not only deliver strong performance on average but also have the ability to handle a variety of exceptional situations. In recent years, research into the robustness of IR has seen significant growth, with numerous researchers offering extensive analyses and proposing myriad strategies to address robustness challenges. In this tutorial, we first provide background information covering the basics and a taxonomy of robustness in IR. Then, we examine adversarial robustness and out-of-distribution (OOD) robustness within IR-specific contexts, extensively reviewing recent progress in methods to enhance robustness. The tutorial concludes with a discussion on the robustness of IR in the context of large language models (LLMs), highlighting ongoing challenges and promising directions for future research. This tutorial aims to generate broader attention to robustness issues in IR, facilitate an understanding of the relevant literature, and lower the barrier to entry for interested researchers and practitioners.

Search under Uncertainty: Cognitive Biases and Heuristics: A Tutorial on Testing, Mitigating and Accounting for Cognitive Biases in Search Experiments

Jiqun Liu
Leif Azzopardi

Understanding how people interact with search interfaces is core to the field of Interactive Information Retrieval (IIR). While various models have been proposed (e.g., Belkin's ASK, Berry picking, Everyday-life information seeking, Information foraging theory, Economic theory, etc.), they have largely ignored the impact of cognitive biases on search behaviour and performance. A growing body of empirical work exploring how people's cognitive biases influence search and judgments, has led to the development of new models of search that draw upon Behavioural Economics and Psychology. This full day tutorial will provide a starting point for researchers seeking to learn more about information seeking, search and retrieval under uncertainty. The tutorial will be structured into three parts. First, we will provide an introduction of the biases and heuristics program put forward by Tversky and Kahneman [60] (1974) which assumes that people are not always rational. The second part of the tutorial will provide an overview of the types and space of biases in search,[5, 40] before doing a deep dive into several specific examples and the impact of biases on different types of decisions (e.g., health/medical, financial). The third part will focus on a discussion of the practical implication regarding the design and evaluation human-centered IR systems in the light of cognitive biases - where participants will undertake some hands-on exercises.

Using and Evaluating Quantum Computing for Information Retrieval and Recommender Systems

Maurizio Ferrari Dacrema
Andrea Pasin
Paolo Cremonesi
Nicola Ferro

The field of Quantum Computing (QC) has gained significant popularity in recent years, due to its potential to provide benefits in terms of efficiency and effectiveness when employed to solve certain computationally intensive tasks. In both Information Retrieval (IR) and Recommender Systems (RS) we are required to build methods that apply complex processing on large and heterogeneous datasets, it is natural therefore to wonder whether QC could also be applied to boost their performance. The tutorial aims to provide first an introduction to QC for an audience that is not familiar with the technology, then to show how to apply the QC paradigm of Quantum Annealing (QA) to solve practical problems that are currently faced by IR and RS systems. During the tutorial, participants will be provided with the fundamentals required to understand QC and to apply it in practice by using a real D-Wave quantum annealer through APIs.

SESSION: Session: Workshops

5th Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech2024)

Ralf Krestel
Hidir Aras
Linda Andersson
Florina Piroi
Allan Hanbury
Dean Alderucci

Information retrieval systems for the patent domain have a long history. They can support patent experts in a variety of daily tasks: from analyzing the patent landscape to support experts in the patenting process and large-scale information extraction. Advances in machine learning and natural language processing allow to further automate tasks, such as paragraph retrieval, question answering (QA) or even patent text generation. Uncovering the potential of semantic technologies for the intellectual property (IP) industry is just getting started. Investigating the use of artificial intelligence methods for the patent domain is therefore not only of academic interest, but also highly relevant for practitioners. Compared to other domains, high quality, semi-structured, annotated data is available in large volumes (a requirement for supervised machine learning models), making training large models easier. On the other hand, domain-specific challenges arise, such as very technical language or legal requirements for patent documents. With the 5th edition of this workshop we will provide a platform for researchers and industry to learn about novel and emerging technologies for semantic patent retrieval and big analytics employing sophisticated methods ranging from patent text mining, domain-specific information retrieval to large language models targeting next generation applications and use cases for the IP and related domains.

AgentIR: 1st Workshop on Agent-based Information Retrieval

Qingpeng Cai
Xiangyu Zhao
Ling Pan
Xin Xin
Jin Huang
Weinan Zhang
Li Zhao
Dawei Yin
Grace Hui Yang

Information retrieval (IR) systems have become an essential component in modern society to help users find useful information, which consists of a series of processes including query expansion, item recall, item ranking and re-ranking, etc. Based on the ranked information list, users can provide their feedbacks. Such an interaction process between users and IR systems can be naturally formulated as a decision-making problem, which can be either one-step or sequential. In the last ten years, deep reinforcement learning (DRL) has become a promising direction for decision-making, since DRL utilizes the high model capacity of deep learning for complex decision-making tasks. On the one hand, there have been emerging research works focusing on leveraging DRL for IR tasks. However, the fundamental information theory under DRL settings, the challenge of RL methods for Industrial IR tasks, or the simulations of DRL-based IR systems, has not been deeply investigated. On the other hand, the emerging LLM provides new opportunities for optimizing and simulating IR systems. To this end, we propose the first Agent-based IR workshop at SIGIR 2024, as a continuation from one of the most successful IR workshops, DRL4IR. It provides a venue for both academia researchers and industry practitioners to present the recent advances of both DRL-based IR systems and LLM-based IR systems from the agent-based IR's perspective, to foster novel research, interesting findings, and new applications.

Gen-IR @ SIGIR 2024: The Second Workshop on Generative Information Retrieval

Gabriel Bénédict
Ruqing Zhang
Donald Metzler
Andrew Yates
Ziyan Jiang

Generative information retrieval (Gen-IR) is a fast-growing interdisciplinary research area that investigates how to leverage advances in generative Artificial Intelligence (AI) to improve information retrieval systems. Gen-IR has attracted interest from the information retrieval, natural language processing, and machine learning communities, among others. Since the dawn of Gen-IR last year, there has been an explosion of Gen-IR systems that have launched and are now widely used. Interest in this area across academia and industry is only expected to continue to grow as new research challenges and application opportunities arise. The goal of this proposed workshop, The Second Workshop on Generative Information Retrieval (Gen-IR @ SIGIR 2024) is to provide an interactive venue for exploring a broad range of foundational and applied Gen-IR research. The workshop will focus on tasks such as generative document retrieval, grounded answer generation, generative recommendation, and generative knowledge graphs, all through the lens of model training, model behavior, and broader issues. The workshop will be highly interactive, favoring panel discussions, poster sessions, and roundtable discussions over one-sided keynotes and paper talks.

International Workshop on Algorithmic Bias in Search and Recommendation (BIAS)

Alejandro BellogÍn
Ludovico Boratto
Styliani Kleanthous
Elisabeth Lex
Francesca Maridina Malloci
Mirko Marras

Creating efficient and effective search and recommendation algorithms has been the main objective of industry practitioners and academic researchers over the years. However, recent research has shown how these algorithms trained on historical data lead to models that might exacerbate existing biases and generate potentially negative outcomes. Defining, assessing, and mitigating these biases throughout experimental pipelines is a primary step for devising search and recommendation algorithms that can be responsibly deployed in real-world applications. This workshop aims to collect novel contributions in this field and offer a common ground for interested researchers and practitioners. More information about the workshop is available at https://biasinrecsys.github.io/sigir2024/

IR-RAG @ SIGIR24: Information Retrieval's Role in RAG Systems

Fabio Petroni
Federico Siciliano
Fabrizio Silvestri
Giovanni Trappolini

In recent years, Retrieval Augmented Generation (RAG) systems have emerged as a pivotal component in the field of artificial intelligence, gaining significant attention and importance across various domains. These systems, which combine the strengths of information retrieval and generative models, have shown promise in enhancing the capabilities and performance of machine learning applications. However, despite their growing prominence, RAG systems are not without their limitations and continue to be in need of exploration and improvement. This workshop seeks to focus on the critical aspect of information retrieval and its integral role within RAG frameworks. We argue that current efforts have undervalued the role of Information Retrieval (IR) in the RAG and have concentrated their attention on the generative part. As the cornerstone of these systems, IR's effectiveness dramatically influences the overall performance and outcomes of RAG models. We call for papers that will seek to revisit and emphasize the fundamental principles underpinning RAG systems. At the end of the workshop, we aim to have a clearer understanding of how robust information retrieval mechanisms can significantly enhance the capabilities of RAG systems. The workshop will serve as a platform for experts, researchers, and practitioners. We intend to foster discussions, share insights, and encourage research that underscores the vital role of Information Retrieval in the future of generative systems.

LLM4Eval: Large Language Model for Evaluation in IR

Hossein A. Rahmani
Clemencia Siro
Mohammad Aliannejadi
Nick Craswell
Charles L. A. Clarke
Guglielmo Faggioli
Bhaskar Mitra
Paul Thomas
Emine Yilmaz

Large language models (LLMs) have demonstrated increasing task-solving abilities not present in smaller models. Utilizing the capabilities and responsibilities of LLMs for automated evaluation (LLM4Eval) has recently attracted considerable attention in multiple research communities. For instance, LLM4Eval models have been studied in the context of automated judgments, natural language generation, and retrieval augmented generation systems. We believe that the information retrieval community can significantly contribute to this growing research area by designing, implementing, analyzing, and evaluating various aspects of LLMs with applications to LLM4Eval tasks. The main goal of LLM4Eval workshop is to bring together researchers from industry and academia to discuss various aspects of LLMs for evaluation in information retrieval, including automated judgments, retrieval-augmented generation pipeline evaluation, altering human evaluation, robustness, and trustworthiness of LLMs for evaluation in addition to their impact on real-world applications. We also plan to run an automated judgment challenge prior to the workshop, where participants will be asked to generate labels for a given dataset while maximising correlation with human judgments. The format of the workshop is interactive, including roundtable and keynote sessions and tends to avoid the one-sided dialogue of a mini-conference.

MANILA24: SIGIR 2024 Workshop on Information Retrieval and Climate Impact

Bart van den Hurk
Maarten de Rijke
Flora Salim

The MANILA24 workshop brings together researchers and practitioners from academia, industry, governments, and NGO's to identify and discuss core research problems in information retrieval for climate impact. The workshop aims to foster collaboration by bringing communities together that have so far not been very well connected -- IR, systematic reviews, and climate change. The purpose is to help accelerate the development of IR technology that supports our understanding of climate impact publications and the articulation of recommended actions. Importantly, this includes introducing IR researchers climate impact, introducing researchers in climate to state-of-the-art IR technology, and developing a shared research agenda.

Multimodal Representation and Retrieval [MRR 2024]

Xinliang Zhu
Arnab Dhua
Douglas Gray
I. Zeki Yalniz
Tan Yu
Mohamed Elhoseiny
Bryan Plummer

Multimodal data is available in many applications like e-commerce production listings, social media posts and short videos. However, existing algorithms dealing with those types of data still focus on uni-modal representation learning by vision-language alignment and cross-modal retrieval. In this workshop, we target to bring a new retrieval problem where both queries and documents are multimodal. With the popularity of vision language modeling, large language models (LLMs), retrieval augmented generation (RAG), and multimodal LLM, we see a lot of new opportunities for multimodal representation and retrieval tasks. This event will be a comprehensive half-day workshop focusing on the subject of multimodal representation and retrieval. The agenda includes keynote speeches, oral presentations, and an interactive panel discussion.

ReNeuIR at SIGIR 2024: The Third Workshop on Reaching Efficiency in Neural Information Retrieval

Maik Fröbe
Joel Mackenzie
Bhaskar Mitra
Franco Maria Nardini
Martin Potthast

The Information Retrieval (IR) community has a rich history of empirically measuring novel retrieval methods in terms of effectiveness and efficiency. However, as the search ecosystem is developing rapidly, comparatively little attention has been paid to evaluating efficiency in recent years, which raises the question of the cost-benefit ratio between effectiveness and efficiency. In this regard, it has become difficult to compare and contrast systems in an empirically fair way. Factors including hardware configurations, software versioning, experimental settings, and measurement methods all contribute to the difficulty of meaningfully comparing search systems, especially where efficiency is a key component of the evaluation. Furthermore, efficiency is no longer limited to time and space but has found new, challenging dimensions that stretch to resource, sample, and energy efficiency and have implications for users, researchers, and the environment. Examining algorithms and models through the lens of efficiency and its trade-off with effectiveness requires revisiting and establishing new standards and principles, from defining relevant concepts, to designing measures, to creating guidelines for making sense of the significance of findings. The third iteration of ReNeuIR aims to bring the community together to debate these questions and collaboratively test and improve a benchmarking framework for efficiency derived from the discussions of the first two iterations of this workshop. We provide a first prototype of this framework by organizing a shared task track focused on comparability and reproducibility at the workshop.

SIGIR 2024 Workshop on eCommerce (ECOM24)

Surya Kallumadi
Yubin Kim
Tracy Holloway King
Maarten de Rijke
Vamsi Salaka

ECOM24 brings together researchers and practitioners from academia and industry to identify and discuss core research problems in eCommerce search and recommendation. The workshop aims to foster collaboration, to attract research funding, and to introduce IR researchers and postgraduate students to eCommerce product discovery. The workshop features a special theme of eCommerce search in the age of Generative AI and LLMs and a data challenge in collaboration with TREC on how end-to-end retrieval systems can be built and evaluated given a large set of products.

SIGIR 2024 Workshop on Simulations for Information Access (Sim4IA 2024)

Philipp Schaer
Christin Katharina Kreutz
Krisztian Balog
Timo Breuer
Norbert Fuhr

Simulations in various forms have been used to evaluate information access systems, like search engines, recommender systems, or conversational agents. In the form of the Cranfield paradigm, a simulation setup is well-known in the IR community, but user simulations have recently gained interest. While user simulations help to reduce the complexity of evaluation experiments and help with reproducibility, they can also contribute to a better understanding of users. Building on recent developments in methods and toolkits, the Sim4IA workshop aims to bring together researchers and practitioners to form an interactive and engaging forum for discussions on the future perspectives of the field. An additional aim is to plan an upcoming TREC/CLEF campaign.

The Second Workshop on Large Language Models for Individuals, Groups, and Society

Michael Bendersky
Cheng Li
Qiaozhu Mei
Vanessa Murdock
Jie Tang
Hongning Wang
Hamed Zamani
Mingyang Zhang
Xingjian Zhang

This is the second workshop in the series which discusses the cutting-edge developments in research and applications of personalizing large language models (LLMs) and adapting them to the demands of diverse user populations and societal needs. The full-day workshop plan includes several keynotes and invited talks, a poster session and a panel discussion.

Third Workshop on Personalization and Recommendations in Search (PaRiS)

Sudarshan Lamkhede
Hamed Zamani
Moumita Bhattacharya
Hongning Wang

With proliferation of personal computing devices and large number of logged-in experiences, search has evolved to a stage with many different product scenarios where personalization plays a crucial role for relevance quality and user satisfaction. The purpose of this workshop is have a forum where latest research and advancements specifically on Personalization and Recommendations in Search (PaRiS) can be discussed in conjunction with SIGIR 2024. This will be the third instance of this workshop. We held two very successful instances of this workshop at the WebConf 2023 [5] and WSDM 2022. This year we will especially focus on applications of LLMs and Generative AI to enable personalization and recommendations [1] in the context of search, for example, conversational assistants, while continuing to use this workshop for discussing other advances and applications in the context of personalized search and recommendations in the context of search.

SESSION: Session: Doctoral Consortium

A Predictive Framework for Query Reformulation

Reyhaneh Goli

Web search services are widely employed for various purposes. After identifying information needs, users attempt to articulate them in web queries that express their intentions. Then, they submit these queries to the chosen search engine with the hope of obtaining relevant results to meet their needs. In some cases, users may not immediately find precisely what they are seeking, prompting them to rewrite the query to obtain a greater number of relevant results or results that are perhaps more related to their intent. While significant work has been done on developing features such as query auto-completion, query suggestion, and query recommendation, the majority of these efforts were based on query co-occurrence or query similarity by clustering them or constructing query flow graphs to capture query connections. These approaches operate under the assumption that frequently observed follow-up queries are more likely to be submitted by users [1, 2, 4].

In this research, we investigate user query reformulation behavior. To achieve this, we will utilize the Trip Click dataset, a large-scale collection of user click data within the context of a health web search engine [3]. The log data from 2018 to 2020 will be considered, comprising 1,803,493 records representing the clicks that occurred across 527,749 sessions. Specifically, the focus will be on the impact of user interactions with the search result page when forming subsequent queries.

Axiomatic Guidance for Efficient and Controlled Neural Search

Andrew Parry

Pre-trained language models based on the transformer architecture provide solutions to general ad-hoc search tasks--ranging from news search to question-answering--vastly outperforming statistical approaches in terms of both precision and recall. These models operate over "semantics'', removing the need for bespoke features based on proprietary data (e.g., interaction logs). In doing so, this paradigm may lead to further adoption of the idealised "end-to-end'' retrieval system as an elegant and powerful search solution. However, outside of sanitised benchmarks, these models present exploitable and untrustworthy biases relinquishing any control over inference due to their black-box nature.

Such biases threaten the viability of neural models in production. Without greater control over model output, stakeholders could raise concerns hindering the adoption of effective and efficient search. Today, feature-based search systems are still performant relative to state-of-the-art neural search and can adapt to a changing corpus and the needs of system stakeholders. As agency over information access is further reduced via emerging paradigms such as Retrieval-Augmented-Generation, we must retain control over the output of a search system. We consider that bias in neural search systems is an artefact of the training and underlying mechanisms of current pre-trained models but is not present in statistical models. Features such as statistical models are principled and arbitrarily controllable; these features can adapt to a corpus and meet the demands of a given search task. Conversely, the output of a current neural system can only be changed by post hoc constraints or by re-training the underlying model. We posit that by allowing external features to influence the semantic interactions within neural search at inference time, we can not only allow control over system output but reduce the need to model corpus-specific priors, which can instead be modelled by external features, allowing for greater generalisation and training efficiency gains. We aim to reduce the complexity of neural ranker training and inference, applying classical IR principles and systems that align with such principles as a generalisable process as opposed to the ad-hoc constraint of prior work. Such an approach can reduce the need for larger models whilst improving generalisation. Axiomatic signals can guide and control neural ranking models to reduce spurious factors in semantic relevance estimation by compensating for the frozen priors of neural systems whilst still operating over flexible latent space. Given the biases observed in current systems, this may satiate the concerns of multiple stakeholders, leading to broader adoption of the paradigm.

GOLF: Goal-Oriented Long-term liFe tasks supported by human-AI collaboration

Ben Wang

The advent of ChatGPT and similar large language models (LLMs) has revolutionized the human-AI interaction and information-seeking process. Leveraging LLMs as an alternative to search engines, users can now access summarized information tailored to their queries, significantly reducing the cognitive load in navigating vast information resources. This shift underscores the potential of LLMs in redefining information access paradigms [1]. Drawing on the foundation of task-focused information retrieval and LLMs' task planning ability, this research extends the scope of LLM capabilities beyond short-term task automation (i.e., smaller-scale and routine tasks that LLM agents can automate with less human intervention) to support users in navigating long-term and significant life tasks. The long-term tasks encompass broader personal life goals or development in aspects like health, finances, education, and professional development, which cannot be fully completed by LLM agents but require significant human involvement.

This study introduces the GOLF framework (Goal-Oriented Long-term liFe tasks), which focuses on enhancing LLMs' ability to assist in significant life decisions through goal orientation and long-term planning. Figure 1 presents the GOLF framework, including a task taxonomy and the process for task management. The GOLF framework envisions the completion of complex tasks as a strategic journey toward a final goal, incorporating a sequence of activities, tasks, and subtasks, adopting the task taxonomy in Figure 1a [2]. Figure 1b illustrates the task process within the GOLF framework, which operates on AutoGen [3], a sophisticated multi-agent system, and involves multiple LLM agents to facilitate user support and workload distribution for achieving long-term goals. The multi-agent system processes the task following steps: Initial Planning, Step Planning, Task Assignment, Multi-Agent Coordination, User Engagement, and Evaluation and Iteration.

The methodology encompasses a comprehensive simulation study to test the framework's efficacy, followed by model and human evaluations to develop a dataset benchmark for long-term life tasks, and experiments across different models and settings. By shifting the focus from short-term tasks to the broader spectrum of long-term life goals, this research underscores the transformative potential of LLMs in enhancing human decision-making processes and task management, marking a significant step forward in the evolution of human-AI collaboration.

Leveraging LLMs for Detecting and Modeling the Propagation of Misinformation in Social Networks

Payel Santra

Recent success in language generation capabilities of large language models (LLMs), such as GPT, Llama, etc., can potentially lead to concern about their possible misuse in inducing mass agitation and communal hatred via generating fake news and spreading misinformation. Traditional means of developing a misinformation ground-truth dataset do not scale well because of the extensive manual effort required to annotate the data. It is crucial to anticipate and counteract potential adversarial fake information to mitigate detrimental effects and promote societal harmony. To this end, this PhD proposal spans three main research directions. The first concerns investigating ways of developing unsupervised models for fake news identification leveraging retrieval augmented generation (RAG) approaches. In our second thread of work, we explore ways of creating synthetic datasets to eventually train supervised or few-shot example-based models. Another direction of research work involves tracking the propagation of fake information through social networks to develop preventive measures against it.

Machine Generated Explanations and Their Evaluation

Edward Richards

Rapid adoption of a new generation of LLMs has demonstrated their considerable capabilities. However, these models are far from infallible, raising significant ethical concerns, especially in decision-making applications, prompting calls for increased restraint [2].

The Augmented Intelligence paradigm is one proposed mitigation. Therein LLMs are tools used by human decision makers to improve performance without corresponding loss of accountability.

However, this mitigation imposes requirements on models that are not the primary focus of existing evaluation approaches. In particular, current explanation evaluation approaches tend to prioritize premises and conclusions over reasoning quality. It is evident that logical soundness is a crucial aspect of system operation, as the output must be interpretable to the user.

This work therefore proposes adopting a technique from programming language theory, wherein intermediate representations are employed to simplify the evaluation of code [3]. Rather than the model mapping directly from queries to solutions, code generation is used to produce an executable intermediate. An effect of this design is to shift the LLM from being a producer of solutions to the creator delegation plans. Production of a more structured output is expected to ease estimation of model reasoning quality via the comparison to golden solutions. Use of a bespoke representation, designed to take advantage of the particulars of automated code generation, aims to reduce the difficulty of this estimation. Use of a novel syntax is however, made challenging by the obvious absence of existing examples. It is impractical and undesirable for reasons of both cost and flexibility, to create the large numbers of examples which would be needed for conventional training.

Early efforts have therefore been concentrated on two main areas: developing a syntax and interpreter, and addressing the challenge of data sparsity. A well-designed syntax is crucial, not only because updates will necessitate revising an increasing number of established solutions, but also due to its expected impact on the overall system utility. The expressiveness of the syntax is particularly significant in this context. Excessive constraint sacrifices generality, while too much leniency results in a proliferation of semantically equivalent solutions, complicating comparisons with the gold solutions. The generated intermediate representation is executed by an interpreter to produce the end solution. Other than a small number of control flow statements all other statements in the language are parameterized calls to external tools such as retrieval systems or math expression evaluators. External tool usage is the primary motivator behind avoiding training via extensive, manually curated, examples: the ability to easily add or remove tools is highly desirable. In case of an error, the interpreter output may include explanations of constructs and available tools, error messages, and task-specific metrics. Whereas on success, tool output is substituted into the appropriate section of the representation, such that a completely evaluated intermediate includes all information necessary to construct the end natural language explanation. Transformation into this natural language explanation can then be undertaken by another LLM. In terms of solutions for data sparsity, an approach similar to that used by LLM agents in environment exploration is suggested [1]. For each query, a ranking of model generations based on interpreter output in conjunction with a scoring function is produced. Pairwise selection of stronger and weaker responses are used thereafter in a modified form of iterative Direct Preference Optimisation (DPO) [4].

Given the preliminary system we aim to test the system across a range of tasks with correspondence to the augmented intelligence paradigm such as multi-hop question answering.

Mosaicing Prevention in Declassification

Nathaniel Rollings

Multiple methods can be used to infer as-yet unrecorded information. However, this ability can place confidentiality at risk when some inferences, although correct, could cause harm. We therefore flip the problem, seeking not to enable but to prevent specific inferences. This inference prevention task is motivated by what has been called the "mosaicing'' problem in declassification review for documents that in the past were withheld from public access for national security reasons~\citepozen2005mosaic. The goal of such a review is to reveal as much as can now be safely revealed but to also withhold things that could be used to infer facts that require continued protection. This problem is modeled using three primary components: (1) currently public information, (2) a set of secrets (information that is not public and requires continuing protection), and (3) a review set (other information now being reviewed for possible release). The inference prevention task is to determine what in the review set would substantially increase the inference rick for a secret.

Our initial work investigated use of knowledge graphs for keeping secrets using Knowledge Graph Completion (KGC) techniques. While declassification is typically text-based, we expect a structured analog to that problem can provide some useful insights. There also are applications where prevention of inference in a knowledge graph is the actual task, such as protecting against specific drug discovery inferences when augmenting the Hetionet knowledge graph. Our mosaicing problem is the inverse of KGC---rather than inferring a link, we need to prevent inference of a link. This challenge is distinct from anonymization for social media graphs because we can't alter most relationships, only those in the review set. Using the FB15K-237 knowledge graph, we analyzed three KGC models to identify the relation in a defined review set most critical to inference of a missing secret relation (thus "nominating" a relation for redaction). We evaluated the impact of redactions nominated by one model on inference by other models by ranking a secret with some selected confounds, finding that our simplest model (RuleN) produced the best nominations, despite being least effective of the three on the KGC task. Future work will use graphs more closely modeling declassification, and other KGC models. It will also explore areas in which differences between the traditional KGC task and the declassification problem may be exploited, most notably in the focus on specific secrets for declassification which may allow more focused training of models and improve scalability.

Our ultimate goal is to perform redaction directly on text. We will explore two sets of techniques, one building on traditional Multi-Hop Question Answering (MHQA) and a second using Large Language Models (LLM) which now constitute a major element of text-based inference methods. Both approaches to MHQA typically operate over limited document sets, so a retrieval step is needed for preselection. This retrieval step adds challenges because we must accommodate redundant information spread across the collection. We can evaluate nomination generalizability across model classes and the impact of alternative retrieval approaches using the same confound ranking technique, but ultimately we will also need absolute measures of effectiveness, not just relative comparisons, because we must balance the benefit of releasing information with the cost imposed by the risk of revealing a secret.

While our work begins the exploration of the mosaicing problem, it has limitations. We must use analogs for our problem as working with classified information is challenging in access and distribution. While these are selected to serve as reasonable representations of our problem, they will exhibit differences from the actual classified datasets. Furthermore, the performance of the model classes used for inference in both the text and KG scenarios may not generalize against novel approaches developed in the future. The framework established in testing the current models would still be applicable but would have to be rerun with these new classes of models.

Personalized Large Language Models through Parameter Efficient Fine-Tuning Techniques

Marco Braga

Personalization of the search experience according to the users and their context is an important topic in Information Retrieval (IR), studied by the research community for a long time. The IR field has witnessed a transformation with the recent availability of pre-trained Large Language Models. Typically, personalization requires the model to incorporate user-specific information, through the definition of an appropriate prompting or injecting user knowledge into the model and then fine-tuning it. However, using prompting, we do not know where and how much the model is personalizing the output. Furthermore, fine-tuning such systems is computationally expensive: since they are characterized by billions of parameters, the fine-tuning process has introduced profound computational challenges. For these reasons, we propose a novel approach that combines personalization and Parameter Efficient Fine-Tuning methods.

Query Performance Prediction for Conversational Search and Beyond

Chuan Meng

Query performance prediction (QPP) is a key task in information retrieval (IR) [1]. The QPP task is to estimate the retrieval quality of a search system for a query without human relevance judgments. In summary, I aim to solve 4 limitations identified in previous QPP studies: I have published 3 papers that address 3 of these limitations, while the remaining one is the focus of my future work.

While extensively explored for traditional ad-hoc search, QPP for conversational search (CS) [4] has been little studied. I have identified limitation 1 in previous QPP studies: There is a lack of a comprehensive investigation into how well existing QPP methods designed for ad-hoc search perform in the context of CS. To fill this research gap, I have conducted a comprehensive reproducibility study [5], where I examined various QPP methods that were designed for ad-hoc search in the CS setting. I have made the code and data publicly available on https://github.com/ChuanMeng/QPP4CS.

Moreover, I have identified limitation 2 in previous studies on QPP for CS: There is a lack of research in investigating and leveraging the CS-specific features that do not exist in ad-hoc search to improve QPP quality for CS. I have authored a paper to fill this research gap [3]. Specifically, my empirical analysis indicates a correlation between query rewriting quality in CS and the actual retrieval quality. Based on this finding, I have proposed a perplexity-based pre-retrieval QPP framework (PPL-QPP) for CS, which integrates query rewriting quality into existing QPP methods. Experimental results show that PPL-QPP improves QPP quality.

Beyond the scope of QPP for CS, I have identified drawbacks in general QPP methods. Existing QPP methods typically return a single scalar value that indicates the retrieval quality, which results in two issues: (i) relying on a single value to represent different IR metrics leads to a "one size fits all" issue, and (ii) a single value constraints the interpretability of QPP. Thus, I have identified limitation 3: there is a shortage of QPP methods that are capable of effectively predicting various IR evaluation metrics while maintaining interpretability. To address the limitation, I have proposed a QPP framework using automatically generated reevance judgments (QPP-GenRE); it decomposes QPP into independent subtasks of judging the relevance of each item in a ranked list to a given query [6]. QPP-GenRE enables the prediction of any IR metric using generated relevance judgments as pseudo-labels, and enables the interpretation of predicted IR metrics based on generated judgments. I have fine-tuned an open-source large language model (LLM) for judging relevance. Experimental results show that QPP-GenRE achieves state-of-the-art QPP quality; my fine-tuned LLM demonstrates a high relevance judgment agreement with human assessors. I have made the code and data publicly available on https://github.com/ChuanMeng/QPP-GenRE.

As part of my future work, I plan to solve limitation 4: No study has explored the application of QPP in retrieval-augmented generation (RAG) to predict when not to rely on low-quality retrieved items that have the potential to hurt RAG's text generation.

Towards a Framework for Legal Case Retrieval

Tebo Leburu-Dingalo

Legal case reports detail the main points of a decided case, findings and decisions of the court. The reports are a fundamental source for Case law, a law which requires judges to align their rulings with previous judicial decisions on similar cases [1]. Timely and reliable access to case reports is thus of critical importance to legal practitioners working on a current case, and laymen interested in the outcome of cases. However, ensuring effective retrieval of previous case reports is still proving a challenge, even with the use of retrieval technologies already proven effective in other Information Retrieval (IR) domains. This has been attributed to factors such as lack of structure, and lengthiness of case report documents, and queries formulated to represent an ongoing case for which the reports are being sought [4].

To address these factors we propose an IR framework that focuses on infusing structure into the documents and queries through the identification of legal rhetorical roles such as arguments and facts in the text. Furthermore, we aim to explore the use of selected groupings of these rhetorical roles as representations for the documents and queries. The benefit of using selected content is illustrated in recent research where for instance segments of documents such as abstracts, case headers, specific paragraphs, and sentences have been used to build effective legal IR systems. We thus hypothesize that we can attain marked improved performance when we build a case retrieval system using only a section of a case report or a query such as arguments or facts. However, in contrast to these studies we posit that utilizing rhetorical role information to extract content will lead to more effective representations that can enhance the performance of case retrieval systems.

The proposed framework will consists of a set of components needed to process both query and case report text to firstly infuse structure, extract effective representative content and finally perform retrieval. To aid the development of the framework, several empirical investigations will be conducted on publicly accessible datasets, and a self-curated test collection derived from Botswana legal case reports.

Key research questions to assist in our investigation are as follows:

: Can we successfully detect the implicit elements of a legal text reflecting rhetorical roles significant to legal case documents?
RQ2: In comparison to human formulated queries, do whole case queries give better performance?
RQ3: Can we improve retrieval performance by only retaining textual units representing specific rhetorical roles from an entire query text (current case)?
RQ4: Does indexing only textual units representing specific rhetorical roles from prior case documents improve retrieval performance?
RQ5: Do the selected approaches result in performance improvement for our local corpus in terms of precision, recall and user satisfaction?

Some preliminary work has been done and published towards investigating the viability of using summaries in legal case retrieval and identification of rhetorical roles in case documents. We submitted results of a system that utilized expanded summarized queries for an AILA precedent retrieval task competition that outperformed other submissions [2]. Furthermore, our approach that utilized TagCrowd for summarization performed well on a task of Statute retrieval [5]. Towards the feasibility of rhetorically labelling legal text we experimented with the fastText classifier for an AILA organized task. While our methods did not attain state-of-the-art, they gave insights into the performance of the different roles and factors that can affect performance in the task. [3].

SIGIR 2024 Proceedings