publications | INERTIA Laboratory

2025

Dubious Debiasing: The Intractability of Fair General-Purpose LLMs

Jacy Reese Anthis, Kristian Lum, Michael Ekstrand, Avi Feller, and Chenhao Tan

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, May 2025

Abs

Large language models (LLMs) have great potential for social benefit, but their general-purpose capabilities have raised pressing questions about bias and fairness. Researchers have documented significant disparities in model output when different demographics are specified, but it remains unclear how more systematic fairness metrics—those developed in technical frameworks such as group fairness and fair representations—can be applied. In this position paper, we analyze each framework and find inherent challenges that make the development of a generally fair LLM intractable. We show that each framework either does not logically extend to the general-purpose LLM context or is infeasible in practice, primarily due to the large amounts of unstructured data and the many potential combinations of human populations, use cases, and sensitive attributes. These inherent challenges would persist even if empirical roadblocks were overcome, but there are still promising practical directions, particularly the development of context-specific evaluations, standards for the responsibility of LLM developers, and methods for iterative and participatory evaluation.
User and Recommender Behavior Over Time: Contextualizing Activity, Effectiveness, Diversity, and Fairness in Book Recommendation

Samira Vaez Barenji, Sushobhan Parajuli, and Michael D. Ekstrand

In Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, May 2025

arXiv:2505.04518 [cs]

Abs DOI arXiv

Data is an essential resource for studying recommender systems. While there has been significant work on improving and evaluating state-of-the-art models and measuring various properties of recommender system outputs, less attention has been given to the data itself, particularly how data has changed over time. Such documentation and analysis provide guidance and context for designing and evaluating recommender systems, particularly for evaluation designs making use of time (e.g., temporal splitting). In this paper, we present a temporal explanatory analysis of the UCSD Book Graph dataset scraped from Goodreads, a social reading and recommendation platform active since 2006. We measure the book interaction data using a set of activity, diversity, and fairness metrics; we then train a set of collaborative filtering algorithms on rolling training windows to observe how the same measures evolve over time in the recommendations. Additionally, we explore whether the introduction of algorithmic recommendations in 2011 was followed by observable changes in user or recommender system behavior.
Recall, Robustness, and Lexicographic Evaluation

Fernando Diaz, Michael D. Ekstrand, and Bhaskar Mitra

ACM Trans. Recomm. Syst., Apr 2025

Just Accepted

Abs DOI

Although originally developed to evaluate sets of items, recall is often used to evaluate rankings of items, including those produced by recommender, retrieval, and other machine learning systems. The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure. In light of this debate, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define ‘recall-orientation’ as the sensitivity of a metric to a user interested in finding every relevant item. Second, we analyze recall-orientation from the perspective of robustness with respect to possible content consumers and providers, connecting recall to recent conversations about fair ranking. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across multiple recommendation and retrieval tasks, we establish that our new evaluation method, lexirecall, has convergent validity (i.e., it is correlated with existing recall metrics) and exhibits substantially higher sensitivity in terms of discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.

2024

It’s Not You, It’s Me: The Impact of Choice Models and Ranking Strategies on Gender Imbalance in Music Recommendation

Andres Ferraro, Michael D. Ekstrand, and Christine Bauer

In Proceedings of the 18th ACM Conference on Recommender Systems, Oct 2024

Abs DOI

As recommender systems are prone to various biases, mitigation approaches are needed to ensure that recommendations are fair to various stakeholders. One particular concern in music recommendation is artist gender fairness. Recent work has shown that the gender imbalance in the sector translates to the output of music recommender systems, creating a feedback loop that can reinforce gender biases over time. In this work, we examine that feedback loop to study whether algorithmic strategies or user behavior are a greater contributor to ongoing improvement (or loss) in fairness as models are repeatedly re-trained on new user feedback data. We simulate user interaction and re-training to investigate the effects of ranking strategies and user choice models on gender fairness metrics. We find re-ranking strategies have a greater effect than user choice models on recommendation fairness over time.
Distributionally-informed recommender system evaluation

Michael D. Ekstrand, Ben Carterette, and Fernando Diaz

ACM Transactions on Recommender Systems, Mar 2024

Abs DOI

Current practice for evaluating recommender systems typically focuses on point estimates of user-oriented effectiveness metrics or business metrics, sometimes combined with additional metrics for considerations such as diversity and novelty. In this paper, we argue for the need for researchers and practitioners to attend more closely to various distributions that arise from a recommender system (or other information access system) and the sources of uncertainty that lead to these distributions. One immediate implication of our argument is that both researchers and practitioners must report and examine more thoroughly the distribution of utility between and within different stakeholder groups. However, distributions of various forms arise in many more aspects of the recommender systems experimental process, and distributional thinking has substantial ramifications for how we design, evaluate, and present recommender systems evaluation and research results. Leveraging and emphasizing distributions in the evaluation of recommender systems is a necessary step to ensure that the systems provide appropriate and equitably-distributed benefit to the people they affect.
Multiple testing for IR and recommendation system experiments

Ngozi Ihemelandu and Michael D. Ekstrand

In Proceedings of the 46th European Conference on Information Retrieval, Mar 2024

Abs DOI

While there has been significant research on statistical techniques for comparing two information retrieval (IR) systems, many IR experiments test more than two systems. This can lead to inflated false discoveries due to the multiple-comparison problem (MCP). A few IR studies have investigated multiple comparison procedures; these studies mostly use TREC data and control the familywise error rate. In this study, we extend their investigation to include recommendation system evaluation data as well as multiple comparison procedures that controls for False Discovery Rate (FDR).

2023

Candidate set sampling for evaluating top-N recommendation

Ngozi Ihemelandu and Michael D. Ekstrand

In Proceedings of the 22nd IEEE/WIC international conference on web intelligence and intelligent agent technology, Oct 2023

arXiv:2309.11723 [cs]

Abs DOI arXiv

The strategy for selecting candidate sets – the set of items that the recommendation system is expected to rank for each user – is an important decision in carrying out an offline top-\N recommender system evaluation. The set of candidates is composed of the union of the user’s test items and an arbitrary number of non-relevant items that we refer to as decoys. Previous studies have aimed to understand the effect of different candidate set sizes and selection strategies on evaluation. In this paper, we extend this knowledge by studying the specific interaction of candidate set selection strategies with popularity bias, and use simulation to assess whether sampled candidate sets result in metric estimates that are less biased with respect to the true metric values under complete data that is typically unavailable in ordinary experiments.