publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2025
- Michael D. Ekstrand, Afsaneh Razi, Aleksandra Sarcevic, Maria Soledad Pera, Robin Burke, and Katherine Landau WrightACM Trans. Recomm. Syst., Aug 2025Just Accepted
Recommender systems are usually designed by engineers, researchers, designers, and other members of development teams. These systems are then evaluated based on goals set by the aforementioned teams and other business units of the platforms operating the recommender systems. This design approach emphasizes the designers’ vision for how the system can best serve the interests of users, providers, businesses, and other stakeholders. Although designers may be well-informed about user needs through user experience and market research, they are still the arbiters of the system’s design and evaluation, with other stakeholders’ interests less emphasized in user-centered design and evaluation. When extended to recommender systems for social good, this approach results in systems that reflect the social objectives as envisioned by the designers and evaluated as the designers understand them. Instead, social goals and operationalizations should be developed through participatory and democratic processes that are accountable to their stakeholders. We argue that recommender systems aimed at improving social good should be designed by and with, not just for, the people who will experience their benefits and harms. That is, they should be designed in collaboration with their users, creators, and other stakeholders as full co-designers, not only as user study participants.
- Fernando Diaz, Michael D. Ekstrand, and Bhaskar MitraACM Trans. Recomm. Syst., Jul 2025
Although originally developed to evaluate sets of items, recall is often used to evaluate rankings of items, including those produced by recommender, retrieval, and other machine learning systems. The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure. In light of this debate, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define “recall orientation” as the sensitivity of a metric to a user interested in finding every relevant item. Second, we analyze recall orientation from the perspective of robustness with respect to possible content consumers and providers, connecting recall to recent conversations about fair ranking. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across multiple recommendation and retrieval tasks, we establish that our new evaluation method, lexirecall, has convergent validity (i.e., it is correlated with existing recall metrics) and exhibits substantially higher sensitivity in terms of discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.
- Mohammad Namvarpour, Elham Aghakhani, Michael D. Ekstrand, Rezvaneh Rezapour, and Afsaneh RaziIn Proceedings of the 17th ACM Web Science Conference, May 2025
There have been various efforts to understand how youth online safety has been reflected in the news, as news play an important role in shaping public opinions. However, these efforts focused on specific contexts, such as individual countries and specific online risk types. Therefore, there is a need for a holistic view of understanding trends in the news regarding stakeholders involved and various ranges of online risks. In this work, we seek to understand how discussions of online safety for youth has evolved in news publications over the last two decades. We applied quantitative media content analysis and sentiment analysis to 3.9K English-language news articles from 2002–2024, documenting shifts in the portrayal of key stakeholders. Our results showed increased media focus on technology companies and government in youth safety discussions, particularly highlighting cyberbullying as a key risk. We found a generally negative trend in the sentiment toward the perceived safety of youth online, which fluctuates based on societal concerns and policy changes. The significance of this work lies in its analysis of how media discourse has illuminated public perceptions and policy directions concerning youth safety in digital spaces.
- Jacy Reese Anthis, Kristian Lum, Michael Ekstrand, Avi Feller, and Chenhao TanIn Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, May 2025
Large language models (LLMs) have great potential for social benefit, but their general-purpose capabilities have raised pressing questions about bias and fairness. Researchers have documented significant disparities in model output when different demographics are specified, but it remains unclear how more systematic fairness metrics—those developed in technical frameworks such as group fairness and fair representations—can be applied. In this position paper, we analyze each framework and find inherent challenges that make the development of a generally fair LLM intractable. We show that each framework either does not logically extend to the general-purpose LLM context or is infeasible in practice, primarily due to the large amounts of unstructured data and the many potential combinations of human populations, use cases, and sensitive attributes. These inherent challenges would persist even if empirical roadblocks were overcome, but there are still promising practical directions, particularly the development of context-specific evaluations, standards for the responsibility of LLM developers, and methods for iterative and participatory evaluation.
- Samira Vaez Barenji, Sushobhan Parajuli, and Michael D. EkstrandIn Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, May 2025
Data is an essential resource for studying recommender systems. While there has been significant work on improving and evaluating state-of-the-art models and measuring various properties of recommender system outputs, less attention has been given to the data itself, particularly how data has changed over time. Such documentation and analysis provide guidance and context for designing and evaluating recommender systems, particularly for evaluation designs making use of time (e.g., temporal splitting). In this paper, we present a temporal explanatory analysis of the UCSD Book Graph dataset scraped from Goodreads, a social reading and recommendation platform active since 2006. We measure the book interaction data using a set of activity, diversity, and fairness metrics; we then train a set of collaborative filtering algorithms on rolling training windows to observe how the same measures evolve over time in the recommendations. Additionally, we explore whether the introduction of algorithmic recommendations in 2011 was followed by observable changes in user or recommender system behavior.
2024
- Andres Ferraro, Michael D. Ekstrand, and Christine BauerIn Proceedings of the 18th ACM Conference on Recommender Systems, Oct 2024
As recommender systems are prone to various biases, mitigation approaches are needed to ensure that recommendations are fair to various stakeholders. One particular concern in music recommendation is artist gender fairness. Recent work has shown that the gender imbalance in the sector translates to the output of music recommender systems, creating a feedback loop that can reinforce gender biases over time. In this work, we examine that feedback loop to study whether algorithmic strategies or user behavior are a greater contributor to ongoing improvement (or loss) in fairness as models are repeatedly re-trained on new user feedback data. We simulate user interaction and re-training to investigate the effects of ranking strategies and user choice models on gender fairness metrics. We find re-ranking strategies have a greater effect than user choice models on recommendation fairness over time.
- Jonathan Stray, Alon Halevy, Parisa Assar, Dylan Hadfield-Menell, Craig Boutilier, Amar Ashar, Chloe Bakalar, Lex Beattie, Michael Ekstrand, Claire Leibowicz, Connie Moon Sehat, Sara Johansen, Lianne Kerlin, David Vickrey, Spandana Singh, Sanne Vrijenhoek, Amy Zhang, Mckane Andrus, Natali Helberger, Polina Proutskova, Tanushree Mitra, and Nina VasanACM Transactions on Recommender Systems, Jun 2024
Recommender systems are the algorithms which select, filter, and personalize content across many of the world’s largest platforms and apps. As such, their positive and negative effects on individuals and on societies have been extensively theorized and studied. Our overarching question is how to ensure that recommender systems enact the values of the individuals and societies that they serve. Addressing this question in a principled fashion requires technical knowledge of recommender design and operation, and also critically depends on insights from diverse fields including social science, ethics, economics, psychology, policy and law. This paper is a multidisciplinary effort to synthesize theory and practice from different perspectives, with the goal of providing a shared language, articulating current design approaches, and identifying open problems. We collect a set of values that seem most relevant to recommender systems operating across different domains, then examine them from the perspectives of current industry practice, measurement, product design, and policy approaches. Important open problems include multi-stakeholder processes for defining values and resolving trade-offs, better values-driven measurements, recommender controls that people use, non-behavioral algorithmic feedback, optimization for long-term outcomes, causal inference of recommender effects, academic-industry research collaborations, and interdisciplinary policy-making.
- Jacy Anthis, Kristian Lum, Michael Ekstrand, Avi Feller, Alexander D’Amour, and Chenhao TanMay 2024
The need for fair AI is increasingly clear in the era of general-purpose systems such as ChatGPT, Gemini, and other large language models (LLMs). However, the increasing complexity of human-AI interaction and its social impacts have raised questions of how fairness standards could be applied. Here, we review the technical frameworks that machine learning researchers have used to evaluate fairness, such as group fairness and fair representations, and find that their application to LLMs faces inherent limitations. We show that each framework either does not logically extend to LLMs or presents a notion of fairness that is intractable for LLMs, primarily due to the multitudes of populations affected, sensitive attributes, and use cases. To address these challenges, we develop guidelines for the more realistic goal of achieving fairness in particular use cases: the criticality of context, the responsibility of LLM developers, and the need for stakeholder participation in an iterative process of design and evaluation. Moreover, it may eventually be possible and even necessary to use the general-purpose capabilities of AI systems to address fairness challenges as a form of scalable AI-assisted alignment.
- Michael D. Ekstrand, Ben Carterette, and Fernando DiazACM Transactions on Recommender Systems, Mar 2024
Current practice for evaluating recommender systems typically focuses on point estimates of user-oriented effectiveness metrics or business metrics, sometimes combined with additional metrics for considerations such as diversity and novelty. In this paper, we argue for the need for researchers and practitioners to attend more closely to various distributions that arise from a recommender system (or other information access system) and the sources of uncertainty that lead to these distributions. One immediate implication of our argument is that both researchers and practitioners must report and examine more thoroughly the distribution of utility between and within different stakeholder groups. However, distributions of various forms arise in many more aspects of the recommender systems experimental process, and distributional thinking has substantial ramifications for how we design, evaluate, and present recommender systems evaluation and research results. Leveraging and emphasizing distributions in the evaluation of recommender systems is a necessary step to ensure that the systems provide appropriate and equitably-distributed benefit to the people they affect.
- Ngozi Ihemelandu and Michael D. EkstrandIn Proceedings of the 46th European Conference on Information Retrieval, Mar 2024
While there has been significant research on statistical techniques for comparing two information retrieval (IR) systems, many IR experiments test more than two systems. This can lead to inflated false discoveries due to the multiple-comparison problem (MCP). A few IR studies have investigated multiple comparison procedures; these studies mostly use TREC data and control the familywise error rate. In this study, we extend their investigation to include recommendation system evaluation data as well as multiple comparison procedures that controls for False Discovery Rate (FDR).
- In Proceedings of the 46th European Conference on Information Retrieval, Mar 2024
Information access systems, such as search engines and recommender systems, order and position results based on their estimated relevance. These results are then evaluated for a range of concerns, including provider-side fairness: whether exposure to users is fairly distributed among items and the people who created them. Several fairness-aware ranking and re-ranking techniques have been proposed to ensure fair exposure for providers, but this work focuses almost exclusively on linear layouts in which items are displayed in single ranked list. Many widely-used systems use other layouts, such as the grid views common in streaming platforms, image search, and other applications. Providing fair exposure to providers in such layouts is not well-studied. We seek to fill this gap by providing a grid-aware re-ranking algorithm to optimize layouts for provider-side fairness by adapting existing re-ranking techniques to grid-aware browsing models, and an analysis of the effect of grid-specific factors such as device size on the resulting fairness optimization.
- Michael D. Ekstrand, Lex Beattie, Maria Soledad Pera, and Henriette CramerIn Proceedings of the 46th European Conference on Information Retrieval, Mar 2024
Information Retrieval (IR) systems have a wide range of impacts on consumers. We offer maps to help identify goals IR systems could—or should—strive for, and guide the process of *scoping how to gauge a wide range of consumer-side impacts and the possible interventions needed to address these effects. Grounded in prior work on scoping algorithmic impact efforts, our goal is to promote and facilitate research that (1) is grounded in impacts on information consumers, contextualizing these impacts in the broader landscape of positive and negative consumer experience; (2) takes a broad view of the possible means of changing or improving that impact, including non-technical interventions; and (3) uses operationalizations and strategies that are well-matched to the technical, social, ethical, legal, and other dimensions of the specific problem in question.
2023
- Alexandra Olteanu, Michael Ekstrand, Carlos Castillo, and Jina SuhNov 2023
All types of research, development, and policy work can have unintended, adverse consequences - work in responsible artificial intelligence (RAI), ethical AI, or ethics in AI is no exception.
- Ngozi Ihemelandu and Michael D. EkstrandIn Proceedings of the 22nd IEEE/WIC international conference on web intelligence and intelligent agent technology, Oct 2023
The strategy for selecting candidate sets – the set of items that the recommendation system is expected to rank for each user – is an important decision in carrying out an offline top-\N recommender system evaluation. The set of candidates is composed of the union of the user’s test items and an arbitrary number of non-relevant items that we refer to as decoys. Previous studies have aimed to understand the effect of different candidate set sizes and selection strategies on evaluation. In this paper, we extend this knowledge by studying the specific interaction of candidate set selection strategies with popularity bias, and use simulation to assess whether sampled candidate sets result in metric estimates that are less biased with respect to the true metric values under complete data that is typically unavailable in ordinary experiments.