Lucy Li

PhD Candidate, University of California, Berkeley

lucy3_li [AT] berkeley.edu

About Me

I am a PhD student at the University of California, Berkeley, affiliated with Berkeley Artificial Intelligence Research (BAIR) and the School of Information. My research intersects natural language processing with computational sociolinguistics and digital humanities (e.g. cultural analytics), and a significant segment of my work emphasizes equity and fairness. I am advised by David Bamman.

Underpinning my research is the premise that nearly all language, whether AI- or human-generated, is sociocultural data. In particular, my work investigates how social groups are represented in language models and textual media such as books and online forums. I'm also interested in developing content analysis approaches that can be applied at scale, and I enjoy working with billion-word datasets. Though I publish primarily in computing venues, I'm passionate about bridging NLP with the humanities and social sciences, and my current collaborators span fields including psychology, education, and English literature.

My work has been recognized by EECS Rising Stars, Rising Stars in Data Science, an American Educational Research Association (AERA) Best Paper Award, and an NSF Graduate Research Fellowship. I've interned at Microsoft Research and the Allen Institute for AI, and during the latter, I was awarded Outstanding Intern of the Year. Before my PhD, I completed my M.S. and B.S. at Stanford.

I am on the job market during the 2024-2025 academic year. Here is my CV.

Katie Keith, Naitian Zhou, and I have a podcast, Diaries of Social Data Research, where we chat with researchers on the process behind interdisciplinary papers.

Prospective PhD applicants, especially those from underrepresented backgrounds, are welcome to email me questions about the application process or the PhD experience.

Pronouns: she/her

Recent news:

Publications

I publish with my name backwards, so citations should refer to "L. Lucy". I do this because my last name is one of the most common in the world, researchers are often recognized and remembered by last name, and computer vision researcher Fei-Fei Li does this, too. More thoughts from others about names and academia, here.

*equal contribution.

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

Li Lucy, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren Klein, Jesse Dodge.

Association of Computational Linguistics (ACL) 2024.

"One-Size-Fits-All"? Examining Expectations around What Constitute "Fair" or "Good" NLG System Behaviors.

Li Lucy, Su Lin Blodgett, Milad Shokouhi, Hanna Wallach, Alexandra Olteanu.

North American Association for Computational Linguistics (NAACL) 2024.

Words as Gatekeepers: Measuring Discipline-specific Terms and Meanings in Scholarly Publications.

Li Lucy, Jesse Dodge, David Bamman, Katherine A. Keith.

Findings of the Association of Computational Linguistics (ACL) 2023.

Discovering Differences in the Representation of People using Contextualized Semantic Axes.

Li Lucy, Divya Tadimeti, David Bamman.

Empirical Methods in Natural Language Processing (EMNLP) 2022.

Gender and Representation Bias in GPT-3 Generated Stories.

Li Lucy, David Bamman.

Workshop on Narrative Understanding (WNU) at the North American Association for Computational Linguistics (NAACL) 2021.

Characterizing English variation across social media communities with BERT.

Li Lucy, David Bamman.

Transactions of the Association of Computational Linguistics (TACL) 2021.

Content Analysis of Textbooks via Natural Language Processing: Findings on Gender, Race, and Ethnicity in Texas U.S. History Textbooks.

Li Lucy*, Dora Demszky*, Patricia Bromley, Dan Jurafsky.

American Educational Research Association (AERA) Open 2020.

On Classification with Large Language Models in Cultural Analytics.

David Bamman, Kent K. Chang, Li Lucy, Naitian Zhou.

Computational Humanities Research (CHR) 2024.

Evaluating Language Model Math Reasoning via Grounding in Educational Curricula.

Li Lucy, Tal August, Rose E. Wang, Luca Soldaini, Courtney Allison, Kyle Lo.

Findings of Empirical Methods in Natural Language Processing (EMNLP) 2024.

"Othering" through War: Depiction of Asians/Asian Americans in U.S. History Textbooks from California and Texas.

Minju Choi*, Li Lucy*, Patricia Bromley, David Bamman.

Educational Researcher 2024 (forthcoming).

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

Li Lucy, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren Klein, Jesse Dodge.

Association of Computational Linguistics (ACL) 2024.

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo.

Association of Computational Linguistics (ACL) 2024.

"One-Size-Fits-All"? Examining Expectations around What Constitute "Fair" or "Good" NLG System Behaviors.

Li Lucy, Su Lin Blodgett, Milad Shokouhi, Hanna Wallach, Alexandra Olteanu.

North American Association for Computational Linguistics (NAACL) 2024.

Words as Gatekeepers: Measuring Discipline-specific Terms and Meanings in Scholarly Publications.

Li Lucy, Jesse Dodge, David Bamman, Katherine A. Keith.

Findings of the Association of Computational Linguistics (ACL) 2023.

Discovering Differences in the Representation of People using Contextualized Semantic Axes.

Li Lucy, Divya Tadimeti, David Bamman.

Empirical Methods in Natural Language Processing (EMNLP) 2022.

Gender and Representation Bias in GPT-3 Generated Stories.

Li Lucy, David Bamman.

Workshop on Narrative Understanding (WNU) at the North American Association for Computational Linguistics (NAACL) 2021.

Characterizing English variation across social media communities with BERT.

Li Lucy, David Bamman.

Transactions of the Association of Computational Linguistics (TACL) 2021.

Investigating Causal Effects of Instructions in Crowdsourced Claim Matching.

Emma Lurie, Li Lucy, Masha Belyi, Sofia Dewar, Daniel Rincón, John Baldwin, Rajvardhan Oak.

Computation + Journalism Symposium (C+J) 2020.

Content Analysis of Textbooks via Natural Language Processing: Findings on Gender, Race, and Ethnicity in Texas U.S. History Textbooks.

Li Lucy*, Dora Demszky*, Patricia Bromley, Dan Jurafsky.

American Educational Research Association (AERA) Open 2020.

Using Sentiment Induction to Understand Variation in Gendered Online Communities

Li Lucy, Julia Mendelsohn.

Society for Computation in Linguistics (SCiL) 2019.

Are distributional representations ready for the real world? Evaluating word vectors for grounded perceptual meaning.

Li Lucy, Jon Gauthier.

Language Grounding for Robotics (RoboNLP) Workshop at the Association for Computational Linguistics (ACL) 2017.

Miscellaneous

I was born in and grew up in Minnesota. My cat's name is Toast. When I was a kid, I wanted to be an ornithologist and a fiction writer.

Looking for something to read? Check out this list of papers from subfields that care about social aspects of NLP.

Thanks Martin Saveski for this website template.