Lucy Li

PhD Candidate, University of California, Berkeley

lucy3_li [AT] berkeley.edu

About Me

I am a PhD student at the University of California, Berkeley, affiliated with Berkeley Artificial Intelligence Research (BAIR) and the School of Information. My research intersects natural language processing (NLP) with computational social science and digital humanities (e.g. cultural analytics), with an emphasis on equity and fairness. I am advised by David Bamman.

Underpinning my research is the premise that nearly all language, whether AI- or human-generated, is social and cultural data. Though I publish primarily in NLP venues, my work is deeply interdisciplinary, and my collaborators span fields including psychology, education, and English literature.

I'm graduating in Spring 2025, and I'm on the academic job market. Here is my CV.

I'm on Bluesky.

Recent news:

Teaching, Mentoring, & Outreach:

Prospective undergraduate research assistants should apply for a URAP position in David Bamman's group instead of emailing me. These positions are posted at the beginning of each semester, and in the past, they have targeted students from a range of disciplinary backgrounds (e.g. EECS, media studies).

Looking for something to read? Check out this list of papers from subfields that care about social aspects of NLP.

Prospective PhD applicants, especially those from underrepresented backgrounds, are welcome to email me questions about the application process or the PhD experience.

Katie Keith, Naitian Zhou, and I have a podcast, Diaries of Social Data Research, where we chat with researchers on the process behind interdisciplinary papers.

Publications

I publish with my name backwards, so citations should refer to "L. Lucy". I do this because my last name is one of the most common in the world, researchers are often recognized and remembered by last name, and computer vision researcher Fei-Fei Li does this, too. More thoughts from others about names and academia, here.

*equal contribution.

On Classification with Large Language Models in Cultural Analytics.

David Bamman, Kent K. Chang, Li Lucy, Naitian Zhou.

Computational Humanities Research (CHR) 2024.

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

Li Lucy, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren Klein, Jesse Dodge.

Association of Computational Linguistics (ACL) 2024.

"One-Size-Fits-All"? Examining Expectations around What Constitute "Fair" or "Good" NLG System Behaviors.

Li Lucy, Su Lin Blodgett, Milad Shokouhi, Hanna Wallach, Alexandra Olteanu.

North American Association for Computational Linguistics (NAACL) 2024.

Words as Gatekeepers: Measuring Discipline-specific Terms and Meanings in Scholarly Publications.

Li Lucy, Jesse Dodge, David Bamman, Katherine A. Keith.

Findings of the Association of Computational Linguistics (ACL) 2023.

Discovering Differences in the Representation of People using Contextualized Semantic Axes.

Li Lucy, Divya Tadimeti, David Bamman.

Empirical Methods in Natural Language Processing (EMNLP) 2022.

Gender and Representation Bias in GPT-3 Generated Stories.

Li Lucy, David Bamman.

Workshop on Narrative Understanding (WNU) at the North American Association for Computational Linguistics (NAACL) 2021.

Characterizing English variation across social media communities with BERT.

Li Lucy, David Bamman.

Transactions of the Association of Computational Linguistics (TACL) 2021.

Content Analysis of Textbooks via Natural Language Processing: Findings on Gender, Race, and Ethnicity in Texas U.S. History Textbooks.

Li Lucy*, Dora Demszky*, Patricia Bromley, Dan Jurafsky.

American Educational Research Association (AERA) Open 2020.

Racial and Ethnic Representation in Literature Taught in US High Schools.

Li Lucy, Camilla Griffiths, Claire Ying, JJ Kim-Ebio, Sabrina Baur, Sarah Levine, Jennifer Eberhardt, David Bamman, Dora Demszky.

Journal of Cultural Analytics (forthcoming) 2025.

On Classification with Large Language Models in Cultural Analytics.

David Bamman, Kent K. Chang, Li Lucy, Naitian Zhou.

Computational Humanities Research (CHR) 2024.

DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images

Sami Baral*, Li Lucy*, Ryan Knight, Alice Ng, Luca Soldaini, Neil Heffernan, Kyle Lo

NeurIPS Workshop on Mathematical Reasoning and AI 2024.

Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula.

Li Lucy, Tal August, Rose E. Wang, Luca Soldaini, Courtney Allison, Kyle Lo.

Findings of Empirical Methods in Natural Language Processing (EMNLP) 2024.

"Othering" through War: Depiction of Asians/Asian Americans in U.S. History Textbooks from California and Texas.

Minju Choi*, Li Lucy*, Patricia Bromley, David Bamman.

Educational Researcher (forthcoming) 2025.

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

Li Lucy, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren Klein, Jesse Dodge.

Association of Computational Linguistics (ACL) 2024.

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo.

Association of Computational Linguistics (ACL) 2024.

"One-Size-Fits-All"? Examining Expectations around What Constitute "Fair" or "Good" NLG System Behaviors.

Li Lucy, Su Lin Blodgett, Milad Shokouhi, Hanna Wallach, Alexandra Olteanu.

North American Association for Computational Linguistics (NAACL) 2024.

Words as Gatekeepers: Measuring Discipline-specific Terms and Meanings in Scholarly Publications.

Li Lucy, Jesse Dodge, David Bamman, Katherine A. Keith.

Findings of the Association of Computational Linguistics (ACL) 2023.

Discovering Differences in the Representation of People using Contextualized Semantic Axes.

Li Lucy, Divya Tadimeti, David Bamman.

Empirical Methods in Natural Language Processing (EMNLP) 2022.

Gender and Representation Bias in GPT-3 Generated Stories.

Li Lucy, David Bamman.

Workshop on Narrative Understanding (WNU) at the North American Association for Computational Linguistics (NAACL) 2021.

Characterizing English variation across social media communities with BERT.

Li Lucy, David Bamman.

Transactions of the Association of Computational Linguistics (TACL) 2021.

Investigating Causal Effects of Instructions in Crowdsourced Claim Matching.

Emma Lurie, Li Lucy, Masha Belyi, Sofia Dewar, Daniel Rincón, John Baldwin, Rajvardhan Oak.

Computation + Journalism Symposium (C+J) 2020.

Content Analysis of Textbooks via Natural Language Processing: Findings on Gender, Race, and Ethnicity in Texas U.S. History Textbooks.

Li Lucy*, Dora Demszky*, Patricia Bromley, Dan Jurafsky.

American Educational Research Association (AERA) Open 2020.

Using Sentiment Induction to Understand Variation in Gendered Online Communities

Li Lucy, Julia Mendelsohn.

Society for Computation in Linguistics (SCiL) 2019.

Are distributional representations ready for the real world? Evaluating word vectors for grounded perceptual meaning.

Li Lucy, Jon Gauthier.

Language Grounding for Robotics (RoboNLP) Workshop at the Association for Computational Linguistics (ACL) 2017.

Miscellaneous

I was born in and grew up in Minnesota. When I was 7 years old, I wanted to be an ornithologist, and when I was 9, I wanted to be a fiction writer. I have a cat named Toast.

Thanks Martin Saveski for this website template.