Readings

Introduction
Data & model pitfalls
“Good” research practices
Computational social science
NLP for conversations
NLP for communities
Computational sociolinguistics
Misinformation, factuality, and toxicity
NLP for literature & history
NLP for education & political science
Quantifying bias across domains
Biases in the LLM pipeline
HCI & NLP
NLP & HCI

Introduction

Required:

Nguyen, D., Liakata, M., DeDeo, S., Eisenstein, J., Mimno, D., Tromble, R., & Winters, J. (2020). How we do things with words: Analyzing text as social and cultural data. Frontiers in Artificial Intelligence, 3, 62.

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021, March). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610-623).

Data & model pitfalls

Required:

Olteanu, A., Castillo, C., Diaz, F., & Kıcıman, E. (2019). Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data, 2, 13.

Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020, July). The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 6282-6293).

Fiesler, C., & Proferes, N. (2018). “Participant” perceptions of Twitter research ethics. Social Media + Society, 4(1), 2056305118763366.

Klein & D’Ignazio (2020). “What Gets Counted Counts” from Data Feminism.

Optional:

Klein, L., & D’Ignazio, C. (2024). Data Feminism for AI. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency.

Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramer, F., … & Wallace, E. (2023). Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23) (pp. 5253-5270).

Metcalf, J., & Crawford, K. (2016). Where are human subjects in big data research? The emerging ethics divide. Big Data & Society, 3(1), 2053951716650211.

Baden, C., Pipal, C., Schoonvelde, M., & van der Velden, M. A. G. (2022). Three gaps in computational text analysis methods for social sciences: A research agenda. Communication Methods and Measures, 16(1), 1-18.

Jacobs, A. Z., & Wallach, H. (2021, March). Measurement and fairness. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 375-385).

Speer, Rob. (2017). “How to make a racist AI without really trying.”

“Good” research practices

Required:

Olah, Christopher. (2021). Research Taste Exercises. https://colah.github.io/notes/taste/

Birhane et al. (2022). The Values Encoded in Machine Learning. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency.

Leins, K., Lau, J. H., & Baldwin, T. (2020, July). Give Me Convenience and Give Her Death: Who Should Decide What Uses of NLP are Appropriate, and on What Basis?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 2908-2913).

Jakesch, M., Buçinca, Z., Amershi, S., & Olteanu, A. (2022, June). How different groups prioritize ethical values for responsible AI. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (pp. 310-323).

Optional:

Green, B. (2021). Data Science as Political Action: Grounding Data Science in a Politics of Justice. Journal of Social Computing.

Dror, R., Baumer, G., Shlomov, S., & Reichart, R. (2018, July). The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1383-1392).

Dodge, J., Gururangan, S., Card, D., Schwartz, R., & Smith, N. A. (2019, November). Show Your Work: Improved Reporting of Experimental Results. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 2185-2194).

Required:

Wallach, H. (2018). Computational social science ≠ computer science + social data. Communications of the ACM, 61(3), 42-44.

Lazer, D. M., Pentland, A., Watts, D. J., Aral, S., Athey, S., Contractor, N., … & Wagner, C. (2020). Computational social science: Obstacles and opportunities. Science, 369(6507), 1060-1062.

Ziems et al. (2023). Can Large Language Models Transform Computational Social Science? Arxiv.

Optional:

Evans, J. A., & Aceves, P. (2016). Machine translation: mining text for social theory. Annual Review of Sociology, 42, 21-50.

Edelmann, A., Wolff, T., Montagne, D., & Bail, C. A. (2020). Computational social science and sociology. Annual Review of Sociology, 46, 61-81.

Lazer, D., Brewer, D., Christakis, N., Fowler, J., & King, G. (2009). Life in the network: the coming age of computational social. Science, 323(5915), 721-723.

Bail, C. A. (2014). The cultural environment: Measuring culture with big data. Theory and Society, 43, 465-482.

Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267-297.

Egami, N., Fong, C. J., Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). How to make causal inferences using texts. Science Advances, 8(42), eabg2652.

Feder, A., Keith, K. A., Manzoor, E., Pryzant, R., Sridhar, D., Wood-Doughty, Z., … & Yang, D. (2022). Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10, 1138-1158.

NLP for conversations

Required:

Voigt, R., Camp, N. P., Prabhakaran, V., Hamilton, W. L., Hetey, R. C., Griffiths, C. M., … & Eberhardt, J. L. (2017). Language from police body camera footage shows racial disparities in officer respect. Proceedings of the National Academy of Sciences, 114(25), 6521-6526.

Althoff, T., Clark, K., & Leskovec, J. (2016). Large-scale analysis of counseling conversations: An application of natural language processing to mental health. Transactions of the Association for Computational Linguistics, 4, 463-476.

Optional:

Chang, J. P., Chiam, C., Fu, L., Wang, A., Zhang, J., & Danescu-Niculescu-Mizil, C. (2020, July). ConvoKit: A Toolkit for the Analysis of Conversations. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 57-60).

Danescu-Niculescu-Mizil, C., & Lee, L. (2011, June). Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics (pp. 76-87).

Mayfield, E., & Black, A. W. (2019). Analyzing Wikipedia Deletion Debates with a Group Decision-Making Forecast Model. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW), 1-26.

Zhang, J., Chang, J., Danescu-Niculescu-Mizil, C., Dixon, L., Hua, Y., Taraborelli, D., & Thain, N. (2018, July). Conversations Gone Awry: Detecting Early Signs of Conversational Failure. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1350-1361).

Saveski, M., Roy, B., & Roy, D. (2021, April). The structure of toxic conversations on Twitter. In Proceedings of the Web Conference 2021 (pp. 1086-1097).

Luu, K., Tan, C., & Smith, N. A. (2019). Measuring online debaters’ persuasive skill from text over time. Transactions of the Association for Computational Linguistics, 7, 537-550.

NLP for communities

Required:

Bruckman, A. (2006, April). A new perspective on “community” and its implications for computer-mediated communication systems. In CHI’06 extended abstracts on Human factors in computing systems (pp. 616-621).

Yang, D., Kraut, R. E., Smith, T., Mayfield, E., & Jurafsky, D. (2019, May). Seekers, providers, welcomers, and storytellers: Modeling social roles in online health communities. In Proceedings of the 2019 CHI conference on human factors in computing systems (pp. 1-14).

Optional:

Gilbert, E., & Karahalios, K. (2009, April). Predicting tie strength with social media. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 211-220).

Danescu-Niculescu-Mizil, C., West, R., Jurafsky, D., Leskovec, J., & Potts, C. (2013, May). No country for old members: User lifecycle and linguistic change in online communities. In Proceedings of the 22nd international conference on World Wide Web (pp. 307-318).

Zhang, J. S., Keegan, B., Lv, Q., & Tan, C. (2021, May). Understanding the diverging user trajectories in highly-related online communities during the COVID-19 pandemic. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 15, pp. 888-899).

Chandrasekharan, E., Samory, M., Jhaver, S., Charvat, H., Bruckman, A., Lampe, C., … & Gilbert, E. (2018). The internet’s hidden rules: An empirical study of Reddit norm violations at micro, meso, and macro scales. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 1-25.

Cunha, T., Jurgens, D., Tan, C., & Romero, D. (2019, May). Are all successful communities alike? Characterizing and predicting the success of online communities. In The World Wide Web Conference (pp. 318-328).

Antoniak, M., Mimno, D., & Levy, K. (2019). Narrative paths and negotiation of power in birth stories. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW), 1-27.

Vilhena, D. A., Foster, J. G., Rosvall, M., West, J. D., Evans, J., & Bergstrom, C. T. (2014). Finding cultural holes: How structure and culture diverge in networks of scholarly communication. Sociological Science, 1, 221.

Computational sociolinguistics

Required:

Nguyen, D., Doğruöz, A. S., Rosé, C. P., & De Jong, F. (2016). Computational sociolinguistics: A survey. Computational linguistics, 42(3), 537-593. Bucholtz, M., & Hall, K. (2004). Language and identity. A Companion to Linguistic Anthropology, 1, 369-394.

Hovy, D., Bianchi, F., & Fornaciari, T. (2020, July). “you sound just like your father” commercial machine translation systems include stylistic biases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 1686-1690).

Ziems et al. (2023). Multi-VALUE: A Framework for Cross-Dialectal English NLP. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (pp. 744–768).

Optional:

Nguyen, D., Rosseel, L., & Grieve, J. (2021, June). On learning and representing social meaning in NLP: a sociolinguistic perspective. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 603-612).

Eisenstein, J., O’Connor, B., Smith, N. A., & Xing, E. P. (2014). Diffusion of lexical change in social media. PloS One, 9(11), e113114.

Demszky, D., Sharma, D., Clark, J. H., Prabhakaran, V., & Eisenstein, J. (2021, June). Learning to Recognize Dialect Features. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 2315-2338).

Pavalanathan, U., & Eisenstein, J. (2015). Audience-modulated variation in online social media. American Speech, 90(2), 187-213.

Grieve, J., Nini, A., & Guo, D. (2018). Mapping lexical innovation on American social media. Journal of English Linguistics, 46(4), 293-319.

Jones, T. (2015). Toward a description of African American vernacular English dialect regions using “Black Twitter”. American Speech, 90(4), 403-440

Misinformation, factuality, and toxicity

Required:

Welbl, J., Glaese, A., Uesato, J., Dathathri, S., Mellor, J., Hendricks, L. A., … & Huang, P. S. (2021, November). Challenges in Detoxifying Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 2447-2469).

Vosoughi, S., Roy, D., & Aral, S. (2018). The spread of true and false news online. Science, 359(6380), 1146-1151.

Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. (2019, July). The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 1668-1678).

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., … & Clark, J. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.

Optional:

Waseem, Z., Davidson, T., Warmsley, D., & Weber, I. (2017, August). Understanding Abuse: A Typology of Abusive Language Detection Subtasks. In Proceedings of the First Workshop on Abusive Language Online (pp. 78-84).

Chandrasekharan, E., Pavalanathan, U., Srinivasan, A., Glynn, A., Eisenstein, J., & Gilbert, E. (2017). You can’t stay here: The efficacy of Reddit’s 2015 ban examined through hate speech. Proceedings of the ACM on Human-Computer interaction, 1(CSCW), 1-22.

Jurgens, D., Hemphill, L., & Chandrasekharan, E. (2019, July). A Just and Comprehensive Strategy for Using NLP to Address Online Abuse. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3658-3666).

Friggeri, A., Adamic, L., Eckles, D., & Cheng, J. (2014, May). Rumor cascades. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 8, No. 1).

Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020, November). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 3356-3369).

Kreps, S., McCain, R. M., & Brundage, M. (2022). All the news that’s fit to fabricate: AI-generated text as a tool of media misinformation. Journal of Experimental Political Science, 9(1), 104-117.

Schuster, T., Schuster, R., Shah, D. J., & Barzilay, R. (2020). The limitations of stylometry for detecting machine-generated fake news. Computational Linguistics, 46(2), 499-510.

Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., & Choi, Y. (2019). Defending against neural fake news. Advances in Neural Information Processing Systems, 32.

NLP for literature & history

Required:

Marche. (2012). Literature Is not Data: Against Digital Humanities. Los Angeles Review of Books.

Nelson, L. K., Getman, R., & Haque, S. A. (2022). And the Rest is History: Measuring the Scope and Recall of Wikipedia’s Coverage of Three Women’s Movement Subgroups. Sociological Methods & Research, 51(4), 1788-1825.

Soni, S., Klein, L. F., & Eisenstein, J. (2021). Abolitionist Networks: Modeling Language Change in Nineteenth-Century Activist Newspapers. Journal of Cultural Analytics, 6(1).

Underwood, T., Bamman, D., & Lee, S. (2018). The transformation of gender in English-language fiction. Journal of Cultural Analytics, 3(2).

Optional:

Piper, A., So, R. J., & Bamman, D. (2021, November). Narrative theory for computational narrative understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 298-311).

Walsh, M., & Antoniak, M. (2021). The goodreads “Classics”: A computational study of readers, amazon, and crowdsourced amateur criticism. Journal of Cultural Analytics, 6(2).

Lee, B. C. G., Mears, J., Jakeway, E., Ferriter, M., Adams, C., Yarasavage, N., … & Weld, D. S. (2020, October). The Newspaper Navigator dataset: extracting headlines and visual content from 16 million historic newspaper pages in chronicling America. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (pp. 3055-3062).

Boyd-Graber, J., Hu, Y., & Mimno, D. (2017). Applications of topic models, Chapter 4 (Historical Documents) or Chapter 6 (Fiction and Literature). Foundations and Trends in Information Retrieval, 11(2-3), 143-296.

Reagan, A. J., Mitchell, L., Kiley, D., Danforth, C. M., & Dodds, P. S. (2016). The emotional arcs of stories are dominated by six basic shapes. EPJ Data Science, 5(1), 1-12.

NLP for education & political science

Required:

McFarland, D. A., Khanna, S., Domingue, B. W., & Pardos, Z. A. (2021). Education data science: Past, present, future. AERA Open, 7, 23328584211052055.

Wang, R. E., & Demszky, D. (2023). Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction. arXiv preprint arXiv:2306.03090.

Hofstra, B., Kulkarni, V. V., Munoz-Najar Galvez, S., He, B., Jurafsky, D., & McFarland, D. A. (2020). The diversity–innovation paradox in science. Proceedings of the National Academy of Sciences, 117(17), 9284-9291.

Rodriguez, P. L., & Spirling, A. (2022). Word embeddings: What works, what doesn’t, and how to tell the difference for applied research. The Journal of Politics, 84(1), 101-115.

Card, D., Chang, S., Becker, C., Mendelsohn, J., Voigt, R., Boustan, L., … & Jurafsky, D. (2022). Computational analysis of 140 years of US political speeches reveals more positive but increasingly polarized framing of immigration. Proceedings of the National Academy of Sciences, 119(31), e2120510119.

Optional:

Demszky, D., Liu, J., Mancenido, Z., Cohen, J., Hill, H., Jurafsky, D., & Hashimoto, T. B. (2021, August). Measuring Conversational Uptake: A Case Study on Student-Teacher Interactions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1638-1653).

Liu, J., & Cohen, J. (2021). Measuring teaching practices at scale: A novel application of text-as-data methods. Educational Evaluation and Policy Analysis, 43(4), 587-614.

Card, D., Boydstun, A., Gross, J. H., Resnik, P., & Smith, N. A. (2015, July). The Media Frames corpus: Annotations of frames across issues. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (pp. 438-444).

Mathur, A., Wang, A., Schwemmer, C., Hamin, M., Stewart, B. M., & Narayanan, A. (2023). Manipulative tactics are the norm in political emails: Evidence from 300K emails from the 2020 US election cycle. Big Data & Society, 10(1), 20539517221145371.

Barberá, P., Boydstun, A. E., Linn, S., McMahon, R., & Nagler, J. (2021). Automated text classification of news articles: A practical guide. Political Analysis, 29(1), 19-42.

Vafa, K., Naidu, S., & Blei, D. (2020, July). Text-Based Ideal Points. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5345-5357).

Quantifying bias across domains

Required:

Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020, July). Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5454-5476).

Field, A., Blodgett, S. L., Waseem, Z., & Tsvetkov, Y. (2021, August). A Survey of Race, Racism, and Anti-Racism in NLP. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1905-1925).

Antoniak, M., Field, A., Mun, J., Walsh, M., Klein, L., & Sap, M. (2023, July). Riveter: Measuring Power and Social Dynamics Between Entities. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) (pp. 377-388).

Optional:

Field, A., Park, C. Y., Lin, K. Z., & Tsvetkov, Y. (2022, April). Controlled analyses of social biases in Wikipedia bios. In Proceedings of the ACM Web Conference 2022 (pp. 2624-2635).

Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635-E3644.

Fraser, K. C., Nejadgholi, I., & Kiritchenko, S. (2021, August). Understanding and Countering Stereotypes: A Computational Approach to the Stereotype Content Model. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 600-616).

Bailey, A. H., Williams, A., & Cimpian, A. (2022). Based on billions of words on the internet, people= men. Science Advances, 8(13), eabm2463.

Biases in the LLM pipeline

Required:

Wang, A., Morgenstern, J., & Dickerson, J. P. (2024). Large language models cannot replace human participants because they cannot portray identity groups. arXiv preprint arXiv:2402.01908.

Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., & Wallach, H. (2021, August). Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1004-1015).

Cheng, M.. (2023). Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (pp. 1504–1532).

Luccioni, A., & Viviano, J. (2021, August). What’s in the box? an analysis of undesirable content in the Common Crawl corpus. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (pp. 182-189).

Optional:

Pawar, S., Park, J., Jin, J., Arora, A., Myung, J., Yadav, S., … & Augenstein, I. (2024). Survey of Cultural Awareness in Language Models: Text and Beyond. arXiv preprint arXiv:2411.00860.

Kumar, S., Balachandran, V., Njoo, L., Anastasopoulos, A., & Tsvetkov, Y. (2023, May). Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (pp. 3291-3313).

Sheng, E., Chang, K. W., Natarajan, P., & Peng, N. (2019, November). The Woman Worked as a Babysitter: On Biases in Language Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3407-3412).

Sap, M., Swayamdipta, S., Vianna, L., Zhou, X., Choi, Y., & Smith, N. A. (2022, July). Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 5884-5906).

Santy, S., et al. (2023). NLPositionality: Characterizing Design Biases of Datasets and Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (pp. 9080–9102).

Kirk, H. R., Jun, Y., Volpin, F., Iqbal, H., Benussi, E., Dreyer, F., … & Asano, Y. (2021). Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. Advances in Neural Information Processing Systems, 34, 2611-2624.

Ovalle, A., Goyal, P., Dhamala, J., Jaggers, Z., Chang, K. W., Galstyan, A., … & Gupta, R. (2023, June). “I’m fully who I am”: Towards Centering Transgender and Non-Binary Voices to Measure Biases in Open Language Generation. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (pp. 1246-1266).

Cao, Y. T., Pruksachatkun, Y., Chang, K. W., Gupta, R., Kumar, V., Dhamala, J., & Galstyan, A. (2022, May). On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 561-570).

HCI & NLP

Required:

Amershi, S., Weld, D., Vorvoreanu, M., Fourney, A., Nushi, B., Collisson, P., … & Horvitz, E. (2019, May). Guidelines for human-AI interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1-13).

Birhane, A., Isaac, W., Prabhakaran, V., Diaz, M., Elish, M. C., Gabriel, I., & Mohamed, S. (2022). Power to the people? Opportunities and challenges for participatory AI. Equity and Access in Algorithms, Mechanisms, and Optimization, 1-8.

Lee, M., Liang, P., & Yang, Q. (2022, April). CoAuthor: Designing a human-AI collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (pp. 1-19).

Hancock, J. T., Naaman, M., & Levy, K. (2020). AI-mediated communication: Definition, research agenda, and ethical considerations. Journal of Computer-Mediated Communication, 25(1), 89-100.

Optional:

Bansal, G., Wu, T., Zhou, J., Fok, R., Nushi, B., Kamar, E., … & Weld, D. (2021, May). Does the whole exceed its parts? The effect of AI explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-16).

Gordon, M. L., Lam, M. S., Park, J. S., Patel, K., Hancock, J., Hashimoto, T., & Bernstein, M. S. (2022, April). Jury learning: Integrating dissenting voices into machine learning models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (pp. 1-19).

Wu, T., Terry, M., & Cai, C. J. (2022, April). AI chains: Transparent and controllable human-AI interaction by chaining large language model prompts. In Proceedings of the 2022 CHI Conference on Human Factors in computing systems (pp. 1-22).

August, T., Wang, L. L., Bragg, J., Hearst, M. A., Head, A., & Lo, K. (2022). Paper plain: Making medical research papers approachable to healthcare consumers with natural language processing. ACM Transactions on Computer-Human Interaction.

Wu, T., Jiang, E., Donsbach, A., Gray, J., Molina, A., Terry, M., & Cai, C. J. (2022, April). Promptchainer: Chaining large language model prompts through visual programming. In CHI Conference on Human Factors in Computing Systems Extended Abstracts (pp. 1-10).

Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15).

Ashktorab, Z., Jain, M., Liao, Q. V., & Weisz, J. D. (2019, May). Resilient chatbots: Repair strategy preferences for conversational breakdowns. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1-12).

Jakesch, M., Bhat, A., Buschek, D., Zalmanson, L., & Naaman, M. (2023, April). Co-writing with opinionated language models affects users’ views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (pp. 1-15).

NLP & HCI

Required:

Lee, M., Srivastava, M., Hardy, A., Thickstun, J., Durmus, E., Paranjape, A., … & Liang, P. (2022). Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746.

Nekoto, W., Marivate, V., Matsila, T., Fasubaa, T., Fagbohungbe, T., Akinola, S. O., … & Bashir, A. (2020, November). Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 2144-2160).

Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S., & Smith, N. A. (2021, August). All That’s ‘Human’Is Not Gold: Evaluating Human Evaluation of Generated Text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 7282-7296).

Kirk, H. R., Whitefield, A., Röttger, P., Bean, A., Margatina, K., Ciro, J., … & Hale, S. A. (2024). The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models. arXiv preprint arXiv:2404.16019.

Optional:

Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020, July). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4902-4912).

Wang, Z. J., Choi, D., Xu, S., & Yang, D. (2021, April). Putting Humans in the Natural Language Processing Loop: A Survey. In Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing (pp. 47-52).

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730-27744.

Kreiss, E., Bennett, C., Hooshmand, S., Zelikman, E., Morris, M. R., & Potts, C. (2022, December). Context Matters for Image Descriptions for Accessibility: Challenges for Referenceless Evaluation Metrics. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 4685-4697).

Röttger, P., Vidgen, B., Hovy, D., & Pierrehumbert, J. (2022, July). Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 175-190).

Akoury, N., Wang, S., Whiting, J., Hood, S., Peng, N., & Iyyer, M. (2020, November). STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6470-6484).

Li, H., et al. (2022). Ditch the Gold Standard: Re-evaluating Conversational Question Answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (pp. 8074–8085).

Readings

Table of contents

Introduction

Data & model pitfalls

“Good” research practices

Computational social science

NLP for conversations

NLP for communities

Computational sociolinguistics

Misinformation, factuality, and toxicity

NLP for literature & history

NLP for education & political science

Quantifying bias across domains

Biases in the LLM pipeline

HCI & NLP

NLP & HCI