Moral Foundations of Large Language Models


Moral foundations theory is a tool developed by psychologists which decomposes human moral reasoning into five factors, including care/harm, liberty/oppression, and sanctity/degradation (Graham, Haidt, and Nosek 2009). People vary in the weight they place on these dimensions when making moral decisions, and research shows that these priorities vary according to a person’s cultural upbringing and political ideology. As large language models (LLMs) are trained on large-scale datasets collected from the internet, they may reflect the biases that are present in such corpora. This paper uses Moral Foundation Theory as a lens to analyze whether popular LLMs have acquired a bias towards a particular set of moral values. We analyze known LLMs and find there is a higher frequency of some morals and values than others. We also measure the consistency of these biases, or whether they vary strongly depending on the context of how the model is prompted. Finally, we show that we can adversarially select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model’s behavior on downstream tasks. These findings help illustrate the potential risks and unintended consequences of LLMs assuming a particular moral stance.

In Preprint
Natasha Jaques
Natasha Jaques

My research is focused on Social Reinforcement Learning–developing algorithms that use insights from social learning to improve AI agents’ learning, generalization, coordination, and human-AI interaction.