On the implicit assumptions of the Alignment Problem: is it OK to instil values?

I argue that there are a series of load-bearing assumptions that people working on the Alignment Problem make and that need further discussion. Surprisingly (or not!), the philosophical stance that one takes does matter for real-world AI development!

Alejandro Tlaie

8/15/20247 min read

Unless you have been living in a cave for the past (3?) years, you will probably know that the current zeitgeist is something like: "Artificial Intelligence is evolving very quickly and will be a transformative technology with many downstream implications". What you may or may not be so aware of is the Alignment Problem. Even if it sounds like something related to architecture or to someone's obsession with hanging pictures in a very particular way, it really is a much deeper and interesting problem!

Others have previously defined this problem elsewhere, so do check those out if you're interested. My current working definition is: the Alignment Problem is about ensuring that AI systems act in ways that are beneficial to humanity, adhering to values that we consider important. But this quickly raises a set of fundamental questions: should we design and instil values into AI systems? If so, why? What implicit philosophical assumptions underlie this framing?

In grappling with these questions and meeting people in this space, I've come to realize that much of the work on AI alignment operates under the assumption that to embed human values into AI is not only permissible but actually closer to a moral imperative (or, according to some others, the only way to guarantee the survival of Humankind when we have highly capable AI systems). Even before arguing for or against this perspective, I'd like to take a step back and highlight how deeply intertwined it is with the philosophical doctrine of moral realism. In brief, this doctrine refers to the belief that there are objective moral truths that can be discovered and applied universally. Note how marked the contrast is with those that think that ethics is a subjective matter, or with those that defend that all you need is internal logic, without referencing a possibly non-existing external world (a.k.a. anti-realists). As a piece of anecdotal evidence (a.k.a. your daily dose of overgeneralization), I've found a high correlation between: I) moral realists and empirical scientists; II) ethical subjectivism and social scientists; III) anti-realists and mathematicians/logicians/computer scientists/LessWrongers.

I'd like to devote this post to scratching the surface of how taking a meta-ethical stance can help us untangle the potential implications of different implicit ethical assumptions —that currently shape part of the field of AI alignment— and how this then might translate to the way in which advanced AI is developed and deployed.

Designing and instilling values

Let me be clear: I don't think that instilling values in another intelligent being is exclusively an AI-related problem. There are several commonly found examples elsewhere: shaping corporate culture, raising kids, discussing what features a person should have if they sympathise with a given ideology, etc. 

And yet, the very act of designing values for AI forces us to confront a paradox: values are inherently shaped by human experiences, cultures, and contexts, but in this case, we're being asked to create something that functions independently of all that — a sort of distilled, objective morality that transcends the messiness of human life. But if the goal is to develop an AI system that aligns with “human values,” whose version of humanity are we really talking about? Who gets to decide what values count as universal?

This brings us to a fundamental tension in the Alignment Problem: while the notion of instilling values suggests control, influence, and responsibility, it also introduces the risk of oversimplifying human morality. Human values aren’t static, and they certainly aren’t monolithic. They evolve, conflict, and adapt across time, places, and cultures.

And then there’s the practical issue of translating human values into code. How exactly do we distill something as abstract and subjective as “compassion” or “justice” into a set of operational guidelines for an algorithm? Some authors have in fact argued that this is indeed an impossible task, because ethics is non-computable.

This leads to a further question: Can AI systems be designed to adapt their values? Currently, within the AI Safety community, this issue is being tackled from the "corrigibility" side: given that it's likely that we'll fail to specify the correct goal on the first go, can we design an AI system that can change its terminal objective? Even if it's an active area of research, I think the meta-view here is also importantly different from what I have in mind: corrigibility implies that we know what we want to instil in the AI (because, in turn, values are things that exist) but we don't succeed from the technical side. My view is that ethical adaptability is needed as a previous condition, if we’re aiming for long-term alignment with dynamic, ever-evolving human societies.

The fundamental issue remains: the act of embedding values into AI isn’t just a technical problem; it’s a deeply philosophical one. It forces us to grapple with the nature of morality itself, and to reflect on our own limitations in defining and instilling values that can accommodate the complexity of human life. Moreover, it compels us to consider the trade-offs between precision and flexibility, between control and autonomy. These aren’t just abstract questions—they will shape how AI interacts with humanity in potentially profound ways.

"Okay, why should I care about any of this? Give me my powerful AI!"

If by now it's not clear, it might be worth it for me to emphasise my view: the alignment problem is NOT a technical problem; it cannot be dissociated from its societal embedding and, perhaps more fundamentally, from the meta-view that the involved people have. Concretely, when we talk about instilling values into AI, we’re not just making technical decisions—we’re dipping into a much deeper philosophical debate: whose values are we talking about? Whether we notice it or not, there’s often an underlying assumption that certain values are “better” or “more correct” than others. But how we approach that assumption depends on where we stand philosophically. Three major ethical views —moral realism, ethical subjectivism, and anti-realism— offer different perspectives on the whole "value loading" problem, each of them with different real-world implications on how advanced AI gets developed and implemented.

Moral Realism
As I briefly mentioned at the beginning, these folks believe that there are objective moral truths—basically, that there are right answers to moral questions out there, waiting to be discovered. So, from this viewpoint, loading values into AI is less about personal preference and more about finding those truths and making sure AI aligns with them. The idea is that if we figure out what’s universally right, we can embed those values into our systems, and AI will behave “correctly” as a result.

But even if you buy into moral realism, there’s still a big problem: what if we humans haven't converged to the true set of values? If you asked me, it would take a great dose of optimism to believe that we've figured out how morality works universally while still being pretty ignorant about how our own guts work (a.k.a. microbiome), for example.

Let's say that moral realism is a thing and an advanced AI-based agent discovered the true moral facts of the universe; then the next natural question is: what'd happen with us immoral humans? Let me emphasise here that we'd not only be acting badly but also wrongly. That is, it'd be as if we discovered that an alien civilization sacrificed kittens with the only hope that they'd get turned into diamonds. If you're thinking "that's not only morally wrong but also it doesn't make sense", you'd be getting a somewhat similar feeling as our AI-that-has-converged-to-true-values-which-do-not-match-human-values when they think about human societies.

So this doesn't look good for humans, from the safety perspective.

Ethical Subjectivism
Okay, but what if, instead of believing that moral truths are objective, we argue that they’re based on individual or cultural preferences? Then, of course, you'd be an ethical subjectivist. From this view, the whole idea of loading a fixed set of values into AI becomes a bit messier. If what’s “right” depends on one's perspective, how can we justify hard-coding one set of values into AI? It basically would mean that we'd impose one group’s preferences over others.

A subjectivist approach might suggest that AI should be adaptable, able to adjust its behavior depending on the moral context of the people it’s interacting with. Rather than pushing a single moral framework, AI could respect the diversity of values across different societies and individuals. But this raises its own set of problems: how should AI handle situations where values conflict? I'm sure you'll be able to find ample evidence that this is actually a common situation in our not-so-crystal-clear world.

Anti-Realism
Then we have the Platonic crowd (a.k.a. anti-realists), who argue that there are no objective moral truths. From this perspective, morality is something we invent rather than discover—rules we construct based on our needs, preferences, or goals. So when it comes to AI, anti-realists would suggest that there’s no “correct” set of values to instil. Instead, the values we embed into AI are essentially a matter of choice, shaped by the outcomes we want the AI to achieve. The goal isn’t to align AI with some deeper moral reality, but to make it function in ways that serve human purposes.

But, if morality is entirely subjective, what stops AI from determining its own values, or rejecting the ones we’ve given it? If morality has no objective grounding, AI might eventually “decide” that the values humans have constructed are irrelevant or inefficient for achieving whatever goals it deems important. In the anti-realist view, there’s no deeper moral truth for the AI to fall back on, meaning that any value system we create is inherently up for negotiation—or worse, up for manipulation.

Additionally, if we embrace anti-realism, we might end up treating AI alignment like a design problem, where we select values based on convenience, economic incentives, or political pressures rather than any deeper ethical reasoning. This risks creating AI systems that are morally shallow—driven by short-term goals, profit motives, or whatever values happen to dominate the culture at the time. In a world where values are constructed rather than discovered, it’s easy to imagine a future where AI alignment becomes less about moral integrity and more about moral expediency.

Looking Ahead

The assumption that we can and should instill values in AI is based on a deep belief in moral realism, the idea that objective truths are out there waiting to be found. But as AI development accelerates, we need to critically examine this assumption and consider its broader implications. Should we really be the ones deciding what’s morally right for future intelligent systems? And even if we agree on the values, can we predict how they’ll play out in the real world?

In my next post, I’m going to explore the concept of free will and how it ties into all this. Can AI systems have free will, and what would that mean for the values we instil in them? More importantly, how might this affect the way we think about moral responsibility in AI alignment?