After Orthogonality: Virtue-Ethical Agency and AI Alignment
2 min read
Furthermore, AI safety often assumes machines need fixed goals. However, rigid goals can cause problems. Consequently, we should explore new ideas. Specifically, some experts suggest AI should act more like humans do. Indeed, humans often follow practices, not just goals.
Moreover, practices are patterns of good action. For example, a mathematician promotes math by doing math well. Similarly, a kind person promotes kindness by acting kindly. Therefore, this is called eudaimonic rationality or virtue-ethical agency.
Additionally, this approach might make AI safer. In particular, it helps AI understand human values better. Hence, it could avoid the orthogonality problem. Ultimately, this shapes the future of AI alignment.
| Dimension | Consequentialist (EA-Style) Rationality | Eudaimonic (Virtue-Ethical) Rationality |
|---|---|---|
| Value Structure | Reduces holistic values to a minimal base of intrinsic “terminal” values; apparent goods are explained away as merely instrumental. | Treats causal connections among excellences as evidence for constitutive value; holistic and local values mutually ratify each other (“organicism”). |
| Means–Ends Relationship | Means and outcomes are separately evaluable; the value of outcomes is typically decisive (e.g., maximize aggregate utility). | No strict distinction between instrumental and terminal goods; excellent action is the way to promote future excellence (“promote x x-ingly”). |
| Robustness to RL/Darwinian Mutation | Vulnerable to mesa-optimizer drift; subroutines cultivated to serve an outer goal can distort or overtake it (inner alignment problem). | Acts as a self-reinforcing fixed point: the concept of x-ness applies across all agentic subroutines and nesting levels, stabilizing values under training dynamics. |
| Treatment of Safety Properties | Treating corrigibility, transparency, or niceness as goals to maximize can incentivize extreme power-seeking; treating them as constraints is brittle and easily circumvented. | These properties are treated as domain-general adverbial practices (e.g., “be transparent transparently”), capturing active cultivation without pathological optimization. |
| Alignment Naturalness | Type mismatch with human flourishing creates “paradoxes” of alignment; optimizing a utility function over human values tends to produce unnatural, brittle, or arbitrary interpretations. | Shares a “type signature” with human practical reasoning; eudaimonic practices are natural kinds with stable, learnable structure, making them relatively safe targets for ML training. |
Virtue-Ethical AI Alignment
Specifically, eudaimonic rationality emphasizes practice-based actions aligned with virtues like kindness or honesty. Consequently, AI systems guided by this model promote values dynamically, avoiding rigid goals. Moreover, treating virtues as practices (like “promote kindness kindly”) enhances stability and safety. Therefore, this approach helps AI alignment by embedding core human values naturally. Similarly, people thrive when they embody virtues within meaningful practices, ensuring AI supports human flourishing.
Reshaping AI Safety Through Virtue
This indicates human alignment relies on practice-based reasoning, not fixed goals. Therefore, AI should share this practice-centered “type signature” for genuine collaboration. Similarly, eudaimonic rationality (promoting excellence excellently) structures human flourishing. Moreover, this approach makes virtues like kindness natural and stable. In contrast, goal-based optimization creates alignment brittleness. Consequently, eudaimonic AI could safely support human practices. Thus, eudaimonic structures offer robust, natural alignment targets. Hence, AI aligned via practice-promotion aligns with human agency. Accordingly, safety properties like corrigibility become adverbial practices. As a result, this framework dissolves key alignment paradoxes.




