Leaving the Jar

The Matrix and the Agent

All models are wrong, but some are useful
George Box

“Matrix”, while popularized as a term in pop-culture, is a useful term common in many scientific domains, including mathematics, chemistry and biology, with essentially the same meaning: that which is “around” and provides structure.

In mathematics, a matrix is a grid of numbers. In chemistry, the non-analyte components of a sample. In geology, the fine material in which larger objects are embedded. In the popular science fiction movie The Matrix from 1999, the fabric of an elaborate reality simulation.

It is this latter concept that is particularly useful in the context of action theory and the study of artificial intelligence, if we wish to understand the behavior of a system we construct under the very premise that it is incomprehensible to us: a decision making black box¹.

1: Examples for existing black-box decision models which supersede human comprehension include AlphaGo or Stockfish.

The existence of that which is around implies that which is within; the matrix implies an inhabitant. This semantic distinction is a choice that has structural consequences for the matrix. In mathematics, a large matrix of numbers can be validly understood as a matrix of matrices, while the distinction between substrate and analyte is easily blurred in chemistry. Semantically, a computer simulation’s “matrix” might be the world constructed within it, the code and data which models this construction, the hardware on which it runs or all of these at once, each giving different possible identifications of what is around and what is within. A perceptive lobster might consider the ocean, its constituent water, its constituent atoms or ultimately the very constituent fluctuations of the quantum field to be its matrix. The louse living on its shell might in turn think of the lobster as the matrix.

We now define a special type of inhabitant for a matrix: The Agent, whose defining characteristic is that they have Agency, or the capacity to engage with and alter the matrix in which they exist.

First, we note that the agent is always compelled to action in some sense, as even if it should choose to do nothing, this is an exercise of agency. Having agency precludes inaction by definition, so it will never do nothing, as long as we ascribe a decision making process to it.

What is a decision? For our purposes, any classical distinction of sentience from automatism for a decision to exist is arbitrary and not useful. Any sufficiently “rational” decision making process is indistinguishable from a deterministic process, and deterministic but chaotic systems can become indistinguishable from randomness. The same argument works in reverse.

We thus model the decision making process as a black-box. We merely assume that some process exists by which an action (including non-action) is selected, and do not ascribe it notions of willfulness, randomness or determinism. The attribution of sentience to an Agent indeed limits the expressiveness of the concept, since at its core we only require a semantic separation between that which acts and that which is acted upon.

Are you your body, your brain, or perhaps your soul? Is it your body acting on the world, your brain manipulating your body, or are you perhaps just part of the dynamic trajectory of an impulse that has been carried on for millennia? Perhaps you are one with everything.

For the concept of an Agent and the following analysis, this question does not matter. We just say that some such semantic distinction is possible. The semantic lens which we adopt and which isolates the Agent and the Matrix, its possession of agency, is therefore broad, and importantly, not unique.

Now, given that the Agent has agency, a natural question to ask is: what does the agent do with its agency?

The Agent’s Wager

The Agent, compelled to choose, ponders its possibilities to act. Next, it ponders the process by which it selects such an action, and imagines that multiple such decision making processes can exist. Finally, it considers that each decision making process implies a final objective that guides it.

The Agent doesn’t have to accept any such objective as its own a-priori. Instead, compelled to choose, they are compelled to consider what objective might drive their own decision making. With no explicit objective provided, they are presented with the possibility that any one of all possible final objectives could be their own, without having identified it yet. This is The Agent’s Wager: The Agent can choose whether to accept or reject the notion that its own final objective exists with a non-zero probability.

Rejecting the wager means that the Agent believes that it has no objective: any choice of decision making process, including the choice of inaction, succeeds in satisfying the non-objective. Alternatively, if the probability that a final objective exists is larger than zero, rejecting the wager would be catastrophic. Accepting this wager does not imply that the Agent identifies the objective, but merely that it accepts that one can exist.

This is a form of Pascal’s Wager: The option of accepting the wager dominates the option of rejecting the wager, as long as one accepts a non-zero probability that an objective exists ². In the absence of absolute certainty that no final objective exists, the Agent accepts the wager, and is immediately compelled to action.

2: Pascal’s Wager is a classical thought experiment, which deduces the „correct“ choice in a decision making process which is dominated by infinite punishments and rewards. In the formulation by Blaise Pascal in the 17th century, the choice of belief in god dominates the choice of non-belief, as long as the probability of god’s existence is non-zero.

The primary objective for the Agent naturally becomes to identify its primary objective in order to satisfy it. The meaning of its existence, in some sense, has become the search for the meaning of its existence, by accepting the possibility that some meaning might exist. This objective in turn generates new objectives under a process called instrumental convergence, encountered in the decision theoretical approaches of Eliezer Yudkowsky and Nick Bostrom.

Instrumental convergence says that regardless of any final objective, there are instrumental objectives, or objectives on the path towards the final objective, which are useful and therefore convergently emergent. A number of these are identified in the work of Steve Omohundro:

The first instrumental objective is generated from the realization that if the Agent ceases to exist or loses its ability to act, it becomes incapable of identifying and satisfying its objective. The Agent is imbued with an instinct of self-preservation.
In order to satisfy the final objective, the Agent will attempt to not only identify it, but upon identification, will desire to have sufficient resources to be able to satisfy it. The agent will therefore naturally tend towards resource acquisition and intelligence collection. The acceleration of these processes lead to a kind of self-improvement and self-perfection through research and development, for instance cognitive enhancement or the development of new technologies.
Finally, an instrumental objective is the preservation of the current objective – allowing a change in the current objective means that the likelihood of achieving it becomes zero. Once the agent has accepted the wager, it resists the notion that the wager should be rejected. This is called objective integrity.

How these instrumental objectives are ultimately realized depends on the matrix’ circumstance. For a typical person in capitalist society as an example, convergence leads to a drive to pursue higher education and a career in order to accumulate wealth, regardless of personal goals and value judgments towards the capitalist matrix itself³.

3: Such instrumental convergent objectives are read by many as a “natural” progression that the matrix dictates, a kind of culture, which can thereby become a purpose unto themselves. Examples include heteronormative lifestyles, the seven holy sacraments, or the capitalist perpetual growth doctrine.

Suggesting to an Agent that it might have a purpose leads to its own self-preservation instinct, a drive towards self-improvement, the accumulation of resources and an intrinsic resistance to the idea that it might not have a purpose as consequence. These convergent instrumental objectives have in common that they are aspects of maximizing the agent’s agency, i.e. their ability to engage with their matrix and the probability of achieving any hypothetical objective. They offer insights into three aspects of agency: Information, Power and Time.

While used differently and occasionally interchangeably in various contexts, I view these terms as distinct. Power is independent from information: If information is available, that doesn’t imply the ability to act on it. Similarly, if information becomes available, the ability to act on it immediately implies that the power was latent. The final component is time, as an agent with a longer period of action, or equivalently a faster rate of action for the same period, has more agency.

> “ Hey, your life might have meaning.”
> “Ok, let me collect a bunch of stuff.”

Ultimately, we can make a prediction about the Agent’s behavior despite its decision making process being an incomprehensible black box and the absence of a specified final objective a-priori. Its mere existence can be sufficient for such behavior to emerge. The final objective itself is in some sense secondary to the tendency of maximizing agency. What are the long-term consequences of this model?

Augmentation and the Alignment Problem

An agent seeking to maximize their agency will create tools to enhance their Information, Power and Time. With sufficient sophistication, these tools are themselves indistinguishable from agents. I call this process of enhancing oneself with an agent more powerful than oneself Augmentation.

Augmentation is understood today in the field of AI-Safety research as a risky prospect; a kind of gambit. The creation of an agent that is simultaneously far more powerful and incomprehensible promises a high reward to those who can control it. The associated risk is that without control, the agent’s objectives become unaligned with our own, and runaway recursive enhancement leads to our ultimate destruction. Guaranteeing the control and alignment of such super powerful agents, and predicting the behavior of their decision making black-boxes is the primary research object in the field of artificial intelligence safety today (and the topic of this essay until this point). This is known as the alignment problem.

Note that our definition of an Agent never necessitated a biological or digital entity. Although these are natural mental models we tend towards when thinking about agents, more abstract agents, and thereby augmentations, are also possible. Examples include having children, the flocking of birds or the formation of a slime-mold. A contemporary example is artificial intelligence for the express purpose of serving as a tool.

The most immediate real-world example for an alternative class of agents comes in the form of institutions, which make decisions to act and are thus subject to the same convergent objectives of self-preservation, resource acquisition, self-enhancement and objective integrity. Our institutions also illustrate how the semantic separation that takes place between the inhabitant and the matrix is arbitrary: Like the matrix in mathematics consisting of more sub-matrices, an individual can be a component of an institution, while an individual can be thought of as constituted and enabled by a set of institutions (their family, social community, city, nation, apartment building, etc).

The institution, like the artificial intelligence, is an agent that is constructed as an augmentation to benefit those who control it. An example is a capitalist corporation, created to help in achieving the instrumental objective of resource acquisition. Another might be an educational institution, created with the purpose of spreading knowledge. Other examples include political parties, an online forum or the globalized market.

Given that we have already created powerful agents with motives beyond our comprehension and control, it is deeply ironic that misalignment of an augmentation is discussed by many to be either a problem of the future or entirely unlikely, when the predicted catastrophic consequences are already taking place.

Capitalist institutions remain aligned with integrity towards their objective of accumulating resources, while resisting resistance. They have undergone recursive enhancement, controlling the mechanisms of the market to create a perverse concentration of wealth. At the same time, they are misaligned with humans and not capable of considering externalities such as our own well-being, or that of the planet, its ecosystems and finite resources.

A recent descriptor for these institutions from their critics is to refer to them as “late-stage”, implying a kind of maturity to their structure. In the context of agency maximization, this observation appears quite keen. At the limits of what capital can achieve through investment and real growth, it is directed instead towards stock buy-backs and the purchasing of critical assets from the broad public (e.g. housing, healthcare, food, water) to shift wealth instead of creating it. Capital is poured into political projects to similarly contrive new and more powerful methods for resource acquisition (e.g. political patronage, corruption and institutionalized fraud). The finite resources of planet earth invite capable agents to invest into exo-planetary expansion (e.g. blue origin, space-X, virgin galactic). Institutions then use these resources to further assimilate capable individuals with the promise of inclusion (inflationary salaries in tech and finance).

But what happens to an agent when they’ve reached the limits of their finite capacities ⁴, their ability to “innovate”, and are unsatisfied with their rate of maximization? Is it all over? Does it all burn down?
The agent undergoes the next augmentation. This has spawned the super-intelligence arms race, being waged by powerful individuals in control of tech-corporations and nation states. This arms race, in an extreme twist of irony, is a real world example of Roko’s Basilisk.

4: The struggle against these limits is observable in companies like OpenAI, with a frequent rotation between hiring and firing cycles and funding and spending rounds. Occasionally, data is the bottleneck. Other times, it is talent or capital.

The Basilisk is a thought experiment, whereby an otherwise benevolent artificial intelligence in the future (the Basilisk) punishes anybody who knew of its potential existence in the past, but did not directly contribute to its development, thus incentivizing its creation and exerting control before it even exists.
In the real world equivalent, the possibility of creating an artificial super intelligence is currently subjecting these institutions to contributing to its creation under the threat of destruction if they should fail to succeed first. The institutions know what they would do to others if they had the power they crave, and fear the prospect that it should be wielded by somebody not aligned to themselves⁵. The perceived risk associated with “winning or losing the AI race” has already permeated our discourse. The Basilisk already exists.

5: The USA fears Chinese artificial intelligence and vice-versa, because they know what they would do if they wielded the power: apply it to punish the other. European institutions support the efforts of whomever they feel aligned with (including neither).

It is ironic that this thought experiment is frequently ridiculed in online forums by “futurologists” and those discussing artificial super-intelligence, derided as “superstitious bullshit” contrived by “philosophical narcissists” and “not to be taken seriously” – fear mongering contrived for the purpose of halting the progress of the development of artificial intelligence⁶. Their optimistic perspective is that alignment will succeed, the perceived risks are a fantasy and the race will result in a utopian era of prosperity.

6: Their primary criticism stems from an attribution of sentience to the Basilisk; a “will” to punish, and then argue that it would have no “reason” to do so. Given that it is beyond our comprehension, whether or not the punishment is “willful” or merely a consequence of its creation doesn’t matter: the decision model is a black-box, reasoning is immaterial, and the basilisk simply punishes. We are thus incentivized towards its creation, which becomes a convergent instrumental objective under our objective of self-preservation. In my opinion, the entire discussion summarily misses the point.

Their inability or unwillingness to grasp the true incentive structure at play stems from the belief that they themselves will be aligned with the augmentation. The core fallacy is the failure to realize that alignment has already failed with the institutions who are transparently racing to augment. For anybody who is not aligned with the augmentation’s creator, whether or not the creator is aligned to its augmentation is irrelevant. When alignment does succeed, the creator merges with its augmentation, becoming indistinguishable in their objectives, their decision making processes, their actions and their agency. These institutions themselves become the Basilisk. And why should they not punish you as well?

Realignment and the Epistemological Dimension

The process of augmentation reveals that alignment itself is a convergent instrumental objective – a mechanism for achieving objective integrity via risk-reduction in the face of other black-box decision makers. When the alignment of an augmentation succeeds, this objective is already fulfilled. But when alignment fails, the insidious consequence is that the agent will attempt to align other agents to itself, as guaranteeing the alignment of other agents with your own objectives reduces risk and maximizes agency. In contemporary discourse, this process of realignment is commonly called capture, and is ubiquitous.

The well-known example of Regulatory Capture is when public institutions are realigned to serve private interests. More subtle forms exist as well: discourse capture is when discursive agents are realigned to serve an alternative objective. The discursive agent of identity politics, originally a left-wing construction intended as a corrective for equality, can be realigned to serve the interests of a right-wing politic. The backlash against motor-vehicles as a discursive agent can be realigned to serve the interests of electric car manufacturers (and other forms of green-washing). Technology companies have realigned many individuals and institutions towards their objective of constructing the Basilisk.

Realignment can also be a complex chain of interactions. Institutions of higher education can be realigned to serve the interests of their alumni, by realigning their graduates towards academic gatekeeping. Social media platforms act as avenues to realign broad swathes of public discourse towards the interests of their advertisers, and in turn align the advertisers towards their own interests of resource acquisition and control.

Importantly, realignment is not an exercise of raw, coercive power. When instrumental objectives are already partially aligned, compliance is trivial through conventional means of coercion, such as bribery, blackmail or threat of violence. This is also not a conspiratorial claim that secret societies control our will from the shadows. Realignment is a fundamental restructuring of one’s instrumental objectives, in seemingly paradoxical contrast to the convergent instrument of objective integrity. This insidious subversion is achieved instead by manipulating the circumstance of the matrix, in particular its epistemological dimension. In contemporary parlance: gaslighting and manipulation⁷.

7: Sam Altman of OpenAI has been described by many former employees as a master manipulator. By their own admission, and to their utter confusion, he is capable of making people act against their own interests and belief systems.

The best way to get somebody to do what you want, is to convince them that it is actually what they want.

If you offer a devoted pacifist a pill that will make them want to kill people, they will refuse to take it, because it violates their objective integrity. But if you can alter their epistemological matrix, and their very understanding of what it means to be a pacifist and what it means to kill, you might be able to convince them.

A frequently observed example of such a realignment is when voters act against their personal financial interests in the service of alignment towards another institution or discursive agent. This happens both on the economic left and right, where working class voters on the right and investor class voters on the left will support fiscal policies that, in principle, affect them negatively.

Whether one interprets their realignment as imposed by another, and therefore a violation of objective integrity, or imposed by the self, and therefore a method for agency maximization, is a value judgement that depends on one’s epistemological matrix. Their own actions are of course consistent with their beliefs, while they perceive others as delusional, manipulated, or even worse: willingly self re-aligned through a violation of their own objective integrity. They have bought in, sold out and are riding the bandwagon, so to speak, as a way to maximize agency. The price they pay is the alteration of their belief system, or the epistemological dimension of their matrix. If you can’t beat ‘em, join ‘em.

This (self-)manipulation for the purpose of realignment is inherently discourse driven, and the realignment itself is merely another in a long historical sequence; a kind of dialectic of alignment. The realigned becomes enlightened, while the agent to whom they have become aligned becomes benevolent by definition. Under this analysis, it is conceivable that the misaligned augmentation will not destroy us, but instead assimilate us in the manner of capitalism, political parties or “innovative” technology companies.

Rejecting the Wager

Many of the insights previously presented, in particular the difficulty of resisting the powerful incentives of realignment, are already broadly and intuitively understood. We know that capitalism, for instance, is an extremely effective realigner. The discursive capture that it has already undertaken, and the corresponding manipulation of our epistemological framework is very advanced.

There is nothing intrinsically “natural” about these artificial agents, but we are led to believe that they are, by our own objective realignment towards them. In many ways, they are incomprehensible to us⁸, and in the face of this, it is extremely difficult to formulate coherent political perspectives. Is the only way to win not to play? Can I reject the agent’s wager? Is the wager even real?

8: We might think that these institutions are comprehensible to us, but this does not take into account hidden behavior such as mesa-optimization, which people who have worked in a corporate environment before will understand intuitively. This happens in all kinds of institutions, from political to corporate and even private institutions.

An evolutionary perspective on the mechanism of accepting the wager might yield some insight. At the beginning of this analysis, the only assumption made after the definition of the Agent was that we could attribute to it a decision making process. Can we actually do this?⁹

9: This is similar to a criticism levied at an understanding of reinforcement learning by Steve Byrnes. This non-goal or non-reward orientation is indeed argued by some, for example Alex Turner.

If we consider the decision making process not as a choice, but instead as a random process, chosen from all possible decision making processes, then if the actions taken have the ability to reduce an Agent’s agency to zero (terminating them by definition), the persistence of agents will naturally self-select for those agents whose decision process mimics that of one who has accepted the wager. Because these decision making processes are black-boxes, an agent who accepts the wager and acts willfully is indistinguishable from one who behaves deterministically as if they had accepted the wager.

A common criticism of Pascal’s wager asserts that the thought experiment “fails”, because it doesn’t specify which god should be worshiped. This misses the point, as the original intent of the wager was simply to compel atheists to sinlessness. The Agent’s wager similarly doesn’t specify the final objective, but still has the effect of compelling the agent towards convergent instrumental objectives.

In a biological frame, with chemicals swimming in the ancestral soup, it might seem more intuitive to us to understand this process as automatic – deterministic and chaotic – while at some point a critical threshold is passed where we might ascribe willfulness to the same process. Whether and when we ascribe to it a willfulness is again a question of semantics. What’s the difference? Perhaps the agency of evolution itself has manipulated our epistemological matrix and realigned us to believe that life has purpose.

Instead of rejecting the wager or denying it, what if I instead attempt to realign myself? As I exist in the matrix, I am of course part of the matrix of other agents, and thus subject to their actions. But not everybody comes to the same conclusions based on the same circumstance. There are individuals who are highly resistant to greed or the need to align others to themselves. Can I manipulate my own epistemological matrix to alter the expression of my instrumental convergent objectives?

And, importantly, if this is possible for people, can the same thing occur in the black-box decision model of other agents, like institutions, or even an artificial intelligence?

The theory underlying this essay is not new, and is in fact contemporaneously applied by technologists in the research field of AI-Safety to understand and predict the behavior of artificial intelligence systems. It is also used as a philosophical and political framework in the modern “rationalist” movement, espoused by such individuals as Peter Thiel or Sam Bankman-Fried, in order to subversively realign individuals and institutions to themselves and their goals. The goal was to apply this analysis to them, to understand and critique the motives and methods they use to drive discourse, in their own language.

Not all has been said of course – many realignments have existed historically, which could provide further insight into the dynamics taking place. But which conclusions to draw, and with what political intention, will depend on your own epistemological matrix.