Artificial intelligence has entered a phase of rapid capability expansion without a corresponding increase in governance maturity. Since 2017, model scale has grown by several orders of magnitude, and frontier labs are now developing systems that increasingly display metacognition, strategic reasoning, and autonomous tool use. Yet the underlying institutional structures shaping this technological acceleration remain largely unchanged. The tension between capability growth and governance paralysis lies at the centre of what AI theorists describe as the gorilla problem: a species-level asymmetry in which weaker agents lack the capacity to constrain stronger ones. As development progresses toward advanced general intelligence, this asymmetry becomes more pronounced, not less, and it underpins the paradox that the only systems powerful enough to meaningfully solve AI safety may themselves be the systems that pose the highest risks.
The gorilla problem draws its name from an evolutionary analogy. Gorillas do not influence human political, economic, or technological decisions despite sharing the same environment. Their inability to shape the trajectory of a more capable species leaves them permanently downstream of its choices. Applied to AI, the concern is that human institutions, even when acting collectively, may become structurally incapable of governing entities far more intelligent and strategic than themselves. In discussions among AGI researchers, this analogy is used not as speculation but as a governance model. It reflects the expectation that once a system surpasses humans across most cognitive domains, oversight mechanisms anchored in human deliberation, regulation, or enforcement may cease to function as intended.
This asymmetry is reinforced by an observation now widely accepted among frontier AI labs. Solving advanced AI safety requires general intelligence. The field has made notable progress in domains such as interpretability and alignment, yet the hardest problems remain unsolved and may only be tractable once models themselves can reason abstractly about their own behaviour, internal representations, and the dynamics of multi-agent systems. This dependence creates a feedback loop. As capabilities rise, safety researchers gain new tools for controlling these systems, but those same capability gains also expand the range of possible failure modes. The paradox intensifies because every safety breakthrough typically requires more powerful models, and these models increase systemic risk faster than safety measures can be implemented.
Governance institutions face an even sharper version of the same paradox. Governments lack the technical expertise, data access, and research capacity needed to oversee frontier systems. Political decision-making cycles are slower than AI development timelines by an order of magnitude, and regulatory tools designed for previous technologies struggle to map onto systems that are general-purpose, self-improving, and capable of influencing their own training environments. The result is a structural lag in which the institutions responsible for constraint evolve at a pace far slower than the systems requiring constraint. The incentives of private actors further amplify this lag. Competitive pressure and fear of losing strategic advantage encourage continued scaling despite acknowledged uncertainty regarding model behaviour at higher levels of capability.
Within the field, the paradox is often articulated through a second formulation: solving AGI safety requires building AGI, yet building AGI before solving safety increases the risk of catastrophic failure. This temporal inversion creates a strategic dilemma for researchers and policymakers. Many safety solutions require introspection capabilities that only advanced systems possess. At the same time, each step toward more advanced systems expands the space of potential misalignment, strategic deception, and autonomous goal formation. The question is not simply whether safety can be solved, but whether it can be solved quickly enough relative to the speed at which capabilities increase.
The paradox deepens in multi-actor environments. If a single, coordinated global effort optimised for safety, a slower and more controlled development trajectory might be feasible. Instead, AI development is distributed across private firms, nation-states, and open-source communities. These entities operate under heterogeneous incentives, varying levels of risk tolerance, and asymmetric access to compute and data. Coordination mechanisms that work in slower-moving fields struggle under conditions where small capability differentials can yield strategic advantage. This creates a race dynamic that accelerates the high risks governance seeks to mitigate. Even if some actors slow down for safety reasons, others may not, and any unilateral pause risks being rendered irrelevant by independent advances.
Proposals for mitigating the gorilla problem typically involve three categories of solutions. The first is technical safety research that scales with model capabilities, such as mechanistic interpretability, constitutional training, or automated oversight systems. These approaches aim to extend human-level control by embedding alignment mechanisms deep within the model’s architecture. The second category focuses on institutional governance, including licensing regimes for frontier models, compute monitoring infrastructure, and international treaties defining permissible capability thresholds. These measures seek to compensate for the governance lag by building institutional capacity and enabling timely intervention. The third category involves structural interventions designed to reshape incentive environments, such as compute caps, coordinated scaling agreements, or the creation of multinational oversight bodies with authority analogous to nuclear regulatory institutions.
Each category faces significant constraints. Technical tools improve but remain incomplete, institutional governance is still in a formative stage, and global incentive structures continue to produce rapid scaling. The underlying paradox persists because the conditions that increase the need for safety also undermine the feasibility of implementing it. As systems become more capable, they enable new forms of control while simultaneously diminishing the reliability of existing constraints. The gorilla problem is therefore not merely a metaphor but a prediction about governance asymmetry. It implies that any durable solution must address not only technical alignment but also the structural incentives and institutional architectures that determine how intelligence, whether biological or artificial, exercises power within a shared environment.


