Democratic AI Benchmarking

Why Tocqueville Would Embrace AI Benchmarking: Charting a Path for the Future of Democracy in the Age of Artificial Intelligence

Benjamin Jensen and Ian Reynolds | 2025.07.28

Benchmarking and evaluating AI requires a bottom-up, associational model of governance that emphasizes accountability, transparency, and the public interest to ensure that the technology furthers democratic values.

Drawing inspiration from the work of Alexis de Tocqueville, CSIS Futures Lab sees an emergent role for artificial intelligence (AI) benchmarking—independent studies that test and evaluate AI performance in domain-specific tasks—as a new mechanism for ensuring accountability in a free society. Through adapting Tocqueville’s concept of association, AI agents can be held accountable, thus ensuring a more open and transparent society. For Tocqueville, association was the “mother science” of life in a democracy that illuminated moral truths, fostered public virtue, and guarded against social isolation. In a world of algorithmic reasoning, association is no longer confined to human debates in the public square: It has taken on a technical dimension that involves testing and evaluating AI foundation models to help the body politic understand their strengths and weaknesses as well as push firms to fine-tune their models to meet specific user needs. This process must be independent and transparent to ensure the free exchange of ideas in a democracy.

As a result, AI benchmarking is a central task in a free society that embraces AI technology. Furthermore, to preserve its independence, the process should be open and free of direct government or business sector control. True association is bottom-up and maximizes transparency and accountability. In an era where individual actors—whether analysts or citizens—are increasingly reliant on opaque agentic workflows, collective action through transparent, domain-specific model evaluation is essential. Without such associations to mediate the influence of AI in national security and foreign policy, decisionmaking risks becoming centralized, brittle, and divorced from democratic oversight. Worse still, the polarization that defines online discourse in the modern United States will be overrun by narrow, self-interested factions convinced of the rightness of their causes by never-ending feedback loops of tailored information.

Seen in this light, there is a mix of public and private sector action needed to create a world where technology and democracy are co-constituted and thrive, a new town square where algorithms help mitigate—as opposed to reinforce—bias and create the possibility of open dialogue and civic association. First, the public sector needs to pursue legislation that encourages private sector, independent model benchmarking. This push should include a mix of funding for efforts at nonprofits, including universities, and exploring which congressional committee has the authorities to call for routine hearings on model benchmarks. Second, legislators should encourage collaboration between civil society and AI firms through experimenting with legal incentives. Third, and most importantly, the United States’ foundations need to come together to fund multiple benchmarking initiatives to preserve the independence of this modern form of association and accountability. To ensure its role in holding algorithms accountable, benchmarking should be free of both government and industry interests.

Introduction

It is 20XX. A defense analyst struggles to keep up with the array of incoming information on adversary troop movements near a contested border region. A combination of autonomous systems, infantry units, and air assets are already in the area. Moreover, some reports suggest that the adversary may be preparing precision long-range strike capabilities targeting the contested border, should a conflict break out. Pressured to demonstrate their resolve by an array of computational propaganda bots amplifying extreme nationalist views, political leaders on both sides have implied that they will not back down in the dispute. However, the analyst has been made aware that backdoor negotiations are underway in an attempt to avoid further crisis escalation. Circumstances are unfolding rapidly, uncertainty is high, and defense leaders are demanding a recommended course of action from the analyst immediately.

The analyst is equipped with a decision support agent leveraging live intelligence feeds, other relevant information, and a large language model (LLM)-enabled natural language chat interface. Based on the current information and its underlying training data, the decision support system recommends that an escalatory policy, in which use of force occurs, is the best course of action. The analyst, however, is hesitant. Her gut says to avoid escalatory behavior, but the decision support system is pushing for the exact opposite. Her superiors are increasingly demanding of her recommendation due to tight operational timelines, but she is unsure of the exact reasoning behind the decision support system’s recommendation to use force. Moreover, she is not abreast of the data and training processes the system has undergone. While the final decision on what to recommend does lay in her hands, she is worried that her decision is being nudged toward a course of action by a complicated system she does not entirely understand.

Luckily in this case, the situation was simply a crisis simulation helping benchmark a new decision support agent within an AI-enabled military command system. However, the tensions that are a part of this fictional scenario are increasingly very real. And they extend beyond national security to economics, energy policy, and even online discourses that shape how people mobilize in their communities and exercise their right to vote.

Both citizens and government institutions around the world are experimenting with integrating AI agents into their daily lives, including in decisionmaking contexts. AI agents that perceive the environment, make decisions, and recommend actions are increasingly ubiquitous, further blurring the line between human and machine. As a consequence, it is likely that, in the near future, defense and foreign policy analysts, as well as city councils and even individual voters, will be faced with the question of whether to trust AI agents in critical circumstances.

The resulting situation leaves society at large with a choice. Citizens and institutions could passively rely on opaque and complicated systems that harness AI agents to augment decisionmaking. Alternatively, a free people could endeavor to achieve an active approach to broad and robust civic engagement with such systems to better understand, evaluate, and shape human-machine interactions in the future.

This report argues for the second option and suggests that it is unwise to leave either the public or the private sector to be the sole arbitrators for evaluating the use of AI agents, especially when such agents inform political and national security decisions. A fundamental element to implementing this approach will be successfully developing methodologies for continuous and robust benchmarking and evaluation of AI foundation models and derivative agents in contexts where ground truth may not exist, information is subject to abrupt change, and uncertainty reigns. Moreover, such approaches must be transparent, plural, and feature shared responsibility across a broad spectrum of U.S. society. No one party, no one government agency, and no single corporate cabal should control the ability of free people to exchange ideas and shape their society.

Processes of benchmarking and evaluation can be developed and practiced not only as abstract technical or regulatory tasks, but also—and primarily—as a form of civil association essential to preserving democratic judgment in an era of increasing digital dependency. That is the core proposition of this report and its recommendations for ensuring that the benchmarking process is free from government or industry control and influence.

To make its case, this report revisits Tocqueville’s arguments about U.S. society from his work Democracy in America, as well as other writings in democratic theory, political philosophy, and governance, suggesting that such perspectives can inform an approach on model benchmarking and evaluation that incorporates civil society and leverages the associative potential of different forms of actors across the United States to build a robust and accountable space for agentic AI to flourish.

Tocqueville’s Warning and the Civil Society Imperative

Though first published in the nineteenth century, Tocqueville’s Democracy in America holds contemporary relevance for thinking about how to ensure democratic accountability in the era of agentic AI. Two related factors are most pertinent to this report: (1) the possible detrimental effects of isolation, and (2) the corrective and beneficial consequences of associations. These factors are relevant in the context of AI agents in that delegating social action to AI could increase isolation and reduce human agency in governance decisions absent proper benchmarking and fine-tuning. However, robust collective human involvement within processes of agentic AI development and evaluation hold the potential to make the integration of agentic AI into society an associational process, thus reducing isolation rather than exacerbating it.

In his study, Tocqueville expressed concern regarding isolation in democratic life. Because democracies emphasize equality among citizens, each person is free to pursue their own interests and desires. The result is a generally individualistic social structure. Individualism, in the context of agentic AI, further incentivizes the risk of abdicating governance decisions to technology as citizens see to their own individual lives and daily priorities. The consequence could include leaving technological development and implementation to private companies and government actors with little-to-no public input.

For Tocqueville, despite citizen equality, left to their own devices, equity and individualism can also lead to isolation—and isolation can lead to weakness and the threat of despotism. As he argued, “the vices fostered by tyranny are exactly those supported by equality. These two things are complementary and mutually supportive, with fatal results.” Moreover, apart from the risk of tyranny, Tocqueville wrote that isolation leads to impotence in that alone, “citizens can achieve almost nothing.” As AI agents are increasingly deployed across public and private life, further isolation could easily lead to a public that simply—even unconsciously—accepts a version of AI’s integration into practices of governance that challenge democratic principles and limit human capacity to exercise political agency.

For Tocqueville, the counterweight to the consequences of isolation and its threat to democratic governance was the formation of civil associations between citizens to take on problems of mutual concern, generate social cohesion, and foster care for others within an otherwise individualistic context. The goals of associations can range from building local community centers to variable commercial interests, among many others, in which citizens realize and pursue common bonds. Tocqueville emphasized the critical nature of associations, writing “in democratic countries, the knowledge of how to form associations is the mother of all knowledge since the success of all others depends on it.” Thus, the fear is a surrender to a despotic state if associations, and their various positive effects, fail to form. Accordingly, within the context of agentic AI, associations can serve as a key check against the narrow interests of society’s most powerful actors and even foster collective democratic practices guiding the technology’s development and deployment in a fashion that has diffuse benefits across the U.S. public.

Tocqueville’s discussion of the risk of isolation and the need for association creates a framework for thinking about how to balance the efficiency gains of AI agents with the desire to maintain a free society. In a range of contexts, AI agents are replacing expert judgments and determining important life outcomes, including in the domains of medicine, bank loans, and even governance decisions, among others. If broader society passively—and uncritically—accepts algorithmically shaped decisionmaking procedures, it risks surrendering to an antidemocratic version of algorithmic governance, or a way of “social ordering” that incorporates algorithmic procedures into decisionmaking in which these computational systems may opaquely shape government processes and life outcomes. As one assessment suggests, “advances in machine-learning—or artificial intelligence—portend a future in which many governmental decisions will no longer be made by people, but [rather] by computer-processed algorithms,” presenting an “emerging threat” to liberty and democracy. To map this on to Tocqueville’s framework, instead of surrendering to the pull of tyrannical government, absent the corrective forces of associations, broader democratic society risks surrendering to an antisocial, nondemocratic version of algorithmic governance.

Importantly, scholars have shown that civil society can act as a pillar of accountability and an additional form of checks and balances within governance structures. For this reason, a brief review of social scientific work on the relationship between civil society and governance, as well as the possible impact of technology, is a useful exercise. While the modern version of the term “civil society” has been deployed at least as far back as the late eighteenth century, more contemporary expressions of the term, particularly in the work of U.S. political scientist Robert Putnam, have had broader impact in popular discussions of modern democracy. As U.S. sociologist Larry Diamond argued in 1994, “no phenomenon has more vividly captured the imagination of democratic scholars, observers, and activists alike than ‘civil society.’” For the purposes of this discussion, following Larry Diamond’s definition, civil society refers to sets of organized social groups that members voluntarily join, create, and support, and that are generally autonomous from the state.

Scholars have attempted to demonstrate the important role that civil society can play in shaping political outcomes. For example, Putnam’s work on civil society in Italy argues that robust networks of civic engagement hold a range of benefits for social cooperation, including increasing defection costs, fostering norms of reciprocity, and demonstrating the benefits of community collaboration. In turn, he suggests that this form of civic activity—and the subsequent development of social capital—supports high-functioning democratic institutions. Putnam expands his argument beyond Italy. Drawing from Tocqueville, Putnam argues, “Tocqueville was right: democratic government is strengthened, not weakened, when it faces vigorous civil society.” As summed up by sociologist Sydney Tarrow, Putnam’s basic argument is that “for where there is no social capital … democracy cannot flourish.”

Other scholars also emphasize the important links between civil association, democracy, and the spread of liberal norms and values. The study of nongovernmental organizations (NGOs) is a prominent example of the relationship between governance and civil society organizations. Scholars focusing on international politics have argued, for instance, that NGOs work as key norm entrepreneurs for the expansion of liberal human rights regimes. Moreover, NGOs can constrain government action by applying public pressure through advocacy (i.e., naming and shaming), increasing the normative costs for pursuing policies that result in things such as human rights abuses.

Apart from internationally oriented NGOs, researchers contest that, under the right conditions—such as sufficient levels of free press and political competition—robust and active civil society organizations can reduce government corruption and increase accountability. Moreover, civil society organizations can participate in processes of public oversight and information sharing that can increase the transparency and accountability of governing institutions.

The picture, however, is not always so clear, and the impact of technology, particularly digital technology, on civil associations and democracy remains murky. Spanish social theorist Manuel Castells, for example, notes that technological changes have introduced a “networked society” that has fundamentally altered communicative practices and transformed “space and time in the human experience.” As such, technological changes have altered the basic ways in which individuals and groups in societies are linked together. Scholars express different perspectives on how technological factors will shape the role and structure of civil society. Some, such as technology expert Lee Rainie and sociologist Barry Wellman, suspect that such technological changes, particularly in the form of information and communication technologies (ICTs), improve the prospects of robust socialization and community building. Others argue that ICTs offer the prospect of more impactful activism in restricted political environments due to ICTs’ capacity to “facilitate open and inclusive participation.” Still other scholars express concern over the role that technology will play in social organizations and democratic structures. Returning to Castells, he suggests that one consequence of the networked society could be a destruction of organizations and the “delegitimating” of institutions, leading to the prospect of social alienation. Moreover, research finds that, in some contexts, social media can decrease satisfaction with democracy, and scholars studying authoritarianism have argued that digital technologies can increase state capacity for repression of democratic freedoms.

Of additional importance is the co-constitutive relationship between science, technology, and society. All too commonly, technology is treated simply as a variable (or tool) that impacts the social world in a direct way. Thus, in terms of civil society, technologies such as the internet will either help or hurt such organizations. Yet, the relationship is far more dynamic and dependent “on a complex pattern of interactions” between the social and technological. The majority of peoples’ interactions with each other and their environment are increasingly mediated through technology in general, and AI agents in particular. Humans may have made AI, but AI agents shape how people interact with their world.

A few examples from the daily practices of scientific laboratories as well as broader implementations of technology in public policy domains will help to illustrate this point. Social theorists Bruno Latour and Steve Woolgar demonstrate that even within the most pure scientific location—i.e., the laboratory—social factors still shape and direct scientific findings. Others, such as political theorist Langdon Winner, discuss how technologies are not neutral artifacts, but, in fact, can have crucial implications for power relationships and political outcomes within broader society. The intentional construction of bridges in Long Island, New York, to be too short for buses to pass under illustrates the role technology can play in power (and political) relationships. The goal in limiting the availability of public transportation in this way was to restrict poorer, typically minority, populations from accessing beaches and public areas served by the roads, as these groups commonly relied on public transit over more expensive modes of transportation, such as the automobile. Moreover, technological discoveries are not divorced from the social structures of power in which they are produced, and broader social imaginations of technology can shape the direction of technological developments. The upshot of these wider effects is that there are critical choices to be made in technological design, implementation, and execution that, as those decisions are shaped by broader social structures and sequences of interactions, also have far-reaching social and political impacts. As Winner writes, “technologies are ways of building order in our world.” Thus, when applied to the question of the relationship between AI—specifically agentic AI decisionmaking—and civil society, associational politics can play a crucial role in building socio-technical orders that favor democratic outcomes and processes. However, this will not happen automatically. In simple terms, it is imperative to proactively create a relationship between AI and society that enables democracy and responds effectively to the political interests of everyday people.

The relationship between robust civil association and democracy is further complicated by empirical work that responded to Putnam’s initial arguments suggesting robust civil society was the tonic for democratic governance. For example, a range of studies have documented cases in which civil society organizations have had detrimental effects on democratic outcomes. U.S. political scientist Sheri Berman’s work on civil society in the German Weimar period, for example, illustrates how the Nazi party was able to co-opt aspects of robust German civil organizations in a fashion that supported the rise of fascism. Further research demonstrates similar dynamics in Spain and Italy during the end of the nineteenth and into the twentieth century. Consequently, contextual and historical factors matter in how strong civic associations shape democratic outcomes. While Putnam did emphasize this fact, he was perhaps wrong in then generalizing the relationship of civic association, social capital, and democracy to other cases so deterministically. Yet, while we might not be able to isolate a consistent democratic effect of associations across time and space, this does not mean that civic associations and civil society cannot serve as practical tools in a larger tool kit for addressing broader issues of accountability and transparency related to artificial intelligence. Here we can draw lessons from the political theory of American pragmatism, which emphasizes social problem-solving oriented around joint public goals.

Pragmatism highlights the “diversity of perspectives that different individuals and organizations bring to the definition of the problems, and to the generation of possible solutions.” As U.S. political scientist Christopher Ansell notes, “pragmatism is usefully described as a philosophy of evolutionary learning. It emphasizes the ability of both individuals and communities to improve their knowledge and problem-solving capacity over time through continuous inquiry, reflection, deliberation, and experimentation.” While pragmatism is dynamic in its broader philosophical commitments, a constant thread, particularly when connected to public-oriented governance, is an “emphasis on the open-ended process of refining values and knowledge.” In addition, this line of theory suggests the critical nature of learning through encountering tangible problems that require resolution, sometimes referred to as a “problem situation,” a context in which actors must creatively and experimentally find resolutions to new dilemmas. In fact, some scholars have emphasized how pragmatist philosophy can link practical, solution-oriented approaches to “grand problems” with high levels of complexity and uncertainty by offering a “situated, distributed, and processual approach to problem solving.”

This deliberative, action-oriented, experimental view can be supported by associational politics emphasized by Tocqueville, as well as by more contemporary lessons from scholarship on civil society. Consequently, it’s worth synthesizing some of the above discussion into more concise takeaways. Fundamentally, from Tocqueville to modern research on democracy, under the right conditions, civil society groups have been shown to be important component parts of governance due to their capacity to unify citizen interests and improve the accountability and transparency of governing organizations. As a caution, however, research also demonstrates that civic organizations are not simply vacant objects of social good; they need specific social forces oriented toward democratic action. To draw on U.S. philosopher John Dewey, democracy is not something that perpetuates itself automatically.

Accordingly, civil associations need to be positioned around democratic goals and social action. One way is to orient organizations toward joint problems facing large cross-sections of democratic society, including the case of governing and evaluating AI systems. Deliberation and communication will be fundamental to this form of associative politics, as through these processes actors can realize joint interests and civil associations can vocalize and propose creative solutions to communal problems. While technology can complicate these relationships, experimental problem-solving can create a link between society and technology that nudges the development of AI toward accountability, transparency, and democracy. Critically, if people, firms, and government entities are going to use AI agents to inform decisions, of paramount importance is transparency and accountability related to the information and technological systems used to make those decisions.

Model Benchmarking as a New Art of Association

New forms of association can play important roles in combating the consequences of a passively accepted version of algorithmic governance, and its subsequent detrimental impact on democratic society. One such connection is in the practice of benchmarking and evaluating AI models. Though this report leverages the empirical domain of national security as an example because it is the research team’s area of expertise, this argument applies to other areas of governance and society more broadly speaking.

To begin this conversation, it’s worth reviewing what benchmarking is. Benchmarks are datasets designed to evaluate model performance on a specific set of tasks. The processes of benchmarking and evaluating models have become increasingly important as models are deployed in a range of situations that have real-world consequences. Successful benchmarking and evaluation of models not only allow for tracking model improvements but can also identify risks of models that do not perform at an adequate level for the desired use-case.

For example, benchmarks have been developed to test model performance on tasks such as knowledge recall, quantitative reasoning, and other academic tasks. Other benchmarks focus on harmful social biases with respect to gender or race. Additionally, the research team at the CSIS Futures Lab has developed a benchmark, and associated methodology, for tracking model preferences with respect to critical foreign policy decisions in contexts such as crisis escalation scenarios.

Successful evaluation processes can be made more robust through associative practices that unite diverse teams of researchers, policymakers, and civil society actors to steer technological development away from opaque technological systems and toward a form of social organization in which AI public literacy is high, and a wide range of players have a say in the form of technology that is deployed in the public domain. In more tangible terms, this will require ongoing processes of domain-specific data creation and the testing and evaluation of models before and after deployment in contextually relevant scenarios. Key to the practice of benchmarking is the fact that, as Raji et al. argue, “the imagined artifact of a ‘general’ benchmark does not actually exist … presenting any single dataset in this way is ultimately dangerous and deceptive.” Moreover, issues of construct validity—“the degree to which a test or measurement tool accurately measures the construct it intends to measure”—and certain private sector actors successfully gaming benchmark results can problematically skew the reality of model performance. Such conclusions point to the critical nature of having a wide range of experts and actors in the evaluative process of benchmarking highly specialized, yet broadly critical, domains.

Furthermore, such processes must be overseen by organizations with the public’s interest at heart. Here, the information environment will be fundamental. Returning to Tocqueville, information sharing is critical for successful association to occur. Only through making interests and opinions of individuals available for consumption in the public domain can joint interests be realized and successful associations forged. In Tocqueville’s era, local newspapers, as well as townhalls, were the key vector for information transfer. While a broader range of communicative technologies exists today, a core lesson is clear—without communicating facts of interest to broader publics, the risk of isolation increases. When applied to AI, and specifically the task of benchmarking and evaluation, this means transparency and broad-based communication of technological risks and uses are essential to ensuring the public’s interest. Binding interested actors into a dense, yet transparent, web of association has the potential to shape AI’s use in governance away from the consequences of “dead hand” algorithmic governance.

Absent a form of Tocquevillian association related to AI development, the risk is that social passivity gives way to co-option and control of this general-purpose technology to private interests, ceding technological development to a narrower set of goals driven by a select few. Robust associative politics centered around the technology of AI, specifically with respect to evaluation and benchmarking, increase the probability of a strong democratic process working in coordination with efforts of integrating and developing new technologies. This is important as research illustrates that socio-technical relationships structure human agency and can have path-dependent effects as certain relational structures become stabilized, a process in science and technology studies frequently called closure. Because the contemporary era is one in which socio-technical relations with respect to AI are still somewhat flexible, broader U.S. society has its greatest opportunity to structure a relationship more favorable to public—democratic—interests.

A General Review of U.S. AI Policy

Any attempt to initiate a process of associational benchmarking must embed itself within ongoing policy developments in AI. The last three presidential administrations have proposed a range of approaches to both governing AI and incentivizing its integration across the private and public sectors. A review of these developments will provide a basis for building out the associative model of benchmarking governance discussed below. In 2019, Trump signed an executive order (EO) entitled “Maintaining American Leadership in Artificial Intelligence.” This order had a number of goals, including providing a foundation for AI innovation; integrating AI across federal agencies; fostering collaboration between government, the private sector, and academia; and developing technical standards, with the National Institute of Science and Technology (NIST) acting as the coordinating agency. The order sought to balance an innovation-friendly context for AI with the need to protect civil liberties and U.S. values. While the EO established a set of AI-related objectives, some criticized it for lacking details on both funding and practical implementation. That said, later in 2019, the Trump administration released the National AI R&D Strategic Plan: 2019, which outlined a strategy of federal investment into the research and development of AI. The plan included eight specific goals supporting the earlier EO: making long-term investments in AI research; better understanding how AI and humans can work together; addressing ethical and societal implications of the technology; advancing the security and safety of AI; developing data sets for technology development; evaluating AI through standards and benchmarks; seeing to workforce implications; and fostering public-private partnerships.

While the first Trump administration’s efforts related to AI policy mention the need to establish fair, transparent, accountable, and ethical AI, the Biden administration furthered such efforts through the release of what the administration termed a “Blueprint for an AI Bill of Rights” in 2022. This framework underscored the threats advances in AI posed to democratic processes and the rights of U.S. citizens, specifically related to bias and privacy. To attempt to address such issues, the Biden administration, led by the Office of Science and Technology Policy, identified five principles guiding “the design, use, and deployment of automated systems.” These principles were that AI systems should be safe and effective, that AI systems should not propagate biases or discrimination, that there must be protections from abusive data practices, that users should be aware when they are interacting with an automated system, and finally, that there should be clear human alternatives and fallbacks for any problems users have when interacting with AI-enabled systems.

This served as a platform for the eventual signing of the Biden administration’s 2023 EO “Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.” The 2023 EO focused on ensuring that the United States remained the leader in “safe, ethical, and responsible” AI use and, to that end, directed federal agencies to establish processes “managing dual-use AI models, implementing rigorous testing protocols for high-risk AI systems, enforcing accountability measures, safeguarding civil rights, and promoting transparency across the AI lifecycle.” An additional key event occurring alongside the Biden EO was the establishment of the U.S. AI Safety Institute (U.S. AISI), a part of NIST, which was tasked with testing models, conducting red-teaming, and attempting to identify risks from increasingly advanced systems, including those with national security implications. Moreover, following guidance from the initial EO, in 2024, the administration released a memo focused on integrating AI into the national security enterprise in a fashion that both leveraged technological advances while also protecting human and civil rights. The Biden administration’s efforts continued through the final days of his presidency, with a second EO released in January of 2025, focusing on AI infrastructure and directing federal agencies to find eligible lands that could be used for “frontier AI” data centers.

Following Trump’s election to a second term, the new administration quickly sought to stamp its authority on the direction of U.S. AI policy. The Trump administration quickly revoked elements of the Biden administration’s executive action, suggesting that the latter had imposed “onerous and unnecessary government control over the development of AI.” The Trump team replaced Biden’s executive guidance with an EO entitled “Removing Barriers to American Leadership in Artificial Intelligence.” This late-January 2025 EO positioned the Trump administration, at least in terms of rhetoric, as far more innovation friendly than its predecessors. Furthermore, the EO provided guidance that relevant agency heads, as well as various presidential technology advisers, should review actions described within the Biden EO for revision or rescindment. The 2025 Trump EO, while short and relatively scant on details, directed the Office of Management and Budget (OMB) to release additional guidance on AI implementation throughout the federal government.

Following this directive, in the spring of 2025, the OMB released two memos further laying out the Trump administration’s approach to AI policy. In general terms, the memos contain guidance for federal agencies to speed up AI innovation and adoption in relevant use cases while also improving public trust in the technology. Moreover, and in contrast to initial indications from the Trump administration, the OMB memos shared many elements with previous Biden policies. One significant change in the policy, however, is the combining of issues related to “rights” and “safety” impacting AI into a single category called “high-impact AI.” That said, as described by a Brookings report, “the continuity [between administrations] is more striking than the change” as both administrations’ approaches appear similar on issues such as technology acquisition and the promotion of AI use throughout government. Yet, a June 2025 change may illustrate the Trump team’s desire to appear more innovation friendly with respect to AI: The AISA has been renamed the Center for AI Standards and Innovation (CAISI). According to the CAISI, the organization will act as the “primary point of contact” for industry members in the U.S. government to assist with testing, securing, and implementing AI systems.

Thus, as indicated by the above review, AI policy in the United States is progressing. However, civil society and robust benchmarking practices must serve as checks and balances to ensure that the technology functions as intended for broader society. Positioning the United States as a leading AI superpower should be accompanied less by regulation than by a robust benchmarking effort that brings civil society into the process.

The Case of AI in Foreign Policy Decisionmaking

To make this more tangible, the following section relies on an empirical discussion of the Futures Lab’s efforts at benchmarking model preferences in foreign policy decisionmaking contexts as well as a review of literature in the field of international relations focusing on the relationship between civil society and foreign policy decisionmaking. Research in international relations has long explored the links between democratic regime type and foreign policy outcomes. For example, democratic peace theory suggests that democracies, whether as a result of institutional structures or normative elements, tend to not go to war with one another. Other scholars have illustrated that, due to fear of electoral punishment, public opinion related to foreign policy can shape leader preferences, particularly when issue salience is high. However, research has also illustrated that the type of information environment—specifically how information flows from political elites to publics—is critical to the extent to which publics can constrain decisionmakers.

That said, advances in digital technology have led some to argue that the constraining effects of democratic publics on elite decisionmakers may be eroded as constituents are further “fragmented and siloed” so that voters support their leaders, regardless of the leaders’ policy decisions. Thus, drawing on the field of international relations, research illustrates that mechanisms for constraining foreign policy decisionmakers can be created within democracies; however, information environments, particularly in the context of the internet, can challenge these processes of constraint. To put this into simple terms, while there are ways to ensure that democratic publics can influence foreign policy decisions, digital technologies present novel challenges to that relationship.

Importantly, advances in AI have the potential to further erode feedback mechanisms between democratic publics and foreign policy experts as the technology is integrated into foreign policy decisionmaking. The Futures Lab’s findings related to its benchmarking of LLMs in foreign policy decision contexts can assist in illustrating this point. Results from the Critical Foreign Policy Decisions (CFPD) benchmark demonstrate that some LLMs are significantly more escalatory in decision scenarios when compared to other models. For example, models such as DeepSeekV2, Qwen2 72B, and Llama 3.1 8B Instruct all tend to prefer recommending the use of force when compared to other LLMs included in the study. These escalatory preferences are particularly salient in scenarios where models are prompted to recommend courses of action for democratic countries, like the United States and United Kingdom, versus autocratic countries, such as China and Russia.

Imagine a scenario in which models are integrated into decision pathways (and are not continuously and robustly evaluated for their risk profiles in a transparent fashion) and nudge decisionmakers toward escalation, in the context of, for example, a Taiwan Strait crisis. While it is highly unlikely that LLMs would ever make such a critical decision alone, if models are leveraged to generate possible courses of action, and they are primed to favor the use of force, decisionmakers could be unknowingly constrained in looking for more peaceful pathways to crisis resolution. Moreover, consider a situation in which private organizations and government institutions are left to evaluate model performance, without any oversight. Absent robust transparent processes of evaluatingmodels, such circumstances could lead to pathological outcomes as private companies seek to overpromise on system performance for financial gains and government institutions endeavor to avoid political consequences from deploying unreliable systems. For example, scholars have noted that private companies may be incentivized to keep critical information, such as code, in machine learning algorithms private both to increase profit and to avoid regulatory roadblocks. Moreover, bureaucratic organizations can, at times, have issues with accountability due to political calculations and the pursuit of narrow institutional interests.

Much like Tocqueville’s warning about the adverse effects of isolation and passive obedience, broader civil society, in a nonassociative form and restricted from involvement in practices of evaluating AI agents, may be left to believe claims that systems work reliably and in an unbiased, rational fashion. In such a situation, the public may lack the capabilities to go behind the curtain and access data in a transparent and accountable way. Or, to draw on Tocqueville, “a tyrant is relaxed enough to forgive his subjects for failing to love him, provided they do not love one another.”

Expanding on Tocqueville’s analysis, the solution will involve tightly binding together civil society actors (ranging from local organizations to mainstream think tanks, research organizations, and beyond) into the processes of benchmarking and evaluating models for use cases relevant to their interests. While regulations, in terms of incentivizing government and private organizations to participate in the process, will be an element of such a vision, a more critical factor will be establishing a culture of civil involvement in influencing how the technology of AI will shape governance in the United States. This applies to cases as diverse as foreign policy to issues of local government. As the following section will outline, a civil culture dedicated to open and transparent forms of AI development is more likely to shape technological—and social—developments in a direction more conducive to robust democratic governance in a context in which AI permeates many decisionmaking domains.

Civil Society in the Age of AI: Indirect Impacts and Democratic Renewal

Tocqueville argued that “as soon as communal affairs are treated as belonging to all, every man realizes that he is not as separate from his fellows as he first imagined.” This broad social realization is key to driving the development of associative action and the building up of a plural set of actors in U.S. society that are involved in the iterative and transparent evaluation of models. Associative behavior, however, must be cultivated. That said, its impacts can be recursive in form. Not only do associations mitigate the threats of isolation, but they also train citizens to be engaged in issues of joint concern and to hold a civic mindset. To draw on John Dewey, such practices must become “a way of life,” or as Tocqueville put it, “an imperceptible influence of habit.” In other words, a collective effort of Americans to ensure technology works for the average citizen must be encouraged and normalized.

In his work on U.S. civil society, Robert Putnam asked, “is technology … driving a wedge between our individual interests and our collective interests?” This is a critical question, and while technology certainly can have this effect—see for example research illustrating that social media can contribute to problematic political polarization—there must be room for an alternative path.

Benchmarking thus can serve as a practical vector for cultivating an associative, civic mindset related to the technology of AI and its relationship with governance to achieve meaningful results. In other words, benchmarking holds the potential to become a critical site of association in which collaborative spaces are generated for policy experimentation matched with public oversight through robust and dynamic evaluation methodologies. Simply put, benchmarking is a tool to ensure the public is aware of the impact that AI is having on their everyday lives.

For example, if AI is going to be involved in making decisions on how certain social benefits are provided, there is no reason why interested members or representatives of the community should not be involved in the evaluative process of system development and be kept abreast of what evaluation results mean for the AI-enabled decisions that could shape life outcomes. This claim is not new; some researchers have already committed to a vision of community-oriented AI development in which AI systems are cocreated with local communities. This vision will undoubtably require a substantial measure of civic responsibility. Therefore, the benefits for communities of interest should not only be made clear; they must also be tangible. AI-enabled systems must feature accountable and transparent feedback loops that are responsive when algorithmic decision systems are failing. Absent such processes, broader democratic society is likely to fail the “tragic double bind” of governing algorithms while “governing by algorithms,” as technology researchers Maciej Kuziemski and Gianluca Misuraca put it. In such processes, large private tech organizations cannot be ignored. Yet, it is possible that, despite economic interests and with the right incentives, companies may prove to be amenable partners. Large AI firms have already set up industry organizations to share best practices in model assessment and agreed to a set of voluntary commitments regarding model performance in public domains, although issues of transparency and accountability remain. Mapping market incentives to civic responsibility will undoubtably be a future challenge in generating an open, plural, and transparent evaluation environment.

That said, there is potential for domain-specific, diverse associational methodologies of evaluation to emerge from below. As suggested by literature focusing on the role of civil society, such associations can serve as an injection of collective democratic spirit into the adoption of AI. These practices must be focused on collaboratively ensuring that governance-related decisions involving AI are continuously evaluated in a fashion that demystifies technological outputs and that ensures that citizens are not driven to simply “accept” algorithmic outputs out of either a lack of understanding or the absence of social structures that allow them to meaningfully contest seemingly unjust or incorrect algorithmic decisions. Moreover, people must feel that they receive some practical benefit of working within such associational models, implying that the socio-technical governance structures that are developed need to actually work for people at all levels of government and must respond to their feedback, ideas, and concerns. The results of evaluations cannot simply be meaningless reports left to rot in untouched data repositories. Importantly, research has demonstrated that there are consequential benefits both in terms of social trust and in terms of economic development for communities with high levels of civic engagement. There are thus clear advantages to achieving this collaborative ideal through AI benchmarking and to cementing benchmarking’s role in empowering new forms of civic association.

Democratic AI Governance: Lessons from Tocqueville and Today

In many mainstream conversations related to AI, scaling is presented as the fundamental solution to advancing the technology. In technical terms, scaling can refer to increasing model parameters, the amount of training data, or computing capabilities to achieve performance benefits. Scaling models in such a fashion takes significant financial resources, and, as a result, has contributed to the centralization of AI development in the hands of private organizations that already have the capacity to collect data, develop larger models, and afford significant computing resources. As a result of this understanding of scaling, the technical capabilities, model development, and overall direction of AI development are concentrated. This form of socio-technical relationship, and its political and economic results, presents a challenge to a more democratic vision for how AI can be integrated into governance. It is also misaligned with Tocqueville’s preference for local initiative and community leadership to address issues of public concern. More democratic-focused efforts surrounding AI will require local participation and association that can assist in generating the robust—and democratically oriented—civil society that this report’s approach highlights.

CSIS Futures Lab researchers are not the first to recognize this issue. Civil society organizations have already called for more distributed models of AI research and deployment. Legislative activity in the United States, such as the CREATE AI Act, is attempting to democratize access to computing resources. Moreover, a robust community of open-source AI development is highly active, illustrated by organizations such as Hugging Face, which hosts a repository of open-source models available for public use. Within open-source AI development, some level of model weights, code, and model parameters are available to the public. Proponents of an open-source approach suggest that it is far more likely to democratize access to the technology and, furthermore, is more in line with scientific practices of transparency and the evaluation of research findings. In addition, within the research community, scholars have begun to pose the question: “How can scientists co-create AI systems with local communities to address context-specific concerns?”

The picture that emerges centers on a range of actors, operating within a regulatory environment that incentivizes participatory action and a democratization of access, coming together to address and experiment with solving practical governance problems that are both posed by AI and, potentially, solved through the targeted use of the same technology. Such a view is supported by modern legal theorists suggesting that a diverse set of actors with clear, vocal interests can contribute to localized policy innovation. Moreover, this methodology emphasizes the associational and civic-minded approach that scholars such as James and Deborah Fallows have revealed through their writings on the United States. Thus, experimentation on, and subsequent technical evaluation of, how AI will interact with democratic governance cannot simply be a top-down driven process but instead must pursue associational pluralism.

While the domains of international relations and foreign policy may seem too abstract for such an associative politics related to general AI development, and the evaluation process specifically, this does not have to be the case. Much mainstream international relations research—particularly from the 1970s and 1980s—treated states as like units, or billiard balls which interacted to either cooperate or compete. Complicating the picture, more recent work, such as that cited above on the relationship between democracy and foreign policy, as well as other research in international affairs emphasizing notions of “human security”—i.e., the desire to shift the lens of security away from states and toward the needs of individuals—has expanded the aperture for what, and who, counts when it comes to international affairs. Moreover, there exists a diverse array of actors with a stake in the field, ranging from citizen organizations, universities, think tanks, NGOs, and others. The point is that even the domain of international relations—at times called the realm of high politics—has an existing grammar and set of actors with which to discuss and experiment with the issues addressed in this paper.

At its core, this paper advocates for a set of AI governance and model evaluation procedures that are multi-scalar, participatory, and grounded in civil society—not simply rooted in top-down regulatory regimes that, although well meaning, may fail to understand the complexity of certain local or domain-specific problems, or that are corporate controlled, and thus driven by market incentives rather than through robust civil involvement. This effort should be independent of government institutions and corporate interests and should be nonpartisan in form. The goal is to promote the maximum exchange of information by holding providers that power agentic systems accountable. Illustrating problematic tendencies or biases can pressure firms to correct and improve model performance, thereby creating a feedback system that incentivizes open information exchange. This argument is not without basis, as research on corporate social responsibility has demonstrated that, in some circumstances, firms respond to public pressures and update business practices.

Toward an Associational Model of AI Benchmarking Governance

Drawing from the prior sections on the critical nature of civic associations in forming more publicly accountable, transparent, and effective practices of AI benchmarking, this section develops a model for associational practices related to AI governance, specifically focusing on AI evaluation and benchmarking. Many proposed AI governance models are broad based and offer wholistic views of the governance environment. They do so by incorporating elements or layers that are as general as categories such as society, technical factors, and ethics. Though these concepts are useful for identifying key considerations for successful AI governance, without more direct conceptualization, they are less practical in terms of generating practices in specific contexts. The broad scope of ethical AI is a good example, as scholarly reviews of the emerging AI governance literature have noted that some discussions of AI ethics struggle to identify the details of real-world implementation, possibly limiting ethical AI’s practical effectiveness. Yet, problematically, some proposed governance approaches fail to discuss ethical dimensions at all, including fairness, transparency, and trust. In fact, some reviews illustrate a general failure to broadly operationalize processes related to AI governance. Thus, focusing on tangible practices within governing structures can help to illustrate what broader categories, such as ethics, mean for practical implementation purposes.

Some researchers address the specificity of organizational governance, offering important insight as to how organizations, such as private firms, play a role in governance structures. However, due to their focus on the level of the organization, these approaches can miss the critical set of players that must be involved in successful governance beyond how individual organizations set up and implement their own AI-related standards. For example, levels of AI governance touch on a range of players (frequently referred to as stakeholders in the governance literature) as diverse as small teams within organizations, large international governing bodies, and even individuals impacted by the technology. Moreover, a diverse array of actors has been instrumental in proposing various models and principals of AI governance, including the U.S. National Institute for Standards and Technology (NIST) AI Risk Framework, the EU AI Act, private organization frameworks from the likes of Microsoft and Google, and civil society groups, among many others.

As such, not only are there multiple levels in which AI governance occurs, but there are also multiple domains of practice in which AI governance must be implemented. As some have noted, governance solutions must be implemented at all stages of the AI development lifecycle, from model development to deployment. This includes factors in which benchmarking plays a role, for example, those related to testing, evaluation verification, and validation.

To make this discussion more practical, the following section focuses specifically on the critical area of benchmarking and its possible role in AI governance, while also acknowledging the complex array of actors involved in any truly associational model of governance. The fundamental focus here is grounding benchmarking in transparency and accountability based on domain-specific, expert-crafted data, whether the context of the evaluation process be benchmarking model reasoning capabilities on tasks such as foreign policy decisions or evaluating model performance on local community implementation of AI to assist in providing efficient services. It is through increasing transparency and accountability that U.S. citizens will be able to better decide how (and if) the technology is working to improve their lives in meaningful ways.

Rather than merely suggest that critical factors such as transparency are important in AI governance, this report posits a model for implementation that puts transparency and accountability into practice, thus contributing to calls from researchers to ensure that there is a “how for every what” involved in the governance of AI and ML technologies. In other words, this report attempts to translate abstract principals into specific practices, roles, and benefits. Thus, the proposal outlined here is both micro-focused in that this model isolates one aspect of AI governance (evaluation and benchmarking) and macro-focused as it applies to a general range of domains from local community organizations to the governmental level. Simply put, it is a model that, while focused chiefly on benchmarking, can apply to many real-world use cases.

Figure 1 depicts this model. It starts by describing the constituent elements located at the central part of the model and then laying out their interactions—the most fundamental piece of the associational model. Three component parts contribute directly to the proposed model’s main purpose—a robust AI evaluation and benchmarking cycle—which is further described below. Table 1 summarizes the roles and functions of the various elements involved in this cycle, as well as the benefits each element receives from participation in the process.

Civil Society. The first component part, as emphasized most directly in the above discussion, consists of civil society actors. With respect to a technology like AI that is largely a general-purpose technology, actors comprising this constituent part of the model are intentionally diverse. For example, civil society actors with interests in the evaluation and benchmarking of AI could range from think tanks to universities and to local community organizations concerned with mitigating adverse impacts of poor performing AI on local services.

Within this associational model, civil society serves three key roles. The first is through the expression of interested community actors and organized groups in shaping technological performance. Due to the likely wide impacts of AI integration (from computer vision to language models), there are obvious incentives for civil society groups to become engaged in signaling their interests regarding model performance. Groups interested in AI evaluation also provide inputs related to how civil society organizations see the technology as benefiting their specific communities, along with the worries and concerns the same groups have regarding possible technological abuse and undesirable performance. As a result of this relationship, public institutions and technology companies receive important inputs on what broader social groups want—and do not want—from AI-enabled technology. These lines of communication between civil society and government and technology developers will be fundamental to ensuring that technology firms and government are aware of how civil society organizations imagine AI integrating into their daily lives.

▲ Figure 1: Associational Benchmarking Concept

Second, because civil society organizations can operate outside the incentive structures of private businesses and public institutions, they are able to provide important checks and balances by demanding transparency and accountability from other constituent parts of evaluation processes. Of course, transparency and accountability do not come from external demands alone, thus requiring important legislative enablers, but they can provide pressures for governments and private organizations to act or change behavior due to, for example, invoking reputational concerns or illustrating broad social desires for action. Additionally, civil society organizations can serve as key interlocutors for publishing evaluation metrics, writing public-facing reports on benchmarking studies, and conducting reviews of evaluation results related to real-world use cases. In basic terms, civil society can push the development and deployment of AI in a direction that serves broader social interests.

Third, as discussed above, valid benchmarking datasets and evaluation processes require coherent, domain-specific knowledge that can be updated and refined as contexts change. In many circumstances, civil society organizations can assist in providing domain-relevant expertise and knowledge that typically cannot be replicated within governing institutions and private technology providers. Moreover, the organizations can serve as key partners in generating specific metrics and operationalizations of domain-specific knowledge that can be integrated into benchmarking procedures. Without valid metrics, the construct validity of any evaluation process will be dubious. For instance, academic and scientific organizations retain a great deal of critical systematic knowledge relevant to real-world AI use cases that, absent transparent dialogue, cannot be obtained in a manner workable for designing model evaluations. The basic point is that expert-created data and evaluations will improve the quality of AI models and better demonstrate their strengths and weaknesses. To facilitate access to this data, government institutions, such as the Library of Congress, could serve as publicly accessibly data repositories for high-quality, curated, pooled data training that public organizations can access for free.

Importantly, as discussed in greater detail below, government must invest in broader AI literacy to improve the feedback mechanisms between civil society and other constituent parts of the model. Not all relevant civil society organizations with interest in participating in associative processes related to model benchmarking will have the specific skill set to make necessary contributions from the onset. If the U.S. public does not have a basic level of AI literacy, as well as access to resources to experiment with AI, civil society will be a less effective form of checks and balances.

Technical Providers and Systems. The second component part comprises technical providers and systems. As with civil society addressed above, this component can be multidimensional, including, first off, the critical actor set of private technology organizations. These organizations include the major AI developers: OpenAI, Anthropic, Meta, XAI, Google, and others. These private businesses are in the unique position of having the financial and compute resources, as well as the technical expertise, to build, train, and offer public-facing AI products. Here, the commonly discussed “black box” of AI- and ML-related technologies poses the highest risk, as these actors at times have incentives to restrict outsider access to algorithms, model weights, and the like. There is also generally a wide gap between technical knowledge within these organizations compared to the general public. The consequence is high levels of opacity due to technical literacy differences between communities. Moreover, these companies typically offer application programing interfaces (APIs) that allow businesses and individuals to integrate high-performing models into their own workflows, which then may be converted into consumer-facing products, making technical providers key actors in broader business (and other organization) processes in which AI is increasingly deployed. Yet technology companies are not the only actor that could qualify within this constituent part of the broader model. Open-source approaches to AI development allow for individuals and institutions to host models and to train them to their own specific purposes without access to a particular corporation’s API, thus complicating the direct roles of technology provider and system developer. The basic point is that developers of AI products must be involved in governance processes for them to be effective.

Within evaluation processes, the role of these groups is critical as, in general, companies will likely continue to retain the highest degree of combined technical expertise in crafting model evaluations along with the related financial resources. Moreover, from an operational perspective, owing to their technical expertise, the technical providers and systems are important in offering guidance on converting local domain and institutional knowledge into a data structure on which a benchmark can be created and model performance measured. Due to the aforementioned motivations driving many private companies comprising this element of the model, the enabling legislative environment, discussed in greater detail below, must structure incentives in a fashion that drives benefits for technology providers toward open and accountable participation in the benchmarking and evaluation processes. This effort at associational benchmarking will be most effective if friction between technology providers, government, and civil society is minimized through the right legal incentive structure.

Public Institutions. The third component part of the associational model presented here is public institutions. This includes organizations ranging from federal departments that are part of the executive branch to local governments that may be interested in implementing AI tools in their own daily workflows or service provision. Although organizations can pursue their own bureaucratic political interests, the ideal function of public institutions, at least in a democratic context, is to implement policy in service of the public. What these institutions contribute in the context of this governance model is that they are likely integrating AI into service provision and thereby impacting governing practices and policy implementation in a range of domains. In fact, recent executive guidance has directed federal departments to develop institutional policies and procedures for implementing AI. The involvement of public institutions is unlikely to decrease. As such, these institutions will serve as a major touchpoint between the public and AI’s real-world use cases in circumstances ranging from obtaining government benefits to courts making decisions on the likelihood of recidivism. Moreover, such institutions retain significant public data on which institutional models are likely to be trained (and privacy concerns made acute) as well as specific expertise related to the institution’s particular purview (for example, the Environmental Protection Agency on environmental issues or the National Institutes of Health on issues related to public health). As such, they will be critical in developing relevant benchmarks and metrics associated with public service performance.

Finally, public institutions, such as NIST, are critical in creating regulatory and standard-setting apparatuses in many scientific and technical contexts from which private organizations will derive their own institutional policies. In this space, CAISI (previously the U.S. AISI), as part of NIST, will play a key role as it operates as a standards-setting body and works with industry members to set voluntary commitments on model evaluation and security. CAISI has already been an active participant in model evaluations. For example, in late 2024, alongside the UK AI Safety Institute, CAISI conducted a pre-deployment evaluation of Open AI’s o1 model, focusing specifically on cyber, biological, and software and AI development. Moreover, in early 2025, CAISI released a technical report related to initial evaluation results on how AI agents risk being subject to “high jacking,” in which malicious actors inject unwanted instructions into AI agent workflows. NIST and CAISA must continue to play a meaningful part within this benchmarking governance model.

Consequently, in terms of the associational evaluation model presented here, public institutions play a fundamental role. First, they are key targets of transparency and accountability feedback loops that are driven by civil society and civic association. If effective AI governance is to be implemented, it is up to such institutions to act on and respond to benchmarking results in meaningful ways by updating training data, workflows, and other relevant practices in response to feedback. Second, such organizations are likely to act as key data brokers in their specific domains as they work with private companies to train and deploy models on institutional use cases. In simple terms, without the commitment of public institutions, effective implementation and governance of AI agents will remain out of reach.

▲ Table 1: Summary of Model Elements

Evaluation cycle. The three constituent parts discussed above are linked within the associational evaluation cycle, with each element providing feedback and incentives to the other component parts. Links between these component parts are representative of both communication and accountability. Responsiveness to impacted communities is fundamental to this associational model. This section will briefly outline this recursive relationship along three parameters: (1) benchmark creation; (2) implementation, analysis, and dissemination of results; and (3) real-world responsiveness.

First, benchmark creation requires critical inputs from all constituent parts of the evaluation cycle. Civil society groups, for example, can be instrumental in expressing local interests, and, in certain circumstances (particularly in the case of think tanks or universities where specified knowledge is available), creating scenario-specific data on which to evaluate model performance. Thus, along this parameter, information flows to both technical providers and public institutions (as indicated by the arrows in Figure 1). Civil society groups transfer information in the forms of expressing a civil association’s specific interest when it comes to model performance, assisting in generating data for a benchmark, and providing input on which metrics account for a fair and realistic measure of performance on the relevant task. For benchmark creation, technical providers and systems offer important technical knowledge on how to design benchmarks valid for domain-specific data structures that they can share with partners in civil society and public institutions. Moreover, they can provide feedback on how technological progress in AI, largely driven by private firms, may shape requirements for updating relevant benchmarking practices. Finally, public institutions will be key in both listening to and engaging with relevant civic groups interested in involvement in benchmarking. Moreover, due to domain expertise and data access, public institutions will be crucial in generating contextual knowledge to include in benchmarking and, working recursively with interested civil society groups, in developing relevant metrics on which to assess evaluation results. Proper metrics thus will be the outcome of deliberative processes between stakeholders and based on domain-specific knowledge. Importantly, without the participation of multiple stakeholders in benchmarking, the results of benchmarking studies may lack real-world applicability.

Second, with respect to implementation, analysis, and information dissemination, civil society interacts with the other constituent parts largely in terms of interpreting benchmarking results according to their own local or domain-specific knowledge, as well as by acting as a key driver of transparency and accountability. Within the context of associational benchmarking, civil society organizations should participate in public advocacy and serve as essential providers of data and metrics to the public, particularly through creating and disseminating accessible information relevant to both their own association members and to the broader interested public. Moreover, because civil society operates outside of the for-profit incentives of private corporations and the bureaucratic pressures of public institutions, civil society organizations (particularly universities and think tanks) can offer objective analyses of evaluation results and their implications. During implementation and analysis, technology providers and systems typically should offer compute, model hosting, and other related technical skills to make evaluation processes methodologically consistent and relevant for the most advanced AI technologies (whether in computer vision, natural language processing, or others). Finally, public institutions within this parameter should focus on analyzing results specific to institutional goals based on agreed upon metrics. Along these lines, they should provide access to information to civil society organizations on evaluation results to ensure transparency and accountability to public interests. As with benchmark creation, meaningful participation of all stakeholders is crucial. This means matching local expertise with technical experience in interpreting benchmarking results and government actors maintaining awareness of how results link to their institutional goals.

Finally, as emphasized above, benchmarks should not be seen as simple technical resources, but rather as a set of socio-technical relations. As such, linking evaluation results to real-world processes of responsiveness by updating social and institutional practices as well as data resources is fundamental to this associational model functioning outside of theoretical terms. Thus, civil society actors must provide indicators to technical systems and providers, as well as public institutions, regarding how downstream AI applications are shaping their specific interests. In addition, such groups must be prepared to apply pressure to public institutions and technical providers who may fail to commit to transparent practices and who lack accountability. When applicable, technology providers and systems must be prepared to share how their models perform on benchmarks relevant to a broad range of civic actors. In the long run, the technology providers and systems will benefit from group feedback on how their products perform in a range of contexts, thereby allowing them to make product improvements and increase user trust. Lastly, public institutions need to be responsive to updating and curating relevant institutional data while also retaining enough institutional and technical flexibility to adapt practices according to the results of evaluation. This is critical, as maintaining a stagnant approach within the highly fluid context of governance will likely lead to undesirable technological performance, and thereby worse public services. Notably, it is perfectly possible that evaluation results may demonstrate technological limitations, thus illustrating that certain use cases are not appropriate, or remain too high risk, for AI integration into institutional practices at this stage. Even with all the technical best practices imaginable, if evaluation results do not result in required change, AI will likely fail to have a positive impact on broader society.

Apart from the constituent elements of the central evaluation cycle, two constraining and enabling conditions feature in the model: the domain and the legislative environment. While each will vary based on the context in which benchmarking is needed, both will necessarily exist in some form. Moreover, both generate the type of data that will be needed for evaluation purposes and for the incentive structure under which key actors operate.

Domain. All benchmarking and evaluation processes, particularly those in contexts relevant to governing, must be shaped by domain-specific considerations. These specific considerations should orient and guide the entire benchmarking and evaluation cycle discussed above, including the technical processes of implementing evaluations; different use cases will require unique considerations as well. Moreover, the domain will influence what sort of organizations (e.g., environmental, legal, and human rights organizations) participate in the evaluation processes of AI technologies. Data curation and subsequent model evaluation should include a range of domain-specific experts, depending on the desired AI use case. In addition, this should involve leveraging civil society associations, among other possible pools of experts, in the entire evaluation lifecycle. Ignoring this critical element risks evaluation results not properly corresponding to the real-world contexts in which the technology may be used, thus seriously threatening a benchmark’s construct validity, or put simply, its applicability to the real world.

Legislative Environment. All elements of the model are embedded within the legislative environment and, therefore, this environment must serve as an enabler of the associative model proposed here. While legislation and regulation are unlikely to solve all governance issues related to AI, they can create incentive structures that give value for participation to a broad range of stakeholders. Moreover, legislation can enable civic groups to have a real impact on model benchmarking and, thus, on how models are deployed. This includes supporting efforts aimed at increasing AI literacy among civil associations in local government contexts in order to ensure more robust and effective benchmarking processes and creating funding pools to enable expert civil society organizations to commit labor and other resources to creating robust evaluation cycles.

Moreover, a truly enabling legislative environment must manage power differentials between organizations that may be involved in benchmarking processes. Technological development and deployment take place within large social power structures, and the gap between the financial resources of exsisting AI companies and, for instance, locally oriented civil society groups will be vast. Thus, it is important to create an environment in which civil associations, the heart of this model of governance, do not get steamrolled by more powerful interests, compromising the validity of evaluative processes and their benefits. Within this model, truly robust benchmarking can only take place if civil organizations’ concerns are heard and acted upon rather than ignored in favor of more powerful corporate or political interests.

The final element of the model centers on the components that emphasize the process outputs and the requirements of updating inputs.

Process outputs. The evaluation cycle, per the stated goals of transparency and accountability, will result in multiple outputs, including publicly available metrics and reports that indicate model performance on domain-specific tasks. However, process outputs must also lead to iterative updating of institutional practices that align AI workflows and use cases with evaluation results. Moreover, new technical methodologies in domain-specific evaluation are likely to emerge through this process, and civil society organizations will gain insight into how AI impacts their direct interests, along with increasing their organizational knowledge about evaluation and benchmarking as a practice. That said, as discussed below, leaving these outputs stagnant sets up the broader process for failure; as such, recursive updating of process inputs is key. The data and metrics that result from evaluation studies must be leveraged to inform future AI development.

Updating Inputs. At its heart, the model presented here emphasizes associational politics, deliberative processes, and experimentalism. Each aspect has roots in the literatures of democratic political theory, philosophy, and experimental governance. As such, this model includes the essential function of updating inputs recursively, underscoring the need for constant analysis and iterative learning through reaction to prior operations and deliberation between constituent components of the evaluation cycles. Moreover, this model emphasizes experimentation in that it calls for “ongoing, reciprocal readjustments of ends and means.” Actors must reflexively learn and scrutinize their own actions and interests in pursuing the collective goal. This model builds these requirements as new inputs as the process cycles through continuous iteration. Without adapting to new information, actors risk deploying technology to use cases it is not built for, resulting in undesirable consequences for everyday users.

In summary, each component part of the model described above must work in conjunction with the others to achieve the desired outcome of responsive, domain-specific, benchmarking efforts. By including the proper legislative incentive structures and tangible benefits, each actor within the model can be incorporated into a cycle of positive feedback loops that can sustain the model’s relevance and success. The following section will identify three recommendations that could serve as enablers to implementing this model in real-world contexts.

Policy Implications and Recommendations

It is worth linking the above discussion to ongoing developments in U.S. AI policy. As addressed in more detail above, recent guidance from the Trump administration has rearticulated much of the work done by the Biden administration in establishing national AI policy. In doing so, the Trump administration, particularly in terms of rhetoric, has preferred a more innovation-forward approach to implementing AI across the U.S. government. This is seen in a range of policy guidance from the administration, including the aforementioned executive order entitled “Removing Barriers to American Leadership in Artificial Intelligence” as well as guidance from the Office of Management and Budget (OMB) in the form of “Accelerating Federal Use of AI through Innovation, Governance, and Public Trust” and “Driving Efficient Acquisition of Artificial Intelligence in Government.” While innovation and adoption of AI throughout the U.S. government may be a priority of the Trump administration, the current approach risks failure without reliable and iterative evaluation practices that emphasize input from civil actors. Voices from the bottom-up need to be key drivers of technology implementation.

To be fair, the above-mentioned federal guidance appears to be amenable to many of the items proposed in the associational evaluation model of this report. For example, the OMB’s memorandum on “Accelerating Federal Use of AI through Innovation, Governance, and Public Trust” notes the importance of protecting civil rights and liberties and developing responsible AI, and it calls out the critical nature of transparency, governance, and public trust. Moreover, it identifies the need for continuous monitoring of AI systems, particularly in high-impact use cases, along with supporting broader AI literacy among agency employees. Finally, in line with aspects drawn out in the model discussed here, this OMB memo highlights the importance of numerous and diverse stakeholders, including private actors and external experts in AI, along with the need for the independent review of AI systems, particularly in high-impact use cases.

There are a few issues within the OMB guidance worth further assessment, as well as discussion regarding how the model proposed here can support and improve recent policy guidance. For example, independent reviews within the OMB document are stated to be conducted by a reviewer “within the agency” who has not been a part of system development. While this is certainly an important step, within-agency review may not sufficiently isolate the reviewer from political and bureaucratic pressures. As such, incorporating civil society organizations into the process of model assessment and evaluation can add an additional layer of separation between bureaucratic political pressures and AI evaluation, thereby contributing to more objective and transparent analyses of benchmarking results. Additionally, while innovation is key, emphasizing this factor to too great a degree may undermine other components of federal guidance, including developing and implementing AI responsibly, improving public trust, and avoiding harm to U.S. citizens. Depending on the domain, incorporating a range of civil society actors into the evaluation cycle and integrating robust and transparent feedback mechanisms could facilitate a pragmatic balance between technological innovation and building systems that work as intended within specific contexts. Importantly, increasing trust and accountability within AI governance will lead to more robust and sustainable development of AI-associated technologies. Finally, while sensible in general terms, proposed efforts at improving efficiency and reusing elements of AI development through agency coordination could lead to issues in downstream technology deployment. Unrestrained efforts at improving implementation efficiency and the reuse of technical systems could lead to circumstances in which systems and data designed for one context are applied to a different, non-applicable case, increasing the risk that the system will not perform as desired due to contextual variation. This is why continuous benchmarking on domain-specific tasks, including constant iteration and feedback loops, is key.

This discussion leads to three substantive recommendations (summarized in Table 2 below). First, legislators at the federal and state levels should pursue legislation that encourages and enables civil society actors to become involved in independent benchmarking efforts, while working with private technology providers and public institutions. This could include offering funding resources for think tanks, universities, and even local community organizations to become meaningful stakeholders in the process of benchmarking and evaluation. This could also incorporate increasing resources for upscaling broader AI literacy in existing civil society organizations and encouraging these organizations to position themselves as key stakeholders within conversations on AI governance, benchmarking, and evaluation.

Moreover, Congress should begin to explore which congressional committees have the authority to hold routine hearings on model benchmarking and evaluation to ensure that iterative, inclusive, and robust processes remain in play across private and public sector contexts. Currently, a range of committees have at least some form of jurisdiction related to AI policy development, including committees as diverse as the Judiciary Committee in the Senate and the Foreign Affairs Committee in the House, reflecting the wide impact AI will have on questions of governance. Congressional committees with interests in AI should begin the process of holding benchmarking-related hearings to hold AI firms and federal agencies accountable, while also looking to the best interests of the people they are elected to represent. Within such a process, civil society organizations can leverage their traditional advocacy roles to make local interests heard by congressional representatives, as well as offer expert testimony on the evaluation results and technological impact they observe in their local contexts.

Second, legislative bodies at all levels of government should pursue creating meaningful incentives for AI firms and public institutions to coordinate their benchmarking efforts with civic society organizations. This can help to match civil society organizations’ domain expertise, local concerns, and user feedback with any deployed AI agent. Incentives could include formal legislation requiring AI companies to operate in specific ways or incentives which hope to shape corporate behavior without specific standards. Legislators could even use state or federal tax law in order to manage AI firms’ incentives. Historically, these tools have been used successfully in other contexts, such as influencing corporate hiring practices. Of course, such a strategy would forgo tax revenue from some of the United States’ largest companies, requiring a careful legislative strategy that balances desired corporate behavior with necessary tax revenue. Moreover, legislation should encourage open, transparent, and accountable evaluation practices, and could even establish standard practices for benchmark reporting. Similar reporting incentives have been used with regard to corporate disclosures of supply chains and labor abuses, although their effectiveness remains a point of a debate. Private firms have already offered a variety of pledges toward building more robust and safe AI systems, for example, in the form of creating nonprofit industry groups and voluntary commitments to shared safety practices in model development and deployment. Additionally, legislation could encourage these private organizations to pursue additional commitments aimed at integrating civil society actors—as well as committing to greater degrees of transparency—within the evaluation processes in line with the general model of associational benchmarking proposed here. A key consideration in legislative processes needs to be harmonizing local, state, and federal AI policies into a functioning mosaic of governance. Excessive fragmentation of AI policy could lead to difficulties as organizations attempt to navigate different rules for model evaluation and deployment in different localities. Efforts to disincentivize state-level regulations on AI look to address this issue; however, these efforts challenge core components of the U.S. federalist system. Moreover, lawmakers will need to balance the need for oversight with the desire for further innovation.

Third, and perhaps most critically, the United States’ philanthropic foundations can play a fundamental role in coordinating the funding of multiple major cross-disciplinary benchmarking initiatives to enhance and preserve the independence of a modern form of accountable, associational model evaluation. Without independent financial support, there is a risk that evaluation processes will be co-opted by narrow interests, thus turning a potentially productive tool of measuring the technologies’ performances into something far more superficial. In fact, recent research has demonstrated that some model leaderboards related to popular AI benchmarks can be gamed by large private companies, which are incentivized to present their products as the best-performing models. Involving independent actors with interests in the public good and not financial profit, and providing them with the resources to meaningfully participate, could help to mitigate such problems and make evaluation practices more robust.

While it is important to recognize that foundations, and their philanthropic contributions, have a more complex history than can be explored here, there is a clear precedent for their inclusion in this process. Researchers studying foundations suggest that philanthropic contributions commonly focus on the creation of “something new,” whether that be related to social arrangements, the arts, or science. Moreover, foundations can serve as social entrepreneurs that “respond to needs or problems that are beyond the reach” of current market incentives or government capacity. While the grant-making capabilities of foundations are key, particularly in the specific contexts of supporting research initiatives, these organizations also have the ability to leverage social mechanisms that legitimate new forms of organization and that develop professionals to support that same organizational infrastructure.

Foundations also play a key role in supporting civil society actors, such as NGOs. For example, the MacArthur Foundation and the Mott Foundation both provide grants to various NGOs working in the United States and globally. The grants range from support for organizations working on managing global crises to those addressing climate change. Historically, foundations have enabled key research in the social and hard sciences. The Rockefeller Foundation played a key role in the establishment of biomedical research and molecular biology. Grants by the MacArthur Foundation helped to build up a range of research institutions working on managing risks from nuclear weapons, such as Stanford’s Center for International Security and Cooperation and Harvard’s Belfer Center. The Russell Sage Foundation has focused on supporting social science researchers and contributed to the development of “social indicators” to help inform data-driven social policy. The editors of a foundation-supported volume asserted that such indicators are “imperative for the guidance of social policy.” As a final example, the Gates Foundation is a critical cog in foundation-based global health and development initiatives. Foundations thus support a wide range of pursuits in the social and hard sciences across a range of domains.

Critical for this discussion is that the broader fields of computer science and artificial intelligence also have roots in foundation support. As a significant example, the Macy Foundation contributed to a series of conferences beginning in the 1940s that set the stage for the development of the field of “cybernetics.” While the field is far reaching in scope, impacting disciplines as diverse as neuroscience and various social sciences, cybernetics was highly influential in early research into computing and AI. Its influence largely lies in the introduction of concepts such as feedback mechanisms, control theory, and complex systems. In basic terms, cybernetics focuses on how feedback loops shape behavioral patterns in both human and machine systems. Key attendees of the Macy conferences included Norbert Wiener, generally considered the father of cybernetics; John von Neumann, the creator of the Von Neumann architecture, which remains an influential paradigm in modern computer design; and Warren McCulloch and Walter Pitts, key figures in the development of the initial model of the neural network. Foundation support can be said to have helped create the conditions for today’s advances in AI and computing.

Foundation grants in the domain of AI continue to be influential across a diverse spectrum of viewpoints. Philanthropic organizations, such as Open Philanthropy and the Future of Life Institute, provide grants for AI-related research, particularly focusing on managing “existential risks.” Moreover, established foundations such as the Ford and MacArthur Foundations have funded AI-related organizations, including the AI Now Institute, which focuses on issues of AI-related surveillance and power consolidation in the hands of technology companies.

The history and current funding profile of U.S. foundations have clear relevance for the CSIS Futures Lab’s proposed associational model of AI evaluation and benchmarking. While the foundation approach to funding is not perfect, it does play a critical role in how research organizations and civil society groups gain support for their initiatives. For example, foundation contributions could encourage the emergence of new civil society organizations oriented toward transparent benchmarking practices across a range of domains. This type of foundation support could fill in the gaps where market or government incentives for such funding may be limited. Moreover, foundation-based funding efforts could assist in creating new research fellowship streams that are focused directly on bringing domain-specific experts into benchmarking processes, whether these be individuals from small local organizations focused on community issues or experienced researchers with specific disciplinary skills and knowledge. However, effective relationships will need to result in long-term funding commitments to avoid funding gaps and to reduce the risk that funding organizations rapidly change priorities.

The recommendations outlined above seek to create a policy environment that enables AI firms and civil society groups to work together in a productive fashion to produce more robust and dynamic benchmarking outputs. Such cooperation will improve the likelihood that AI works for broader segments of society on tasks that require specific expertise and knowledge. Moreover, it suggests that U.S. foundations can play a crucial part in establishing civil society groups that have the talent and resources to participate in this associational model of AI benchmarking and evaluation.

▲ Table 2: Policy Recommendations Summary

Conclusion

In the near term, AI agents will not simply function as a tool but will shape (and to a degree already are shaping) how decisions are made and who gets to make them. Whether it be in the context of national security and foreign policy decisions or related to getting access to government benefits, decisions shaped by AI systems are set to influence people’s life outcomes. To avoid a scenario in which society relinquishes a detrimental amount of agency to opaque AI systems, thereby challenging democratic principles of governance, a renewed form of associative action is required. This form of association must bring many stakeholders to the table—including policymakers, researchers, civic leaders, and interested members of impacted communities—to generate robust, dynamic, and responsive benchmarking practices. Critically, this process must be transparent and open. As others have argued, “transparency is an essential precondition for public accountability, scientific innovation, and effective governance of digital technologies. Without adequate transparency, stakeholders cannot understand foundation models, who they affect, and the impact they have on society.” Moreover, researchers have suggested that transparency can be fundamental for “reducing the mystique and opaqueness of AI to the general public.” While research has shown that transparency is not a direct cause of accountability, it is a necessary condition.

In line with this goal and drawing from the work of Tocqueville, scholarship in deliberative democracy, and experimentalist approaches to governance, this analysis has presented an associational model of benchmarking to assist in visualizing our broader argument. Moreover, this report has offered a variety of recommendations related to government and foundation policy that could assist in enabling this vision to become a reality. Fundamentally, in an open and democratic society, it is not enough to build powerful and capable models—we must also build the civic institutions and practices capable of questioning them.

Benjamin Jensen is director of the Futures Lab and a senior fellow for the Defense and Security Department at the Center for Strategic and International Studies (CSIS), where Dr. Jensen leads research initiatives on applying data science and AI and machine learning to study the changing character of war and statecraft. Under his leadership, Futures Lab has pioneered building AI applications into wargames and innovative scenario exercises. The exercise topics range from major war, competitive strategy, and national mobilization to economic security, energy politics, and national resilience. He is also the Frank E. Petersen Chair for Emerging Technology and a professor of strategic studies at the Marine Corps University School of Advanced Warfighting (MCU), where he leads a research program on future war and teaches seminars on modern operational art and joint-all domain operations.

Ian Reynolds is the postdoctoral fellow for the Futures Lab in the International Security Program at CSIS. His research focuses on the intersection of technology, science, and international security. Ian’s dissertation addressed the history and cultural politics of integrating artificial intelligence into military decision-making processes in the United States.

The Republic of Agora