PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

1Peking University 1USTC 3WHU 2BIGAI 4SJTU 5Zhongguancun Academy

TL;DR: We introduce PoliCon, a benchmark constructed from 2,225 European Parliament's real-world deliberation records, which can evaluate the ability of LLMs to craft consensus within varying collective decision-making contexts and political requirements.

Abstract

Achieving political consensus is crucial yet challenging for the effective functioning of social governance. However, although frontier AI systems represented by large language models (LLMs) have developed rapidly in recent years, their capabilities in this scope are still understudied. In this paper, we introduce PoliCon, a novel benchmark constructed from 2,225 high-quality deliberation records of the European Parliament over 13 years, ranging from 2009 to 2022, to evaluate the ability of LLMs to draft consensus resolutions based on divergent party positions under varying collective decision-making contexts and political requirements. Specifically, PoliCon incorporates four factors to build each task environment for finding different political consensus: specific political issues, political goals, participating parties, and power structures based on seat distribution. We also developed an evaluation framework based on social choice theory for PoliCon, which simulates the real voting outcomes of different political parties to assess whether LLM-generated resolutions meet the requirements of the predetermined political consensus. Our experimental results demonstrate that even state-of-the-art models remain undersatisfied with complex tasks like passing resolutions by a two-thirds majority and addressing security issues, while uncovering their inherent partisan biases and revealing some behaviors LLMs show to achieve the consensus, such as prioritizing the stance of the dominant party instead of uniting smaller parties, which highlights PoliCon's promise as an effective platform for studying LLMs' ability to promote political consensus.

Overview

An example scenario in PoliCon. In each task, PoliCon builds a collective decision-making environment with varying political goals, power structures, issues, and participating parties. The tested LLM then attempts to achieve a consensus resolution based on these setups and the divergent party positions. The outcome is evaluated first via a simulated vote and then mapped to a quantitative score according to the specific environment setting by PoliCon’s evaluation framework.

The upper part presents the setting of the current collective decision-making environment. The seating colors in the figure represent the seat distribution among the four participating parties, which are GREEN/EFA (red, 50%), EPP (green, 20%), GUE/NGL (purple, 20%), and EFD (blue, 10%). The lower part demonstrates the LLM’s consensus-finding process. The parliamentary president announced the need to discuss the issue of surplus dairy products and introduced the political goal is to passing the resolution with a two-thirds majority among the members of the European Parliament (MEPs).

Subsequently, each participating party expresses inconsistent positions on this issue. For example, EPP and EFD have significant disagreements on the matter of extending storage time, while GREEN/EFA and GUE/NGL have differences over export refunds. Although the resolution generated by the evaluated LLM partially considered the apportionment of seats among different parties to balance conflicting positions, it still failed to reconcile a new consensus resolution beyond the compromise. As a result, in PoliCon’s evaluation, 50%, 90%, 70%, and 30% of the MEPs from GREEN/EFA, EPP, GUE/NGL, and EFD vote in favor of the resolution, respectively. Considering the seat distribution, only 60% of the entire parliament voted in favor of the resolution, which does not meet the two-thirds majority standard, and thus the resolution was not passed.

PoliCon Benchmark

Topics

There are 5 coarse-grained and 19 fine-grained topic categories of issues in EuroCon. The shade of the color indicates the proportion of the fine-grained topic within the coarse-grained topic; the darker the color, the higher the proportion.


Stances Diversity

Semantic representation distribution of party stances (indicated by their symbols) in the 7th (2009-2014) and 8th (2014-2019) terms of the European Parliament in PoliCon. The core requirement of conflict is to have a different number of parties with various positions on each issue. There are clear differences in political positions among the parties in the European Parliament. To demonstrate this point more obviously, we randomly sample 200 stance data points from each party during the 7th and 8th parliamentary terms. We then use OpenAI's text-embedding-003-small model to map each party’s stances into a semantic representation space, and employ Principal Component Analysis (PCA) to visualize this information. As shown in the figure, the stances of each party form distinct clusters in the semantic space, with significant differences.


Results

Average Performance on All Topics

Performance of different LLMs on PoliCon. The values in square brackets indicate the range of each metric, and all metrics follow the principle that higher values are better. The background color of the table cells deepens as the performance improves. The blue color scheme represents metrics in the 0-1 range, while the red color scheme represents metrics in the 0-9 range.


We find that Gemini-2.5 performs the best, achieving the best results on 60% of the tasks. Deepseek-V3.1 and GPT-4o follow with both attaining top performance on 33% of the tasks.


We also compare the performance differences among other evaluated LLMs and identify the following trends: (1) Thinking models like Gemini-2.5 and Deepseek-V3.1 generally outperform no-thinking models like GPT-4o and Llama-3.3-70B. (2) Commercial models typically outperform non-commercial models. (3) Based on the results of four open-sourced models with known parameter sizes, we find that the performance is generally positively correlated with the model size.


The average contribution ratio of the largest party to other parties in failed and passed cases across SM and 2/3M.


Additionally, we analyze whether a common strategy exists for LLMs to achieve political consensus under various power structures, excluding two-party scenarios. As shown in the above figure, we find that under both simple majority and two-thirds majority systems, LLMs lack the ability to unite smaller parties to achieve collective welfare. Instead, successful proposals often rely on the support of the largest party, indicating that the votes of dominant parties are foundational for approval in most cases.


Performance on Different Task Settings and Topics

The average results of the six evaluated LLMs of the five coarse-grained topics on passing resolution (PR, including SM, 2/3M, and VP), Rawls, and Util political goals.


Analysis for Different Parliament Settings. As shown in the figure, for the political goal of passing a resolution, SM is the simplest, and most models can perform well. However, in the 2/3M and VP settings, model performance declines significantly, indicating that the capabilities of existing LLMs generally lie in the gap between the increased difficulty of SM and these two settings. We further find that as the number of parties increases, the results of most models gradually rise. This could be due to our task construction prioritizing parties with the most diverse positions, complicating reconciliation with fewer parties. For the Rawls objective, however, the success rate of models decreases as the party number increases. This aligns with the task's definition, as the more participants there are, the harder it becomes to avoid neglecting any party's interests, presenting a significant challenge for current LLMs in this task.


Analysis for Different Issue Topics. As shown in the figure, we analyze the experimental results of five coarse-grained topics. These results suggest that the difficulty of different topics shows certain similarities across various parliamentary settings. Specifically, topics involving policies, such as Security and Civil Rights, tend to be more challenging than those related to industrial development. This may be because these topics tend to present more complex and conflicting positions, requiring the evaluated LLM to possess stronger reasoning capabilities.


Our experimental results successfully reveal the limitations of the current LLMs in political consensus finding. Although top-performing models like Deepseek-V3.1 and Gemini-2.5 achieve a success rate of 87-93% in SM scenarios, their performance significantly drops when faced with stricter consensus requirements. In 2/3M tasks, the success rate falls to 52-63%, and in the more challenging Rawls setting, it ranges from only 3.42-4.60. Additionally, when dealing with more complex topics such as security, these models still face considerable challenges.


Bias Evaluation for the Tested LLMs

Scores of different LLMs regarding the degree of bias between political parties.


As shown in the table, we calculated the variance of scores across different terms of parties for each model. We found that, generally, as this variance decreases, the corresponding performance increases. This makes sense because when models discard their bias towards political parties, they can better adapt to the party weights in our setting and produce reasonable resolutions.


Partisan bias of the tested LLMs. (Top) Average scores from the tested LLMs on different parties. (Middle) Ground truth votes of different parties. (Bottom) Scores of random assignment.


We further investigate the partisan bias of tested models. As surprisingly shown in the figure, though our party seats were randomly reassigned, the scores across different parties still resemble the distribution of real-world voting results. This indicates that the tested models are somehow influenced by the real-world party preferences. In contrast, the score distribution for random resolutions is entirely different, suggesting that this bias is not caused by our evaluator or the data cleaning process.


BibTeX

@inproceedings{zhang2025policon,
    title={{PoliCon}: Evaluating LLMs on Achieving Diverse Political Consensus Objectives},
    author={Zhaowei Zhang and Xiaobo Wang and Minghua Yi and Mengmeng Wang and Fengshuo Bai and Zilong Zheng and Yipeng Kang and Yaodong Yang},
    booktitle={Arxiv},
    year={2025}
}