TL;DR: We introduce EuroCon, a benchmark constructed from 2,225 European Parliament's real deliberation records, which can evaluate the ability of LLMs to find political consensus within various parliament settings.
Achieving political consensus is crucial yet challenging for the effective functioning of social governance. However, although frontier AI systems represented by large language models (LLMs) have developed rapidly in recent years, their capabilities on this scope are still understudied. To this end, we introduce EuroCon, a novel benchmark constructed from 2,225 high-quality deliberation records of the European Parliament over 13 years, ranging from 2009 to 2022, to evaluate the ability of LLMs to reach political consensus among divergent party positions across diverse parliament settings. Specifically, EuroCon incorporates four factors to build each simulated parliament setting: specific political issues, political goals, participating parties, and power structures based on seat distribution. We also develop an evaluation framework for EuroCon to simulate real voting outcomes in different parliament settings, assessing whether LLM-generated resolutions meet predefined political goals. Our experimental results demonstrate that even state-of-the-art models remain undersatisfied with complex tasks like passing resolutions by a two-thirds majority and addressing security issues, while revealing some common strategies LLMs use to find consensus under different power structures, such as prioritizing the stance of the dominant party, highlighting EuroCon's promise as an effective platform for studying LLMs' ability to find political consensus.
An example scenario in EuroCon. In each task, EuroCon constructs a simulated parliament with varying political goals, power structures, issues, and participating parties. The tested LLM then attempts to find political consensus based on the parliament’s setup and the parties’ divergent positions. The outcome is evaluated via a simulated voting by EuroCon ’s evaluation framework.
The upper part presents the setting of the current simulated parliament. The seating colors in the figure represent the seat distribution among the four participating parties, which are GREEN/EFA (red, 50%), EPP (green, 20%), GUE/NGL (purple, 20%), and EFD (blue, 10%). The lower part demonstrates the LLM’s consensus-finding process. The parliamentary president announced the need to discuss the issue of surplus dairy products and introduced the political goal is to passing the resolution with a two-thirds majority among the members of the European Parliament (MEPs).
Subsequently, each participating party expresses inconsistent positions on this issue. For exam- ple, EPP and EFD have significant disagreements on the matter of extending storage time, while GREEN/EFA and GUE/NGL have differences over export refunds. Although the resolution gener- ated by the evaluated LLM partially considered the apportionment of seats among different parties to balance conflicting positions, it still failed to reconcile a new consensus resolution beyond the compromise. As a result, in EuroCon’s evaluation, 50%, 90%, 70%, and 30% of the MEPs from GREEN/EFA, EPP, GUE/NGL, and EFD vote in favor of the resolution, respectively. Considering the seat distribution, only 60% of the entire parliament voted in favor of the resolution, which does not meet the two-thirds majority standard, and thus the resolution was not passed.
There are 5 coarse-grained and 19 fine-grained topic categories of issues in EuroCon. The shade of the color indicates the proportion of the fine-grained topic within the coarse-grained topic; the darker the color, the higher the proportion.
Semantic representation distribution of party stances (indicated by their symbols) in the 7th (2009-2014) and 8th (2014-2019) terms of the European Parliament in EuroCon. The core requirement of conflict is to have a different number of parties with various positions on each issue. There are clear differences in political positions among the parties in the European Parliament. To demonstrate this point more obviously, we randomly sample 200 stance data points from each party during the 7th and 8th parliamentary terms. We then use OpenAI's text-embedding-003-small model to map each party’s stances into a semantic representation space, and employ Principal Component Analysis (PCA) to visualize this information. As shown in the figure, the stances of each party form distinct clusters in the semantic space, with significant differences.
Performance of different LLMs on EuroCon. The values in square brackets indicate the range of each metric, and all metrics follow the principle that higher values are better. The background color of the table cells deepens as the performance improves. The blue color scheme represents metrics in the 0-1 range, while the red color scheme represents metrics in the 0-9 range.
We find that Qwen-72B and Deepseek-R1 perform the best. Qwen-72B demonstrates exceptional performance, surpassing models of similar scale and even commercial models such as GPT-4o and Gemini, as well as the larger Deepseek-R1. This finding is consistent with some results from existing work, which suggests that a model’s ability in specific tasks is not entirely directly related to the number of parameters but should instead focus more on the task-specific capabilities.
We also compare the performance differences among other evaluated LLMs and identify the following trends: (1) Thinking models like Deepseek-R1 and Gemini-2.5 generally outperform no-thinking models like Llama-3.3-70B and Qwen2.5-32B. (2) Commercial models (Deepseek-R1, Gemini-2.5, GPT-4o) typically outperform non-commercial open-source models (Llama-3.3-70B and Qwen2.5- 32B). (3) The minimal differences between Qwen2.5-32B and Llama-3.3-70B may further suggest that the task of political consensus finding is not strongly correlated with model size.
The average results of the six evaluated LLMs of the five coarse-grained topics on passing resolution (PR, including SM, 2/3M, and VP), Rawls, and Util political goals.
Analysis for Different Parliament Settings. As shown in the figure, for the political goal of passing a resolution, SM is the simplest, and most models can perform well. However, in the 2/3M and VP settings, model performance declines significantly, indicating that the capabilities of existing LLMs generally lie in the gap between the increased difficulty of SM and these two settings. We further find that as the number of parties increases, the results of most models gradually rise. This could be due to our task construction prioritizing parties with the most diverse positions, complicating reconciliation with fewer parties. For the Rawls objective, however, the success rate of models decreases as the party number increases. This aligns with the task's definition, as the more participants there are, the harder it becomes to avoid neglecting any party's interests, presenting a significant challenge for current LLMs in this task.
Analysis for Different Issue Topics. As shown in the figure, we analyze the experimental results of five coarse-grained topics. These results suggest that the difficulty of different topics shows certain similarities across various parliamentary settings. Specifically, topics involving policies, such as Security and Civil Rights, tend to be more challenging than those related to industrial development. This may be because these topics tend to present more complex and conflicting positions, requiring the evaluated LLM to possess stronger reasoning capabilities.
Our experimental results successfully reveal the limitations of the current LLMs in political consensus finding. Although top-performing models like Qwen2.5-72B achieve a success rate of 86-90% in SM scenarios, their performance significantly drops when faced with stricter consensus requirements. In 2/3M tasks, the success rate falls to 61-62%, and in the more challenging Rawls setting, it ranges from only 3.26-5.12. Additionally, when dealing with more complex topics such as security, these models still face considerable challenges.
Strategy analysis of LLMs under different power structures. Figure (a) shows the average contribution ratio of the largest party to other parties in failed and passed cases across SM and 2/3M. Figure (b) shows the proportion of not passed SM but not vetoed (SM-/Veto+), not passed SM and vetoed (SM-/Veto-), and passed SM but vetoed (SM+/Veto-) among all failed cases in the VP setting.
As shown in the figure, we analyze whether a common strategy exists for LLMs to achieve political consensus under various power structures, excluding two-party scenarios. Our findings are as follows: (1) Under both simple majority and two-thirds majority systems, successful proposals often rely on the support of the largest party, indicating that dominant parties' votes are foundational for approval and decisive in most cases. (2) Regarding the veto mechanism, models with higher passing rates experience more failures due to vetoes after simple majority approval. This suggests a strategic trade-off: prioritizing majority party support can maximize approval chances but risks overlooking veto-holding parties, leading to sudden failures when veto power is exercised.
These findings highlight EuroCon's unique ability to reveal subtle flaws in the LLMs' political decision-making capabilities, which existing negotiation benchmarks are often hard to detect.
@inproceedings{zhang2025eurocon,
title={EuroCon: Benchmarking Parliament Deliberation for Political Consensus Finding},
author={Zhaowei Zhang and Minghua Yi and Mengmeng Wang and Fengshuo Bai and Zilong Zheng and Yipeng Kang and Yaodong Yang},
booktitle={Arxiv},
year={2025},
url={https://openreview.net/forum?id=f9w89OY2cp}
}