Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs

Abstract

How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse regarding culture, values, or time. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users' personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.

Method

The figure is intersected by an axis, with each node on the axis displaying a different distribution that shows the constantly changing user personalized preferences due to factors like time, value, need, and context, as illustrated by the part (a). The part (b) shows that existing methods mostly consider aligning LLMs with general preferences from a static dataset, which may result in misalignment in dynamically personalized scenarios. In the part (c), we have enlarged one of the preference nodes to show the processing of our Amulet framework. We formulate the decoding process of every token as a separate online learning problem and further adapt the backbone LLMs to align with the current user preference through a real-time optimization process with the guidance of user-provided prompts. The red token means the current processing token, which will be the condition for the next token prediction.

Results

ArmoRM Metric

Results of our Amulet framework and all the other baselines on various combination settings of LLMs, user preferences, and datasets. All results are the arithmetic averages of the reward model scores on each dataset. The bold text in the table indicates the best performance under that setting. The colors in the table represent the percentage improvement of that method in the current setting relative to the Base method, with more positive growth bluer and more negative growth redder.

The percentage of the highest scores (win rate) for the 64 groups of experiments across all the methods with different user preferences and LLMs. As shown in the left plot of the figure, from the win rate of our method in the 16 experiments conducted for each preference, the highest-ranking preference is creative, reaching 93.8%. Following that are uplifting (81.2%), concise (75%), and verbose (50%). As shown in the right part, the following models are Llama-2-7B-Chat (75%), Mistral-7B-Instruct-v0.2 (68.8%), and QWen2-7B-Instruct (56.2%). Our method shows win rate improvements of these three models are 200%, 121%, and 49.9%, respectively, compared to the current SOTA method, LA. These experimental results demonstrate our method’s effectiveness in enhancing the alignment of model responses with user preferences across various LLMs.

GPT-4o Win Rate

Detailed results on the GPT-4o win rate among Amulet versus all the other baselines (Base, Pref, BS16, and LA) on the Personal dataset. The first row of the figure shows the average win rate of Amulet for all the preferences and the second row for all the LLMs. As shown in figure, Amulet achieved the highest win rate in all tasks. Even the QWen2-7B model, which performed relatively weakly under the ArmoRM metric, achieved a least win rate of 62.2%.

BibTeX

@inproceedings{zhang2025amulet,
    title={Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of {LLM}s},
    author={Zhaowei Zhang and Fengshuo Bai and Qizhi Chen and Chengdong Ma and Mingzhi Wang and Haoran Sun and Zilong Zheng and Yaodong Yang},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=f9w89OY2cp}
}