Yue Xu

Hi , this is Yue Xu (Savannah). I’m a Ph.D. student in Computer Science at ShanghaiTech University, advised by Prof. Wenjie Wang. My research explores AI alignment, including safety, fairness, and robustness on large language and multimodal models, aiming to build intelligent systems that are both trustworthy and adaptive.

Recently, I’ve been focusing on the personalization of LLMs and LLM-powered agents, exploring how memory, preference modeling, and adaptive reasoning can enable human-aligned, self-evolving agents.

If you’re interested in collaboration or discussion, feel free to reach out at xuyue2022 [at] shanghaitech.edu.cn!

news

Sep 22, 2025	FaIRMaker accepted to NeurIPS2025! See you in San Diego!
Dec 14, 2024	MMJ-Bench accepted to AAAI2025!
Sep 20, 2024	CIDER accepted to EMNLP2024!
May 30, 2024	I will be continuing my Ph.D. journey at ShanghaiTech University, advised by Prof. Wenjie Wang!🎉
Mar 30, 2024	LinkPrompt accepted to NAACL2024! See you in Mexico!

selected publications

CIDER

Cross-modality information check for detecting jailbreaking in multimodal large language models

Yue Xu, Xiuyuan Qi, Zhan Qin, and 1 more author

EMNLP Findings, 2024

Abs arXiv Code

We propose Cross-modality Information DEtectoR (CIDER), a plug-and-play jailbreaking detector designed to identify maliciously perturbed image inputs, utilizing the cross-modal similarity between harmful queries and adversarial images. This simple yet effective cross-modality information detector, CIDER, is independent of the target MLLMs and requires less computation cost. Extensive experimental results demonstrate the effectiveness and efficiency of CIDER, as well as its transferability to both white-box and black-box MLLMs.
FaIRMaker

Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in Large Language Models

Yue Xu, Chengyan Fu, Li Xiong, and 2 more authors

NeurIPS, 2025

Abs arXiv Code

Pre-training large language models (LLMs) on vast text corpora enhances natural language processing capabilities but risks encoding social biases, particularly gender bias. While parameter-modification methods like fine-tuning mitigate bias, they are resource-intensive, unsuitable for closed-source models, and lack adaptability to evolving societal norms. Instruction-based approaches offer flexibility but often compromise task performance. To address these limitations, we propose FaIRMaker, an automated and model-independent framework that employs an auto-search and refinement paradigm to adaptively generate Fairwords, which act as instructions integrated into input queries to reduce gender bias and enhance response quality. Extensive experiments demonstrate that FaIRMaker automatically searches for and dynamically refines Fairwords, effectively mitigating gender bias while preserving task integrity and ensuring compatibility with both API-based and open-source LLMs.
LinkPrompt

Linkprompt: Natural and universal adversarial attacks on prompt-based language models

Yue Xu and Wenjie Wang

In NAACL, 2024

Abs arXiv Code

Prompt-based learning is a new language model training paradigm that adapts the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes the performance benchmarks across various natural language processing (NLP) tasks. Instead of using a fixed prompt template to fine-tune the model, some research demonstrates the effectiveness of searching for the prompt via optimization. Such prompt optimization process of prompt-based learning on PLMs also gives insight into generating adversarial prompts to mislead the model, raising concerns about the adversarial vulnerability of this paradigm. Recent studies have shown that universal adversarial triggers (UATs) can be generated to alter not only the predictions of the target PLMs but also the prediction of corresponding Prompt-based Fine-tuning Models (PFMs) under the prompt-based learning paradigm. However, UATs found in previous works are often unreadable tokens or characters and can be easily distinguished from natural texts with adaptive defenses. In this work, we consider the naturalness of the UATs and develop LinkPrompt, an adversarial attack algorithm to generate UATs by a gradient-based beam search algorithm that not only effectively attacks the target PLMs and PFMs but also maintains the naturalness among the trigger tokens. Extensive results demonstrate the effectiveness of LinkPrompt, as well as the transferability of UATs generated by LinkPrompt to open-sourced Large Language Model (LLM) Llama2 and API-accessed LLM GPT-3.5-turbo.
MMJ-Bench

Mmj-bench: A comprehensive study on jailbreak attacks and defenses for vision language models

Fenghua Weng, Yue Xu, Chengyan Fu, and 1 more author

In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

Abs arXiv Code

As deep learning advances, Large Language Models (LLMs) and their multimodal counterparts, Vision-Language Models (VLMs), have shown exceptional performance in many real-world tasks. However, VLMs face significant security challenges, such as jailbreak attacks, where attackers attempt to bypass the model’s safety alignment to elicit harmful responses. The threat of jailbreak attacks on VLMs arises from both the inherent vulnerabilities of LLMs and the multiple information channels that VLMs process. While various attacks and defenses have been proposed, there is a notable gap in unified and comprehensive evaluations, as each method is evaluated on different dataset and metrics, making it impossible to compare the effectiveness of each method. To address this gap, we introduce MMJ-Bench, a unified pipeline for evaluating jailbreak attacks and defense techniques for VLMs. Through extensive experiments, we assess the effectiveness of various attack methods against SoTA VLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility for normal tasks. Our comprehensive evaluation contribute to the field by offering a unified and systematic evaluation framework and the first public-available benchmark for VLM jailbreak research. We also demonstrate several insightful findings that highlights directions for future studies.