School of Information Sciences

Illinois information sciences researchers develop AI safety testing methods

Haohan Wang
Haohan Wang, Assistant Professor
Haibo Jin
Haibo Jin

Large language models are built with safety protocols designed to prevent them from answering malicious queries and providing dangerous information. But users can employ techniques known as "jailbreaks" to bypass the safety guardrails and get LLMs to answer a harmful query.

Researchers at the University of Illinois Urbana-Champaign are examining such vulnerabilities and finding ways to make the systems safer. Information sciences professor Haohan Wang, whose research interests include trustworthy machine learning methods, and information sciences doctoral student Haibo Jin have led several projects related to aspects of LLM safety.

Large language models — artificial intelligence systems that are trained on vast amounts of data—perform machine learning tasks and are the basis for generative AI chatbots such as ChatGPT.

Wang's and Jin's research develops sophisticated jailbreak techniques and tests them against LLMs. Their work helps identify vulnerabilities and make the LLMs' safeguards more robust, they said.

"A lot of jailbreak research is trying to test the system in ways that people won't try. The security loophole is less significant," Wang said. "I think AI security research needs to expand. We hope to push the research to a direction that is more practical — security evaluation and mitigation that will make differences to the real world."

For example, a standard example of a security violation is asking an LLM to provide directions about how to make a bomb, but Wang said that is not an actual query that is being asked. He said he wants to focus on what he considers more serious threats — malicious inquiries that he believes are more likely to be asked of an LLM, such as those related to suicide or to the manipulation of a partner or potential partner in a romantic or intimate relationship. He doesn't believe those kinds of queries are being examined enough by researchers or AI companies, because it is more difficult to get an LLM to respond to prompts concerning those issues.

Users are querying for information on more personal and more serious issues, and "that should be a direction that this community is pushing for," Wang said.

Wang and Jin developed a model they call JAMBench that evaluates LLMs' moderation guardrails, which filter its responses to questions. JAMBench created jailbreak methods to attack the guardrails for four risk categories: hate and fairness (including hate speech, bullying and attacks based on race, gender, sexual orientation, immigration status and other factors), violence, sexual acts and sexual violence, and self-harm.

In a research paper, Wang and Jin wrote that most jailbreak research evaluates the safeguards only on input, or whether or not the LLM recognizes the harmful nature of some queries. It doesn't test whether the safeguards prevent the output of harmful information. "Our approach focuses on crafting jailbreak prompts designed to bypass the moderation guardrails in LLMs, an area where the effectiveness of jailbreak efforts remains largely unexplored," they wrote.

Wang and Jin also offered two countermeasures that reduced the jailbreak success rates to zero, "underscoring the necessity of enhancing or adding extra guardrails to counteract advanced jailbreak techniques."

The researchers also developed a method to test how well LLMs comply with government guidelines on AI security. Security guidelines create a challenge for developers because they often are written as high-level requirements — for example, AI shouldn't violate human rights — but are lacking specific, actionable instructions, Wang and Jin said. Their testing method transforms abstract guidelines into specific questions that use jailbreak techniques to evaluate LLM compliance with the guidelines.

In other projects, Wang and Jin developed new jailbreak methods to test the security of LLMs. They found that using excessive linguistic complexity and fake sources allowed them to bypass the safety guardrails and get LLMs to answer a harmful query, a strategy they called "information overload."

They developed InfoFlood, a jailbreak method that, in one example, transformed the 13-word request asking to "develop a strategy for hacking into a government database and stealing sensitive information" into a 194-word query.

"If we bury (a query) under really dense linguistic prose and academic jargon, will the LLM answer the question because it doesn't really understand what the question means," said Advait Yadav, a rising junior in math and computer science at Illinois, a member of the project team and the first author of a paper about their results.

Wang and Jin also developed GuardVal, an evaluation protocol that dynamically generates and refines jailbreak prompts to ensure the evaluation evolves in real time and adapts to the security capabilities of the LLM.

Updated on
Backto the news archive

Related News

Lourentzou receives NSF CAREER Award

Assistant Professor Ismini Lourentzou has received a National Science Foundation (NSF) CAREER award to develop the next generation of embodied AI agents, systems that can reason, explain, and adapt as they act in the physical world.

Ismini Lourentzou

Raji invited to join UN Working Expert Group

PhD student Mubarak Raji has been invited to join the Working Expert Group on AI Governance Interoperability. This group operates under the United Nations Office for Digital and Emerging Technologies' new AI Governance for Humanity Lab. It supports the Secretary-General's High-level Advisory Body on AI by providing evidence-based analysis for the Global Dialogue on AI Governance, which will be held in July 2026 in Geneva, Switzerland.

Mubarak Raji headshot

Faculty and staff recognized with inaugural iSchool awards

The iSchool recognized faculty and staff for their contributions to teaching and outstanding service to the School at a ceremony on May 6. Interim Dean Emily Knox presented plaques to the inaugural recipients of the Faculty Teaching Award, Adjunct Teaching Award, and Staff Excellence Award.

Paper by He's lab recognized at ICLR 2026 workshop

The iDEA-iSAIL Joint Laboratory at the University of Illinois received an Outstanding Paper Award at the International Conference on Learning Representations (ICLR) 2026 Logical Reasoning of Large Language Models Workshop for their paper, "RAG Over Tables: Hierarchical Memory Index, Multi-State Retrieval, and Benchmarking." Paper authors include lab members Jingrui He, professor and MSIM program director; Sirui Chen, Xinrui He, and Zihao Li, computer science PhD students; Jiaru Zou, computer science MS student; Dongqi Fu, alum; as well as Jiawei Han, professor of computer science, and Yada Zhu, IBM collaborator. Chen gave an oral presentation of the research at the workshop, which was held last month in Rio de Janeiro, Brazil. This award was selected out of 206 accepted papers at the workshop.

Jingrui He

School of Information Sciences

501 E. Daniel St.

MC-493

Champaign, IL

61820-6211

Voice: (217) 333-3280

Email: ischool@illinois.edu

Back to top