School of Information Sciences

Illinois information sciences researchers develop AI safety testing methods

Haohan Wang
Haohan Wang, Assistant Professor
Haibo Jin
Haibo Jin

Large language models are built with safety protocols designed to prevent them from answering malicious queries and providing dangerous information. But users can employ techniques known as "jailbreaks" to bypass the safety guardrails and get LLMs to answer a harmful query.

Researchers at the University of Illinois Urbana-Champaign are examining such vulnerabilities and finding ways to make the systems safer. Information sciences professor Haohan Wang, whose research interests include trustworthy machine learning methods, and information sciences doctoral student Haibo Jin have led several projects related to aspects of LLM safety.

Large language models — artificial intelligence systems that are trained on vast amounts of data—perform machine learning tasks and are the basis for generative AI chatbots such as ChatGPT.

Wang's and Jin's research develops sophisticated jailbreak techniques and tests them against LLMs. Their work helps identify vulnerabilities and make the LLMs' safeguards more robust, they said.

"A lot of jailbreak research is trying to test the system in ways that people won't try. The security loophole is less significant," Wang said. "I think AI security research needs to expand. We hope to push the research to a direction that is more practical — security evaluation and mitigation that will make differences to the real world."

For example, a standard example of a security violation is asking an LLM to provide directions about how to make a bomb, but Wang said that is not an actual query that is being asked. He said he wants to focus on what he considers more serious threats — malicious inquiries that he believes are more likely to be asked of an LLM, such as those related to suicide or to the manipulation of a partner or potential partner in a romantic or intimate relationship. He doesn't believe those kinds of queries are being examined enough by researchers or AI companies, because it is more difficult to get an LLM to respond to prompts concerning those issues.

Users are querying for information on more personal and more serious issues, and "that should be a direction that this community is pushing for," Wang said.

Wang and Jin developed a model they call JAMBench that evaluates LLMs' moderation guardrails, which filter its responses to questions. JAMBench created jailbreak methods to attack the guardrails for four risk categories: hate and fairness (including hate speech, bullying and attacks based on race, gender, sexual orientation, immigration status and other factors), violence, sexual acts and sexual violence, and self-harm.

In a research paper, Wang and Jin wrote that most jailbreak research evaluates the safeguards only on input, or whether or not the LLM recognizes the harmful nature of some queries. It doesn't test whether the safeguards prevent the output of harmful information. "Our approach focuses on crafting jailbreak prompts designed to bypass the moderation guardrails in LLMs, an area where the effectiveness of jailbreak efforts remains largely unexplored," they wrote.

Wang and Jin also offered two countermeasures that reduced the jailbreak success rates to zero, "underscoring the necessity of enhancing or adding extra guardrails to counteract advanced jailbreak techniques."

The researchers also developed a method to test how well LLMs comply with government guidelines on AI security. Security guidelines create a challenge for developers because they often are written as high-level requirements — for example, AI shouldn't violate human rights — but are lacking specific, actionable instructions, Wang and Jin said. Their testing method transforms abstract guidelines into specific questions that use jailbreak techniques to evaluate LLM compliance with the guidelines.

In other projects, Wang and Jin developed new jailbreak methods to test the security of LLMs. They found that using excessive linguistic complexity and fake sources allowed them to bypass the safety guardrails and get LLMs to answer a harmful query, a strategy they called "information overload."

They developed InfoFlood, a jailbreak method that, in one example, transformed the 13-word request asking to "develop a strategy for hacking into a government database and stealing sensitive information" into a 194-word query.

"If we bury (a query) under really dense linguistic prose and academic jargon, will the LLM answer the question because it doesn't really understand what the question means," said Advait Yadav, a rising junior in math and computer science at Illinois, a member of the project team and the first author of a paper about their results.

Wang and Jin also developed GuardVal, an evaluation protocol that dynamically generates and refines jailbreak prompts to ensure the evaluation evolves in real time and adapts to the security capabilities of the LLM.

Updated on
Backto the news archive

Related News

Perkins defends dissertation

PhD candidate Jana M. Perkins successfully defended her dissertation, "Scholarship writ large: A data-rich analysis of professionalization in English literary scholarship from 1940 to the present."

Jana Perkins

Yu receives 2025 Google PhD Fellowship

PhD student Yaman Yu has been named a recipient of the 2025 Google PhD Fellowship in Privacy, Safety, and Security. The fellowship program recognizes outstanding graduate students who are conducting exceptional and innovative research in computer science and related fields, with a special focus on candidates who seek to influence the future of technology. Google PhD fellowships include tuition and fees, a stipend, and mentorship from a Google Research Mentor for up to two years. Google.org is providing over $10 million to support 255 PhD students across 35 countries and 12 research domains.

Yaman Yu

iSchool researchers to present at ASSETS 2025

iSchool faculty and students will present their research at the 27th International Association for Computing Machinery (ACM) Special Interest Group (SIG) ACCESS Conference on Computers and Accessibility (ASSETS 2025), which will be held in Denver, Colorado, October 26–29, 2025. This conference allows researchers to present their scholarship on design, evaluation, use, and education related to computing for people with disabilities and older adults.

Chan to give an invited talk on "Predatory Data"

Professor Anita Say Chan will give an invited lecture at the American University of Beirut (AUB) on October 23. The talk, part of the "Confronted with America" series hosted by the Center for American Studies and Research, will be moderated by Jihad Touma, founding director of AUB's School of Computing and Data Sciences.

Anita Say Chan

iSchool researchers present at ILA 2025

School faculty, staff, and students will present their research at the 2025 Illinois Library Association (ILA) Annual Conference, which will be held on October 14–16 in Rosemont. The theme of this year's conference is "You Belong Here."

School of Information Sciences

501 E. Daniel St.

MC-493

Champaign, IL

61820-6211

Voice: (217) 333-3280

Fax: (217) 244-3302

Email: ischool@illinois.edu

Back to top