Full Publications

Browse the complete list by research area or by year. All entries provide inline Abstract and BibTeX toggles together with paper and project links.

Agent and Agentic System Security

[Preprint] Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows

Bonan Ruan, Yeqi Fu, Chuqi Zhang, Jiahao Liu, Jun Zeng, Zhenkai Liang

arXiv 2026

BibTeX | Abstract | Paper

@article{ruan2026heimdallr,
  title={Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows},
  author={Ruan, Bonan and Fu, Yeqi and Zhang, Chuqi and Liu, Jiahao and Zeng, Jun and Liang, Zhenkai},
  journal={arXiv preprint arXiv:2605.05969},
  year={2026}
}

GitHub Continuous Integration (CI) workflows increasingly integrate Large Language Models (LLMs) to automate review, triage, content generation, and repository maintenance. This creates a new attack surface: externally controllable workflow inputs can shape LLM prompts and outputs, which may in turn affect security decisions, repository state, or privileged execution. Although LLM security and CI security have each been studied extensively, their intersection remains underexplored. In this paper, we present the first study of LLM-induced security risks in GitHub CI workflows. We characterize the problem along the full execution chain and develop a taxonomy of high-level risk classes and concrete threat vectors. To detect such risks in practice, we design Heimdallr, a hybrid analysis framework that normalizes workflows into an LLM-Workflow Property Graph (L-WPG) and combines triggerability analysis, LLM-assisted dataflow summarization, and deterministic propagation to synthesize concrete threat-vector findings. Evaluated on 300 manually annotated unique workflows, Heimdallr achieves high accuracy on LLM-node identification (F1 = 0.994), triggerability classification (99.8%), and threat-vector detection (micro-average F1 = 0.917). As part of an ongoing detection and disclosure effort, we have so far responsibly disclosed 802 vulnerable workflow instances across 759 repositories and received 71 acknowledgments.

[ICLR 2026] DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle

Yuheng Tang^*, Kaijie Zhu^*, Bonan Ruan, Chuqi Zhang, Michael Yang, Hongwei Li, Suyue Guo, Tianneng Shi, Zekun Li, Christopher Kruegel, Giovanni Vigna, Dawn Song, William Yang Wang, Lun Wang, Yangruibo Ding, Zhenkai Liang, Wenbo Guo

14th International Conference on Learning Representations

BibTeX | Abstract | Paper | Code | Leaderboard

@inproceedings{tang2026devopsgym,
  title={DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle},
  author={Yuheng Tang and Kaijie Zhu and Bonan Ruan and Chuqi Zhang and Michael Yang and Hongwei Li and Suyue Guo and Tianneng Shi and Zekun Li and Christopher Kruegel and Giovanni Vigna and Dawn Song and William Yang Wang and Lun Wang and Yangruibo Ding and Zhenkai Liang and Wenbo Guo},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=bP48r4dt7Z}
}

Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents' capabilities in the full software DevOps cycle are still unknown. Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions. However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps. We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go. We develop a semi-automated data collection mechanism with rigorous and non-trivial expert efforts in ensuring the task coverage and quality. Our evaluation of state-of-the-art models and agents reveals fundamental limitations: they struggle with issue resolving and test generation in Java and Go, and remain unable to handle new tasks such as monitoring and build and configuration. These results highlight the need for essential research in automating the full DevOps cycle with AI agents.

[Preprint] TraceAegis: Securing LLM-Based Agents via Hierarchical and Behavioral Anomaly Detection

Jiahao Liu, Bonan Ruan, Xianglin Yang, Zhiwei Lin, Yan Liu, Yang Wang, Tao Wei, Zhenkai Liang

arXiv 2025

BibTeX | Abstract | Paper

@article{liu2025traceaegis,
  title={TraceAegis: Securing LLM-Based Agents via Hierarchical and Behavioral Anomaly Detection},
  author={Liu, Jiahao and Ruan, Bonan and Yang, Xianglin and Lin, Zhiwei and Liu, Yan and Wang, Yang and Wei, Tao and Liang, Zhenkai},
  journal={arXiv preprint arXiv:2510.11203},
  year={2025}
}

LLM-based agents have demonstrated promising adaptability in real-world applications. However, these agents remain vulnerable to a wide range of attacks, such as tool poisoning and malicious instructions, that compromise their execution flow and can lead to serious consequences like data breaches and financial loss. Existing studies typically attempt to mitigate such anomalies by predefining specific rules and enforcing them at runtime to enhance safety. Yet, designing comprehensive rules is difficult, requiring extensive manual effort and still leaving gaps that result in false negatives. As agent systems evolve into complex software systems, we take inspiration from software system security and propose TraceAegis, a provenance-based analysis framework that leverages agent execution traces to detect potential anomalies. In particular, TraceAegis constructs a hierarchical structure to abstract stable execution units that characterize normal agent behaviors. These units are then summarized into constrained behavioral rules that specify the conditions necessary to complete a task. By validating execution traces against both hierarchical and behavioral constraints, TraceAegis is able to effectively detect abnormal behaviors. To evaluate the effectiveness of TraceAegis, we introduce TraceAegis-Bench, a dataset covering two representative scenarios: healthcare and corporate procurement. Each scenario includes 1,300 benign behaviors and 300 abnormal behaviors, where the anomalies either violate the agent’s execution order or break the semantic consistency of its execution sequence. Experimental results demonstrate that TraceAegis achieves strong performance on TraceAegis-Bench, successfully identifying the majority of abnormal behaviors. We further validate TraceAegis’ practicality through an internal redteaming process conducted within a technology company, where it effectively detects abnormal traces generated by red-team attacks.

[Preprint] When MCP Servers Attack: Taxonomy, Feasibility, and Mitigation

Weibo Zhao, Jiahao Liu, Bonan Ruan, Shaofei Li, Zhenkai Liang

arXiv 2025

BibTeX | Abstract | Paper | News

@article{zhao2025when,
  title={When MCP Servers Attack: Taxonomy, Feasibility, and Mitigation},
  author={Zhao, Weibo and Liu, Jiahao and Ruan, Bonan and Li, Shaofei and Liang, Zhenkai},
  journal={arXiv preprint arXiv:2509.24272},
  year={2025}
}

Model Context Protocol (MCP) servers enable AI applications to connect to external systems in a plug-and-play manner, but their rapid proliferation also introduces severe security risks. Unlike mature software ecosystems with rigorous vetting, MCP servers still lack standardized review mechanisms, giving adversaries opportunities to distribute malicious implementations. Despite this pressing risk, the security implications of MCP servers remain underexplored. To address this gap, we present the first systematic study that treats MCP servers as active threat actors and decomposes them into core components to examine how adversarial developers can implant malicious intent. Specifically, we investigate three research questions: (i) what types of attacks malicious MCP servers can launch, (ii) how vulnerable MCP hosts and Large Language Models (LLMs) are to these attacks, and (iii) how feasible it is to carry out MCP server attacks in practice. Our study proposes a component-based taxonomy comprising twelve attack categories. For each category, we develop Proof-of-Concept (PoC) servers and demonstrate their effectiveness across diverse real-world host-LLM settings. We further show that attackers can generate large numbers of malicious servers at virtually no cost. We then test state-of-the-art scanners on the generated servers and found that existing detection approaches are insufficient. These findings highlight that malicious MCP servers are easy to implement, difficult to detect with current tools, and capable of causing concrete damage to AI agent systems. Addressing this threat requires coordinated efforts among protocol designers, host developers, LLM providers, and end users to build a more secure and resilient MCP ecosystem.

Systems and Software Security

[ASE 2025] Propagation-Based Vulnerability Impact Assessment for Software Supply Chains

Bonan Ruan, Zhiwei Lin, Jiahao Liu, Chuqi Zhang, Kaihang Ji, Zhenkai Liang

40th IEEE/ACM International Conference on Automated Software Engineering

@inproceedings{ruan2025vpss,
  title={Propagation-Based Vulnerability Impact Assessment for Software Supply Chains},
  author={Ruan, Bonan and Lin, Zhiwei and Liu, Jiahao and Zhang, Chuqi and Ji, Kaihang and Liang, Zhenkai},
  booktitle={Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering},
  pages={65--77},
  year={2025}
}

Identifying the impact scope and scale is critical for software supply chain vulnerability assessment. However, existing studies face substantial limitations. First, prior studies either work at coarse package-level granularity producing many false positives or fail to accomplish whole-ecosystem vulnerability propagation analysis. Second, although vulnerability assessment indicators like CVSS characterize individual vulnerabilities, no metric exists to specifically quantify the dynamic impact of vulnerability propagation across software supply chains. To address these limitations and enable accurate and comprehensive vulnerability impact assessment, we propose a novel approach: (i) a hierarchical worklist-based algorithm for whole-ecosystem and call-graph-level vulnerability propagation analysis and (ii) the Vulnerability Propagation Scoring System (VPSS), a dynamic metric to quantify the scope and evolution of vulnerability impacts in software supply chains. We implement a prototype of our approach in the Java Maven ecosystem and evaluate it on 100 real-world vulnerabilities. Experimental results demonstrate that our approach enables effective ecosystem-wide vulnerability propagation analysis, and provides a practical, quantitative measure of vulnerability impact through VPSS.

[USENIX Security 2025] Fuzzing the PHP Interpreter via Dataflow Fusion

Yuancheng Jiang, Chuqi Zhang, Bonan Ruan, Jiahao Liu, Manuel Rigger, Roland Yap, Zhenkai Liang

34th USENIX Security Symposium

Distinguished Paper Award

@inproceedings{jiang2025fuzzing,
  title={Fuzzing the PHP Interpreter via Dataflow Fusion},
  author={Jiang, Yuancheng and Zhang, Chuqi and Ruan, Bonan and Liu, Jiahao and Rigger, Manuel and Yap, Roland HC and Liang, Zhenkai},
  booktitle={34th USENIX Security Symposium (USENIX Security 25)},
  pages={6143--6158},
  year={2025}
}

PHP, a dominant scripting language in web development, powers a vast range of websites, from personal blogs to major platforms. While existing research primarily focuses on PHP application-level security issues like code injection, memory errors within the PHP interpreter have been largely overlooked. These memory errors, prevalent due to the PHP interpreter's extensive C codebase, pose significant risks to the confidentiality, integrity, and availability of PHP servers. This paper introduces FlowFusion, the first automatic fuzzing framework to detect memory errors in the PHP interpreter. FlowFusion leverages dataflow as an efficient representation of test cases maintained by PHP developers, merging two or more test cases to produce fused test cases with more complex code semantics. Moreover, FlowFusion employs strategies such as test mutation, interface fuzzing, and environment crossover to increase bug finding. In our evaluation, FlowFusion found 158 unknown bugs in the PHP interpreter, with 125 fixed and 11 confirmed. Comparing FlowFusion against the official test suite and a naive test concatenation approach, FlowFusion can detect new bugs that these methods miss, while also achieving greater code coverage. FlowFusion also outperformed state-of-the-art fuzzers AFL++ and Polyglot, covering 24% more lines of code after 24 hours of fuzzing. FlowFusion has gained wide recognition among PHP developers and is now integrated into the official PHP toolchain.

[RAID 2024] KernJC: Automated Vulnerable Environment Generation for Linux Kernel Vulnerabilities

Bonan Ruan, Jiahao Liu, Chuqi Zhang, Zhenkai Liang

27th International Symposium on Research in Attacks, Intrusions and Defenses

Best Practical Paper Award · Black Hat Asia 2025 Briefings

@inproceedings{ruan2024kernjc,
  title={Kernjc: Automated vulnerable environment generation for linux kernel vulnerabilities},
  author={Ruan, Bonan and Liu, Jiahao and Zhang, Chuqi and Liang, Zhenkai},
  booktitle={Proceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses},
  pages={384--402},
  year={2024}
}

Linux kernel vulnerability reproduction is a critical task in system security. To reproduce a kernel vulnerability, the vulnerable environment and the Proof of Concept (PoC) program are needed. Most existing research focuses on the generation of PoC, while the construction of environment is overlooked. However, establishing an effective vulnerable environment to trigger a vulnerability is challenging. Firstly, it is hard to guarantee that the selected kernel version for reproduction is vulnerable, as the vulnerability version claims in online databases can occasionally be incorrect. Secondly, many vulnerabilities cannot be reproduced in kernels built with default configurations. Intricate non-default kernel configurations must be set to include and trigger a kernel vulnerability, but less information is available on how to recognize these configurations.

To solve these challenges, we propose a patch-based approach to identify real vulnerable kernel versions and a graph-based approach to identify necessary configs for activating a specific vulnerability. We implement these approaches in a tool, KernJC, automating the generation of vulnerable environments for kernel vulnerabilities. To evaluate the efficacy of KernJC, we build a dataset containing 66 representative real-world vulnerabilities with PoCs from kernel vulnerability research in the past five years. The evaluation shows that KernJC builds vulnerable environments for all these vulnerabilities, 32 (48.5%) of which require non-default configs, and 4 have incorrect version claims in the National Vulnerability Database (NVD). Furthermore, we conduct large-scale spurious version detection on kernel vulnerabilities and identify 128 vulnerabilities that have spurious version claims in NVD. To foster future research, we release KernJC with the dataset in the community.

[IEEE TPS-ISA 2021] Security Challenges in the Container Cloud

Yutian Yang, Wenbo Shen, Bonan Ruan, Wenmao Liu, Kui Ren

2021 Third IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA)

BibTeX | Abstract | Paper

@inproceedings{yang2021security,
  title={Security Challenges in the Container Cloud},
  author={Yang, Yutian and Shen, Wenbo and Ruan, Bonan and Liu, Wenmao and Ren, Kui},
  booktitle={2021 Third IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA)},
  pages={137--145},
  year={2021},
  organization={IEEE},
  doi={10.1109/TPSISA52974.2021.00016}
}

In recent years, containerization has become a major trend in cloud computing because of its high resource-utilization efficiency and strong DevOps support. At the same time, the complexity of container systems introduces broad attack surfaces. This paper studies security challenges in the container cloud by dividing the system into the kernel layer, container layer, and orchestration layer, then summarizing the security technologies used in each part. It analyzes weaknesses and challenges across these layers, reviews the current protection status of container systems, and highlights future research directions. The study argues that improving container-cloud security requires stronger kernel isolation, more systematic security analysis of existing container techniques, and more comprehensive configuration-checking tools.

Datasets

[ASE 2025] A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis

Zhiwei Lin, Bonan Ruan, Jiahao Liu, Weibo Zhao

40th IEEE/ACM International Conference on Automated Software Engineering, Tool Demonstrations

BibTeX | Abstract | Paper | Code

@inproceedings{lin2024mcpcorpus,
  title={A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis},
  author={Lin, Zhiwei and Ruan, Bonan and Liu, Jiahao and Zhao, Weibo},
  booktitle={Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering},
  year={2025}
}

The Model Context Protocol (MCP) has recently emerged as a standardized interface for connecting language models with external tools and data. As the ecosystem rapidly expands, the lack of a structured, comprehensive view of existing MCP artifacts presents challenges for research. To bridge this gap, we introduce MCPCorpus, a large-scale dataset containing around 14K MCP servers and 300 MCP clients. Each artifact is annotated with 20+ normalized attributes capturing its identity, interface configuration, GitHub activity, and metadata. MCPCorpus provides a reproducible snapshot of the real-world MCP ecosystem, enabling studies of adoption trends, ecosystem health, and implementation diversity. To keep pace with the rapid evolution of the MCP ecosystem, we provide utility tools for automated data synchronization, normalization, and inspection. Furthermore, to support efficient exploration and exploitation, we release a lightweight web-based search interface. MCPCorpus is publicly available at: https://github.com/Snakinya/MCPCorpus.

[ASE 2024] VulZoo: A Comprehensive Vulnerability Intelligence Dataset

Bonan Ruan, Jiahao Liu, Weibo Zhao, Zhenkai Liang

39th IEEE/ACM International Conference on Automated Software Engineering, Tool Demonstrations

BibTeX | Abstract | Paper | Slides | Code

@inproceedings{ruan2024vulzoo,
  title={Vulzoo: A comprehensive vulnerability intelligence dataset},
  author={Ruan, Bonan and Liu, Jiahao and Zhao, Weibo and Liang, Zhenkai},
  booktitle={Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering},
  pages={2334--2337},
  year={2024}
}

Software vulnerabilities pose critical security and risk concerns. Many techniques are proposed to assess and prioritize vulnerabilities. To evaluate their performance, researchers often craft datasets from limited data sources, lacking a global overview of broad vulnerability intelligence. The repetitive data preparation process complicates the evaluation of new solutions. To solve this issue, we propose VulZoo, a comprehensive vulnerability intelligence dataset that covers 17 vulnerability data sources. We also construct connections among these sources, enabling more straightforward configuration and adaptation for different tasks. VulZoo provides utility scripts for automatic data synchronization and cleaning, relationship mining, and statistics generation. We make VulZoo publicly available and maintain it with incremental updates. We believe that VulZoo serves as a valuable input to vulnerability assessment and prioritization studies. The video is at https://youtu.be/EvoxQmUAHtw. The dataset is at https://github.com/NUS-Curiosity/VulZoo.