TM-Bench: Threat Modeling Benchmark

Hi, I'm Matt Adams, the creator of TM-Bench and STRIDE GPT. As a cybersecurity professional with a passion for both security and artificial intelligence, I've been fascinated by the potential of Large Language Models to transform how we approach security challenges.

My journey into AI-powered threat modeling began with the creation of STRIDE GPT, an open-source tool that uses LLMs to generate threat models based on the STRIDE methodology. The enthusiastic adoption of this tool by the security community showed me both the promise and the limitations of current AI models in security contexts.

From STRIDE GPT to TM-Bench

When I created STRIDE GPT, I wanted to explore how LLMs could assist security professionals in the often time-consuming process of threat modeling. The tool quickly gained traction, with hundreds of stars on GitHub and adoption by security teams worldwide.

As I continued to develop and refine STRIDE GPT, I noticed significant variations in how different LLMs approached threat identification. Some models excelled at identifying certain types of threats while missing others entirely. This observation led me to a critical question: how could we objectively measure and compare LLM performance in threat modeling tasks?

This question was the genesis of TM-Bench. I realized that without a standardized benchmark, it would be impossible to meaningfully compare models or track improvements over time. So I set out to create the first benchmark specifically designed to evaluate LLM capabilities in security threat modeling.

Focus on Local LLMs

Since releasing STRIDE GPT, one of the most common questions I've received is: "Which local LLMs perform best for threat modeling?" Many security professionals and organizations prefer to run models locally for privacy, compliance, or air-gapped environments where cloud-based LLMs aren't an option.

This feedback directly shaped the focus of TM-Bench. The benchmark specifically evaluates models that can be run on consumer-grade hardware—specifically an Nvidia GeForce RTX 4090 GPU or smaller. This makes the benchmark particularly relevant for security teams looking to implement threat modeling capabilities without relying on cloud APIs.

Beyond just local deployment capabilities, benchmarking these smaller models helps teams optimize their threat modeling approach by finding the right balance between performance, cost, and speed. Organizations can make data-driven decisions about which models deliver sufficient accuracy for their security needs while minimizing computational overhead and operational costs.

By focusing on models that can run on accessible hardware, TM-Bench helps security professionals make informed decisions about which LLMs to deploy for their threat modeling needs, balancing security performance with practical hardware constraints and operational efficiency.

A First-of-its-Kind Benchmark

TM-Bench is the first benchmark in the world to evaluate the capability of Large Language Models for threat modeling. While other benchmarks exist for general AI capabilities or even some security tasks, TM-Bench is uniquely focused on the complex task of identifying security threats in system designs.

Creating this benchmark required developing a diverse set of application scenarios, establishing ground truth threat models with security experts, and designing a scoring methodology that could fairly evaluate model outputs. It's been a challenging but rewarding process, and I'm proud to contribute this tool to the security and AI communities.

My hope is that TM-Bench will not only help organizations make more informed decisions about which LLMs to use for security tasks, but also drive improvements in how these models are trained and fine-tuned for security applications.

Mission & Vision

My mission with TM-Bench is to provide an open, transparent benchmark for evaluating and improving LLM-based threat modeling capabilities. I believe that rigorous evaluation is essential for responsible AI deployment in security contexts.

The vision for TM-Bench extends beyond just evaluation - I aim to drive improvement in AI security capabilities by highlighting areas where models excel and where they need enhancement. By identifying these strengths and weaknesses, we can guide the development of more effective AI security tools.

Collaboration & Community

TM-Bench is a project I've developed to address the need for rigorous evaluation of threat modeling capabilities in LLMs. I've developed the methodology and evaluation criteria independently, drawing on my experience in both security research and AI evaluation.

Like many rigorous AI benchmarks, TM-Bench maintains a balance between transparency and integrity. While the methodology is publicly documented, the specific test cases and evaluation implementation remain private to prevent optimization specifically for the benchmark rather than for actual threat modeling capabilities. This approach aligns with best practices in AI evaluation, particularly in security domains where benchmark integrity is crucial.

I welcome feedback from the security community and am committed to continuously improving the benchmark based on input and emerging research in AI security. If you have suggestions for improvement or questions about the methodology, please reach out!

TM-Bench

Navigation

About TM-Bench

From STRIDE GPT to TM-Bench

Focus on Local LLMs

A First-of-its-Kind Benchmark

Mission & Vision

Collaboration & Community