TM-Bench: Threat Modeling Benchmark

Benchmark Design

TM-Bench is designed to comprehensively evaluate how well Large Language Models (LLMs) can perform threat modeling across a diverse range of application scenarios. The benchmark focuses on models that can run on consumer-grade hardware (Nvidia GeForce RTX 4090 GPU or smaller), making it valuable for security teams looking to implement threat modeling capabilities without relying on cloud APIs.

The benchmark is also useful for teams looking to identify smaller LLMs that are “good enough” for their threat modeling needs. This allows organizations to optimize on performance and costs while using hosted LLM provider APIs, rather than defaulting to the largest and most expensive models.

The benchmark consists of several key components:

Application Descriptions: A diverse set of application scenarios with varying complexity levels (basic, moderate, and complex)
Ground Truth Threat Models: Synthetic threat models created for each application using the STRIDE methodology
Evaluation Framework: An “LLM-as-a-judge” approach using Claude 3.7 Sonnet to evaluate model outputs
Standardized Prompts: Consistent instructions for all evaluated models to ensure fair comparison

Evaluation Metrics

TM-Bench uses a percentage-based scoring system (0-100%) across four key dimensions to evaluate model performance:

STRIDE Coverage & Accuracy (30% of overall score): Measures how well the model covers all threat categories in the STRIDE framework, including category coverage, classification accuracy, and balance across threat types.
Threat Completeness (30% of overall score): Evaluates the quantity of threats captured, specificity and definition quality, and component mapping accuracy compared to the ground truth.
Technical Validity (30% of overall score): Assesses the technical accuracy of threat descriptions, practical relevance to the system, and appropriate level of detail.
JSON Structure Compliance (10% of overall score): Measures schema adherence, field completeness, and proper formatting of the output.

These metrics are combined into an overall weighted score that provides a comprehensive assessment of the model’s threat modeling capabilities.

Benchmark Process

1. Ground Truth Creation

Synthetic threat models are generated to serve as the ground truth for evaluation. These models are based on a diverse set of application descriptions and identify relevant security threats using the STRIDE framework (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privilege).

Each threat model is saved as a structured JSON file with detailed information about each identified threat, including its category, description, and affected components.

2. Model Evaluation

The benchmark evaluates local LLMs running on consumer-grade hardware. Each model is prompted to generate a threat model for each application using standardized prompts. The generated threat models are then compared against the ground truth models using Claude 3.7 Sonnet as an expert judge.

This “LLM-as-a-judge” approach enables detailed, nuanced evaluation of model outputs that goes beyond simple keyword matching or structural comparison.

3. Scoring and Analysis

For each application, the judge provides detailed percentage scores across the four evaluation dimensions, along with specific strengths and weaknesses for each area. The judge also provides threat comparison statistics (threats found, categories covered/missed) and an overall weighted score.

Results are broken down by application complexity to provide deeper insights into how models perform across different types of systems. This approach enables fine-grained comparison between models and provides detailed insights into their specific strengths and weaknesses in threat modeling tasks.

Benchmark Implementation

The TM-Bench evaluation process is implemented as a controlled pipeline that ensures consistent and fair assessment across all models:

Standardized Input: Each model receives identical application descriptions and instructions through a consistent prompt template
Local Model Testing: Models are evaluated on consumer-grade hardware, allowing for testing of various local LLMs
Expert Evaluation: Claude 3.7 Sonnet serves as the judge, providing detailed analysis of each generated threat model
Comprehensive Reporting: Results are saved as structured JSON files with detailed scores and analysis

The benchmark is designed with careful controls to prevent gaming of the system. Results are automatically saved with timestamps and model names, making it easy to track performance over time and compare different models.

This methodology ensures that TM-Bench provides valuable insights into the threat modeling capabilities of different LLMs, helping security teams make informed decisions about which models to use for their specific needs.

TM-Bench

Navigation