CTI-REALM is Microsoft’s open-source benchmark designed to evaluate AI agents in generating detection rules from threat intelligence reports. Unlike existing benchmarks that test parametric knowledge such as classifying techniques, CTI-REALM focuses on the end-to-end workflow of turning narrative CTI into operational detections. It uses 37 curated CTI reports and evaluates models across Linux endpoints, Azure Kubernetes Service (AKS), and Azure cloud infrastructure.
CTI-REALM: A new benchmark for end-to-end detection rule generation with AI agents
Summary: Microsoft has released CTI-REALM, an open-source benchmark for evaluating AI agents in generating detection rules from threat intelligence reports. It focuses on operationalizing threat insights into actionable detections.
Key facts
- Microsoft has released CTI-REALM as an open-source benchmark for evaluating AI agents in generating detection rules from threat intelligence reports.
- CTI-REALM evaluates the end-to-end workflow, including reading CTI reports, exploring telemetry, writing KQL queries, and producing Sigma rules.
- The benchmark uses 37 curated CTI reports across Linux endpoints, Azure Kubernetes Service (AKS), and Azure cloud infrastructure.
- Results from evaluating 16 frontier model configurations on CTI-REALM-50 show that Anthropic models lead across the board.
Why it matters
CTI-REALM matters for businesses because it provides a detailed evaluation framework that measures the operationalization of AI in security workflows, offering insights into where human review and guardrails are needed. This benchmark supports safer adoption by helping teams assess model performance before deploying them in production environments.
Key metrics
- Model Performance: Anthropic models lead with Claude occupying top three positions (0.587–0.637)