RootScout — AI-Powered Root Cause Analysis

The Problem

Incidents are expensive. Manual investigation is worse.

Every minute your on-call engineer spends reading logs across 20 services is a minute your users are experiencing downtime. Traditional approaches are slow, noisy, and error-prone.

Without RootScout

The old way is broken

When an alert fires, engineers must manually:

Triage which services are affected
Correlate logs across multiple dashboards
Check for recent deployments manually
Guess at root cause under pressure
Write post-mortems from scratch

With RootScout

Automated, focused investigation

RootScout does the legwork instantly:

Graph traversal scopes impacted services
Telemetry automatically collected and ranked
GitHub PRs and commits correlated
LLM generates structured RCA with reasoning
Slack alert delivered with recommended actions

How It Works

The detective and the blueprint

Think of a building on fire. The old approach sends a detective to search every room. RootScout hands the detective a blueprint first — so they go straight to the source.

Ingest Telemetry

Services emit OpenTelemetry traces, metrics, and logs to RootScout's OTLP endpoints in real time.

Build the Graph

Spans are parsed to automatically construct a live service dependency graph — nodes, edges, and health status.

Scope the Blast Radius

When an alert fires, BFS traversal identifies only the related services. No irrelevant noise sent to the LLM.

Enrich with Code Context

Recent GitHub PRs and commits are automatically attached — correlating deployments with the outage window.

LLM Analysis

A structured prompt is sent to an LLM. The model reasons step-by-step and returns a JSON RCA report.

Notify and Act

The RCA is posted to Slack with root cause, confidence, and recommended actions.

Your Microservices

API Gateway

→

Backend Service

→

Database

OTel Traces · Metrics · Logs via OTLP

▼

RootScout Platform

OTel Ingester

Traces · Metrics · Logs

GitHub Ingester

Push Events · PRs

▼

Service Dependency Graph

Live Topology Built from Spans · NetworkX

BFS scope

▼

Context Retriever

Logs · Metrics · Git Diffs · Deployment Window

▼

          LLM Agent
          LLM · Chain-of-Thought → JSON RCA

▼

Slack Alert

RCA report posted to #rootscout-alerts with root cause, confidence, and recommended actions

Features

Everything an SRE needs. Automated.

RootScout combines industry-standard observability protocols with AI analysis into a single, production-ready platform.

Graph

Live Service Dependency Graph

Built automatically from OTel traces using NetworkX. Tracks service health, latency, and error rates in real time.

OTel

OTLP Protocol Support

Native ingestion of OpenTelemetry traces, metrics, and logs. Drop-in compatible with any OTel-instrumented stack.

LLM

Multi-LLM Backend

Supports Google Gemini and Anthropic Claude. Structured Chain-of-Thought prompting with JSON-formatted RCA output.

GitHub

GitHub Integration

Webhook-based ingestion of push events and pull requests. Recent deployments are automatically correlated with incidents.

Slack

Slack Bot

Real-time alerts with severity indicators. Incidents are automatically posted to your configured channel with actionable context.

API

FastAPI Service

Production-ready REST API with OTLP collector endpoints, GitHub webhooks, and background processing. Docker-ready.

Signal

Focused Context, Not Noise

Graph-scoped BFS traversal means the LLM only sees relevant services — reducing hallucinations and API costs.

Output

Structured RCA Reports

JSON output with root cause service, confidence score, reasoning chain, and recommended actions.

Eval

Built-in Evaluation Framework

Benchmark against 10 synthetic scenarios with ground-truth root causes and a rigorous three-axis scoring rubric.

Sample Slack alert delivered by RootScout:

Critical Root Cause Analysis Complete

Service: cart-service

Root cause: cart-service (DB connection pool exhausted)

Confidence: 0.92

Linked PR: #47 — Reduce max_pool_size for cost savings

Actions:

1. Revert PR #47 or increase max_pool_size

2. Restart cart-service pods

3. Monitor db latency for 10 min

Evaluation

Measured, not assumed.

RootScout ships with a rigorous evaluation suite — synthetic benchmarks with known root causes scored across three independent axes.

Dataset	Strengths	Limitations	Best Model	Component match score	RCA cosine similarity score
OpenRCA Microsoft	Emulates real life production incidents	Missing codebase	Claude Opus 4.6	45%	18%
RCAEvals	Has telemetry+ codebase present, deeper analysis for RCA	Doesn't emulate real-life incidents well	Claude Opus 4.6	56%	28%
Synthetic data	Easy to generate, test different scenarios	Doesn't emulate real-life incidents that well	Claude Opus 4.6	100%	91%

Stop hunting. Start finding root causes.