Building LogSleuth: An AI-Powered Incident Investigator with Elastic Agent Builder

The 3 AM Problem

Picture this: It’s 3 AM. Your phone buzzes with a PagerDuty alert. “CRITICAL: Checkout error rate above 10%.”

You stumble to your laptop, connect to VPN, and begin the familiar dance:

Open Kibana
Start typing queries
Scroll through thousands of logs
Copy a trace ID, search again
Try to piece together what happened
Repeat for 45 minutes until you finally find the needle in the haystack

We’ve all probably been there. I decided to fix it.

Introducing LogSleuth

LogSleuth is an AI-powered incident investigation agent that automates the entire root cause analysis workflow. Built on Elastic Agent Builder, it reduces Mean Time To Resolution (MTTR) from 47 minutes to under 5 minutes, a 91% improvement.

⚡

91% Faster MTTR

Automated investigation in seconds, not hours

🔍

5-Step Methodology

Understand → Search → Analyze → Correlate → Synthesize

🧠

Learning System

Gets smarter from every investigation

📊

Visual Insights

Sankey diagrams show error propagation

The Architecture

LogSleuth is designed around a simple but powerful idea: let AI do what humans repeatedly do, but faster.

When an SRE investigates an incident, they follow a mental workflow:

Understand the alert
Search for related logs
Look for patterns (spikes, anomalies)
Correlate across services
Synthesize findings into a root cause

I encoded this workflow into an orchestration layer that calls specialized tools:

┌─────────────────────────────────────────────────────────────────────────────┐
│                          ORCHESTRATION LAYER                                │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                    Investigation Orchestrator                         │  │
│  │                                                                       │  │
│  │   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐        │  │
│  │   │UNDERSTAND│ →  │  SEARCH  │ →  │ ANALYZE  │ →  │CORRELATE │        │  │
│  │   └──────────┘    └──────────┘    └──────────┘    └──────────┘        │  │
│  │                                                          │            │  │
│  │                                                          ▼            │  │
│  │                                                   ┌──────────┐        │  │
│  │                                                   │SYNTHESIZE│        │  │
│  │                                                   └──────────┘        │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Why Elastic Agent Builder?

I chose Elastic Agent Builder because it provides capabilities I’d struggle to build from scratch:

1. Native ES|QL Integration

The tools execute ES|QL queries directly against Elasticsearch. No middleware, no translation layer—just fast, native queries.

FROM logs-logsleuth
| WHERE @timestamp >= NOW() - ?time_range
| WHERE log.level == "error"
| STATS error_count = COUNT(*) BY service.name, error.type
| SORT error_count DESC

2. Intelligent Tool Selection

The agent doesn’t blindly execute all tools. It reasons about which tools to call based on intermediate results.

Found no errors in the initial search? It broadens the query. Found a trace ID? It automatically correlates across services.

3. Built-in Context Management

Agent Builder handles conversation memory, token management, and result summarization—things that would take weeks to build properly.

4. Production-Ready Deployment

Deploy to Kibana with RBAC, audit logging, and rate limiting built in. No infrastructure to manage.

The Tools

LogSleuth has 4 custom tools, each designed for a specific investigation task:

search_logs(
    client,
    search_query="Connection refused",
    time_range="2h",
    service_name="payment-service",
    log_level="error"
)

Searches for logs matching criteria. Returns formatted results with trace IDs for correlation.

get_error_frequency(
    client,
    time_range="2h",
    interval="5m"
)

Analyzes error patterns over time. Detects spikes when error count exceeds 2x average.

find_correlated_logs(
    client,
    trace_id="abc123def456"
)

Traces a request across services. Identifies the root cause service (first to error).

search_past_incidents(
    client,
    search_terms="connection pool exhaustion"
)

Searches the knowledge base for similar past incidents. Returns previous resolutions.

The Visualization

Raw data isn’t enough. I added visual insights to make findings immediately understandable.

Sankey Diagram

The Sankey diagram shows how requests flow between services—and where errors propagate:

api-gateway ──────────► checkout-service ──────────► payment-service
                              │                            │
                              │                      [ERROR ORIGIN]
                              │                            │
                                                           |
                        [ERROR CASCADE]◄───────────────────┘

The links indicate error paths. You can see at a glance that payment-service was the origin, and the error cascaded back through checkout-service to api-gateway.

Progress Stepper

The 5-step investigation methodology is visualized in real-time:

  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │    ✓     │ →  │    ✓     │ →  │    ✓     │ →  │    ●     │ →  │    ○     │
  │UNDERSTAND│    │  SEARCH  │    │ ANALYZE  │    │CORRELATE │    │SYNTHESIZE│
  └──────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘

Engineers can see exactly where the investigation is and what’s happening.

The Knowledge Base

Here’s where LogSleuth truly differentiates: it learns from every investigation.

When an investigation is saved, it becomes part of a searchable knowledge base:

{
  "investigation": {
    "id": "INV-20260120-A7B3C9",
    "status": "completed"
  },
  "findings": {
    "root_cause": "Database primary failover caused connection pool exhaustion",
    "root_cause_service": "payment-service",
    "affected_services": ["payment-service", "checkout-service", "api-gateway"]
  },
  "remediation": {
    "suggestions": "Enable circuit breaker, increase connection pool timeout",
    "resolution_applied": "Restarted pods, enabled circuit breaker"
  }
}

Next time a similar incident occurs, LogSleuth surfaces this past investigation automatically. No more solving the same problem twice.

Performance Results

I measured MTTR across simulated incidents:

Scenario	Manual Investigation	With LogSleuth	Time Saved
Database failover	52 minutes	4 min 30 sec	47.5 min
Payment processor outage	45 minutes	3 min 45 sec	41.25 min
Timeout cascade	38 minutes	4 min 15 sec	33.75 min
Average	47 minutes	4 min 10 sec	42.8 min (91%)

91% reduction in Mean Time To Resolution—from 47 minutes to under 5 minutes.

Technical Implementation

The Orchestrator

The heart of LogSleuth is the InvestigationOrchestrator class:

class InvestigationOrchestrator:
    async def investigate(
        self,
        incident_description: str,
        time_range: str = "2h",
        save_results: bool = False,
    ) -> Dict[str, Any]:
        """Run a complete incident investigation."""
        context = InvestigationContext(
            incident_description=incident_description,
            time_range=time_range,
        )

        # Execute investigation steps
        await self._step_understand(context)
        await self._step_search(context)
        await self._step_analyze(context)
        await self._step_correlate(context)
        await self._step_synthesize(context)

        return self._build_final_report(context)

Each step builds on the previous one. If the search finds trace IDs, they’re passed to correlation. If analysis finds a spike, it’s highlighted in the final report.

Async & Streaming

I added async support for non-blocking execution:

async for update in investigation.investigate_stream("payment errors"):
    print(f"Step: {update['step']}, Status: {update['status']}")
    if update.get('data'):
        process_intermediate_results(update['data'])

This enables real-time progress updates in the dashboard.

Query Caching

Repeated queries hit a cache instead of Elasticsearch:

class QueryCache:
    def __init__(self, default_ttl_seconds: int = 60):
        self._cache: Dict[str, Dict[str, Any]] = {}

    def get(self, query_type: str, **kwargs) -> Optional[Any]:
        key = self._make_key(query_type, **kwargs)
        entry = self._cache.get(key)
        if entry and datetime.utcnow() <= entry["expires"]:
            return entry["data"]
        return None

This reduces load on Elasticsearch and speeds up repeated investigations.

Lessons Learned

1. Tool Design Matters

The granularity of your tools determines how intelligently the agent can reason. I started with 3 large tools and refactored to 4 focused ones. The agent’s reasoning improved dramatically.

2. Context is King

The orchestrator maintains context across steps. Without this, each tool call would be independent, and the agent couldn’t build on previous findings.

3. Visualizations Beat Text

Engineers understand a Sankey diagram faster than 20 lines of log output. Invest in visualization.

4. Learn from History

The knowledge base feature wasn’t in my original plan, but it became the key differentiator. Teams don’t just want to solve incidents—they want to prevent solving the same incident twice.

Try It Yourself

LogSleuth is open source:

Clone the repository

git clone https://github.com/AbisoyeAlli/logsleuth
cd logsleuth

Set up environment

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Configure Elasticsearch

cp .env.example .env
# Edit .env with your Elasticsearch credentials

Load demo data
```
python scripts/setup_elasticsearch.py
```
Run the dashboard
```
streamlit run src/dashboard.py
```

What’s Next

I am exploring:

Slack/PagerDuty integration for chat-based investigations
Anomaly detection with Elastic ML for smarter spike detection
Automated remediation for common issues
Multi-cluster support for organizations with distributed infrastructure

Conclusion

Incident response doesn’t have to be painful. With Elastic Agent Builder, I built an AI agent that:

Investigates incidents in seconds, not hours
Shows exactly how errors propagate with visual diagrams
Learns from every investigation to prevent repeat issues
Integrates natively with Elasticsearch and Kibana

The result? 91% faster incident resolution. More sleep for on-call engineers. Less revenue lost to downtime.

That’s LogSleuth. Stop drowning in logs. Start solving incidents.