Building LogSleuth: An AI-Powered Incident Investigator with Elastic Agent Builder
The 3 AM Problem
Picture this: It’s 3 AM. Your phone buzzes with a PagerDuty alert. “CRITICAL: Checkout error rate above 10%.”
You stumble to your laptop, connect to VPN, and begin the familiar dance:
- Open Kibana
- Start typing queries
- Scroll through thousands of logs
- Copy a trace ID, search again
- Try to piece together what happened
- Repeat for 45 minutes until you finally find the needle in the haystack
We’ve all probably been there. I decided to fix it.
Introducing LogSleuth
LogSleuth is an AI-powered incident investigation agent that automates the entire root cause analysis workflow. Built on Elastic Agent Builder, it reduces Mean Time To Resolution (MTTR) from 47 minutes to under 5 minutes, a 91% improvement.
91% Faster MTTR
Automated investigation in seconds, not hours
5-Step Methodology
Understand → Search → Analyze → Correlate → Synthesize
Learning System
Gets smarter from every investigation
Visual Insights
Sankey diagrams show error propagation
The Architecture
LogSleuth is designed around a simple but powerful idea: let AI do what humans repeatedly do, but faster.
When an SRE investigates an incident, they follow a mental workflow:
- Understand the alert
- Search for related logs
- Look for patterns (spikes, anomalies)
- Correlate across services
- Synthesize findings into a root cause
I encoded this workflow into an orchestration layer that calls specialized tools:
┌─────────────────────────────────────────────────────────────────────────────┐
│ ORCHESTRATION LAYER │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Investigation Orchestrator │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │UNDERSTAND│ → │ SEARCH │ → │ ANALYZE │ → │CORRELATE │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────┐ │ │
│ │ │SYNTHESIZE│ │ │
│ │ └──────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Why Elastic Agent Builder?
I chose Elastic Agent Builder because it provides capabilities I’d struggle to build from scratch:
1. Native ES|QL Integration
The tools execute ES|QL queries directly against Elasticsearch. No middleware, no translation layer—just fast, native queries.
FROM logs-logsleuth
| WHERE @timestamp >= NOW() - ?time_range
| WHERE log.level == "error"
| STATS error_count = COUNT(*) BY service.name, error.type
| SORT error_count DESC
2. Intelligent Tool Selection
The agent doesn’t blindly execute all tools. It reasons about which tools to call based on intermediate results.
Found no errors in the initial search? It broadens the query. Found a trace ID? It automatically correlates across services.
3. Built-in Context Management
Agent Builder handles conversation memory, token management, and result summarization—things that would take weeks to build properly.
4. Production-Ready Deployment
Deploy to Kibana with RBAC, audit logging, and rate limiting built in. No infrastructure to manage.
The Tools
LogSleuth has 4 custom tools, each designed for a specific investigation task:
search_logs(
client,
search_query="Connection refused",
time_range="2h",
service_name="payment-service",
log_level="error"
)
Searches for logs matching criteria. Returns formatted results with trace IDs for correlation.
get_error_frequency(
client,
time_range="2h",
interval="5m"
)
Analyzes error patterns over time. Detects spikes when error count exceeds 2x average.
find_correlated_logs(
client,
trace_id="abc123def456"
)
Traces a request across services. Identifies the root cause service (first to error).
search_past_incidents(
client,
search_terms="connection pool exhaustion"
)
Searches the knowledge base for similar past incidents. Returns previous resolutions.
The Visualization
Raw data isn’t enough. I added visual insights to make findings immediately understandable.
Sankey Diagram
The Sankey diagram shows how requests flow between services—and where errors propagate:
api-gateway ──────────► checkout-service ──────────► payment-service
│ │
│ [ERROR ORIGIN]
│ │
|
[ERROR CASCADE]◄───────────────────┘
The links indicate error paths. You can see at a glance that payment-service was the origin, and the error cascaded back through checkout-service to api-gateway.
Progress Stepper
The 5-step investigation methodology is visualized in real-time:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ ✓ │ → │ ✓ │ → │ ✓ │ → │ ● │ → │ ○ │
│UNDERSTAND│ │ SEARCH │ │ ANALYZE │ │CORRELATE │ │SYNTHESIZE│
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
Engineers can see exactly where the investigation is and what’s happening.
The Knowledge Base
Here’s where LogSleuth truly differentiates: it learns from every investigation.
When an investigation is saved, it becomes part of a searchable knowledge base:
{
"investigation": {
"id": "INV-20260120-A7B3C9",
"status": "completed"
},
"findings": {
"root_cause": "Database primary failover caused connection pool exhaustion",
"root_cause_service": "payment-service",
"affected_services": ["payment-service", "checkout-service", "api-gateway"]
},
"remediation": {
"suggestions": "Enable circuit breaker, increase connection pool timeout",
"resolution_applied": "Restarted pods, enabled circuit breaker"
}
}
Next time a similar incident occurs, LogSleuth surfaces this past investigation automatically. No more solving the same problem twice.
Performance Results
I measured MTTR across simulated incidents:
| Scenario | Manual Investigation | With LogSleuth | Time Saved |
|---|---|---|---|
| Database failover | 52 minutes | 4 min 30 sec | 47.5 min |
| Payment processor outage | 45 minutes | 3 min 45 sec | 41.25 min |
| Timeout cascade | 38 minutes | 4 min 15 sec | 33.75 min |
| Average | 47 minutes | 4 min 10 sec | 42.8 min (91%) |
91% reduction in Mean Time To Resolution—from 47 minutes to under 5 minutes.
Technical Implementation
The Orchestrator
The heart of LogSleuth is the InvestigationOrchestrator class:
class InvestigationOrchestrator:
async def investigate(
self,
incident_description: str,
time_range: str = "2h",
save_results: bool = False,
) -> Dict[str, Any]:
"""Run a complete incident investigation."""
context = InvestigationContext(
incident_description=incident_description,
time_range=time_range,
)
# Execute investigation steps
await self._step_understand(context)
await self._step_search(context)
await self._step_analyze(context)
await self._step_correlate(context)
await self._step_synthesize(context)
return self._build_final_report(context)
Each step builds on the previous one. If the search finds trace IDs, they’re passed to correlation. If analysis finds a spike, it’s highlighted in the final report.
Async & Streaming
I added async support for non-blocking execution:
async for update in investigation.investigate_stream("payment errors"):
print(f"Step: {update['step']}, Status: {update['status']}")
if update.get('data'):
process_intermediate_results(update['data'])
This enables real-time progress updates in the dashboard.
Query Caching
Repeated queries hit a cache instead of Elasticsearch:
class QueryCache:
def __init__(self, default_ttl_seconds: int = 60):
self._cache: Dict[str, Dict[str, Any]] = {}
def get(self, query_type: str, **kwargs) -> Optional[Any]:
key = self._make_key(query_type, **kwargs)
entry = self._cache.get(key)
if entry and datetime.utcnow() <= entry["expires"]:
return entry["data"]
return None
This reduces load on Elasticsearch and speeds up repeated investigations.
Lessons Learned
1. Tool Design Matters
The granularity of your tools determines how intelligently the agent can reason. I started with 3 large tools and refactored to 4 focused ones. The agent’s reasoning improved dramatically.
2. Context is King
The orchestrator maintains context across steps. Without this, each tool call would be independent, and the agent couldn’t build on previous findings.
3. Visualizations Beat Text
Engineers understand a Sankey diagram faster than 20 lines of log output. Invest in visualization.
4. Learn from History
The knowledge base feature wasn’t in my original plan, but it became the key differentiator. Teams don’t just want to solve incidents—they want to prevent solving the same incident twice.
Try It Yourself
LogSleuth is open source:
-
Clone the repository
git clone https://github.com/AbisoyeAlli/logsleuth cd logsleuth -
Set up environment
python -m venv venv source venv/bin/activate pip install -r requirements.txt -
Configure Elasticsearch
cp .env.example .env # Edit .env with your Elasticsearch credentials -
Load demo data
python scripts/setup_elasticsearch.py -
Run the dashboard
streamlit run src/dashboard.py
What’s Next
I am exploring:
- Slack/PagerDuty integration for chat-based investigations
- Anomaly detection with Elastic ML for smarter spike detection
- Automated remediation for common issues
- Multi-cluster support for organizations with distributed infrastructure
Conclusion
Incident response doesn’t have to be painful. With Elastic Agent Builder, I built an AI agent that:
- Investigates incidents in seconds, not hours
- Shows exactly how errors propagate with visual diagrams
- Learns from every investigation to prevent repeat issues
- Integrates natively with Elasticsearch and Kibana
The result? 91% faster incident resolution. More sleep for on-call engineers. Less revenue lost to downtime.
That’s LogSleuth. Stop drowning in logs. Start solving incidents.