How to Detect and Handle Byzantine Failures in Distributed Systems
Your API returns 200 OK. The response body is valid JSON. The schema matches. And the data is completely wrong.
This is a Byzantine failure, named after the Byzantine Generals Problem, where participants in a distributed system can’t distinguish between correct and incorrect information from their peers.
In a previous article on resilient API design, I covered patterns for handling failures you can see: timeouts, 500s, connection resets. This article covers the harder problem: failures you can’t see until something downstream breaks. You will learn how Byzantine failures happen, five strategies for detecting them (checksums, domain invariants, read-after-write verification, cross-service consistency checks, and contract testing), patterns for handling suspected bad data, and prevention techniques including idempotency keys and version vectors.
The examples below assume a few common abstractions:
- a
priceHistorystore for looking up previous values, - a
quarantineStorefor persisting flagged records, - an
alertingclient for paging on-call teams, - a
messageBusfor inter-service communication, - an
idempotencyStorefor tracking multi-step operations, - a
recordStephelper for logging operation progress, - a
deepEqualfunction for structural comparison (for example,fast-deep-equal), - and
metrics/loggeras before.
These are left unimplemented to keep focus on the detection and handling patterns themselves.
What Makes Byzantine Failures Dangerous
A 503 is honest. It tells you something is wrong. A Byzantine failure is dishonest — the system looks healthy while serving bad data.
| Failure Type | Signal | Detection | Impact |
|---|---|---|---|
| Crash | Connection refused, timeout | Immediate | Obvious, contained |
| Error | 500, 502, 503 | Immediate | Explicit, retriable |
| Byzantine | 200 OK, valid schema | Delayed or never | Silent, propagates |
The danger is in the third column. Byzantine failures propagate downstream before anyone notices. A pricing service returning stale prices doesn’t crash your checkout, it processes orders at the wrong price. A permissions service returning an outdated role mapping doesn’t throw, it grants access that should have been revoked. By the time you notice, the damage has compounded.
What to Protect and How
Not every data path needs the same level of defense. Here’s a map of the patterns covered in this article, matched to the data types where they matter most:
| Data Type | Risk of Wrong Data | Recommended Patterns |
|---|---|---|
| Financial (prices, payments) | Revenue loss, legal liability | Checksums, invariant checks, read-after-write, quarantine |
| Authorization (roles, permissions) | Security breach | Read-after-write, contract tests, version vectors |
| User-facing content (listings, descriptions) | Poor UX, support tickets | Invariant checks, staleness monitoring |
| Analytics/logging | Misleading dashboards | Cross-service consistency checks (periodic) |
| Internal metadata | Operational confusion | Contract tests between services |
If you’re starting from zero, the single highest-leverage thing you can do is add a domain invariant check to your most critical endpoint, the one where wrong data costs real money or creates a security hole. Everything else in this article builds on that foundation.
How Byzantine Failures Happen
These aren’t hypothetical scenarios. Every distributed system encounters them eventually.
Stale Caches
The most common source. A cache returns data that was correct an hour ago but has since changed at the source of truth.
// The dangerous version: cache with no staleness awareness
async function getProductPrice(productId: string): Promise<number> {
const cached = await cache.get(`price:${productId}`);
if (cached !== undefined) return cached; // Could be hours old
const price = await pricingService.getPrice(productId);
await cache.set(`price:${productId}`, price, TTL_1_HOUR);
return price;
}
If the price changes 5 minutes after caching, you serve the wrong price for 55 minutes. The cache doesn’t know it’s wrong. The caller doesn’t know it’s wrong. The checkout processes at the old price.
Split-Brain Reads
When a database has replicas, a write to the primary may not have propagated to the replica you’re reading from.
Client writes to primary: user.role = "admin"
Client reads from replica: user.role = "member" ← stale
The read succeeds. The response is valid. The data is wrong. This is especially dangerous for permission checks; the user’s role was updated seconds ago, but the replica hasn’t caught up.
Partial Writes
A multi-step write operation completes some steps but not others. Each service’s data is internally consistent, but the aggregate state is wrong.
1. Order service: order.status = "confirmed" ✓
2. Inventory service: stock.count -= 1 ✓
3. Payment service: charge.status = "pending" ✗ (network error)
The order is confirmed. The stock is decremented. But the payment never went through. Each service returns 200 for its own data, but the combined state is inconsistent.
Silent Data Corruption
Bit rot in storage, serialization bugs, encoding mismatches, or schema drift between services. The data was correct when written but is no longer correct when read.
// Service A writes a timestamp as ISO 8601
await store.set("event:123", { timestamp: "2026-03-15T14:30:00Z" });
// Service B reads it and parses as a Unix timestamp
const event = await store.get("event:123");
const time = parseInt(event.timestamp); // NaN - silent corruption
No error is thrown. NaN propagates through calculations, producing garbage results that look like real numbers.
Clock Skew
Distributed systems rely on time for ordering events, expiring tokens, and enforcing TTLs. When clocks drift between machines, time-dependent logic breaks silently.
// Machine A (clock is 30 seconds ahead): generates token
const token = { userId: "abc", expiresAt: Date.now() + 60_000 };
// Machine B (accurate clock): validates token
if (token.expiresAt < Date.now()) {
// Token appears to expire 30 seconds early on this machine
throw new UnauthorizedError("Token expired");
}
The token is valid. The logic is correct. The clocks disagree.
Detection Strategies
You can’t prevent all Byzantine failures, but you can build systems that detect them quickly.
Strategy 1: Checksums and Content Hashing
Verify that data hasn’t been corrupted in transit or storage by computing a hash at the source and validating it at the consumer.
import { createHash } from "node:crypto";
interface VerifiableResponse<T> {
data: T;
checksum: string; // SHA-256 of canonicalized JSON
}
// Recursively sort object keys for deterministic serialization
function canonicalize(value: unknown): unknown {
if (value === null || typeof value !== "object") return value;
if (Array.isArray(value)) return value.map(canonicalize);
return Object.keys(value as Record<string, unknown>)
.sort()
.reduce<Record<string, unknown>>((acc, key) => {
acc[key] = canonicalize((value as Record<string, unknown>)[key]);
return acc;
}, {});
}
// Producer: attach a checksum to outgoing data
function withChecksum<T>(data: T): VerifiableResponse<T> {
const serialized = JSON.stringify(canonicalize(data));
const checksum = createHash("sha256").update(serialized).digest("hex");
return { data, checksum };
}
// Consumer: verify integrity before using the data
function verifyChecksum<T>(response: VerifiableResponse<T>): T {
const serialized = JSON.stringify(canonicalize(response.data));
const expected = createHash("sha256").update(serialized).digest("hex");
if (expected !== response.checksum) {
throw new DataIntegrityError(
`Checksum mismatch: expected ${response.checksum}, got ${expected}`
);
}
return response.data;
}
class DataIntegrityError extends Error {
constructor(message: string) {
super(message);
this.name = "DataIntegrityError";
}
}
Key detail: The canonicalize function recursively sorts object keys at every level before serializing. This is critical, JSON.stringify’s replacer argument only controls top-level key order. Without recursive sorting, nested objects serialize in insertion order, which can vary across environments and produce different hashes for semantically identical data. For production use, consider a well-tested library like json-stable-stringify instead of rolling your own.
Checksums catch corruption in transit and at rest, but they can’t tell you if the data was wrong at the source. For that, you need domain-level validation.
Strategy 2: Domain Invariant Checks
Verify that the data makes semantic sense, not just structural sense. Schema validation tells you “this is a number.” Invariant checks tell you “this number is plausible.”
interface PriceData {
productId: string;
price: number;
currency: string;
lastUpdated: string;
}
function validatePriceInvariant(data: PriceData): void {
// Price should be positive
if (data.price <= 0) {
throw new InvariantViolation(`Non-positive price: ${data.price}`);
}
// Price shouldn't change by more than 50% from the last known value
// In practice this would be async (await priceHistory.getLastKnown(...))
const lastKnown = priceHistory.getLastKnown(data.productId);
if (lastKnown !== undefined) {
const changePercent = Math.abs(data.price - lastKnown) / lastKnown;
if (changePercent > 0.5) {
logger.warn("Suspicious price change", {
productId: data.productId,
previousPrice: lastKnown,
newPrice: data.price,
changePercent: (changePercent * 100).toFixed(1) + "%",
});
// Don't necessarily throw (log, alert, and let a human decide)
metrics.increment("price.suspicious_change");
}
}
// Data shouldn't be older than 24 hours
const age = Date.now() - new Date(data.lastUpdated).getTime();
if (age > 86_400_000) {
throw new InvariantViolation(
`Stale price data: ${(age / 3_600_000).toFixed(1)} hours old`
);
}
}
class InvariantViolation extends Error {
constructor(message: string) {
super(message);
this.name = "InvariantViolation";
}
}
The 50% price change check is the critical one. A structurally valid response, { price: 0.01, currency: "USD" } passes every schema validator. But if the product was $49.99 yesterday, something is wrong. The system can’t tell you what’s wrong, but it can tell you something is wrong, which is enough to halt propagation and alert a human.
Calibrate thresholds to your domain. Commodity prices might fluctuate 5% daily. Concert tickets might legitimately double in price. The invariant must reflect what’s plausible for your data, not what’s possible in the schema.
Strategy 3: Read-After-Write Verification
After writing data, read it back from the same path consumers will use and verify it matches what you wrote.
async function verifiedWrite<T>(
key: string,
value: T,
writer: (key: string, value: T) => Promise<void>,
reader: (key: string) => Promise<T>,
options: { maxRetries?: number; delayMs?: number } = {}
): Promise<void> {
const { maxRetries = 3, delayMs = 100 } = options;
await writer(key, value);
// Read back through the same path consumers use
for (let attempt = 0; attempt < maxRetries; attempt++) {
await sleep(delayMs * Math.pow(2, attempt)); // Backoff for replication lag
const readBack = await reader(key);
if (deepEqual(value, readBack)) {
return; // Write verified, consumers will see the correct value
}
if (attempt < maxRetries - 1) {
logger.warn("Read-after-write mismatch, retrying", {
key,
expected: value,
actual: readBack,
attempt: attempt + 1,
});
}
}
throw new ConsistencyError(
`Write to "${key}" not visible after ${maxRetries} read attempts`
);
}
class ConsistencyError extends Error {
constructor(message: string) {
super(message);
this.name = "ConsistencyError";
}
}
const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));
This is especially important for systems with replication lag (databases with read replicas), caches that populate asynchronously, or CDNs with propagation delays. The write succeeds, but consumers reading from a replica might see stale data for seconds or minutes.
Trade-off: latency vs. correctness. Read-after-write verification adds at least one extra read (and potentially a delay) to every write. Use it selectively for permission changes, price updates, and other writes where serving stale data has real consequences. Don’t use it for every write; the latency cost will add up quickly on high-throughput paths.
Strategy 4: Cross-Service Consistency Checks
Periodically verify that related data across services agrees. This catches the partial write problem where each service is internally consistent but the aggregate state is wrong.
interface ConsistencyCheck {
name: string;
check: () => Promise<ConsistencyResult>;
schedule: string; // cron expression
}
interface ConsistencyResult {
consistent: boolean;
discrepancies: Array<{
entity: string;
serviceA: { name: string; value: unknown };
serviceB: { name: string; value: unknown };
}>;
}
// Example: verify orders and payments agree
async function checkOrderPaymentConsistency(): Promise<ConsistencyResult> {
const recentOrders = await orderService.getConfirmedOrders({
since: Date.now() - 3_600_000, // Last hour
});
const discrepancies = [];
// For large volumes, consider batching or parallelising payment lookups
for (const order of recentOrders) {
const payment = await paymentService.getPayment(order.paymentId);
if (!payment) {
discrepancies.push({
entity: `order:${order.id}`,
serviceA: { name: "orders", value: "confirmed" },
serviceB: { name: "payments", value: "missing" },
});
continue;
}
if (payment.amount !== order.total) {
discrepancies.push({
entity: `order:${order.id}`,
serviceA: { name: "orders", value: order.total },
serviceB: { name: "payments", value: payment.amount },
});
}
}
return {
consistent: discrepancies.length === 0,
discrepancies,
};
}
Run these as scheduled jobs not inline with request handling. They’re audits, not validations. When a discrepancy surfaces, the check doesn’t fix it automatically (that’s a separate, domain-specific process). It surfaces the problem and alerts the right people.
Start with your most critical data relationships. Orders and payments. Users and permissions. Inventory and listings. You don’t need to check everything, just the joins where inconsistency has real consequences.
Strategy 5: Contract Testing Between Services
Schema validation catches structural changes (a field was renamed). Contract tests catch semantic changes (a field now means something different, or an edge case that used to return null now returns an empty string).
// Contract: what the consumer expects from the pricing API
describe("pricing service contract", () => {
it("returns a positive price for valid products", async () => {
const response = await pricingApi.getPrice("known-product-id");
expect(response.status).toBe(200);
expect(response.data.price).toBeGreaterThan(0);
expect(response.data.currency).toMatch(/^[A-Z]{3}$/);
expect(new Date(response.data.lastUpdated).getTime()).not.toBeNaN();
});
it("returns 404 for unknown products, not a zero price", async () => {
const response = await pricingApi.getPrice("nonexistent-id");
// The contract says: unknown product = 404
// A Byzantine failure: unknown product = 200 with price 0
expect(response.status).toBe(404);
});
it("prices are consistent with bulk endpoint", async () => {
const single = await pricingApi.getPrice("product-a");
const bulk = await pricingApi.getBulkPrices(["product-a"]);
// Two endpoints that should return the same data
expect(bulk.data["product-a"].price).toBe(single.data.price);
});
});
The second test is the important one. A service that returns { price: 0 } for unknown products instead of a 404 passes every structural check. The consumer’s code path for “product not found” never fires, and a $0 line item propagates into invoices.
Run contract tests against a real environment, not mocks. The point is to catch what has changed in the actual service, not to verify your assumptions about it. Most teams run these against a staging environment on a schedule, or as a post-deploy verification step.
Handling Byzantine Failures
Detection is half the problem. Once you suspect bad data, what do you do with it?
Fail Loudly, Not Silently
The worst response to suspected bad data is to swallow it and continue. If an invariant check fails, make noise.
const verifiedPriceCache = new VerifiedCache<PriceData>(cache, validatePriceInvariant);
async function getPrice(productId: string): Promise<number> {
const response = await pricingService.getPrice(productId);
try {
validatePriceInvariant(response);
} catch (error) {
if (error instanceof InvariantViolation) {
// Don't silently return the suspicious value
metrics.increment("byzantine.invariant_violation", {
service: "pricing",
check: error.message,
});
// Escalate! This is not a normal error
logger.error("Byzantine failure detected", {
service: "pricing",
productId,
response,
violation: error.message,
});
// Fall back to last-known-good value if available (see VerifiedCache below)
const lastKnown = await verifiedPriceCache.getVerified(`price:${productId}`);
if (lastKnown !== undefined) {
logger.warn("Using last-known-good price", {
productId,
lastKnown: lastKnown.price,
suspicious: response.price,
});
return lastKnown.price;
}
// No fallback? Fail rather than serve suspicious data
throw new Error(
`Cannot serve price for ${productId}: invariant violation and no fallback`
);
}
throw error;
}
return response.price;
}
Fail Open vs. Fail Closed
This is the most important architectural decision in Byzantine failure handling: when you suspect bad data, do you serve it or reject it?
Fail open means serving the data anyway, possibly with a warning or degraded confidence marker. This is acceptable when wrong data is annoying but not dangerous, showing a slightly stale product description, returning cached search results, or displaying an outdated notification count. The user experience of “something” is better than “nothing.”
Fail closed means rejecting the data and returning an error or a known-safe fallback. This is required when wrong data causes real damage, charging the wrong price, granting unauthorized access, processing a duplicate payment, or reporting incorrect financial figures. An error is recoverable; a wrong transaction often isn’t.
As a rule of thumb: if wrong data triggers an action that’s hard to reverse (a charge, a permission grant, a stock trade), fail closed. If wrong data is purely informational, fail open. When in doubt, fail closed - you can always relax later, but you can’t un-charge a customer.
The getPrice example above demonstrates fail closed: it tries a last-known-good fallback, and if none exists, it throws rather than serving suspicious data.
Last-Known-Good Values
When current data is suspect, fall back to the last value that passed all invariant checks.
class VerifiedCache<T> {
constructor(
private cache: CacheClient,
private validate: (value: T) => void
) {}
async getVerified(key: string): Promise<T | undefined> {
const current = await this.cache.get(key);
if (current === undefined) return undefined;
try {
this.validate(current);
return current;
} catch {
// Current cache value fails validation, try last-known-good
const lastGood = await this.cache.get(`${key}:last_known_good`);
if (lastGood !== undefined) {
logger.warn(`Serving last-known-good for ${key}`);
return lastGood;
}
return undefined;
}
}
async setVerified(key: string, value: T, ttlMs: number): Promise<void> {
try {
this.validate(value);
// Value passes validation, update both current and last-known-good
await this.cache.set(key, value, ttlMs);
await this.cache.set(`${key}:last_known_good`, value, ttlMs * 10);
} catch {
// Value fails validation, don't cache it at all.
// The next getVerified call will find no current value and
// fall through to last-known-good automatically.
await this.cache.delete(key);
logger.warn(`Rejected invalid value for ${key}, last-known-good preserved`);
}
}
}
Notice the asymmetric TTL: last-known-good lives 10x longer than the current value. You want the fallback to survive even when the current data is cycling through bad values.
Quarantine and Manual Review
For high-stakes data, sometimes the right answer is “stop and ask a human.”
interface QuarantinedRecord {
entity: string;
suspectedIssue: string;
detectedAt: Date;
sourceData: unknown;
status: "pending_review" | "confirmed_valid" | "confirmed_invalid";
}
async function quarantine(
entity: string,
issue: string,
data: unknown
): Promise<void> {
await quarantineStore.insert({
entity,
suspectedIssue: issue,
detectedAt: new Date(),
sourceData: data,
status: "pending_review",
});
// Alert the responsible team (illustrative: replace with your alerting system)
await alerting.page({
severity: "high",
team: "data-integrity",
message: `Quarantined ${entity}: ${issue}`,
runbook: "https://wiki.internal/runbooks/byzantine-failure-triage", // your internal runbook URL
});
}
// Usage
// lastKnownPrice: fetched earlier via verifiedPriceCache.getVerified(...)
const price = await pricingService.getPrice(productId);
if (price.amount < lastKnownPrice * 0.5) {
await quarantine(
`product:${productId}`,
`Price dropped ${((1 - price.amount / lastKnownPrice) * 100).toFixed(0)}% — possible Byzantine failure`,
price
);
// Serve last-known-good while the quarantined value is reviewed
return lastKnownPrice;
}
Quarantine works best when the cost of wrong data is high and the volume of suspicious events is low. If you’re quarantining 10,000 records a day, you don’t have a quarantine process, you have a broken upstream.
Prevention Patterns
Detection and handling are reactive. These patterns reduce the likelihood of Byzantine failures occurring in the first place.
Idempotency Keys for State Changes
Prevent the partial write problem by making every state change idempotent and trackable.
async function processOrderConfirmation(
orderId: string,
idempotencyKey: string
): Promise<void> {
// Check if this exact operation has already been processed
const existing = await idempotencyStore.get(idempotencyKey);
if (existing?.status === "completed") {
return; // Already done, safe to return without re-executing
}
// Record that the operation is starting
await idempotencyStore.set(idempotencyKey, {
status: "in_progress",
startedAt: new Date(),
steps: [],
});
try {
await orderService.confirm(orderId);
await recordStep(idempotencyKey, "order_confirmed");
await inventoryService.decrementStock(orderId);
await recordStep(idempotencyKey, "stock_decremented");
await paymentService.capturePayment(orderId);
await recordStep(idempotencyKey, "payment_captured");
await idempotencyStore.set(idempotencyKey, { status: "completed" });
} catch (error) {
// Record which step failed, enables targeted recovery
await idempotencyStore.set(idempotencyKey, {
status: "failed",
failedAt: new Date(),
error: (error as Error).message,
});
throw error;
}
}
When a failure occurs mid-sequence, the idempotency record tells you exactly which steps completed. Recovery can resume from the point of failure rather than re-executing the entire sequence (or worse, guessing).
Version Vectors for Cache Consistency
Attach a version to every piece of cached data. Consumers can detect when they’re reading stale data and decide how to handle it.
interface VersionedData<T> {
data: T;
version: number;
updatedAt: string;
source: string; // Which service produced this
}
async function getCachedWithVersion<T>(
key: string,
fetcher: () => Promise<VersionedData<T>>,
ttlMs: number
): Promise<VersionedData<T>> {
const cached = await cache.get(key);
// Cache hit, serve from cache without hitting upstream
if (cached) return cached;
// Cache miss, fetch from source
const fresh = await fetcher();
await cache.set(key, fresh, ttlMs);
return fresh;
}
// Run on cache refresh (for example, background revalidation or TTL expiry)
// to detect version regressions before consumers see them
async function refreshWithVersionCheck<T>(
key: string,
fetcher: () => Promise<VersionedData<T>>,
ttlMs: number
): Promise<VersionedData<T>> {
const cached = await cache.get(key);
const fresh = await fetcher();
if (cached && cached.version > fresh.version) {
// Source returned an older version than what's cached, regression
logger.error("Version regression detected", {
key,
cachedVersion: cached.version,
freshVersion: fresh.version,
source: fresh.source,
});
metrics.increment("byzantine.version_regression");
// Keep the cached (newer) version and alert
return cached;
}
await cache.set(key, fresh, ttlMs);
return fresh;
}
getCachedWithVersion handles the normal read path, serve from cache when available, fetch on miss. refreshWithVersionCheck runs on background revalidation or TTL expiry, and that’s where version regression detection lives. If the source returns an older version than what’s cached, it means the source rolled back (intentionally or not) or you’re hitting a stale replica. Both are worth investigating immediately.
Causal Ordering with Logical Clocks
When physical clocks can’t be trusted (and across distributed services, they can’t), use logical clocks to establish ordering guarantees.
class LamportClock {
private counter = 0;
// Increment on local events
tick(): number {
return ++this.counter;
}
// Update when receiving a message from another service
receive(remoteTimestamp: number): number {
this.counter = Math.max(this.counter, remoteTimestamp) + 1;
return this.counter;
}
current(): number {
return this.counter;
}
}
// Usage: attach logical timestamps to inter-service messages
const clock = new LamportClock();
async function sendEvent(target: string, payload: unknown): Promise<void> {
const timestamp = clock.tick();
await messageBus.publish(target, {
payload,
logicalTimestamp: timestamp,
sourceService: SERVICE_NAME,
});
}
async function handleEvent(event: IncomingEvent): Promise<void> {
clock.receive(event.logicalTimestamp);
// Process with causal ordering guarantee (see caveats below)
}
Lamport clocks don’t give you wall-clock time, but they guarantee causal ordering: if event A happened before event B, A’s timestamp is strictly less than B’s. This is often all you need to detect out-of-order updates without relying on synchronized clocks.
Important caveats. This guarantee only holds if:
- Every participating service uses a Lamport clock and calls
tick()before sending messages - Each service instance has exactly one clock (not per-request, the clock must be shared across the process)
- The message bus preserves ordering within a channel
If you’re running multiple instances of a service with no shared state between them, a single Lamport clock per instance won’t give you global ordering. For that case, you need vector clocks, which track causal relationships across multiple concurrent participants, at the cost of more complexity and larger message headers.
Observability for Byzantine Failures
Standard monitoring (error rates, latency) misses Byzantine failures entirely, the requests succeed and the latency is normal. You need different signals.
What to Monitor
// Track all invariant validations, not just failures
function monitoredValidation<T>(
name: string,
data: T,
validate: (data: T) => void
): T {
try {
validate(data);
metrics.increment("invariant.checked", { name, result: "pass" });
return data;
} catch (error) {
metrics.increment("invariant.checked", { name, result: "fail" });
throw error;
}
}
Key signals that indicate Byzantine failures:
- Invariant violation rate: Any non-zero rate deserves investigation
- Version regressions: The source returned an older version than cached
- Cross-service consistency check failures: Scheduled audits finding discrepancies
- Cache hit rate spikes: If cache hit rate suddenly jumps, it might mean the source is failing and you’re serving increasingly stale cache
- Data age distribution: Track how old your served data is; a shift toward older data suggests the refresh path is broken
Dashboards That Surface Byzantine Issues
Standard dashboards show request volume and error rates. For Byzantine detection, add:
- Data freshness heatmap: Visualize the age of served data across endpoints. A cluster of stale responses in one region or for one product category reveals partial failures.
- Invariant violation timeline: Correlate with deploys, config changes, and infrastructure events. Most Byzantine failures have a clear “this started when…” moment.
- Cross-service agreement rate: What percentage of consistency checks pass? Trending this over time catches gradual drift before it becomes an incident.
Where to Start
If you’re looking at the patterns above and wondering where to begin, here’s a prioritized sequence:
- Add one invariant check to your most critical endpoint. Pick the endpoint where wrong data costs real money or creates a security hole. Add a domain-level plausibility check (Strategy 2). This alone will catch most Byzantine failures before they propagate.
- Implement last-known-good fallbacks for that same endpoint. Once you can detect bad data, you need somewhere to fall back. The
VerifiedCachepattern gives you a safety net. - Add a scheduled cross-service consistency check for your most important data relationship (for example, orders and payments). This catches the partial write problem that no single service can detect.
- Layer in checksums and contract tests as you expand coverage to more endpoints.
Refer back to the investment table at the top to match patterns to your data types. Add detection incrementally, a single invariant check on your pricing endpoint catches more real-world issues than a comprehensive framework you never ship.
Conclusion
In this article, you learned how to detect, handle, and prevent Byzantine failures in distributed systems, the class of bugs where your API returns 200 OK but the data is wrong.
- Byzantine failures return 200 OK you can’t catch them with error handling alone
- Checksums catch corruption, invariants catch implausibility use both for defense in depth
- Read-after-write verification catches replication lag before consumers do
- Cross-service consistency checks surface the partial write problem that no single service can detect on its own
- Fail closed for high-stakes data serving no price is better than serving the wrong price
- Last-known-good fallbacks need asymmetric TTLs so the fallback outlives the bad data
- Version regressions are strong signals if the source returns older data than your cache, something is wrong
- Monitor data freshness, not just errors Byzantine failures don’t show up in error rates
The uncomfortable truth about distributed systems is that correctness is harder than availability. Making a system stay up is well-understood (retries, circuit breakers, failover). Making a system stay right requires you to define what “right” means for your data, encode those definitions as executable checks, and run them continuously. The patterns in this article won’t prevent every Byzantine failure, but they’ll ensure you find out in minutes rather than months.