AegisFlow

Distributed Systems / LLM Infrastructure | 2026

AegisFlow

https://aegisflow.dev

Overview

Deterministic reliability for non-deterministic AI systems.

My Role

Solo Architect & Engineer

Designed the microservice topology and all service contracts. Implemented all seven services, the shared aegis_core library (circuit breaker, Pydantic schemas, Prometheus metric definitions, structured logging), the full OpenTelemetry → Prometheus → Grafana → Tempo observability stack, the chaos-injection engine, and the three ADRs that document every infrastructure decision.

Stack

Python 3.12 · FastAPI · NATS JetStream · Postgres / pgvector · Redis · OpenTelemetry · Prometheus · Grafana · Tempo · Docker Compose · Kubernetes (Kustomize)

AegisFlow sits between your application and any LLM provider. Every model output passes through a 4-axis confidence scorer (structural validity + semantic grounding + validator critique + provider history) combined with a diminishing-returns anomaly penalty into a single [0, 1] score. Based on the score, the system accepts, repairs, retries, falls back, or rejects — without the calling application knowing any of it.

Timeline

Feb → May 2026 · ~3 months · solo

Full 14-container stack boots via make up (docker compose up -d --build), all services pass /healthz + /readyz, chaos scenarios verified end-to-end, Grafana dashboards rendering live data from the mock provider. Kubernetes manifests ship in infra/k8s/ with dev + prod overlays. Open-source under MIT at github.com/MustakimFS/aegisflow.

Highlights

AegisFlow — reliability engineering for models that lie.

Seven microservices, a four-axis confidence model, and a chaos engine that stress-tests fallback paths before production traffic ever does. Every execution is event-sourced: given the same trace ID and a frozen model snapshot, the system reproduces the run bit-for-bit.

Microservices

14 containers · 1 make up

4-axis

Confidence model

+ diminishing-returns anomaly penalty

Chaos failure modes

injectable per-provider, on demand

7 microservices, 14 containers, 1 command.

Gateway, orchestrator, reliability, guardrails, memory, replay, and chaos — each with its own Dockerfile, /healthz, /readyz, and Prometheus /metrics. Plus Postgres (pgvector), Redis, NATS, OTEL collector, Prometheus, Tempo, Grafana. make up brings up everything.

4-axis confidence scoring.

structural_score (JSON parse) + grounding_score (token-Jaccard vs retrieved context) + critique_score (validator rubric) + historical_provider_score (rolling 5-min success rate), minus a diminishing-returns anomaly_penalty → 1 - exp(-0.5 * n_flags). Weights are workflow-configurable.

6 injectable failure modes.

Latency spikes, timeouts, synthetic 5xx, malformed JSON, hallucinations, refusals — injectable on any provider via the chaos service API without touching provider code. Disabled by default; opt-in per test run.

Event-sourced replay.

Every input, output, score, retry, and fallback decision is appended to the NATS JetStream event log. GET /v1/replay/{trace_id} walks the full trace for post-incident debugging.

Context

LLMs are in production — reliability is still ad-hoc.

Traditional backend infrastructure assumes failures are categorical — a request either succeeded or it didn't. LLM systems break that assumption: a “successful” 200 response can still be wrong, malformed, or unsafe. SDKs had retry logic for rate limits but nothing for semantic failures. No open-source tool addressed the full surface: structural validation + grounding checks + provider failover + deterministic trace replay.

GitHub · langchain-ai/langchain #15808

“Output parsers silently swallow malformed JSON and return None.”

Top-voted open issue — default = fail closed, no repair, no signal

Reddit · r/LocalLLaMA

“How do you handle hallucinations in production? — Top-voted answer: 'prompt harder and add a retry loop.'”

The state of the art before reliability-as-infrastructure

a16z 'State of AI' · 2024

“Inconsistent or unexpected model outputs is the #1 production pain point for teams running LLMs at scale.”

Survey from a16z — the pain is industry-wide

OpenAI API · status + community · 2024–2025

“Unexpected content-type, truncated JSON mid-object, output structure changed across model versions.”

Forums full of teams discovering this only after it broke production

AegisFlow README — 'Why this exists'

“Service meshes, retries, circuit breakers, and schema validators all assume failures are categorical — a request either succeeded or it didn't. LLM systems break that assumption.”

The thesis the whole platform is built around

1.0Demand signals.DIAGRAM

The Problem

Five hard constraints, one architecture that survives them.

8 GB VRAM ceiling

Local inference tops out at 7–8B parameter models. The entire reliability pipeline had to add under 200 ms of overhead or it would dominate inference latency on the RTX 4070.

No shared in-process state

Services communicate only over HTTP + NATS. Every confidence score, guardrail check, and memory retrieval is a network round-trip. Each service has to fail open gracefully.

Local model JSON non-compliance

Qwen3 and DeepSeek R1 wrap outputs in ```json fences ~40% of the time and prepend prose prefixes another 10–15%. The guardrail repairer had to handle these before any structural repair.

No ground-truth labels

Confidence scoring had no labeled dataset to calibrate against. Grounding uses token Jaccard as a fast proxy (not embedding similarity) — a deliberate trade-off documented in the code.

Solo build constraint

Every service had to be completable and fully understandable by one person. Architecture complexity was a liability, not an asset. The shared aegis_core library kept the surface consistent.

Engineering principles (from the README)

Failure is the default.

Every cross-service call goes through a circuit breaker; every agent output is treated as untrusted until validated. If a downstream is unreachable, the orchestrator returns a neutral score and continues.

Determinism through replay.

Every execution is event-sourced. Given the same trace ID and a frozen model snapshot, the system reproduces the run bit-for-bit.

Observability is not optional.

A code path without a trace span, a metric, and a structured log doesn't get merged. Every /healthz, /readyz, /metrics endpoint was live from day one.

Process

Five iterations, each one earned by a failure.

The monolith that couldn't be chaos-tested.

V1 was a single FastAPI app with reliability scoring, guardrails, and memory in-process. The immediate problem: you can't inject chaos into part of a monolith without affecting the whole thing. I wanted to test the guardrail repairer with 100% malformed JSON, but doing so in-process corrupted the reliability scorer's provider history stats. The split into microservices came directly from that failure — each service now gets its own chaos surface. ADR-0001 documents the decision.

The confidence formula that couldn't catch hallucinations.

First formula weighted all four components equally at 0.25. A model returning perfect JSON but hallucinating scored 0.75 — well above the 0.30 minimum threshold. Fix was structural: downweight history (0.10), add a separate anomaly penalty subtracted after the weighted sum, and use diminishing returns 1 - exp(-0.5 * n) so a single anomaly doesn't kill the score but five anomalies can't be overcome by perfect JSON. Final weights: 0.30·structural + 0.30·grounding + 0.20·critique + 0.10·history − 0.30·anomaly.

NATS JetStream over Postgres LISTEN/NOTIFY.

Started with Postgres LISTEN/NOTIFY for the event bus — already in the stack, one fewer service. Hit the wall when the replay service needed fan-out to multiple consumers simultaneously. LISTEN/NOTIFY doesn't survive consumer disconnects and has no replay semantics. NATS JetStream solved both in one binary with no ZooKeeper — at-least-once delivery plus hierarchical subject wildcards (workflow.*.completed). The full rationale is in ADR-0002:

ADR-0002 · NATS JetStream over Kafka

1. Operational footprint. NATS runs as a single binary with no ZooKeeper / KRaft to manage. For a platform that targets self-hosting in customer K8s clusters, the lower op cost wins.

2. Latency. NATS pub-sub round-trip is sub-millisecond. Kafka's batching adds 5–50 ms even at low throughput.

3. Subjects vs. topics. NATS supports hierarchical wildcards (workflow.*.completed), which maps cleanly onto trace-driven fan-out.

4. Alternatives rejected: Apache Kafka (heavyweight, 5% of capability used), AWS SQS+SNS (vendor lock-in), Postgres LISTEN/NOTIFY (no fan-out, no disconnect survival).

Chaos that made testing impossible.

First chaos config had 30% failure probability across all providers. The happy path became unreachable — every test run hit at least one failure, making baselines impossible. Pulled back to conservative defaults: primary-blip 5%, json-corruption 10%, latency-spike 20% — all disabled by default and opt-in per test run via the chaos service API.

Discovering local-model JSON behavior empirically.

The assumption was that ```json fence wrapping was an edge case. In practice, ~40% of Qwen3 outputs and a similar fraction of DeepSeek R1 outputs arrived fenced. The prose prefix ("Here is the JSON:") was another surprise at 10–15%. Both are now first-order operations in the repair pipeline, not afterthoughts.

JSON error handling

Before

json.loads(raw) → JSONDecodeError → return None. Silent failure, no signal to caller.

After

6-step repair pipeline returning RepairResult(parsed=…, was_repaired=True, repairs=[…]). Caller knows exactly what was fixed.

3.0DIAGRAM

Provider failure path

Before

raise UpstreamTimeout → unhandled exception → 500. No recovery, no observability.

After

Circuit breaker records the failure → reliability engine scores → routing tree evaluates → fallback chain advances (primary → secondary → rule-based) → Prometheus FALLBACKS counter increments → event-sourced to NATS.

3.1DIAGRAM

Architecture

The reliability loop — one request, seven services.

The data flow below is from ARCHITECTURE.md §3 “Happy path.” Nine steps, synchronous gRPC/HTTP on the request path, asynchronous NATS JetStream for fan-out, audit, replay, and chaos triggers.

aegisflow: ~/request-lifecycle

client@caller:/v1/workflows$POST { "workflow": "research_summarize", "policies": {...} }

─── 9-step happy path (ARCHITECTURE.md §3) ─────────────

[1] gateway · JWT verify · rate-limit (Redis token bucket) · trace ID minted

[2] orchestrator · resolve workflow DAG · run record PENDING in Postgres

[3] memory · pgvector top-k retrieval · rerank · attach to context

[4] LLM invoke · circuit breaker · adaptive retry (full jitter) · timeout budget

[5] reliability · 4-axis score → ACCEPT / RETRY / FALLBACK / REJECT

[6] guardrails · JSON repair · schema validation · PII sanitization

[7] decision · ≥0.75 accept · ≥0.50 repair-retry · ≥0.30 fallback · else reject

[8] replay · event-source full trace to NATS JetStream

[9] response · trace_id + confidence + fallback_depth + retries

{ "run_id": "01KT5PJWTS8PYS3RSKBM29849P", "status": "succeeded",
  "confidence": 0.379, "fallback_depth": 1, "retries": 0,
  "trace_id": "703e552f7a2cc84580ff3eb9fc9dc35b" }

6.0Full request lifecycle (ARCHITECTURE.md §3).DIAGRAM

System decomposition (ARCHITECTURE.md §2)

Service	Process model	Persistence	Scaling axis
gateway	stateless · async	Redis (rate-limit)	request rate
orchestrator	stateless · async	Postgres (runs) · NATS	concurrent workflows
reliability	stateless · CPU-bound	— (in-memory windows)	scoring throughput
guardrails	stateless · CPU-bound	—	validation throughput
memory	stateful read replicas	Postgres + pgvector · S3	retrieval QPS
replay	stateful append-only	Postgres (event store) · S3	event ingestion
chaos	stateless	Redis (active scenarios)	—

6.1Service table — each owns one reliability concern.DIAGRAM

Confidence formula (ARCHITECTURE.md §4)

confidence =
   w1 · structural_score       # JSON parse: 1.0 / 0.5 / 0.0
 + w2 · grounding_score        # token-Jaccard vs retrieved ctx
 + w3 · critique_score         # validator rubric [0,1]
 + w4 · historical_provider    # rolling 5-min success rate
 - w5 · anomaly_penalty        # 1 - exp(-0.5 * n_flags)

# Defaults (workflow-configurable):
#   w1=0.30  w2=0.30  w3=0.20  w4=0.10  w5=0.30

decision:
  ≥ 0.75  → ACCEPT
  ≥ 0.50  → REPAIR_AND_RETRY
  ≥ 0.30  → FALLBACK_PROVIDER
  < 0.30  → REJECT

Circuit breaker (aegis_core)

CLOSED ─(failure_ratio > 0.5 in window)─► OPEN
OPEN ──(cooldown elapsed: 15s→30s→…→120s)──► HALF_OPEN
HALF_OPEN ─(probe success)─► CLOSED
HALF_OPEN ─(probe failure)─► OPEN (double cooldown)

# Failure = 5xx, timeout, connection error.
# Low-confidence outputs are NOT failures at this
# layer — they're handled by the reliability engine.

# Per-provider, per-model.
# Implemented in libs/aegis_core/circuit_breaker.py

6.2Scoring + recovery primitives.DIAGRAM

Prometheus metric families (ARCHITECTURE.md §7)

aegisflow_workflow_duration_seconds{workflow,status}     # histogram
aegisflow_agent_invocations_total{agent,provider,outcome} # counter
aegisflow_reliability_confidence{workflow}                # histogram
aegisflow_circuit_state{provider,model}                   # gauge 0/1/2
aegisflow_retries_total{provider,reason}                  # counter
aegisflow_fallback_total{from_provider,to_provider}       # counter
aegisflow_tokens_total{provider,direction}                # counter
aegisflow_memory_recall_at_k{k}                           # histogram
aegisflow_chaos_injections_total{scenario}                # counter

6.3Every metric the Grafana dashboard consumes.DIAGRAM

pyservices/chaos/scenarios.py

class FailureMode(StrEnum):
  LATENCY         = "latency"         # inject N ms latency spike
  TIMEOUT         = "timeout"         # drop the request on the floor
  PROVIDER_5XX    = "provider_5xx"    # synthetic upstream failure
  MALFORMED_JSON  = "malformed_json"  # wrap or truncate the output
  HALLUCINATION   = "hallucination"   # valid shape, fabricated content
  REFUSAL         = "refusal"         # 'I can't help with that' patterns

6.4Six injectable failure modes.DIAGRAM

Final Designs

What shipped — and what shipping means here.

Everything below is captured from the running stack — no mockups. The boot transcript, the Grafana dashboard, a real workflow response, and the chaos-driven fallback + guardrails repair, in order.

make up — 14/14 aegisflow containers healthy, docker ps showing real ports

7.0make up → 14/14 containers healthy in 41.8s · docker ps with real port mappings.IMAGE

Grafana aegisflow-overview dashboard — throughput, fallback counter, confidence histogram, circuit breaker gauge, token counters

7.1Grafana aegisflow-overview — auto-provisioned from aegisflow-overview.json.IMAGE

Every panel reads a real Prometheus metric: workflow throughput, fallback counter, reliability rejection rate, average + histogram of confidence, P50/P95 duration, circuit-breaker state gauge, hallucination flags/sec, and token counters. Avg confidence sits at 0.376 here because the demo runs against the mock provider with chaos enabled — the system is doing exactly what it should: scoring low and falling back.

curl POST /v1/workflows — real response with run_id, trace_id, confidence 0.379, fallback true

7.2make demo → POST /v1/workflows — a real workflow response.IMAGE

Real fields, real values. confidence: 0.379 lands below the accept and retry thresholds, so the routing tree walks the fallback chain (fallback_depth: 1) and the orchestrator returns the deterministic rule-based fallback — “All primary providers exhausted; returning deterministic fallback.” The caller still gets a structured status: succeeded with a trace_id that keys straight into the replay event stream. No exception, no 500 — the failure is handled, not leaked.

Enable json-corruption chaos scenario, run demo, guardrails /v1/validate returns repaired=true with the exact repairs applied

7.3Chaos json-corruption enabled → guardrails /v1/validate exposes repaired=true + the exact repairs.IMAGE

Structure over values. Enabling the json-corruption chaos scenario (probability 0.1, all providers) forces malformed output through the guardrails repair path. The /v1/validate endpoint returns repaired: true and lists every operation it applied — here stripped_prose_prefix, stripped_trailing_text, and removed_trailing_commas. The repairer only ever fixes syntax; if the structure still won't parse it returns a hard error rather than fabricating a payload.

Launch · honest framingThis is a portfolio project, open-source under MIT. No external users, no production traffic. Verified working on developer hardware, all services pass /healthz + /readyz, Grafana renders live metrics from the mock provider, and the chaos scenarios above were captured end-to-end. Saying so directly is stronger than inflating.

Retrospective

What worked, what I'd change.

Worked

'Fail open' as a design rule, not a guideline.

Every except httpx.RequestError returns a neutral value and logs rather than raising. I never needed all 14 services running to work on one — reliability down = neutral scores, memory down = empty retrieval. Isolation was free.

The shared aegis_core library.

CircuitBreaker, Pydantic schemas, Prometheus metric definitions, and structured logging in one package. Every service had consistent instrumentation from the first line — Grafana dashboards had real data the first time they loaded.

ADRs before the code.

ADR-0002 forced me to articulate the NATS vs. Kafka vs. LISTEN/NOTIFY tradeoffs in writing before implementing. When LISTEN/NOTIFY hit the fan-out wall three weeks later, the decision was already documented and the switch took a day instead of a week.

Would change

Start with integration tests, not unit tests.

The interesting bugs were all at service boundaries: repairer changing the output shape the scorer expected, replay event payload not matching the diff endpoint's assumption. Unit tests per service missed all of these.

Implement the NATS bus wrapper first.

JetStream is wired in docker-compose but aegis_core.bus is still marked TODO (per ADR-0002 'Revisit when' section). The replay service currently receives events over HTTP rather than subscribing to the event stream.

Use a real embedding model from day one.

Memory service has pgvector infra and cosine similarity, but the dev stack uses mock embeddings — grounding scores are consequently meaningless in practice. Wiring all-MiniLM-L6-v2 via Ollama would make the confidence scores semantically real.

The biggest surprise

Local Ollama models wrap JSON in markdown fences far more often than any documentation suggests. The assumption was "occasional edge case — maybe 5%." Empirically it was closer to 40% from Qwen3 and DeepSeek R1. The prose prefix ("Here is the JSON:") hit another 10–15%. The guardrails service isn't a last-resort fallback — it's a required post-processor for local models.

Next Project

Missing Persons Knowledge Graph

An OWL/SPARQL knowledge graph over 3,559 NamUs cases — published at IEEE COMPSAC 2025.

Open