METY Legal Chatbot

AI / Full Stack | 2026

METY Legal Chatbot

https://mety.legal

Overview

What it is, and why I built it.

My Role

AI Pipeline Architect

Sole engineer on the QnA backend, FSPR knowledge-profiling system, rolling summarization, document-generation feature, and production architecture across 7 sprints on a 6-person team. Rebuilt a 6-node LangGraph pipeline into a privacy-first 5-node topology, cut per-query LLM cost 82% ($0.024 → $0.0044), and added implicit Most Critical Gap targeting on every response with zero user-facing latency.

Stack

Django · FastAPI · LangGraph · Qdrant · MongoDB · GPT-4o / 4o-mini · spaCy · React

Two-layer architecture: Django owns state and data, FastAPI is stateless and has zero database credentials. The 5-node LangGraph pipeline runs RAG → Extraction → Anonymization → Reasoning → Formatter; FSPR profiling and rolling summarization fire as background daemon threads after every response so the user never waits for them.

Timeline

Jan → May 2026 · 7 sprints

Sponsor engagement with MyEdMaster — they had already beaten ChatGPT on a benchmark with their health chatbot and wanted the same result for a legal product. Shipped end of May 2026, handed to a continuation team, demoed at the ASU SER517 Innovation Showcase. No public deployment — IP belongs to MyEdMaster under signed NDA.

Highlights

METY in three modes, one knowledge profile.

Rebuilt a 6-node LangGraph pipeline into a privacy-first, cost-optimized 5-node architecture that serves personalized legal guidance across three user modes, cutting per-query LLM cost by 82% while adding implicit FSPR knowledge profiling on every message with zero user-facing latency impact.

82%

LLM cost reduction

$0.024 → $0.0044 / message

Pipeline nodes

down from 6, linear

4 × 20

FSPR × topics

profiled implicitly

82% LLM cost reduction.

Removed the clarification node (one full GPT-4o call that fired unconditionally on every message), switched FSPR inference and summarization from GPT-4o to GPT-4o-mini (17× cheaper). Old: ~$0.024/msg (3 GPT-4o calls). New: ~$0.0044/msg (1 GPT-4o + 2 GPT-4o-mini background).

5-node linear pipeline.

RAG → Extraction → Anonymization → Reasoning → Formatter. Each node reads from a typed LegalChatState dict and writes back partial updates that LangGraph merges automatically.

4-dimension FSPR profiling across 20 legal topics.

Facts · Strategies · Procedures · Rationales. Built implicitly via async background inference on every message — daemon thread fires after the HTTP response is returned so the user never waits for profiling.

Context

Legal help is inaccessible for most people.

Options were: pay $300/hour for a lawyer, search generic legal websites that assume zero or complete knowledge, or use a generic chatbot that gives the same answer regardless of background. No existing tool profiled legal knowledge at an individual level and targeted responses to close specific knowledge gaps. LegalZoom and Rocket Lawyer are form-fillers, not educators.

r/legaladvice — pinned post

“We are not your lawyers. Advice here is general. Your situation may be completely different.”

4.2M members · generic answers by design

John Leddo (sponsor) — LinkedIn, Jan 2026

“A big thanks to our PR person Tony Berry who got our latest chatbot success story (our health chatbot greatly outperformed ChatGPT) featured in 30 newspapers.”

Direct sponsor signal — benchmark beat was the target

Stanford Access to Justice Tech Review · 2023

“86% of civil legal problems reported by low-income Americans receive inadequate or no legal help.”

Market size signal

LangChain GitHub — context-management issues

“Long-session coherence is the recurring developer pain — context windows blow out, history gets truncated.”

Exactly the problem rolling summarization solves

OpenAI API community

“There's no application-layer way to signal what a user already knows.”

FSPR addresses this directly

1.0Demand signals.DIAGRAM

The Problem

Six engineers, seven sprints, and six hard constraints.

NDA-bound deliverable

All IP belongs to MyEdMaster per signed agreement. Nothing proprietary can be disclosed publicly.

6 engineers, 7 two-week sprints

Features scoped and delivered in strict cycles with sponsor review after each sprint.

OpenAI API cost pressure

Every additional LLM call had to be justified. The clarification node was killed specifically because it fired even when no clarification was needed.

No existing legal corpus

RAG KB built from scratch. Documents chunked at 1200 chars / 200 token overlap, embedded with legal-bert (768d), indexed in Qdrant with payload indexes for domain + jurisdiction filtering.

PII in every user input

Users share real legal situations — names, addresses, financial details. spaCy en_core_web_lg anonymization was required before every LLM call. No exceptions.

Local-only deployment

No cloud budget. Full stack on Docker Compose — but architected to lift cleanly into production whenever the sponsor wanted.

North-star principles

Stateless AI layer.

FastAPI holds no state. Django decides what context to send and what to store. Never violated regardless of the shortcut it might offer.

Zero user-facing latency for enrichment.

FSPR and summarization fire after the HTTP response is sent. The user never waits for knowledge profiling.

Anonymize before LLM, always.

No modes or exceptions where real PII reaches OpenAI. Anonymization is a data-provenance question, not a node decision.

Process

Three pivots that earned every dollar back.

Killing the clarification node.

The original pipeline had a dedicated clarification node that ran before reasoning on every message, using GPT-4o to decide whether to ask a clarifying question — but it fired unconditionally, even when the query was completely clear. Removed entirely in Sprint 6, flattening the pipeline to 5 nodes. Lawyer-style probing was rebuilt inline in the reasoning prompt instead. This was the single biggest cost reduction — one full GPT-4o call eliminated per message.

submit_self_assessment running 4 LLM calls per submission.

The self-assessment evaluation was running KB fetch, AI evaluate call, and MongoDB persist inside a for-loop iterating over the four FSPR dimensions. Each operation executed four times per submission. Caught during Sprint 5 testing when the API bill for a single session was 4× expected. All three operations moved outside the loop — KB fetched once, evaluation called once with all four dimensions passed together, persisted once.

Document generation producing anonymized names.

The first document-generation attempt produced PDFs with fictional party names because conversation_summary (which had passed through the anonymization node) was used as context. Fix: pull raw message content directly from the Message model before anonymization runs, bypassing the anonymized summary entirely. Real names appeared immediately.

The JSON code-fence bug

In Sprint 7 testing, ready_to_generate kept returning false at the Django layer even though the AI service was correctly setting it to true. Three hours of debugging traced it to the LLM wrapping its JSON response in markdown code fences (```json), causing the JSON parser to fail silently and fall through to a raw-text fallback that had no ready_to_generate field. Fix: regex strip of code fences before JSON parsing plus a JSON extraction fallback using re.search(r'{[\s\S]*}', raw) for mixed-content responses.

Pipeline topology

Before — 6 nodes

Conditional clarification branch firing GPT-4o on every message regardless of necessity.

After — 5 linear nodes

RAG → Extraction → Anonymization → Reasoning → Formatter. Lawyer-style probing inline in the reasoning prompt.

3.0DIAGRAM

Per-message cost

Before

~$0.024 / message · 3 full GPT-4o calls (reason + clarify + history-summarize)

After

~$0.0044 / message · 1 GPT-4o (reason) + 2 GPT-4o-mini background (FSPR + summarize)

3.1DIAGRAM

Architecture

State, pipeline, and the algorithms behind it.

The full request lifecycle from browser to response — Django preprocesses every message (sanitize via bleach, fetch context via tiktoken token count, build user_context), calls FastAPI /query, the 5-node LangGraph pipeline runs, the response returns, Django persists the LLM message to MongoDB, and two daemon threads fire — one for FSPR inference, one for token-aware summarization. Both call GPT-4o-mini and never block the user.

mety: ~/request-lifecycle

browser@client:/chat$POST /api/chat/{userId}/{chatId}/messages { "content": "..." }

─── Django layer ───────────────────────────────────────

mustakim@portfolio:~$sanitize(bleach) → tokenize(tiktoken) → fetch_user_context()

mustakim@portfolio:~$POST fastapi:8001/query { history, kb_hint, anonymize:true }

─── FastAPI · LangGraph 5-node pipeline ────────────────

[1] RAG · Qdrant search · top-k by domain + jurisdiction

[2] Extraction · S3 presigned URL · pull KB chunk content

[3] Anonymization · spaCy NER strips PERSON, ORG, GPE, LOC

[4] Reasoning · GPT-4o · structured JSON response · lawyer probing inline

[5] Formatter · validate JSON · strip fences · fallback re.search

─── back to Django ─────────────────────────────────────

mustakim@portfolio:~$save(Message, role='assistant') → return JsonResponse to browser

─── after response · 2 daemon threads ──────────────────

↳ thread /fspr/infer · GPT-4o-mini · update 4 dim scores

↳ thread /query/summarize · GPT-4o-mini · if tiktoken > threshold

6.0Full request lifecycle.DIAGRAM

FSPR scoring (background, per message)

# Signals (LLM classifies which dim is most revealed)
Know–Know            → 0.80
Know–Don't Know      → 0.40
False Knowledge      → 0.20
Omission             → 0.30

# EMA update per dimension
new_score = current × 0.7  +  signal × 0.3

# Priority weights for Most Critical Gap
False Knowledge       0.40
Omission              0.30
Know–Don't Know       0.20
Irrelevant Knowledge  0.10

Self-assessment parallel pipeline

LLM1  classify_domain()
       │
       ├──┐  (parallel)
LLM2  generate_kb()         ──┐
LLM4  generate_examples()   ──┤
       │                      │
       │       audit (disabled by default)
       │
       └─▶ 4 parallel dimension evaluators
              ↑
       single KB fetch · single persist
       (moved out of the for-loop in V2)

6.1Algorithms behind the modes.DIAGRAM

Final Designs

Three modes, one cohesive product.

The product shipped as three entry points under one knowledge profile. Below are the three frames that show the end-to-end flow, plus a short demo recorded from the sponsor-handoff build.

METY service hub — three modes: tutoring, chat session, generate document

Hover to zoom

7.0Service hub (three-card entry).IMAGE

Document type + describe + jurisdiction form

7.1Intake form.IMAGE

Split-panel document generation — chat on the left, formatted rental agreement on the right

7.2Split-panel draft.IMAGE

Intake-as-form, not free text. The user picks a document type, describes what they need in a sentence, confirms jurisdiction, and the chat takes over — lawyer-style probing, targeted follow-ups, and a context-aware draft that updates in place. Document chats are saved separately from regular legal chats so a user can return weeks later, find the in-progress NDA, and continue exactly where they left off.

7.3End-to-end demo (sponsor-handoff build).VIDEO LOOP

Refresh recoveryOn reload, React state is lost — a useEffect on chatId calls getDocuments and rehydrates the split panel from MongoDB if a draft exists for that chat.

LaunchDelivered to MyEdMaster end of May 2026, handed to continuation team. Demoed at the ASU SER517 Innovation Showcase. No public deployment — IP belongs to MyEdMaster.

Retrospective

What worked, what I'd change.

Worked

Stateless FastAPI from day one.

Every debugging session, rebuild, and feature addition benefited from the AI service holding no state. Restart the container and nothing breaks. It looked like over-engineering in Sprint 2 and paid off in every sprint after.

Structured JSON from the LLM.

Returning answer, detected_topics, confidence_delta, ready_to_generate, and document_context in one call eliminated what would have been 4-5 separate LLM calls per message and made the entire pipeline observable in LangSmith.

Background threads with daemon=True.

FSPR and summarization fire after the HTTP response with zero user-facing latency. close_old_connections() discipline was established early so no DB connection issues.

Would change

Celery + Redis from Sprint 1.

The fspr_update_in_progress boolean flag works but isn't production-safe — cross-process race conditions are only partially prevented. Celery would have given proper coordination, retries, and monitoring.

JWT auth from Sprint 1.

A bare user_id in the URL path inherited to every new endpoint. Retrofitting proper auth now would require touching every view and every frontend API call.

Prompt versioning from the start.

Prompts changed in every sprint with no way to A/B test or roll back. A version field on prompt constants and LangSmith experiment tagging would have made prompt iteration data-driven instead of intuition-driven.

The biggest surprise

Anonymization was the source of every weird bug. I expected it to be a straightforward privacy layer. What I didn't expect was how many features silently depended on whether they were reading anonymized or raw text — document generation used the anonymized summary and produced fictional names; FSPR inference passed anonymized LLM responses to the evaluator; the reasoning prompt received anonymized history that then produced responses with placeholder names. Every feature that touched persisted text had to be individually audited. Anonymization isn't a node decision — it's a data-provenance question that affects every field in every model.

Next Project

AegisFlow

Reliability infrastructure for LLM pipelines — scoring, fallback, and chaos.

Open