Back

Missing Persons Knowledge Graph

Research / Knowledge Graph | 2024

Missing Persons Knowledge Graph
https://missing-persons-knowledge-graph.vercel.app
+
Overview

A knowledge graph for the people we lose.

My Role
Continuation engineer & deployer

Originally a six-person SER531 team project at ASU under the Semantic Web Technologies course. I continued the work solo after the semester: rewrote the query layer from a paid Java/Jena GraphDB stack into FastAPI + RDFLib, ported the React client to Vite, shipped the whole thing to Vercel + Render free tier, and prepared the resulting paper for IEEE COMPSAC 2025.

Stack
Protégé · OWL · RDFLib · SPARQL · FastAPI · React · Vite · Tailwind · NamUs

The ontology is authored in Protégé and serialized to Turtle (result-triples.ttl). FastAPI loads the graph into an in-memory rdflib.Graph at boot, then maps incoming REST filters to SPARQL queries against it. The React client is a thin consumer — table / card toggle, detail view, Google Maps embed for last-known location.

Timeline
Spring 2025 (team) · summer continuation · IEEE COMPSAC Aug 2025

Project shipped to a paying GraphDB at end of SER531; rebuilt on a $0 infra budget over the summer; paper accepted at IEEE COMPSAC 2025 (Toronto, 27% acceptance rate) under the title “Enhanced Tracking and Reporting of Missing Persons Using Semantic Web Technologies.”

Highlights

One ontology, thousands of cases, zero dollars.

The point of the rebuild was reach: the original GraphDB / Azure stack was excellent but cost ~$50 / month and would die the moment the team stopped paying. Migrating to an in-process RDF graph kept the same SPARQL semantics, ran at the same speed, and dropped the operating cost to $0 / month on free tiers — a precondition for it staying alive long enough to be cited.

3,559
NamUs cases indexed
California · Texas · Alaska
< 100 ms
SPARQL query latency
in-memory rdflib.Graph
$0 / mo
Operating cost
$50 → $0 vs the GraphDB stack
IEEE COMPSAC 2025 publication.
Accepted at the 49th annual IEEE Computer Software and Applications Conference (Toronto, 27% acceptance rate). The paper documents the ontology, the in-memory query architecture, and the lessons from the GraphDB → RDFLib migration.
A custom OWL ontology, not a relational schema.
Cases, persons, locations, demographics, and case events are all RDF classes with typed properties — not foreign keys. That keeps the same domain model the original team modeled in Protégé and makes the data usable by other semantic-web tools without translation.
REST that speaks SPARQL underneath.
The FastAPI layer accepts familiar query params (?county=Fresno&sex=F&age_min=30) and translates them into SPARQL filters against the in-memory graph. The client never sees SPARQL, but a separate developer endpoint exposes the raw graph for researchers.
Context

The data exists. The questions are hard to ask.

NamUs — the National Missing and Unidentified Persons System — is the canonical public dataset, but its search UI was built around looking up a case you already know. Researchers, families, and journalists who want to ask compound questions (“missing women in their thirties last seen in central California”) end up scraping pages or pulling CSVs. A knowledge-graph layer is the right abstraction: it speaks the language the data was always going to live in.

NamUs · NIJ
Tens of thousands of new missing-persons cases are entered into NamUs each year, and a large share remain open.
Scale signal — the dataset only grows
W3C · Semantic Web for Public Data
Public-interest datasets benefit from RDF/OWL when querying patterns matter more than row lookups.
The ontology-first argument
SER531 course brief · ASU
Model a real-world domain in OWL. Demonstrate end-to-end queries via SPARQL. Justify the ontology.
The course constraint that started it
GraphDB pricing — Ontotext, 2025
GraphDB Free runs locally; production hosting starts ~$50/mo on standard cloud.
The cost cliff we walked off
1.0Demand signals.DIAGRAM
The Problem

A research-grade graph that has to run on no money.

1
The ontology was non-negotiable
SER531 required OWL classes modeled in Protégé. Relational alternatives were off the table — the deliverable had to be semantic web from the source up.
2
No paid cloud budget after the semester
The original team deployment used GraphDB on Azure at ~$50/mo. Nobody was going to keep paying. To stay published-reproducible, the system had to run on Vercel + Render free tiers — indefinitely.
3
Real (not synthetic) NamUs data
3,559 cases across California, Texas, and Alaska — pulled from NamUs and shaped to fit the ontology. The graph had to load every record at startup without blowing memory on a free-tier dyno.
4
SPARQL semantics had to survive the rewrite
The original Java/Jena queries used SPARQL FILTER + OPTIONAL clauses extensively. The Python rewrite had to keep the exact same query semantics — same results, same ordering, same nulls.
5
Reproducible for the IEEE submission
The paper had to be re-runnable. Triples in the repo, ontology in the repo, queries documented. Anyone reading the paper had to be able to clone and curl /docs.
6
A team handoff with no follow-on contract
Original SER531 team disbanded at end of semester. Continuation work was solo and unfunded — every decision optimized for low maintenance, low cost, long uptime.
North-star principles
The ontology is the source of truth.
Protégé .owl + Turtle triples sit at the center. Everything else — FastAPI, React, hosting — is replaceable around it.
In-memory beats networked.
At this dataset size, loading the entire graph into rdflib at boot is faster than any external triplestore, and it fits comfortably in a free-tier 512 MB dyno.
REST on the outside, SPARQL on the inside.
The client doesn't need to know about RDF. Every public endpoint is a normal query string; SPARQL is an implementation detail behind the API.
Process

Three deployments that each killed the previous one.

V1

GraphDB on Azure (the SER531 hand-in).

The team submission ran GraphDB Free in a container on an Azure VM with a Java/Jena query frontend. It worked, the queries were fast, the SPARQL was textbook clean. It also cost about $50 / month and required someone to keep paying — a guarantee that the public demo would silently disappear once the semester ended.

V2

Apache Jena Fuseki on a free VPS.

First migration attempt: keep Jena, drop the cloud bill. Stood up Fuseki on an Oracle Cloud free-tier VM. SPARQL parity was perfect, but free tiers go to sleep — cold starts pushed first-query latency past 5 s after any quiet period. Not acceptable for a public-facing demo a journalist might land on once.

V3

rdflib in-process, FastAPI as the query layer.

Replaced the whole external triplestore with rdflib.Graph loaded into FastAPI memory at startup. The ontology, instance triples, and inferred axioms all sit in process. SPARQL runs against the in-memory graph, returning bindings in < 100 ms from a warm container. No external service, no second hop, no monthly bill.

The cold-start trick

Render free-tier dynos still sleep after 15 minutes idle, and the first request after wake-up has to re-parse result-triples.ttl before serving. We added a small /health endpoint and a Vercel cron that pings it every 10 minutes — the dyno stays warm, the graph stays loaded, the public demo keeps answering instantly. The cron itself runs free.

Query backend
Before — V1 GraphDB on Azure
External triplestore container, separate VM, $50/mo, manual restart on OOM.
After — V3 rdflib in-process
Loaded into FastAPI memory at startup. Zero external dependencies, $0/mo, container restart auto-reloads the graph.
3.0DIAGRAM
Operating cost
Before
~$50 / month on Azure for a GraphDB VM that would die when the team stopped paying.
After
$0 / month on Vercel (frontend) + Render (API free tier) — has stayed live continuously since the IEEE submission.
3.1DIAGRAM
Architecture

From Protégé ontology to a query in your URL bar.

The system is intentionally short — one ontology, one TTL serialization, one in-memory graph, one API layer, one client. There's no database and no message bus. The simplicity is the point: the paper has to be reproducible by a single reader with one git clone and one uvicorn invocation.

missing-persons: ~/request-lifecycle
browser@client:/$GET /api/cases?county=Fresno&sex=F&age_min=30&age_max=40
─── FastAPI · query layer ──────────────────────────────
mustakim@portfolio:~$parse query params → build SPARQL FILTER clauses
mustakim@portfolio:~$graph.query(sparql) # rdflib.Graph (in memory)
[1] ?case a :MissingPerson ;
[2] :sex ?sex ; :age ?age ;
[3] :lastSeenIn ?loc .
[4] ?loc :county "Fresno" .
[5] FILTER (?sex = "F" && ?age >= 30 && ?age <= 40) .
mustakim@portfolio:~$serialize bindings → JSON # demographics + photo + namus URL
─── back to client ─────────────────────────────────────
mustakim@portfolio:~$200 OK [ { case: "MP-12041", ... }, ... ] # < 100 ms
6.0Request lifecycle.DIAGRAM
Ontology classes (excerpt)
:MissingPerson  a owl:Class .
   :caseId            xsd:string
   :name              xsd:string
   :sex               xsd:string
   :dateOfBirth       xsd:date
   :lastContactDate   xsd:date
   :lastSeenIn        :Location
   :hasDemographics   :Demographics
   :namusUrl          xsd:anyURI

:Location       a owl:Class .
   :city              xsd:string
   :county            xsd:string
   :state             xsd:string
   :lat / :lon        xsd:decimal

:Demographics   a owl:Class .
   :race              xsd:string
   :height_cm         xsd:integer
   :weight_kg         xsd:integer
   :eyeColor          xsd:string
Deployment topology
Protégé (.owl)
   │  serialize
   ▼
result-triples.ttl   (in repo)
   │  load at boot
   ▼
FastAPI  ──  rdflib.Graph  (in process)
   │
   ▼
Render free tier  ←—  /health ping (Vercel cron, 10 min)
   ▲
   │  GET /api/cases?…
   │
React + Vite + Tailwind
   │
Vercel  (static frontend)
6.1Deployment topology.DIAGRAM
Final Designs

The public surface ships at missing-persons-knowledge-graph.vercel.app.

The deployed product is two surfaces. The frontend at missing-persons-knowledge-graph.vercel.app serves the search + table + detail UI; the API at missing-persons-knowledge-graph-1.onrender.com/docs exposes the same data via Swagger. Both are live as of the IEEE submission and are kept warm by the cron heartbeat described above.

Case detail — Maria Munoz, demographics, Google Maps embed, NamUs deep link
7.0Case detail view.IMAGE

Each case renders as a full-page detail card. Photo on the left, demographics + case information on the right — case number, date of last contact, location, biological sex, race. Below that: missing age vs. computed current age, full circumstance-of-disappearance narrative, and a Google Maps embed pinned to the lat/lon coordinates from the ontology.

The filter surface maps REST params to SPARQL. Name, Case ID, sex (radio), race (multi-check), missing age range (check-bucket), county, city, and cause of disappearance — all compound-queryable in one submission. The backend translates every selected filter into a SPARQL FILTER clause against the in-memory graph and returns results in < 100 ms.

This example searches for a white male named John, aged 18–35, last seen in Santa Barbara under suspicious circumstances — the kind of compound query NamUs's own search UI was not built for.

The “View More on NamUs” button at the bottom deep-links back to the official NamUs case page so the graph serves as a discovery layer, not a replacement for the primary source.

Filter form — John, Male, White, 18-35, Santa Barbara, Suspicious circumstances
7.1Advanced filter — compound SPARQL query.IMAGE
Retrospective

What survived, what I'd do differently.

Worked

Ontology-first paid off.
Because the OWL model was the source of truth, swapping Java/Jena → Python/rdflib was a translation, not a redesign. Every query and every result row matched.
In-memory at this scale is fine.
3,559 cases fit comfortably in a free-tier dyno with the whole ontology + inferred axioms. The query path is one process; there's no network hop to a triplestore.
The cron heartbeat is silly and effective.
Pinging /health every 10 minutes kept the Render free tier warm for the entire IEEE review window without anybody paying anything.

Didn't

Cold starts still bite if the cron fails.
If the Vercel cron misses (free-tier rate limits), the first request after a long idle re-parses the TTL. Sub-second-perceived UX briefly becomes 5 seconds.
The data scrape is brittle.
NamUs page structure changes occasionally and the original scrape script needs hand-fixes. A proper data-refresh pipeline would have been nice.
No federation across states.
Cases live in three states only because that's what SER531 needed for the deliverable. A nationwide graph requires either a more aggressive scrape budget or an actual partnership with NamUs.

Next

Public SPARQL endpoint.
Expose a read-only /sparql endpoint so external researchers can run their own queries without going through the REST wrapper. Already prototyped — just needs rate-limiting before it ships.
Face-similarity recall.
Add a per-case image embedding and a 'similar appearance' button on the detail page — many missing-persons searches start with a photograph, not a name.
Federated identity across datasets.
The same person can show up in NamUs, ChariotPI, and local-PD bulletins under slightly different demographics. An owl:sameAs layer over multiple sources would let the graph answer questions no single dataset can.