De Bruijn Genome Assembler

Systems / Algorithms | 2022

mustakim@portfolio: ~/projects/genome-assembler

mustakim@portfolio:~$java -jar genome-toolkit-1.0.0.jar assemble data/dataset1.txt --stats

Reading reads from: data/dataset1.txt

Loaded 33,609 reads

Assembling genome with k=20 …

✔ de Bruijn graph built · 111,283 v · 111,624 e

✔ tip removal · 18 dead-ends pruned

✔ bubble resolution · 2 resolved by coverage

✔ Eulerian cycle · Hierholzer's, iterative

✔ circular trim · overlap detected

Assembly Result:
  Genome length:    5,396 bases    (phiX174 ref: 5,386 → 99.9% coverage)
  Input reads:      33,609
  Graph vertices:   111,283
  Graph edges:      111,624
  Tips removed:     18
  Bubbles resolved: 2
  Assembly time:    1,812 ms

Overview

Reassemble a genome from thirty-three thousand fragments.

My Role

Sole engineer

End-to-end build: Java CLI assembler, Spring Boot REST API, React + Vite frontend, Docker packaging, deployed split-stack to Render (backend) + Vercel (frontend). Implements de Bruijn graph construction, Hierholzer's Eulerian cycle, and error correction via tip removal + bubble resolution — all from the bioinformatics primary literature, no library shortcuts.

Stack

Java 17 · Spring Boot 3 · Maven · React · Vite · Docker · Render · Vercel

One assembler jar drives a CLI ( java -jar genome-toolkit-1.0.0.jar assemble ...) and the REST endpoint inside the Spring Boot service. The React frontend is a thin drag-and-drop client over the same API. Two assembly modes ship — the de Bruijn pipeline as primary, and a greedy maximum suffix-prefix overlap assembler as a baseline.

Timeline

Apr 2026 · solo

Lives at debruijn-genome-assembler.vercel.app. Validates on the phi X174 bacteriophage reference — the canonical small-bacteriophage genome for sequencing tools — reconstructing 5,396 bp from 33,609 reads in under two seconds.

Highlights

A de Bruijn assembler that finishes in under two seconds.

On the phi X174 benchmark (5,386 bp reference, 33,609 reads) the pipeline reconstructs 5,396 bases at 99.9% coverage, traverses a graph of ~111K edges via an iterative Hierholzer's implementation, and finishes in ~2 seconds on a laptop. Tip removal prunes 18 dead-end branches; bubble resolution picks the higher-coverage path on 2 competing parallels.

99.9%

Coverage (phi X174)

5,396 bp assembled · 5,386 bp ref

33,609

Input reads → graph

111K vertices · 111K edges

~2 s

End-to-end runtime

laptop · k=20 · Java 17

The pipeline is the algorithm, not a library wrapper.

K-mer extraction, graph construction, tip removal, bubble resolution, Eulerian traversal, and circular trim are all hand-implemented from the source bioinformatics literature. SPAdes and Velvet are referenced for verification, not imported.

Two assemblers, one CLI.

The same jar runs the de Bruijn pipeline (assemble) and a greedy overlap baseline (overlap). The overlap mode exists so the graph-based mode has something to be compared against.

Web UI for non-engineers.

Drop a FASTA or plain-reads file into the React frontend, pick a k-mer size, watch the assembly result and download the genome. Same algorithm, no terminal required.

Context

Genome assembly is graph algorithms with a dictionary.

Modern sequencers don't produce genomes — they produce millions of short reads, each 100–300 bp, that have to be stitched back into the original sequence. The classic computer-science answer is the de Bruijn graph: every k-mer becomes a vertex, every (k+1)-mer becomes an edge, and an Eulerian traversal of the graph reconstructs the source. Production tools like SPAdes and Velvet are de Bruijn at their core, with a couple of decades of engineering on top of the basic structure.

Compeau, Pevzner, Tesler · Nature Biotechnology, 2011

“How to apply de Bruijn graphs to genome assembly — the canonical reference for the algorithmic approach this project implements.”

The paper this toolkit follows

Hierholzer, 1873

“Construct an Eulerian circuit in a graph by depth-first edge traversal with backtracking on dead ends — used in iterative form here for memory safety on the 111K-edge graph.”

The traversal algorithm at the core

SPAdes / Velvet · production assemblers

“State-of-the-art short-read assemblers use de Bruijn graphs with error correction (tip removal + bubble resolution) and multi-k assembly.”

The blueprint this project follows on the toolkit side

phi X174 — Sanger, 1977

“First DNA-based genome ever sequenced — 5,386 bp. Still the canonical small benchmark for any new assembler.”

The reference this toolkit validates against

1.0The algorithmic lineage.DIAGRAM

The Problem

A de Bruijn assembler from the paper, not a wrapper.

Implement the algorithm, not import it

The whole point of the build was understanding-through-implementation. No SPAdes wrapper, no Bio4j shortcut. K-mer extraction, graph construction, tip removal, bubble resolution, and Eulerian traversal are all hand-rolled in Java.

Memory safety on ~111K edges

A recursive Hierholzer's blows the JVM stack on a graph this size. The traversal has to be iterative — explicit stack + edge-consumed bitmap.

Error correction without overcorrection

Real reads have sequencing errors that become tips and bubbles in the graph. Remove them too aggressively and the genome gets fragmented; not enough and the assembly stays wrong. Configurable thresholds, validated against phi X174 coverage.

Two input formats, one pipeline

Plain reads (one per line) and FASTA (with sliding-window read simulation from a reference). Both must produce the same graph for the same content.

N-bases break k-mer indexing

Reads containing N (ambiguous base) corrupt the (k-1)-mer vertex set if passed through. Filter before graph construction; transparent to the user.

One toolkit, two surfaces

Same jar serves the CLI for power users and the REST API behind the React frontend. Drag-and-drop UI for non-engineers; --stats output for everyone else.

North-star principles

Iterative over recursive.

On graphs this size, recursion is a stack-overflow waiting for the right input. Every traversal in the pipeline is iterative with explicit state.

Verifiable against a reference.

The whole reason phi X174 is the benchmark is that you can compare assembled output to a known ground truth. 99.9% coverage isn't a vibe; it's a string match.

Two assemblers for one comparison.

A greedy overlap baseline exists only so the de Bruijn pipeline has something to be measurably better than. Algorithms without baselines are folklore.

Process

Build the graph, walk the cycle, clean the noise.

Greedy overlap baseline.

Started with the simpler algorithm — pairwise maximum suffix-prefix overlap merging. Works on short read sets and is easy to verify by hand, but scales O(n²) on the number of reads. Kept it in the jar under the overlap subcommand as the comparison baseline.

De Bruijn graph + recursive Hierholzer.

First de Bruijn implementation. K-mer extraction, vertex / edge construction, and a textbook recursive Hierholzer's traversal. Assembled correctly on toy inputs (~1K edges) and promptly stack-overflowed at full phi X174 scale (~111K edges). The algorithm was right; the implementation was wrong.

Iterative Hierholzer + error correction.

Rewrote the traversal as an explicit-stack iterative implementation. Added tip removal (dead-end branches shorter than a configurable depth threshold) and bubble resolution (parallel paths between the same two vertices, resolved by coverage comparison). Now finishes phi X174 in ~2 seconds at 99.9% coverage, prunes 18 tips, resolves 2 bubbles.

The circular-trim gotcha

phi X174 is a circular genome — the assembled string starts and ends with the same k-1 prefix because the Eulerian cycle closes on itself. The first version of the output left those duplicated bases in place and reported 5,416 bp instead of 5,396. Fix: detect the overlap window between the head and tail of the assembled string and trim it, controllable via --no-trim for non-circular references.

Hierholzer's implementation

Before — V2 recursive

Recursive depth-first traversal. Beautiful 12-line implementation that stack-overflows on any non-trivial graph. JVM gives up around 64K depth.

After — V3 iterative

Explicit Deque<Edge> stack + visited bitmap. Same algorithm, no recursion. Handles 111K-edge phi X174 graph in 1.8 seconds with stable memory footprint.

3.0DIAGRAM

Reported genome length

Before

5,416 bp on phi X174. Algorithm correct, output wrong — duplicated head/tail bases from the circular topology.

After

5,396 bp at 99.9% coverage. Circular-trim detects the head/tail overlap and removes it. --no-trim flag for non-circular inputs.

3.1DIAGRAM

Architecture

Seven stages, one Eulerian walk.

The de Bruijn pipeline is seven deterministic stages — every stage consumes a typed input from the previous one and produces a typed output for the next. No hidden mutation, no global state. Same code powers the CLI and the REST endpoint.

genome-toolkit: ~/pipeline

cli@laptop:/$assemble data/dataset1.txt --stats # k=20

─── assembly pipeline ───────────────────────────────────

[1] extract k-mers from reads · filter N-bases · k=20

[2] build de Bruijn graph · (k-1)-mer vertices, k-mer edges

[3] tips remove dead-end branches · depth threshold

[4] bubbles resolve parallel paths · keep higher-coverage

[5] euler iterative Hierholzer · explicit-stack DFS

[6] trim detect circular head/tail overlap · trim duplicate

[7] emit assembled genome · graph stats · runtime

done5,396 bp · 99.9% cov · 1,812 ms · 18 tips · 2 bubbles

6.0Pipeline lifecycle.DIAGRAM

Iterative Hierholzer's

Deque<Vertex>  stack  = new ArrayDeque<>();
List<Edge>     trail  = new ArrayList<>();
BitSet         used   = new BitSet(edges.size());

stack.push(startVertex);

while (!stack.isEmpty()) {
    Vertex v = stack.peek();
    Edge next = graph.unusedOutgoing(v, used);

    if (next == null) {                // dead-end → backtrack
        trail.add(graph.incoming(v));
        stack.pop();
    } else {                           // walk forward
        used.set(next.id());
        stack.push(next.dest());
    }
}

// trail is the Eulerian circuit in reverse
Collections.reverse(trail);

CLI surface

# Primary mode
genome-toolkit assemble <reads-file> [options]

  -k <size>     k-mer size (default 20)
  -o <file>     write assembled genome to file
  --no-trim     disable circular trim
  --no-tips     disable tip removal
  --stats       print assembly statistics

# Baseline mode
genome-toolkit overlap <reads-file>
  # greedy maximum suffix-prefix overlap merging
  # O(n²) on reads — for small-set comparison only

# Inputs
  reads.txt     one read per line
  ref.fasta     FASTA — sliding-window simulated reads

6.1Traversal + CLI surface.DIAGRAM

Final Designs

A drag-and-drop UI in front of the same assembler.

Two surfaces, one engine. The CLI is the truth-teller — every flag, every stat, full reproducibility. The web UI at debruijn-genome-assembler.vercel.app wraps the same jar behind a Spring Boot REST endpoint, accepts file drops, and renders the same assembly result + downloadable genome.

Genome Assembler — drag-and-drop landing with k-mer size and demo data button

7.0Web UI — drag-and-drop landing.IMAGE

Demo data loaded — dataset1.txt, 0.02 KB, ready to assemble

7.1Demo data loaded — one click to assemble.IMAGE

Assembly complete — 5,396 bp, 33,609 reads, 26,455 ms, genome sequence preview

7.2Assembly complete — 5,396 bp from 33,609 reads, sequence preview + download.IMAGE

debruijn-genome-assembler.vercel.app

https://debruijn-genome-assembler.vercel.app

Open live app ↗

7.3Live at debruijn-genome-assembler.vercel.app — click to open.IMAGE

Retrospective

What survived, what I'd assemble differently.

Worked

Iterative traversal scaled where recursion didn't.

Same Hierholzer's algorithm, two implementations. Recursive stack-overflowed at 111K edges; iterative finishes in 1.8 seconds and uses bounded heap. The lesson generalizes well beyond bioinformatics.

The greedy baseline justified the de Bruijn build.

Without an in-repo comparison, 'I implemented de Bruijn' is folklore. With it, you can show the O(n²) overlap behavior on the same input and explain why the graph pays off.

One jar, two surfaces.

The CLI and the REST API share the same assembler class. Bug fixes land in both at once; behavior is provably identical.

Didn't

Only validated on phi X174.

Small bacteriophage at ~5.4kb is the toy dataset. Anything closer to a real bacterial genome (~5 Mb) needs more careful memory management — and probably multi-k assembly.

No paired-end support.

Real Illumina sequencing produces paired reads with known insert sizes — invaluable for resolving repeats. The current pipeline treats every read independently.

Render cold start hurts the demo.

The Spring Boot backend on Render free tier sleeps after 15 minutes idle and takes ~10 seconds to warm. First-load impression of the demo is worse than the assembler deserves.

Multi-k assembly.

SPAdes-style — assemble at several k values, merge results. Larger k for long unique stretches, smaller k where coverage drops. Standard move for scaling beyond toy genomes.

Paired-read scaffolding.

Use insert-size information to bridge across repeats that single reads can't resolve. The classic next-step after a working de Bruijn core.

Always-warm backend.

Migrate the Render service to a paid tier or a fly.io always-warm container so the first demo click responds in ~200ms, not 10 seconds.

Next Project

Launch Parameters

Interactive launch planning from any point on Earth.

Open