Semiconductor Yield Predictor

Machine Learning | 2025

https://localhost:8501/secom

Overview

Catch the wafers that fail — even when failure is six percent.

Read the paper

My Role

Sole engineer

End-to-end ownership: data audit, preprocessing pipeline, L1 feature selection, model comparison across five candidates, threshold tuning, and a Streamlit operator UI with a live recall / false-alarm knob. The deliverable is a single-line decision — PASS or FAIL with calibrated probability — that an operator can trust, set their own threshold against, and re-tune as their fab's cost asymmetry shifts.

Stack

Python 3.10 · scikit-learn · imbalanced-learn (SMOTE baselines) · Streamlit · joblib · matplotlib

A four-stage pipeline serialized to joblib artifacts — imputer, scaler, L1 selector, Random Forest — plus a Streamlit UI that loads the artifacts at boot and exposes the decision threshold as a live slider. Notebook for training; app.py for inference; nothing more.

Timeline

Mar 2026 · solo

Trained, evaluated, and packaged with the deployed secom/app/app.py Streamlit demo. The final Random Forest + L1 model beats four baselines (incl. XGBoost + SMOTE) on the metric that actually matters for a fab: recall on the failure class.

Highlights

Five models, one metric that matters: recall on the rare class.

On a dataset where ~6% of wafers fail and 14 out of 15 are good ones, classifiers that look great on accuracy are useless — you can predict “PASS” for every wafer and hit 94%. The right question is how many of the actual failures did you catch? Final model catches 76% at a 0.35 threshold, against a naive-LogReg baseline of 14%.

76%

Recall (failure class)

vs 14% baseline · threshold 0.35

0.81

ROC-AUC

severely imbalanced · 14:1 ratio

590 → 113

Feature dimensionality

L1 selector · ~80% pruned

Model comparison — README results table

Model	Recall	Precision	F1	ROC-AUC
Naive Logistic Regression	0.14	0.14	0.14	—
Cost-Sensitive Logistic Regression	0.29	0.15	0.20	—
ROS Logistic Regression	0.29	0.15	0.20	—
XGBoost + L1 + SMOTE	0.57	0.15	0.24	—
★ Random Forest + L1 (Ours)	0.76	0.23	0.35	0.81

Recall is the metric that matches the cost.

A missed defect escapes to downstream processes — rework, yield loss, field returns. A false alarm is one extra inspection. The metric, the threshold, and the loss are all chosen with this asymmetry in mind.

L1 before Random Forest beats Random Forest alone.

At 590 features and only 104 failure examples, feeding the raw dimensionality into trees produces unstable splits on noise. L1 (Lasso LogReg, C=0.1) narrows the input to ~113 statistically relevant sensors before any tree gets built.

The threshold is a knob, not a constant.

The Streamlit app exposes the decision threshold as a live slider so an operator can match it to their facility's real cost of a missed defect vs a false alarm. The model probability is calibrated; the threshold is editable.

Context

Semiconductor yield is the difference between a $400 wafer and a paperweight.

Modern fabs run hundreds of sensor streams across every step of wafer fabrication — etch, lithography, deposition, polishing. A failing wafer usually leaves a fingerprint in the sensor data, but the fingerprint is buried inside hundreds of correlated dimensions with most failures concentrated in a handful of them. The SECOM dataset is the canonical public benchmark for exactly this problem.

UCI ML Repository · SECOM dataset card

“A typical wafer fabrication process is a complex sequence of operations. Continuous-valued sensor signals are collected throughout — but only a subset are useful for predicting yield.”

Why feature selection is non-optional

SEMI E10 standard (manufacturing equipment)

“Equipment reliability and overall equipment effectiveness (OEE) are measured in part by the rate of out-of-spec product.”

The KPI on the operator side

Common ML pitfall · class-imbalance literature

“With a 14:1 class imbalance, naive classifiers default to predicting the majority class. Accuracy looks great; recall on the rare class is near zero.”

Exactly what the baseline LogReg shows: 14% recall

TSMC / Intel internal yield-engineering practice

“Cost asymmetry: a missed defect can cost 10-100x what a false alarm costs, depending on where in the process it escapes.”

Why recall, not F1, is the headline number

1.0Demand signals.DIAGRAM

The Problem

A 14:1 class imbalance, 590 sensors, 104 failures.

The minority class is 6%

Out of 1,567 wafers in SECOM, only ~104 are labeled failures. Vanilla classifiers default to PASS-everything; accuracy looks 94% great, recall is 14%, the model is useless.

590 features, mostly noise

Most sensor channels carry no failure signal. Constant-value features need dropping, missing values need imputing, correlated features need de-duplicating, and the rest need narrowing to the ones that actually predict failure.

SMOTE over-correction

XGBoost + L1 + SMOTE pushed predicted failure rates to 80%+ — useful in a paper, useless in a fab. The class-imbalance fix can't be more aggressive than the imbalance itself.

No universal threshold

A leading-edge logic fab has a different cost-of-missed-defect than a memory fab. The same model has to serve both — the threshold has to be exposed as a knob, not baked into the artifact.

Reproducible end-to-end

The notebook has to train the model and the Streamlit app has to load the same artifacts unchanged. No drift between training and inference.

Honest comparison or none

The headline model has to beat the baselines on the same split with the same preprocessing. No quietly favorable splits, no cherry-picked thresholds. The results table in the README is the audit.

North-star principles

Cost-asymmetry first.

The metric you optimize against has to match the metric the operator cares about. Recall on the rare class for a yield-loss problem; F1 only when the costs are symmetric.

Prune before you predict.

On a noisy high-dimensional dataset, L1 selection isn't a nice-to-have — it's the move that makes downstream training stable. Reduce signal-to-noise before you reach for the heavier model.

One artifact set, two surfaces.

The notebook trains and saves the artifacts; the Streamlit app loads and uses them. No re-training in the app, no separate preprocessing — both surfaces see the same model state.

Process

Five candidates, four design decisions, one operator-ready slider.

Naive baselines.

First sprint was deliberately under-engineered: logistic regression, no class weighting, no oversampling. Got 14% recall on the failure class. The point wasn't to ship this — the point was to anchor the ceiling of “what you get if you don't treat the imbalance.”

Three imbalance fixes — all hit a wall.

Cost-sensitive LogReg (class-weighted), ROS LogReg (random over-sampling), and XGBoost + L1 + SMOTE. Recall climbed — 0.29 / 0.29 / 0.57 respectively — but each in a way that hurts production: XGBoost + SMOTE over-corrected and flagged 80%+ of wafers as failures. Useful recall, unusable precision floor.

L1 → Random Forest with class_weight='balanced'.

The winning recipe. Preprocess (drop constants, impute median, standard-scale), Lasso LogReg (C=0.1) for feature selection (590 → ~113), then RandomForestClassifier(n_estimators=500, class_weight='balanced', max_depth=6). Lands at 0.76 recall · 0.81 ROC-AUC at the 0.35 threshold. No SMOTE — the built-in class weighting handles the imbalance without over-correcting.

Why 0.35, not 0.50

The default classification threshold (0.50) optimizes for accuracy. On a 14:1 imbalance, accuracy and recall pull in opposite directions. Lowering the threshold to 0.35 trades a few percentage points of precision for a meaningful jump in recall — exactly the trade you want when a missed defect is more expensive than a false alarm. The Streamlit slider exposes this so the operator can move it themselves; 0.35 is just the calibrated default.

Recall on the failure class

Before — V1 naive LogReg

0.14 recall. Model defaults to predicting PASS for every wafer. Useless in production.

After — V3 RF + L1

0.76 recall at threshold 0.35. Catches ~3 of every 4 failing wafers — a 5.4× improvement over the naive baseline.

3.0DIAGRAM

Feature space

Before

590 raw sensors. Random Forest fits unstable splits on noise dimensions. XGBoost over-fits to bootstrap subsets.

After

L1 Lasso (C=0.1) prunes 590 → ~113 features. Trees are now built on the statistically relevant subset; variance across folds drops noticeably.

3.1DIAGRAM

Architecture

Four serialized stages, one decision.

Every stage of the pipeline lives as its own joblib artifact in models/. The Streamlit app loads the four artifacts at boot, applies them in order to any wafer the operator pastes in, and returns a calibrated probability plus the PASS / FAIL verdict against the live threshold.

secom: ~/inference-lifecycle

streamlit@operator:/$input = wafer (1, 590) sensor readings

─── load .pkl artifacts (boot) ──────────────────────────

mustakim@portfolio:~$imputer = joblib.load('models/imputer.pkl')

mustakim@portfolio:~$scaler = joblib.load('models/scaler.pkl')

mustakim@portfolio:~$l1 = joblib.load('models/feature_selector.pkl')

mustakim@portfolio:~$model = joblib.load('models/xgb_model.pkl') # actually a RandomForest

─── per-prediction path ────────────────────────────────

[1] impute median fill on missing sensors → (1, 590)

[2] scale standardize (mean=0, std=1) → (1, 590)

[3] L1 select drop ~80% noise dimensions → (1, 113)

[4] predict RandomForestClassifier.predict_proba → P(fail)

[5] threshold P(fail) ≥ ui.threshold → FAIL else PASS

mustakim@portfolio:~$return { verdict: 'FAIL', p_fail: 0.42, threshold: 0.35 }

6.0Inference lifecycle.DIAGRAM

Training recipe (RF + L1)

# 1. Preprocess
X = drop_constant_features(X)           # 590 → ~474
X = SimpleImputer(strategy='median').fit_transform(X)
X = StandardScaler().fit_transform(X)

# 2. L1 feature selection
selector = SelectFromModel(
    LogisticRegression(
        penalty='l1', solver='liblinear', C=0.1
    ),
    threshold='median'
)
X = selector.fit_transform(X, y)        # ~474 → ~113

# 3. Random Forest
clf = RandomForestClassifier(
    n_estimators=500,
    class_weight='balanced',
    max_depth=6,
    random_state=42,
)
clf.fit(X, y)

# 4. Tune threshold for recall
proba = clf.predict_proba(X_val)[:, 1]
threshold = 0.35    # cost-asymmetry default

Why each choice (from the README)

L1 before RF
  noise dimensions cause unstable splits;
  Lasso narrows feature space before trees.

RF over XGBoost
  XGBoost + SMOTE over-corrects, predicting
  80%+ failure rate; RF + class_weight handles
  the 14:1 ratio gracefully at this dataset size.

Adjustable threshold
  no universal cutoff — costs differ per fab.
  Streamlit slider lets the operator tune live.

Recall as the primary metric
  cost asymmetry: a missed defect is far more
  expensive than a false alarm.

6.1The four key design decisions, verbatim.DIAGRAM

Final Designs

The operator UI is one slider and one decision.

The Streamlit app at secom/app/app.py opens with the threshold slider front and center, a wafer-input pane, and a verdict tile that switches between green PASS and amber FAIL depending on where P(fail) lands relative to the slider. Below that: a per-feature contribution panel and the historical confusion matrix from the holdout set.

Streamlit landing page — wafer input pane with 590 sensor values and FAIL verdict tile at threshold 0.35

7.0Streamlit landing — slider + input pane.IMAGE

Threshold sweep visualization showing live recall, precision, and false-alarm rate curves as the decision slider moves

7.1Threshold sweep — live holdout-set tradeoffs.IMAGE

Holdout confusion matrix at threshold 0.35: TN=238, FP=55, FN=5, TP=16 with recall 0.76 and ROC-AUC 0.81

7.2Confusion matrix at threshold 0.35.IMAGE

Top L1-selected features ranked by coefficient magnitude — 113 sensors with sign of contribution and absolute coefficient value

7.3Top L1 features — 113 sensors ranked by coefficient magnitude.IMAGE

Retrospective

What I'd keep, what I'd model differently.

Worked

L1 → trees beats trees alone.

On a 590-feature dataset with ~100 failure rows, narrowing the input space before fitting trees materially stabilized cross-validation scores and improved generalization.

class_weight='balanced' beat SMOTE here.

Built-in class weighting matched the cost-asymmetry without inventing synthetic minority points. SMOTE overcorrected and pushed predicted failure rates well above the true base rate.

Exposing the threshold is the right UX.

A live slider gives the operator more value than any single 'optimal' threshold. Cost asymmetry differs across fabs; the model can be the same.

Didn't

Precision is still low.

At 0.76 recall the precision sits at 0.23 — for every real failure caught, ~3 false alarms get flagged. Acceptable when missed defects are the expensive failure mode, but it's still the obvious axis to improve.

No probability calibration check.

The predict_proba output is treated as calibrated, but I didn't run a reliability diagram. For a threshold-tuning UI to be meaningful, the probabilities should be calibrated (Platt or isotonic) on a held-out fold.

The artifact filename is a lie.

The serialized model file is named xgb_model.pkl but actually contains a RandomForestClassifier — a leftover from earlier iterations. Harmless but technically misleading. Worth renaming on the next pass.

Calibrate and recalibrate.

Run isotonic or Platt scaling on a calibration fold and re-export the model. Operator-facing probabilities should mean what they say.

Try a gradient-boosted tree without SMOTE.

LightGBM or XGBoost with class_weight (not SMOTE) might split the difference between RF's recall and XGBoost's nominally tighter splits. Same training surface, fairer comparison.

Push the threshold-sweep into the UI.

The Streamlit slider already moves; what's missing is a live ROC curve that highlights where the slider currently sits. Cheap visualization win, high decision-quality return.

Next Project

PixelDrive: Road Scene Segmentation

Three segmentation models compared honestly — U-Net wins 79.33% mIoU on Carla, live on Hugging Face.

Open