Engineering · Build logDecember 2025 — presentSolo build · still shipping

Engineering at PrepAtlas

Hi, I'm Anirudh. I build and run PrepAtlas — a Next.js + Supabase exam-prep platform for Indian students. It's currently strongest on railway and government-job tracks, and expanding to NCERT, JEE/NEET, banking, and SSC. This page is the build log: three architecture phases (EC2 → ECS → EKS), the five decisions that shaped them, and the triggers behind each migration.

If you came here from my LinkedIn to verify I built this end-to-end: the entire production stack runs on roughly $10/mo — one small EC2 box behind nginx, Supabase Free, Route 53 — with no Vercel, no Pinecone, no Kubernetes. v1 lives on AWS today; v2 (ECS Fargate + ALB) is the containerized scale-out I'm migrating into as load justifies; v3 (EKS + service mesh) is the Kubernetes story for when there's a team. Every claim on this page is real and defensible.

Section 01

Architecture evolution · 3 phases on AWS

Three architectures. One product. Each migration triggered by a specific load / team / reliability requirement — not by resume-driven engineering. Diagrams below show what was, what is being built, and what comes next.

EC2 + nginx

Live today

ECS + ALB

Migrating

EKS + mesh

Planned

Lean MVP · EC2 + nginx

Live · December 2025 — present

PrepAtlas v1 architecture: web browser and Android TWA reach a single EC2 instance in ap-south-1 running nginx (HTTP/2, TLS via Let’s Encrypt, static caching) which proxies to a pm2-supervised Next.js 15 standalone process on port 3001. The Next.js process talks to Supabase (Postgres + pgvector + Auth + RLS) and the Anthropic Claude API for the AI tutor.

What serves traffic today. 6 weeks from idea to live. One VM, one process manager, one Postgres. Every web-perf optimization in scope before any infra scale-out.

Single EC2 t3.small in ap-south-1 (Mumbai) — Ubuntu 24.04, Node 20, pm2
nginx HTTP/2, immutable static caching, upstream keepalive
Supabase Free · Postgres + pgvector + RLS · idempotent migrations
Deploy: git push → SSH pull → pnpm build → pm2 reload (< 60 s, zero downtime)

Containerized scale-out · ECS Fargate + ALB

Designed · migrating

PrepAtlas v2 architecture: clients reach Route 53 → CloudFront → WAF + Shield → Application Load Balancer (multi-AZ) → two ECS Fargate tasks in private subnets across AZ-1a and AZ-1b. The cluster talks to Supabase (Postgres + pgvector), ElastiCache Redis, S3 with SQS and Lambda, Secrets Manager, and CloudWatch. The Anthropic Claude API serves the AI tutor.

Triggered when traffic + reliability requirements outgrow a single VM. Multi-AZ HA, edge CDN, WAF for abuse, blue/green rollouts with automatic rollback.

ECS Fargate tasks across AZ-1a + AZ-1b behind an ALB
CloudFront at the edge, WAF for L7 rules, Secrets Manager for env
ElastiCache Redis for sessions + rate limiting · SQS + Lambda for async jobs
Deploy: GitHub Actions → ECR push → CodeDeploy blue/green · automatic rollback on alarm

Kubernetes at scale · EKS + service mesh

Planned · post-PMF / team scale

PrepAtlas v3 architecture: clients reach Global Accelerator → CloudFront with Shield Advanced → an EKS cluster across 3 node groups in multiple AZs. The cluster runs separate pods for web (HPA-scaled Next.js), ai-tutor (sidecar embeddings worker on a GPU pool), and scoring (background SQS workers on Karpenter spot). App Mesh handles mTLS between services; ArgoCD GitOps deploys with Flagger canaries. Data tier: Aurora PostgreSQL Multi-AZ with cross-region replica + pgvector, ElastiCache cluster with DAX. Anthropic Claude in multi-model tiered routing.

Triggered by team size, multi-region latency targets, or the need to isolate services. Pod-level autoscaling, GitOps, service mesh, multi-region data. Premature Kubernetes is the classic resume-driven mistake — v3 waits for the actual signal.

EKS pods with HPA + KEDA + Karpenter spot nodes
AWS App Mesh / Istio for service-to-service mTLS
ArgoCD GitOps · Flagger metric-based canary deploys per namespace
Aurora PostgreSQL Multi-AZ + cross-region replica · ElastiCache cluster mode

AWS pillar

Operational excellence

Reload < 60 s · idempotent migrations

AWS pillar

Security

RLS · TLS · IRSA-ready

AWS pillar

Reliability

Multi-AZ in v2 · multi-region in v3

AWS pillar

Performance

HTTP/2 · 104 KB JS · cookie-only auth

AWS pillar

Cost

$25/mo today · scale only when paid

Section 02

Stack

Frontend

·Next.js 15 App Router · React 19 · TypeScript
·Tailwind CSS v3 · shadcn/ui · Framer Motion · Lucide
·React Hook Form + Zod for validated forms

Backend & data

·Supabase: Postgres + Auth + RLS + Storage
·PostgREST + custom RPCs for the hot paths
·pgvector (1536-dim) on chapter notes
·@supabase/ssr for cookie-based session handling

Mobile

·Bubblewrap TWA — same web codebase, real APK
·@serwist/next service worker · offline shell
·Digital Asset Links for verified-origin trust

AI

·Anthropic Claude API for the doubt-clearing tutor
·Grounded on retrieved chapter chunks (RAG)
·Prompt caching for cost + latency

AWS — running today (v1)

·EC2 t3.small · Ubuntu 24.04 · Node 20 standalone build
·Route 53 DNS · TLS via Let’s Encrypt + Certbot
·Region: ap-south-1 (Mumbai) · single VM, pm2-supervised
·nginx HTTP/2 + per-class immutable static caching

AWS — migration path (v2 → v3)

·v2: ECS Fargate + ALB + CloudFront + WAF (designed)
·v3: EKS + ArgoCD GitOps + App Mesh mTLS (planned)
·ElastiCache · SQS + Lambda · CloudWatch + X-Ray
·CodeDeploy blue/green → ArgoCD canary deploys

Section 03

Five engineering decisions

These are the ones I'd want a senior platform engineer to push back on first.

Grounded RAG over a plain LLM call

The AI tutor experiment was the part I least wanted to be a black box. A student asking “why is the answer C on this question about thermodynamics” deserves more than a confident hallucination.

The flow:

Chapter notes live in public.content_items, each chunked at write time and embedded with a 1536-dim model. The vector lives in a vector(1536) column alongside the chunk text.
When a doubt is submitted, I embed the doubt + the question text, run a cosine-similarity search on the chapter pool the student's exam covers, and take the top 5 chunks.
Those chunks go to Claude with a system prompt that explicitly forbids answering outside the retrieved context. The response carries citation markers that map back to topic_ids, and the UI renders them as “→ see Chapter X” links the student can open.

Tradeoffs I accept

Per-question retrieval adds ~150 ms before the Claude call. I’m not pre-fetching speculatively because the doubt set is highly variable.
Cosine similarity in Postgres, not a hybrid BM25 + vector rerank. At my current corpus size (single-digit thousands of chunks) recall is fine. If the corpus crosses ~100k chunks I’ll add a tsvector lexical layer and merge scores.
I cache the system prompt with the Anthropic prompt-cache headers — the cheapest reliability win I’ve shipped.

A 200 KB mobile data budget per session

The real audience is a student in a Tier-2 Indian city on patchy 4G. I set myself a budget: a student should be able to complete a 30-question warm-up mock without burning more than 200 KB of data, ignoring images they choose to view.

That number drives a lot of unglamorous decisions:

Server components by default. The test runner is a client island because it needs local state for the answer palette and timer — almost everything else is rendered on the server, ships zero JS, and hydrates only what genuinely needs interactivity.
auth.getSession() instead of auth.getUser() in middleware. The former reads the JWT from the cookie locally; the latter hits Supabase per request. Skipping that round-trip on every navigation alone saved 100–200 ms and an unnecessary egress request. Security note: middleware uses getSession() only for the routing-layer redirect decision. Every protected page and server action re-verifies identity with getUser() (via a React-cached helper, getCachedUser) before any data access — so a forged session cookie that sneaks past the redirect still fails at the data layer.
Immutable caching on _next/static. First mock is hot; the second mock the student takes is mostly a few KB of JSON answers.
No third-party JS on the auth shell. No analytics, no chat widget, no webfonts loaded synchronously. The full Inter family is self-hosted from one woff2 with font-display: swap.

It is occasionally annoying — I've had to walk back a client-component implementation more than once — but the constraint forced a saner architecture than I'd have written without it.

TWA, not React Native, for the Android app

I built the Android app as a Trusted Web Activity using Bubblewrap. The package is in.prepatlas.app and the APK is ~3 MB.

Why not React Native or a Capacitor wrap:

One codebase, one deploy. A site push lands in the app within seconds of pm2 reload. There is no app-store-update cycle for everything that isn't the Android container itself.
Same SSR, same server actions, same auth cookies. The TWA is Chrome pretending to be an app, so every server-side optimization on the web shows up on mobile for free.
Honest tradeoffs. I don't have native APIs (camera, biometrics, push). When I need them — most likely for proctored mocks — I'll either add a thin Capacitor bridge or move that subset to native. The TWA covers 98% of the surface today.

The non-obvious work was Digital Asset Links: the assetlinks JSON at /.well-known/assetlinks.json is the only thing that hides Chrome's URL bar in the TWA. Get the SHA-256 fingerprint wrong and the app looks like a website. I have a release-time check that verifies the served JSON matches the production keystore.

Question content as a pipeline, not a CMS

PrepAtlas needs tens of thousands of questions across math, reasoning, general awareness, and science. A traditional CMS would have made me hand-write each one. Instead I built three feeders into the same public.questions table:

Template generator — Python scripts in scripts/ that produce formula-driven MCQs (e.g. percentage problems with parameterized inputs and randomized answer order). About 17k of the current pool came from here.
Corpus-driven generator — for general awareness and science where the answers come from a curated facts corpus. About 900 questions, fully deterministic, regenerable.
Sonnet-assisted batches — for the long tail, currently around 2,100 questions. A separate Claude Code session runs a generation prompt with the topic taxonomy as context, emits JSON, and a Python importer validates the schema, dedupes by hashed prompt, and upserts via the Supabase service-role key.

17k templated + 900 corpus-driven + ~2,100 Sonnet-assisted ≈ 20k total.

Every question carries a topic_id. Every topic carries an exam_id. The admin UI can override difficulty, mark items as PYQ (previous-year question), or unpublish them; everything else is generated and re-runnable. If a generator script changes, I can roll forward by regenerating that batch without touching the rest.

pgvector instead of a dedicated vector store

This is the decision I think the senior infrastructure crowd will most want to challenge, so I'll lead with the constraints.

I evaluated Pinecone, Weaviate, Qdrant, and the obvious “embed yourself” path with FAISS. The deciding factor was operational surface area: every external vector store would have meant a second source of truth, a second backup story, a second outage to monitor, and a second set of credentials in env. With pgvector:

The vectors live in the same Postgres that has the chapter rows. A join is select chunk_text, topic_id from content_items order by embedding <=> $1 limit 5; — that's the entire retrieval call.
RLS applies to vectors the same way it applies to every other row.
The nightly Supabase backup includes embeddings without thinking about it.
Adding pgvector cost me one extension install and one composite index.

Tradeoffs I accept

pgvector with ivfflat is not faster than a dedicated store at large scale. At my current ~10k chunks the search is sub-10ms; up to 100k synthetic chunks I stay under 40ms. Past that I’d add an hnsw index or move to Qdrant — but I’d want the actual signal first.
I don’t get a hosted UI for inspecting embeddings. I built a small admin route that lists nearest neighbors for a sample query, which has paid for itself debugging retrieval misses.

Section 04

Numbers

The section recruiters skim, so I keep it honest and load-bearing.

Paying users

20+

in beta · zero paid marketing

Questions in pool

~20,000

maths · reasoning · GA · science

Mocks published

5 exams · 4 difficulty tiers

p95 server (warm)

~250 ms

measured via curl, HTTP/2

First Load JS

104 KB

shared, from next build

Monthly infra

~$10

EC2 t3.small + Supabase Free + Route 53

Deploys / week

15+

includes content imports · every push: SSH-pull-build-reload

Downtime since launch

0 min

pm2 reload is zero-downtime

Section 05

What’s next

Two tracks running in parallel. Horizontal product expansion — SSC CGL/CHSL first, then banking (IBPS PO/Clerk, SBI PO), then state-PSC. The content pipeline already takes a (category, exam) tuple, so adding SSC is mostly seeding exams/subjects/chapters and pointing the generators at the new topic taxonomy.

Infrastructure migration — v2 components ship one at a time as load + risk justify each piece. The order I'll move in:

CloudFront in front of static assets first — cheap latency win, zero risk.
WAF on the login/signup paths — bot protection before paid signups land.
ECS Fargate behind an ALB once vertical scaling on EC2 stops being enough.
ElastiCache + Secrets Manager + CloudWatch dashboards land alongside ECS.

Each migration is a feature flag away from being a rollback, which is the whole point of doing them one at a time. v3 (EKS + service mesh) waits for a team and a real multi-region requirement — premature Kubernetes is the classic resume-driven mistake.

In parallel I'm building adaptive practice (weak-topic detection from past attempt accuracy auto-builds remediation drills) and laying down a proper observability layer — CloudWatch + OpenTelemetry with structured logs. pm2 logs are fine for one person; they won't scale to a team.