Engineering · Build logDecember 2025 — presentSolo build · still shipping

Engineering at PrepAtlas

Hi, I'm Anirudh. I build and run PrepAtlas — a Next.js + Supabase exam-prep platform for Indian students. It's currently strongest on railway and government-job tracks, and expanding to NCERT, JEE/NEET, banking, and SSC. This page is the build log: three architecture phases (EC2 → ECS → EKS), the five decisions that shaped them, and the triggers behind each migration.

If you came here from my LinkedIn to verify I built this end-to-end: the entire production stack runs on roughly $10/mo — one small EC2 box behind nginx, Supabase Free, Route 53 — with no Vercel, no Pinecone, no Kubernetes. v1 lives on AWS today; v2 (ECS Fargate + ALB) is the containerized scale-out I'm migrating into as load justifies; v3 (EKS + service mesh) is the Kubernetes story for when there's a team. Every claim on this page is real and defensible.

Section 01

Architecture evolution · 3 phases on AWS

Three architectures. One product. Each migration triggered by a specific load / team / reliability requirement — not by resume-driven engineering. Diagrams below show what was, what is being built, and what comes next.

v1
EC2 + nginx
Live today
v2
ECS + ALB
Migrating
v3
EKS + mesh
Planned
v1

Lean MVP · EC2 + nginx

Live · December 2025 — present
PrepAtlas v1 architecture: web browser and Android TWA reach a single EC2 instance in ap-south-1 running nginx (HTTP/2, TLS via Let’s Encrypt, static caching) which proxies to a pm2-supervised Next.js 15 standalone process on port 3001. The Next.js process talks to Supabase (Postgres + pgvector + Auth + RLS) and the Anthropic Claude API for the AI tutor.

What serves traffic today. 6 weeks from idea to live. One VM, one process manager, one Postgres. Every web-perf optimization in scope before any infra scale-out.

  • Single EC2 t3.small in ap-south-1 (Mumbai) — Ubuntu 24.04, Node 20, pm2
  • nginx HTTP/2, immutable static caching, upstream keepalive
  • Supabase Free · Postgres + pgvector + RLS · idempotent migrations
  • Deploy: git push → SSH pull → pnpm build → pm2 reload (< 60 s, zero downtime)
v2

Containerized scale-out · ECS Fargate + ALB

Designed · migrating
PrepAtlas v2 architecture: clients reach Route 53 → CloudFront → WAF + Shield → Application Load Balancer (multi-AZ) → two ECS Fargate tasks in private subnets across AZ-1a and AZ-1b. The cluster talks to Supabase (Postgres + pgvector), ElastiCache Redis, S3 with SQS and Lambda, Secrets Manager, and CloudWatch. The Anthropic Claude API serves the AI tutor.

Triggered when traffic + reliability requirements outgrow a single VM. Multi-AZ HA, edge CDN, WAF for abuse, blue/green rollouts with automatic rollback.

  • ECS Fargate tasks across AZ-1a + AZ-1b behind an ALB
  • CloudFront at the edge, WAF for L7 rules, Secrets Manager for env
  • ElastiCache Redis for sessions + rate limiting · SQS + Lambda for async jobs
  • Deploy: GitHub Actions → ECR push → CodeDeploy blue/green · automatic rollback on alarm
v3

Kubernetes at scale · EKS + service mesh

Planned · post-PMF / team scale
PrepAtlas v3 architecture: clients reach Global Accelerator → CloudFront with Shield Advanced → an EKS cluster across 3 node groups in multiple AZs. The cluster runs separate pods for web (HPA-scaled Next.js), ai-tutor (sidecar embeddings worker on a GPU pool), and scoring (background SQS workers on Karpenter spot). App Mesh handles mTLS between services; ArgoCD GitOps deploys with Flagger canaries. Data tier: Aurora PostgreSQL Multi-AZ with cross-region replica + pgvector, ElastiCache cluster with DAX. Anthropic Claude in multi-model tiered routing.

Triggered by team size, multi-region latency targets, or the need to isolate services. Pod-level autoscaling, GitOps, service mesh, multi-region data. Premature Kubernetes is the classic resume-driven mistake — v3 waits for the actual signal.

  • EKS pods with HPA + KEDA + Karpenter spot nodes
  • AWS App Mesh / Istio for service-to-service mTLS
  • ArgoCD GitOps · Flagger metric-based canary deploys per namespace
  • Aurora PostgreSQL Multi-AZ + cross-region replica · ElastiCache cluster mode
AWS pillar
Operational excellence
Reload < 60 s · idempotent migrations
AWS pillar
Security
RLS · TLS · IRSA-ready
AWS pillar
Reliability
Multi-AZ in v2 · multi-region in v3
AWS pillar
Performance
HTTP/2 · 104 KB JS · cookie-only auth
AWS pillar
Cost
$25/mo today · scale only when paid
Section 02

Stack

Frontend

  • ·Next.js 15 App Router · React 19 · TypeScript
  • ·Tailwind CSS v3 · shadcn/ui · Framer Motion · Lucide
  • ·React Hook Form + Zod for validated forms

Backend & data

  • ·Supabase: Postgres + Auth + RLS + Storage
  • ·PostgREST + custom RPCs for the hot paths
  • ·pgvector (1536-dim) on chapter notes
  • ·@supabase/ssr for cookie-based session handling

Mobile

  • ·Bubblewrap TWA — same web codebase, real APK
  • ·@serwist/next service worker · offline shell
  • ·Digital Asset Links for verified-origin trust

AI

  • ·Anthropic Claude API for the doubt-clearing tutor
  • ·Grounded on retrieved chapter chunks (RAG)
  • ·Prompt caching for cost + latency

AWS — running today (v1)

  • ·EC2 t3.small · Ubuntu 24.04 · Node 20 standalone build
  • ·Route 53 DNS · TLS via Let’s Encrypt + Certbot
  • ·Region: ap-south-1 (Mumbai) · single VM, pm2-supervised
  • ·nginx HTTP/2 + per-class immutable static caching

AWS — migration path (v2 → v3)

  • ·v2: ECS Fargate + ALB + CloudFront + WAF (designed)
  • ·v3: EKS + ArgoCD GitOps + App Mesh mTLS (planned)
  • ·ElastiCache · SQS + Lambda · CloudWatch + X-Ray
  • ·CodeDeploy blue/green → ArgoCD canary deploys
Section 03

Five engineering decisions

These are the ones I'd want a senior platform engineer to push back on first.

01

Grounded RAG over a plain LLM call

The AI tutor experiment was the part I least wanted to be a black box. A student asking “why is the answer C on this question about thermodynamics” deserves more than a confident hallucination.

The flow:

  1. Chapter notes live in public.content_items, each chunked at write time and embedded with a 1536-dim model. The vector lives in a vector(1536) column alongside the chunk text.
  2. When a doubt is submitted, I embed the doubt + the question text, run a cosine-similarity search on the chapter pool the student's exam covers, and take the top 5 chunks.
  3. Those chunks go to Claude with a system prompt that explicitly forbids answering outside the retrieved context. The response carries citation markers that map back to topic_ids, and the UI renders them as “→ see Chapter X” links the student can open.
Tradeoffs I accept
  • Per-question retrieval adds ~150 ms before the Claude call. I’m not pre-fetching speculatively because the doubt set is highly variable.
  • Cosine similarity in Postgres, not a hybrid BM25 + vector rerank. At my current corpus size (single-digit thousands of chunks) recall is fine. If the corpus crosses ~100k chunks I’ll add a tsvector lexical layer and merge scores.
  • I cache the system prompt with the Anthropic prompt-cache headers — the cheapest reliability win I’ve shipped.
02

A 200 KB mobile data budget per session

The real audience is a student in a Tier-2 Indian city on patchy 4G. I set myself a budget: a student should be able to complete a 30-question warm-up mock without burning more than 200 KB of data, ignoring images they choose to view.

That number drives a lot of unglamorous decisions:

  • Server components by default. The test runner is a client island because it needs local state for the answer palette and timer — almost everything else is rendered on the server, ships zero JS, and hydrates only what genuinely needs interactivity.
  • auth.getSession() instead of auth.getUser() in middleware. The former reads the JWT from the cookie locally; the latter hits Supabase per request. Skipping that round-trip on every navigation alone saved 100–200 ms and an unnecessary egress request. Security note: middleware uses getSession() only for the routing-layer redirect decision. Every protected page and server action re-verifies identity with getUser() (via a React-cached helper, getCachedUser) before any data access — so a forged session cookie that sneaks past the redirect still fails at the data layer.
  • Immutable caching on _next/static. First mock is hot; the second mock the student takes is mostly a few KB of JSON answers.
  • No third-party JS on the auth shell. No analytics, no chat widget, no webfonts loaded synchronously. The full Inter family is self-hosted from one woff2 with font-display: swap.

It is occasionally annoying — I've had to walk back a client-component implementation more than once — but the constraint forced a saner architecture than I'd have written without it.

03

TWA, not React Native, for the Android app

I built the Android app as a Trusted Web Activity using Bubblewrap. The package is in.prepatlas.app and the APK is ~3 MB.

Why not React Native or a Capacitor wrap:

  • One codebase, one deploy. A site push lands in the app within seconds of pm2 reload. There is no app-store-update cycle for everything that isn't the Android container itself.
  • Same SSR, same server actions, same auth cookies. The TWA is Chrome pretending to be an app, so every server-side optimization on the web shows up on mobile for free.
  • Honest tradeoffs. I don't have native APIs (camera, biometrics, push). When I need them — most likely for proctored mocks — I'll either add a thin Capacitor bridge or move that subset to native. The TWA covers 98% of the surface today.

The non-obvious work was Digital Asset Links: the assetlinks JSON at /.well-known/assetlinks.json is the only thing that hides Chrome's URL bar in the TWA. Get the SHA-256 fingerprint wrong and the app looks like a website. I have a release-time check that verifies the served JSON matches the production keystore.

04

Question content as a pipeline, not a CMS

PrepAtlas needs tens of thousands of questions across math, reasoning, general awareness, and science. A traditional CMS would have made me hand-write each one. Instead I built three feeders into the same public.questions table:

  1. Template generator — Python scripts in scripts/ that produce formula-driven MCQs (e.g. percentage problems with parameterized inputs and randomized answer order). About 17k of the current pool came from here.
  2. Corpus-driven generator — for general awareness and science where the answers come from a curated facts corpus. About 900 questions, fully deterministic, regenerable.
  3. Sonnet-assisted batches — for the long tail, currently around 2,100 questions. A separate Claude Code session runs a generation prompt with the topic taxonomy as context, emits JSON, and a Python importer validates the schema, dedupes by hashed prompt, and upserts via the Supabase service-role key.

17k templated + 900 corpus-driven + ~2,100 Sonnet-assisted ≈ 20k total.

Every question carries a topic_id. Every topic carries an exam_id. The admin UI can override difficulty, mark items as PYQ (previous-year question), or unpublish them; everything else is generated and re-runnable. If a generator script changes, I can roll forward by regenerating that batch without touching the rest.

05

pgvector instead of a dedicated vector store

This is the decision I think the senior infrastructure crowd will most want to challenge, so I'll lead with the constraints.

I evaluated Pinecone, Weaviate, Qdrant, and the obvious “embed yourself” path with FAISS. The deciding factor was operational surface area: every external vector store would have meant a second source of truth, a second backup story, a second outage to monitor, and a second set of credentials in env. With pgvector:

  • The vectors live in the same Postgres that has the chapter rows. A join is select chunk_text, topic_id from content_items order by embedding <=> $1 limit 5; — that's the entire retrieval call.
  • RLS applies to vectors the same way it applies to every other row.
  • The nightly Supabase backup includes embeddings without thinking about it.
  • Adding pgvector cost me one extension install and one composite index.
Tradeoffs I accept
  • pgvector with ivfflat is not faster than a dedicated store at large scale. At my current ~10k chunks the search is sub-10ms; up to 100k synthetic chunks I stay under 40ms. Past that I’d add an hnsw index or move to Qdrant — but I’d want the actual signal first.
  • I don’t get a hosted UI for inspecting embeddings. I built a small admin route that lists nearest neighbors for a sample query, which has paid for itself debugging retrieval misses.
Section 04

Numbers

The section recruiters skim, so I keep it honest and load-bearing.

Paying users
20+
in beta · zero paid marketing
Questions in pool
~20,000
maths · reasoning · GA · science
Mocks published
60
5 exams · 4 difficulty tiers
p95 server (warm)
~250 ms
measured via curl, HTTP/2
First Load JS
104 KB
shared, from next build
Monthly infra
~$10
EC2 t3.small + Supabase Free + Route 53
Deploys / week
15+
includes content imports · every push: SSH-pull-build-reload
Downtime since launch
0 min
pm2 reload is zero-downtime
Section 05

What’s next

Two tracks running in parallel. Horizontal product expansion — SSC CGL/CHSL first, then banking (IBPS PO/Clerk, SBI PO), then state-PSC. The content pipeline already takes a (category, exam) tuple, so adding SSC is mostly seeding exams/subjects/chapters and pointing the generators at the new topic taxonomy.

Infrastructure migration — v2 components ship one at a time as load + risk justify each piece. The order I'll move in:

  1. CloudFront in front of static assets first — cheap latency win, zero risk.
  2. WAF on the login/signup paths — bot protection before paid signups land.
  3. ECS Fargate behind an ALB once vertical scaling on EC2 stops being enough.
  4. ElastiCache + Secrets Manager + CloudWatch dashboards land alongside ECS.

Each migration is a feature flag away from being a rollback, which is the whole point of doing them one at a time. v3 (EKS + service mesh) waits for a team and a real multi-region requirement — premature Kubernetes is the classic resume-driven mistake.

In parallel I'm building adaptive practice (weak-topic detection from past attempt accuracy auto-builds remediation drills) and laying down a proper observability layer — CloudWatch + OpenTelemetry with structured logs. pm2 logs are fine for one person; they won't scale to a team.

If you've read this far

…and want to talk infrastructure — migrations, IAM, observability, anything in the diagrams above — I'm happy to walk through it in a call. The production code is closed for now, but every claim on this page is real and defensible. — Anirudh

Currently exploring

Senior DevOps · Platform Engineer · SRE roles in Europe (Netherlands · Germany · Ireland) or fully remote. Available immediately.

Reach out: anirudhvaka@gmail.com