Skip to content

NFRs And Design Decisions#

Every NFR forces specific design decisions — and every design decision costs something. State both.


Availability#

The system must stay up even when components fail.

Forces:
  Redundancy at every layer     → no single point of failure
  Multi-AZ deployment           → survive data center failure
  Active-Passive failover       → standby takes over automatically (CP systems)
  Active-Active                 → all nodes serve traffic (AP systems only)
  Load balancer with health checks → detects and routes around failures
  Auto failover                 → no manual intervention when a node dies

Trade-off:

More redundancy = more cost + more operational complexity
Active-Active = availability but conflicts with strong consistency

What to say:

"To meet the high availability NFR I need to eliminate every SPOF — app servers, databases, even the load balancer itself. I'd deploy across at least two availability zones with automatic failover."


Consistency#

Every read must return the latest write. No stale data.

Forces:
  Quorum reads/writes           → R + W > N, majority must agree
  Synchronous replication       → replica confirmed before returning success
  CP database                   → Postgres, Spanner — not Cassandra
  SERIALIZABLE isolation        → no phantom reads, no lost updates at DB level

Trade-off:

Strong consistency = higher latency on every write
Quorum = majority of nodes must respond — slower, less available during partitions

What to say:

"This is financial data — wrong balance is catastrophic. I need strong consistency via quorum writes and synchronous replication. I'm accepting higher latency as the trade-off."


Latency#

The system must respond fast — typically P99 < 200ms for user-facing systems.

Forces:
  Caching (Redis, Memcached)    → serve from memory, skip DB entirely
  CDN                           → serve static content from edge, close to user
  Read replicas                 → spread read load, read from nearest replica
  Async processing              → don't make user wait for non-critical work
                                   return response, process in background
  Denormalization               → precompute joins, avoid expensive queries at read time

Trade-off:

Cache = fast but potentially stale → eventual consistency
Read replicas = fast but replica lag → stale reads possible
Async = responsive but user doesn't get confirmation immediately

What to say:

"The latency NFR forces me towards caching and read replicas. I'm accepting eventual consistency as the trade-off — for a social feed, slightly stale data is fine."


Throughput / Scalability#

The system must handle massive load — millions of requests per second, hundreds of millions of users.

Forces:
  Horizontal scaling            → more app servers, stateless services
  Database sharding             → split data across multiple DB nodes by key
  Partitioning                  → user_id, region — each partition handles a subset
  Read replicas                 → offload reads from primary
  Message queues (Kafka)        → absorb write spikes, process at DB's own pace
  Batching                      → group small writes into larger ones
  Caching                       → reduce DB hits entirely

Trade-off:

Sharding = cross-shard queries are expensive, joins across partitions painful
Queues = writes are async, user doesn't get immediate confirmation
More replicas = more cost

What to say:

"At 500M users the DB is the bottleneck. I'd shard by user_id and put Kafka in front of writes to absorb spikes. Cross-shard queries become expensive — I'd denormalize to avoid them."


Durability#

Data must survive crashes, power loss, disk failure, and data center outages.

Forces:
  WAL (Write-Ahead Log)         → write to log before anything else
                                   crash mid-write? replay the log on recovery
  Replication factor 3+         → one copy isn't enough, three is minimum
  Synchronous replication       → RPO = 0, no data loss on failover
  Cross-region backup           → data center gone? another region has it
  Regular backups               → full + incremental, point-in-time recovery

Trade-off:

Synchronous replication = higher write latency
Cross-region replication = higher cost + network latency
More replicas = more storage cost

What to say:

"Durability is non-negotiable here. I'd use WAL for crash safety, replication factor 3 across availability zones, and async cross-region backups for disaster recovery."


Security#

Data must be protected from unauthorized access, interception, and abuse.

Forces:
  Authentication                → JWT / OAuth2 — verify identity on every request
  Authorization                 → RBAC — verify permissions after identity confirmed
  Encryption in transit         → TLS/HTTPS — data can't be read if intercepted
  Encryption at rest            → AES-256 — DB compromised? data is unreadable
  Rate limiting                 → prevent abuse, brute force at API gateway level
  Input validation              → SQL injection, XSS — validate at system boundaries

Trade-off:

Encryption = slight CPU overhead on every request
Rate limiting = legitimate users may get throttled under high load

What to say:

"Since this handles sensitive data I'd enforce TLS in transit, encryption at rest, JWT-based auth, and rate limiting on all public endpoints."


The NFR → Decision → Trade-off Pattern#

This three-step move is what separates strong hire answers from average ones.

Step 1: State the NFR
  "The NFR here is low latency — feed must load under 200ms"

Step 2: State what it forces
  "That forces me towards caching and read replicas"

Step 3: State the trade-off
  "I'm accepting eventual consistency — for a feed, slightly stale data is fine"

Never just say "I'll use Redis." Say why, and say what you're giving up.