Thuan: Alex, here’s something that blows my mind. Netflix has over 200 million subscribers. At peak time — like Sunday night — millions of people are streaming video simultaneously. How does that even work? If I put that much load on my server, it would catch fire.
Alex: It would literally melt. But here’s the thing — Netflix doesn’t use “a server.” They use thousands of servers, working together, spread across the planet. And the architecture behind it is what we call “distributed systems at scale.” It sounds fancy, but the concepts are actually pretty intuitive once you see them.
Thuan: Walk me through it. Pretend I’m designing a streaming service from scratch.
Alex: Perfect. Let’s start with the most basic version. You have one server. Users connect to it. It sends them video. This works for 100 users. Maybe 1,000. But what happens when user number 10,001 connects?
Thuan: The server runs out of memory, CPU, or bandwidth. Requests start failing. Users see errors.
Alex: Right. You’ve hit a wall. And there are only two ways to get past that wall.
Scaling Up vs. Scaling Out
Alex: Scaling up means buying a bigger server. More CPU, more RAM, more bandwidth. It’s like replacing your sedan with a bus. Simple, but there’s a limit. You can’t buy a server with a million CPUs.
Thuan: And it’s expensive. The jump from a regular server to a high-performance one isn’t linear. It might cost 10 times more for just 2 times the capacity.
Alex: Exactly. So option two: scaling out. Instead of one big server, you use many small servers. Instead of one bus, you have a fleet of cars. Each car carries some passengers. Need more capacity? Add more cars.
Thuan: That’s horizontal scaling. But now you have a new problem — how do users know which server to connect to?
Alex: That’s where the load balancer comes in.
Load Balancers: The Traffic Cop
Alex: Imagine a busy intersection with five roads leading to five parking garages. Without traffic direction, everyone tries to go to the nearest garage. Garage 1 is packed, garages 4 and 5 are empty. Chaos.
Thuan: So the load balancer is like a traffic cop standing at the intersection?
Alex: Exactly. Every request comes to the load balancer first. The load balancer decides which server to send it to. The simplest approach is round robin — request 1 goes to server A, request 2 goes to server B, request 3 goes to server C, then back to A. Like dealing cards.
Thuan: But what if server A is already struggling and server C is idle? Round robin doesn’t account for that.
Alex: Right. Smarter load balancers use least connections — send the request to whichever server currently has the fewest active connections. Or weighted approaches, where beefier servers get more traffic than smaller ones. Some even check server health every few seconds and avoid sending traffic to servers that are slow or failing.
Thuan: What happens if the load balancer itself fails?
Alex: Great question. That’s a single point of failure. In production, you run multiple load balancers with automatic failover. If load balancer A dies, load balancer B takes over. Think of it as having two traffic cops who take shifts.
Caching: Don’t Cook the Same Meal Twice
Thuan: OK, so we’ve got multiple servers behind a load balancer. What’s next?
Alex: Caching. This is probably the single most impactful optimization in system design. Here’s the analogy. Imagine you run a restaurant, and the most popular dish is a Caesar salad. Every day, 500 people order it. Do you prepare each salad from scratch every time? Go to the garden, pick the lettuce, make the dressing…
Thuan: Of course not. You’d prep a big batch of dressing in the morning, wash and chop the lettuce in advance, and assemble quickly when ordered.
Alex: That’s caching. You pre-compute or store frequently accessed data so you don’t have to fetch it from the database every time. The database is the garden. The cache is the prep station.
Thuan: What tools do people use for caching?
Alex: Redis is the most popular. It’s an in-memory data store — incredibly fast because data lives in RAM instead of on disk. It can respond in under a millisecond. Compare that to a database query that might take 10 to 100 milliseconds.
Thuan: Where does caching fit in the architecture?
Alex: Multiple layers. Browser cache — the user’s browser stores static files locally. CDN cache — servers distributed globally cache content close to users. Application cache — your server checks Redis before hitting the database. Database cache — the database itself caches frequently accessed data in memory.
Thuan: So when I visit Netflix, I’m probably never directly hitting their main servers?
Alex: Almost never. When you press play on a show, the video file is served from a CDN server physically close to you. Maybe in your city. The CDN handles over 90% of Netflix’s bandwidth. Their origin servers handle account management, recommendations, search — the intelligent stuff. But the heavy lifting — streaming actual video — is all CDN.
CDNs: Putting Your Kitchen Everywhere
Thuan: Explain CDNs to me like I don’t know what they are.
Alex: Imagine you have one amazing pizza restaurant in New York. People love it. But now you’re getting orders from London, Tokyo, and Sydney. Do you fly every pizza from New York to Sydney? That would take 20 hours. The pizza would be cold. The customer would be angry.
Thuan: So you open branch locations?
Alex: Exactly. You open a branch in London, one in Tokyo, one in Sydney. Each branch has the recipes and ingredients to make the same pizza locally. When someone in Sydney orders, the local branch makes it. Fresh. Fast. No transatlantic flight needed.
That’s a CDN. Companies like Cloudflare, AWS CloudFront, or Akamai have servers all over the world — hundreds of locations. When you deploy your website or video content, copies get distributed to these locations. When a user in Vietnam requests your page, they’re served from a server in Singapore — not from your origin server in the US.
Thuan: What kind of latency difference are we talking about?
Alex: Without a CDN, a user in Vietnam accessing a US server might experience 200 to 300 milliseconds of latency. With a CDN serving from Singapore, that drops to 20 to 40 milliseconds. For video streaming, that’s the difference between instant playback and a loading spinner.
Database Sharding: Splitting the Phone Book
Thuan: OK, so we’ve handled server scaling, load balancing, caching, and CDNs. But what about the database? Even with caching, the database is still a bottleneck eventually, right?
Alex: Absolutely. The database is often the hardest part to scale. You can add more application servers easily — they’re stateless. But the database holds all your data. You can’t just run two copies of it because they’d go out of sync.
Thuan: So what’s the solution?
Alex: Database sharding. Think of a phone book. If you have a small town, one phone book works fine. But if you have a city with 10 million people, one phone book is way too big to search through.
Thuan: So you split it by letter? A through G in book 1, H through N in book 2?
Alex: Exactly. That’s sharding. You split the data across multiple databases, each holding a subset. Common approaches: shard by user ID — users 1 through 1 million in database 1, users 1 million to 2 million in database 2. Or shard by geography — European users in a European database, Asian users in an Asian database.
Thuan: That sounds simple. What’s the catch?
Alex: Several catches. Cross-shard queries are hard. If you want to find all users who spent more than $100 last month, you need to query every shard and combine the results. That’s slow and complex. Rebalancing is painful. If shard 1 gets too big, you need to move some data to a new shard. That’s a complex migration while the system is running. And joins across shards are basically impossible. If user data is in shard 1 and order data is in shard 3, you can’t do a simple SQL join.
Thuan: So sharding is a last resort?
Alex: Absolutely. First, try read replicas — one master database for writes, multiple replicas for reads. Then try better indexing and query optimization. Then try caching more aggressively. Only shard when you’ve exhausted all other options. Because once you shard, you can’t easily go back.
Putting It All Together: The Flow
Thuan: Can you walk me through a complete request? Like, what happens when I open Netflix?
Alex: Sure. Step by step.
Step 1: You open the Netflix app on your phone. Your phone sends a request to netflix.com. DNS resolves that to a load balancer IP address.
Step 2: The load balancer routes your request to one of hundreds of application servers.
Step 3: The application server checks your session. Who are you? What’s your subscription? This data might come from a cache in Redis. If it’s not in the cache, the server queries the database.
Step 4: The server builds your personalized home page — what shows to recommend. This uses machine learning models that were pre-computed offline. The recommendations are stored in a fast data store.
Step 5: The server sends back the page data. Your app renders the homepage. The images — show thumbnails, banners — are loaded from a CDN server near you.
Step 6: You click play on a show. The app requests the video stream from the CDN. Netflix has its own CDN called Open Connect with servers inside ISP networks. The video comes from a server literally inside your internet provider’s building.
Thuan: Wait — Netflix puts servers inside ISPs?
Alex: Yes! They give ISPs custom servers pre-loaded with popular content. That way, when you stream a show, the data travels a few hundred meters within your ISP’s network instead of crossing the ocean. Genius, right?
Key Takeaways You Can Explain to Anyone
Thuan: OK, summary time. If someone asks me “how do big systems handle millions of users?” — what do I say?
Alex:
-
Scale out, not up. Add more small servers instead of buying one giant server. Use a load balancer to distribute traffic.
-
Cache everything you can. The fastest database query is the one you never make. Redis is your best friend.
-
Put content close to users. CDNs serve files from locations near the user. This is the single biggest performance win for global applications.
-
Shard the database as a last resort. Try read replicas, better indexes, and caching first. Sharding adds huge complexity.
-
Design for failure. At scale, something is always broken. Build systems that work even when servers die. Redundancy is not optional.
Thuan: That last point — “something is always broken” — that’s the mindset shift, isn’t it? You stop trying to prevent failure and start designing to survive it.
Alex: That’s the Netflix philosophy. They literally have a tool called Chaos Monkey that randomly kills servers in production. If your system can survive random failures, it’s truly resilient.
Thuan: I love and hate Chaos Monkey at the same time.
Alex: Every engineer does. Next time — containers. I promise you, Docker is simpler than people make it sound.
This is Part 3 of the Tech Coffee Break series — casual conversations about real tech concepts, designed for listening and learning.