I am trying to understand how massive sites like Facebook or Wikipedia work, for my intellectual curiosity. I read about various techniques for building scalable sites, but I am still puzzled about one particular detail.
The part that confuses me is that ultimately, the DNS will map the entire domain to a single IP address, or a handful of IP addresses in the case of round-robin DNS.
For example, wikipedia.org has only one type-A DNS record. So, people from all over the world visiting Wikipedia have to send a request to the one IP address specified in DNS.
What is the piece of hardware that listens on the IP address for a massive site, and how can it possibly handle all the load coming from the requests for users all over the world?
Edit 1: Thanks for all the responses! Anycast seems like a feasible answer... Does anyone know of a way to check whether a particular IP address is anycast-routed, so that I could verify that this really is the trick used in practice by large sites?
Edit 2: After more reading on the topic, it appears that anycast is not typically used for dynamic web content. Anycast is usually used for UDP (e.g., DNS lookups), or sometimes for static content.
One interesting thing to note is that Facebook uses profile.ak.fbcdn.net to host static content like style sheets and javascript libraries. Each time I ping this name, I get a response from a different IP address. However, I can't tell whether this is anycast in action, or a completely different technique.
Back to my original question: as far as I can tell, even a large site will have a single expensive piece of load-balancing hardware listening on its handful of public IP addresses.
Answer
It isn't necessarily a piece of hardware doing this but a complete system that has been designed to scale. This not only encompasses the hardware but more importantly the application design, database design (relational or otherwise), networking, storage and how they all fit together.
A good starting point for your curiosity on finding out how some of the large sites scale is High Scalability - Start Here and High Scalability on Wikimedia architecture, Facebook and Twitter as examples.
Regarding your question about DNS and single IP addresses and round-robin these types of sites will often use load balancing as a method of presenting a single IP address. This can be done either by specialised hardware load balancers or through software running on general purpose servers. The incoming requests to the IP managed by the load balancer is then distributed across a series of servers transparently to the end user.
For a good explanation on this topic, including a comparison of hardware and software load balancers/proxies and how they compare to DNS round robin, have a read of Load Balancing Web Applications.
No comments:
Post a Comment