We have two servers, ie
host1.example.com
host2.example.com
One of these acts as the primary web server for our two main websites, example.com and example2.com. The other acts as a backup, which we can switch traffic to by changing the DNS records.
example.com and example2.com are two separate sites, but they each rely on the other's apis. So, regularly pages on example.com will make curl requests to endpoints of the form https://example2.com/api/endpointa, and pages on example2.com will make curl requests to endpoints of the form https://example.com/api/endpointb. These are curl requests made from back-end php code.
Until recently, this has all worked without issue. However, recently these requests have been very occasionally failing. We are getting log messages of failed inter-site api requests of this sort approximately 5 times per day, and each site makes on the order of 100k such requests per day.
Looking at the server's dom logs, no incoming requests are logged during the failures, so they are not actually reaching Apache as an incoming request. On the sending side, the curl requests ~error out basically instantaneously, with no http status code received.~ Actually it looks like they're timing out. Normally they return near instantly, but now they are hitting (long) timeouts. But again, this only happens extremely intermittently.
These failures are only happening for requests sent to host1, regardless of whether they originate at host1 itself, or at host2. (I tried running example.com on host1 and example2.com on host2, and vice-versa, as well as both sites on each of the two hosts, to confirm this.)
They do not appear to be a symptom of server load, as far as I can tell. CPU load and memory used are both much lower than the server has successfully handled in the past. So are Apache threads (although if that were the problem I would expect to see some indication of the request having been received in the apache dom log and error log).
It seems like a network problem, since it's intermittent and the request never seems to reach the server. Also since these servers are nearly identical and it only happens on one of them. But the part that doesn't make sense there is that it happens even when both sites are hosted on the same server. In that case I'm not sure why the request would be routed through the external network at all.
So, I'm at a bit of a loss as far as what to test. When making a curl request to a site hosted on the same server, using its external domain, will the external network play a role? ie, could it be a switch in the data center dropping packets or something like that? If not, what else could I check?
Edit: one other clue is that these failures do not coincide with the busy portion of the day, when traffic is ~double average, and much higher than the overnight level. They seem to happen just as often when traffic is low, which again suggests to me that it's something outside the server. Just trying to figure out what outside the server could cause a curl request from host1.example.com to a page on example.com hosted on that same server to be dropped.
Answer
It turns out the issue was that several times a day we download large feed files and import them into MariaDB. The file imports weren't causing a problem, since they're naturally throttled by the internet. However, we also have replication set up between our servers, and when each of these massive tables was imported, it would put a large amount of data into the binary log, which would then be pulled to the other servers. These network traffic spikes, which were very large since the servers are adjacent with no external bottleneck to slow the transfer, coincide with the dropped connections we're seeing.
If necessary, we can remove this database from replication and distribute the imported files another way. First though I'm going to look into whether we can throttle or de-prioritize database replication traffic so it doesn't max out the connection.
Edit: Looks like we can use this brand new MariaDB setting to throttle the binlog read speed: https://mariadb.com/kb/en/library/restricting-speed-of-reading-binlog-from-master-by-a-slave/
No comments:
Post a Comment