Friday, March 10, 2017

hardware - Something is burning in the server room; how can I quickly identify what it is?



The other day, we notice a terrible burning smell coming out of the server room. Long story short, it ended up being one of the battery modules that was burning up in the UPS unit, but it took a good couple of hours before we were able to figure it out. The main reason we were able to figure it out is that the UPS display finally showed that the module needed to be replaced.



Here was the problem: the whole room was filled with the smell. Doing a sniff test was very difficult because the smell had infiltrated everything (not to mention it made us light headed). We almost mistakenly took our production database server down because it's where the smell was the strongest. The vitals appeared to be ok (CPU temps showed 60 degrees C, and fan speeds ok), but we weren't sure. It just so happened that the battery module that burnt up was about the same height as the server on the rack and only 3 ft away. Had this been a real emergency, we would have failed miserably.



Realistically, the chances that actual server hardware is burning up is a fairly rare occurrence and most of the time we'll be looking at the UPS the culprit. But with several racks with several pieces of equipment, it can quickly become a guessing game. How does one quickly and accurately determine what piece of equipment is actually burning up? I realize this question is highly dependent on the environment variables such as room size, ventilation, location, etc, but any input would be appreciated.



Answer



The general consensus seems to be that the answer to your question comes in two parts:



How do we find the source of the funny burning smell?



You've got the "How" pretty well nailed down:




  • The "Sniff Test"

  • Look for visible smoke/haze


  • Walk the room with a thermal (IR) camera to find hot spots

  • Check monitoring and device panels for alerts



You can improve your chances of finding the problem quickly in a number of ways - improved monitoring is often the easiest. Some questions to ask:




  • Do you get temperature and other health alerts from your equipment?

  • Are your UPS systems reporting faults to your monitoring system?

  • Do you get current-draw alarms from your power distribution equipment?


  • Are the room smoke detectors reporting to the monitoring system? (and can they?)






When should we troubleshoot versus hitting the Big Red Switch?



This is a more interesting question.
Hitting the big red switch can cost your company a huge amount of money in a hurry: Clean agent releases can be into the tens of thousands of dollars, and the outage / recovery costs after an emergency power off (EPO, "dropping the room") can be devastating.
You do not want to drop a datacenter because a capacitor in a power supply popped and made the room smell.



Conversely, a fire in a server room can cost your company its data/equipment, and more importantly your staff's lives.
Troubleshooting "that funny burning smell" should never take precedence over safety, so it's important to have some clear rules about troubleshooting "pre-fire" conditions.




The guidelines that follow are my personal limitations that I apply in absence of (or in addition to) any other clearly defined procedure/rules - they've served me well and they may help you, but they could just as easily get me killed or fired tomorrow, so apply them at your own risk.




  1. If you see smoke or fire, drop the room
    This should go without saying but let's say it anyway: If there is an active fire (or smoke indicating that there soon will be) you evacuate the room, cut the power, and discharge the fire suppression system.
    Exceptions may exist (exercise some common sense), but this is almost always the correct action.


  2. If you're proceeding to troubleshoot, always have at least one other person involved
    This is for two reasons. First, you do not want to be wandering around in a datacenter and all of a sudden have a rack go up in the row you're walking down and nobody knows you're there. Second, the other person is your sanity check on troubleshooting versus dropping the room, and should you make the call to hit the Big Red Switch you have the benefit of having a second person concur with the decision (helps to avoid the career-limiting aspects of such a decision if someone questions it later).


  3. Exercise prudent safety measures while troubleshooting
    Make sure you always have an escape path (an open end of a row and a clear path to an exit).
    Keep someone stationed at the EPO / fire suppression release.
    Carry a fire extinguisher with you (Halon or other clean-agent, please).
    Remember rule #1 above.
    When in doubt, leave the room.
    Take care about your breathing: use a respirator or an oxygen mask. This might save your health in case of chemical fire.


  4. Set a limit and stick to it
    More accurately, set two limits:





    • Condition ("How much worse will I let this get?"), and

    • Time ("How long will I keep trying to find the problem before its too risky?").



    The limits you set can also be used to let your team begin an orderly shutdown of the affected area, so when you DO pull power you're not crashing a bunch of active machines, and your recovery time will be much shorter, but remember that if the orderly shutdown is taking too long you may have to let a few systems crash in the name of safety.


  5. Trust your gut
    If you are concerned about safety at any time, call the troubleshooting off and clear the room.
    You may or may not drop the room based on a gut feeling, but regrouping outside the room in (relative) safety is prudent.




If there isn't imminent danger you may elect bring in the local fire department before taking any drastic actions like an EPO or clean-agent release. (They may tell you to do so anyway: Their mandate is to protect people, then property, but they're obviously the experts in dealing with fires so you should do what they say!)





We've addressed this in comments, but it may as well get summarized in an answer too -- @DeerHunter, @Chris, @Sirex, and many others contributed to the discussion



No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...