Monday, May 25, 2015

troubleshooting - Diagnosing system which persistently recognises SATA drives but refusing to recognise SAS drive or its replacement?



I've done a fair bit of troubleshooting but I'm completely at a loss what could be going on.



Hardware / platform




  • Supermicro X10SRi-F motherboard

  • EVGA 850W G2 PSU (one of the highest rated for power quality at Jonnyguru.com)


  • 128GB Crucial DDR4 RDIMM

  • LSI 9211-8i PCIe HBA flashed to latest P20-IT (from LSI/Avago website)

  • 8087-to-quad-SAS connector cable (new: see this picture, the kind of cable where the power side accepts a SATA PSU connector)

  • Seagate 6TB SAS drive (new: ST6000NM0054)

  • Various other Seagate 3TB - 6TB SATA drives (test purposes)



Problem / troubleshooting so far



This is a new server being set up, so all components are new, although some have been tested already before now.




On booting, the HBA didn't recognise or report the 6TB SAS drive (either via the main BIOS or via its own OROM -> SAS topology) and the 6TB drive was cold and not spinning up. No other drives were connected. The rest of the system works fine, so on the face of it the issue is limited to one or more of bad HBA, bad cable, or bad drive.



Troubleshooting steps so far:




  1. Connected 6TB SAS drive using different terminator on the quad cable, and the quad cable to both 8087 ports. No change - implies the issue isn't one specific terminator or port.

  2. Connected various Seagate 3TB-6TB SATA drives using same cable (same manufacturer and similar modern range to eliminate subtle compatibility issues if any). All recognised, reported, and spun up perfectly as normal on boot, on both 8087 ports and on all 4 terminators, and over multiple reboots - implies HBA and cable both work fine, at least for SATA. (Would be odd if they worked perfectly for SATA but not SAS.)

  3. Kept identical connections but replaced SATA drives by 6TB SAS drive, not changing anything else. As before, 6TB SAS drive wasn't recognised or reported by HBA, and wasn't spun up.

  4. Tried exactly the same with a different card and platform m- LSI 9260-8i RAID controller on an ASUS based desktop. Again all SATA drives immediately recognised and spun up, but 6TB SAS drive isn't/doesn't.


  5. Reluctantly concluded that however unlikely, the most likely issue was 6TB SAS drive DOA and RMA'ed it. ("Reluctantly" is because I've never actually had a DOA before, the drives are usually reliable, and if it is dead then by far more usual/expected would be that it's at least recognised but non-functional. I just couldn't figure a more likely issue than complete DOA.)

  6. Just received the warranty replacement - and getting exactly identical symptoms with the replacement as well: (a) When the 6TB SAS drive and any SATA drive are connected to 2 terminators and the system boots, the SATA drive is immediately
    recognised, reported and spins up, while the 6TB SAS drive stays
    cold and still. (b) When the 6TB SAS and any SATA drive are
    connected to the 9260-8i RAID card in the other ASUS desktop the SATA drive is likewise
    immediately recognised but the 6TB SAS drive stays cold.

  7. As a last step, re-re-read the 9211-8i HBA user guide in case I missed anything first time, and re-checked the BIOS. Can't find anything that would seem to explain this, or any statement that SAS drives will not be recognised unless/until .



Didn't really believe it was DOA first time. Definitely don't believe it's a DOA now. But if not, then what is it, and what can I be missing?




I've tested everything (AFAIK) in the component chain, the HBA just doesn't have much OROM interface that can go wrong, or any options to recognise SATA/SAS/both, or anything like that, and the main PC/server in both cases just leave detection to the HBA/RAID card. I've tested on two completely different platforms, with two different models of controller cards, with SAS vs SATA drives, and I'm utterly stumped.



(Note: I'm slightly limited as I'm starting to transition from SATA to SAS, with the intent being to replace SATA by SAS as they wear out, so at the moment I don't have any other SAS disks or cables to test with, which I would otherwise have done too. But I think I've probably covered that by testing the cards+cables while varying SATA/SAS)



Updated for more accurate title to help others, now more info obtained. See answer.


Answer



I spoke to LSI (now Avago) tech support for storage, in Germany. They considered that if 2 different kinds of "known good" controller in 2 different machines both recognised all sata but not this sas drive (on any port and connector) then it was very likely the drive.



They also suggested a further test - to connect the power side of the drive only (NOT HBA/motherboard/data wires) and turn on the server. (He warned me it would "sound crazy"!) Apparently like SATA, SAS drives spin up when they are first powered if the data side isn't connected (I didn't know that, wonder how staggered start works then?), providing a very good test that relies only on the PSU and power feed to the drive, and nothing else.




Sure enough, SATA drives all spin up, this drive doesn't. He felt that was enough to be "almost certain" its a 2nd bad drive, however unlikely, without spending cash. The serial number was also almost identical to the original dud drive (1 digit change); so he also suggested speaking to the manufacturer and raise the question if they have any other similar reports for this drive, as it could be a bad batch.



Update April 2017:



I thought for a while that the issue was the LSI 9211 BIOS needed to be disabled, based on an online thread. I disabled the bios and it did work... but later when I moved the box it stopped working and I couldn't figure why. I took this info back to LSI tech support and they said it wasn't possible that the BIOS could be the issue or disabling it could help. They felt that moving the box was likely to have undone a fortuitous cable working and turned it back to not working.



They said to try a new "forward" or "fanout" cable, and specifically, Adaptec (on the side, as it's a competitor!!) which are more reliable than most for SAS. They said that it wasn't always clear or marked whether a cable was the right kind or not, and to check carefully.



The exact SAS cable one would need will vary depending on what interface the HDD and card have. The 9211 has a SFF-8087 connection and my HDD has an SFF-8082 connection (looks a bit like SATA but power and data ports joined).




I was dubious about it being the cable (since the cable did work fine on electrically similar SATA), but went ahead and contacted Adaptec who commented that getting the cable right can be quite challenging in the sense of being sure exactly which kind of cable is needed. They checked the card specs and HDD specs and recommended their 2275300-R on Amazon, and much to my amazement it worked first time, so I guess they must have known what they were talking about.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...