I have bought a used SSHD ( Seagate Laptop SSHD - ST500LM000-1EJ162 ) on ebay. Regarding to S.M.A.R.T the disk might be damaged somehow, I am not sure. To correctly interpret the S.M.A.R.T values, I need your help.
Regarding to S.M.A.R.T I have a tremendous amount of Raw-Read-Error's and Seek-Error's. I have read a lot of different threads about this topic so far and what I have found out is that these two values mentioned are almost irrelevant, because there is no standardization on what kind of error's need to occur to let these two values ( Raw-Read-Error's and Seek-Error's ) raise. It's the manufacturer who decides on this - Generally speaking: Seagate tend to have high RAW-Values of Raw-Read and Seek-Errors, while Western Digital tend to have low RAW-Values in this segment. I've read, because of this fact, it would be useless trying to interpret the RAW-Values of these two Attributes, instead I should compare the columns named VALUE with WORST and THRESHOLD.
And here the next problem comes in. Now it is the opposite: A higher VALUE than THRESHOLD is preferred.
To make things more clear, have a look at the smartctl -a /dev/sdb/
snippet below
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 120 099 006 Pre-fail Always - 237676480
Regarding S.M.A.R.T, I have a Raw_Read_Error_Rate with a RAW-Value of 237676480. This looks dangerous in first place. But regarding to the columns VALUE WORST THRESH
I have a actual(?) VALUE of 120. WORST-case once was 099 and if it falls below THRESH 006 the disk should be considered broken.
Same goes for Reallocated-Sector's. The lower the column-values compared with the THRESH-value the worse the disk-condition.
So regarding to my S.M.A.R.T snippet below, my disk never ever Reallocated anything.
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
Now lets have a look at Reported-Uncorrected-Error's. As far as I understand, these errors are count, whenever the disk fails to reallocate a bad sector with the result that the data stored inside such a sector is/was lost.
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
187 Reported_Uncorrect 0x0032 099 099 000 Old_age Always - 1
Regarding to the S.M.A.R.T snippet above, the disk had one Uncorrected Sector in it's lifetime. Regarding to the columns VALUE and WORST there is no need to be afraid about any disk-failure.
Another attribute is Airflow-Temperature-Cel. First I installed the disk in my 12 years old Laptop and did run badblocks
to check my disk. While badblocks
was running for several hours I checked the S.M.A.R.T temperature value and saw the column VALUE was equal to WORST and both did fall below THRESH. As RAW_VALUE I had a statement like: DISK IS FAILING. So I decided to turn of my Laptop and install that SSHD in my home-server that has better airflow and restarted badblocks
. So when checking this S.M.A.R.T attribute now, the column WORST describes the case, that happened the day before in my Laptop, while the column VALUE shows the actual temperature. Comparing VALUE with THRESH the temperature is fine. Trying to interpret the RAW_VALUE is something I have problems with. Here the snippet
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022 068 037 045 Old_age Always In_the_past 32 (0 120 37 26 0
Last but not least, there is some S.M.A.R.T information I have never ever read in any S.M.A.R.T outputs during my lifetime, and I have absolutely no clue on how to interpret these:
Error 4 occurred at disk power-on lifetime: 521 hours (21 days + 17 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 71 03 80 04 11 40
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ea 00 00 00 00 00 00 00 00:13:30.508 FLUSH CACHE EXT
61 00 08 00 09 9c 40 00 00:13:30.507 WRITE FPDMA QUEUED
61 00 08 78 e1 42 40 00 00:13:30.507 WRITE FPDMA QUEUED
61 00 28 f0 44 9d 40 00 00:13:30.507 WRITE FPDMA QUEUED
61 00 08 00 6f 71 47 00 00:13:29.805 WRITE FPDMA QUEUED
Error 3 occurred at disk power-on lifetime: 519 hours (21 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 00 a0 25 e7 06
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ea 00 00 00 00 00 00 00 00:11:47.000 FLUSH CACHE EXT
61 00 08 88 c4 a0 40 00 00:11:45.863 WRITE FPDMA QUEUED
60 00 08 40 d4 08 49 00 00:11:45.863 READ FPDMA QUEUED
61 00 08 00 09 9c 40 00 00:11:45.863 WRITE FPDMA QUEUED
60 00 12 19 47 5a 40 00 00:11:45.863 READ FPDMA QUEUED
Error 2 occurred at disk power-on lifetime: 519 hours (21 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 40 d4 08 09 Error: WP at LBA = 0x0908d440 = 151573568
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 08 78 e1 42 40 00 00:10:28.019 WRITE FPDMA QUEUED
61 00 08 e0 96 a0 40 00 00:10:27.914 WRITE FPDMA QUEUED
61 00 08 98 95 a0 40 00 00:10:27.914 WRITE FPDMA QUEUED
61 00 08 70 95 a0 40 00 00:10:27.914 WRITE FPDMA QUEUED
61 00 08 58 95 a0 40 00 00:10:27.914 WRITE FPDMA QUEUED
Error 1 occurred at disk power-on lifetime: 426 hours (17 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 71 03 80 04 11 40
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ea 00 00 00 00 00 00 00 00:35:26.857 FLUSH CACHE EXT
61 00 08 00 09 9c 40 00 00:35:26.856 WRITE FPDMA QUEUED
61 00 08 ff ff ff 4f 00 00:35:26.161 WRITE FPDMA QUEUED
61 00 08 ff ff ff 4f 00 00:35:26.161 WRITE FPDMA QUEUED
61 00 08 ff ff ff 4f 00 00:35:26.160 WRITE FPDMA QUEUED
From the postings I have read on different forums, people tend to advice to replace disks before things start to become worse. Also I have read how a few people comment that they have been able to use such disks for several years before they died to death. For me, this is new land. I never ever had a disk with so many errors. Probably the owner before did handle that disk bad. For example shaking his laptop a lot, or the SATA-connectors did not suit perfect, causing errors too. As said, I have no clue, on how to interpret these parameters. It's like an experiment I am going to do with this disk.
I checked the disk with badblocks -wvs -b 4096 -o badblox.result /dev/sdb
and had no errors - DO NOT COPY&PASTE THAT BADBLOCKS COMMAND!!!. But when comparing the results of smartctl -a /dev/sdb
before and after running badblocks
the number of Raw_Read_Error_Rate and Seek_Error_Rate increased a lot while all the other Attribute-values remained the same. Check the snippet below:
Before running badblocks
.
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 104 099 006 Pre-fail Always - 6995776
7 Seek_Error_Rate 0x000f 059 055 030 Pre-fail Always - 107395771838
After babdblocks
had finished.
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 120 099 006 Pre-fail Always - 237676480
7 Seek_Error_Rate 0x000f 059 055 030 Pre-fail Always - 107395783395
The whole S.M.A.R.T Output can be reviewed on PasteBin:
So my questions are:
- How much serious damage does this disk have?
- Is my interpretation about Raw-Read and Seek-Error correct?
- Having zero Reallocated Sectors is a good thing?
- Having only one not Reallocated error is not too bad?
- Zero errors when running
badblocks
means that the disk is in a good shape? - How do i have to interpret the Error 1 to Error 4?
- Any more test I should do, apart from selftest
smartctl -t long /dev/sdb
that is running actually?
Answer
Very quickly:
Raw values mean nothing. They can vary from firmware to firmware, and unless you know exactly what your raw value means for your specific hardware, don't try to interprete them. Sometimes it's obvious (temperature in celsius), often it isn't.
The values are normed to 100, lower is worse. If it's 100 or above, no need to worry. If it's below 100, the harddisk is showing a bit of wear. If it gets close to the threshhold, or under it, start to worry.
All harddisks have raw read errors. That's a consequence of the high density of today's drives, and that's what the inbuilt error-correction is for.
So: Your raw read rate looks fine. Your reallocated sector rate is excellent, meaning nothing seriously happened yet. A few reallocated sectors are nothing to worry about.
Your temperature is too high for some reason, check that the harddrive is cooled properly. The seek error rate is too high, this may be a consequence of the temperature being to high, causing the metal to expand a bit, which may move the head position out of spec.
So the one bit you need to worry about is proper cooling. If you can make that work, the seek errors should go down, and in your place I'd keep the harddisk. (But, of course, you are doing backups, aren't you?)
Edit
Error 1-4 come from a log of the five most recent errors that were communicated on the ATA layer. Usually you get a header like
SMART Error Log Version: 1
ATA Error Count: xxx (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
So one could look up command and feature values in the ATA standard to find out more details about what happened. But having errors occur from time to time is by itself nothing to worry about: the embedded controller is complex, the interaction with the host is complex, the timing is complex; if some odd circumstances happen, that's one way to get an error. Other ways are bugs in the embedded controller firmware that only trigger under these odd circumstances.
Only when errors occur frequently, right now, and continue to occur it's time to worry, especially if it's always the same error.
You have three errors that occured after a cache flush, and one after a write (LBA = logical block address). Two happened together, probably as a consequence of the same problem, and the one before and the one after happened independently because of that. In your place, I'd completely ignore those: Whatever caused them is over, and it's not happening again.
No comments:
Post a Comment