OT Ubuntu drive/ directory/ NFS issue

- L
- leen...
  
  Contact options for registered users
posted
1 year ago

Thu, Feb 23, 2023 7:03 AM

Hi All,

I have all my files stored on a central Ubuntu based server with 3 drives

the OS
all my data
local backup

It has been fine for a few years but annoyingly recently when accessing the data through an NFS mount it times out when reading the directory. Remotely logging on to the server if I try to "ls" that directory it takes say 30 mins to do it. Once done, the subsequent "ls" works immediately and also the NFS works correctly again.

I initially thought it was because drive 2 is starting to fail but looking at smartctrl (run long and short tests) and then reading each block with "dd" it seems like there are 2 dodgy blocks but besides that I think it is ok?

smartctl -a gives ========================================== ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 92225 3 Spin_Up_Time 0x0027 186 171 021 Pre-fail Always - 5683 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 427 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 010 010 000 Old_age Always - 66060 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 424

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 226 193 Load_Cycle_Count 0x0032 192 192 000 Old_age Always - 25655 194 Temperature_Celsius 0x0022 119 102 000 Old_age Always - 31 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

SMART Error Log Version: 1 No Errors Logged

SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 489 1326848392 # 2 Short offline Completed: read failure 90% 489 1326848392 # 3 Conveyance offline Completed without error 00% 0 - # 4 Short offline Completed without error 00% 0 - =============================================

sudo dd if=/dev/sdb1 of=/dev/null bs=64k conv=noerror ==================================================== dd: error reading '/dev/sdb1': Input/output error

43920419+1 records in 43920419+1 records out 2878368583680 bytes (2.9 TB, 2.6 TiB) copied, 24480.1 s, 118 MB/s 45785391+1 records in 45785391+1 records out 3000591388672 bytes (3.0 TB, 2.7 TiB) copied, 26169.8 s, 115 MB/s =================

running smartctl on disk 1(the OS) seems clear although having run the "short" test overnight it is stuck at 90%

So I am thinking the drives are not the cause of this issue. Anyone have any ideas?

Thanks

Lee.

- O
- Ottavio Caruso
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Thu, Feb 23, 2023 10:40 AM

Am 23/02/2023 um 07:03 schrieb snipped-for-privacy@yahoo.co.uk:

Have you tried using sshfs instead?

- J
- Jim Jackson
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Thu, Feb 23, 2023 11:13 AM

have run an fsck on the partitions? You will need to unmount each partition and then fsck it. You'd need to stop nfs exporting any relevant partitions, or the umount will report the moubnt point busy.

- P
- Paul
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Thu, Feb 23, 2023 12:36 PM

Someone is getting a new hard drive for Valentines Day. And that was nine days ago.

Power_On_Hours 66060 # I do have one this old. One drive of thirty two drives. Load_Cycle_Count 25655 # head park every two or three hours # it is not an aggressive parking-drive

Current_Pending_Sector 1 # No spare is available nearby, by the looks of it # Normally, Current_Pending never accumulates a count # That means we can't make the error go away. # But you can certainly try. It could, for example # be a high-fly error.

If I was a parish priest, I would tell you to "do a write pass followed by a read pass". Which would be 6 hours to write the entire drive and

6 hours to read the entire drive. Then, run smartctl again and see if the Current Pending is gone. It's like a Hail Mary for having sinned.

This may cause the things that were "sticking" or "slow" before, to perk up a tiny bit. As you're not waiting 15 seconds for a timeout. If it's a high fly error, a rewrite can "fix" the sector.

But as a shoot-from-the-hip comment, 66000 hours on a 3TB drive, it "has served you well". The only way I could get that hour count, was on a 500GB drive. Some of the less-dense drives, were exceptionally good on hours. The bigger ones tend to be more shitty. I've had drives start to show their true-self, at 5000 hours.

Just don't buy the cheapest SKU. There are exceptions to that rule, but then again, they are not the absolute cheapest. My WD Blue, now that was crap. The recent WD Black 1TB have been low-cost for some reason, but are usually cost a tiny bit more than a WD Blue.

Seagate can vary from generation to generation. You need some customer reviews that haven't been fudged, to capture the essence of the product.

You want a Perpendicular Magnetic Recording (PMR), not a Shingled Magnetic Recording (SMR) drive. SMR drives are not good as boot drives. They may be used as data drives... if you are "desperate for trouble". The manufacturers do not want to identify the SMR ones, and they have had to apologize on at least one occasion, for slipping SMR into applications where they do not belong (as a near-line NAS drive).

Helium drives start at either 6TB or 8TB capacity. There still isn't a good idea as to how long the Helium stays inside the drive. Apparently there is a sensor inside the drive, and some SMART parameter may cover that. I have some 6TB drives here, and those are air breather drives (the normal kind), rather than (sealed) Helium drives. Helium drives have two covers and no breather hole. (A breather hole is marked as "do not cover this hole", although some models do not have a warning on the label any more.)

Paul

- B
- Bob Eager
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Thu, Feb 23, 2023 2:08 PM

I normally pension drives off at about 50,000 hours, but noticed recently that I had some that had reached longer values - in one case nearly

90,000. I think they were 1TB or 2TB.

I have replaced them all. Easy enough as they were paired in mirrors - take one out of the mirror, change it, insert it back, wait for a sync. Then do it with the other one.

I tend to use WD Red - - but always Plus or Pro as they are not SMR.

WD tend to call the non-shingled ones 'CMR' - Conventional Magnetic Recording. Probably some marketing thing to make them sound old fashioned.

They are a disaster as part of a RAID array.

WD don't make it crystal clear, but they do tell you if you look carefully.

- R
- Rod Speed
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Thu, Feb 23, 2023 8:18 PM

Remains to be seen if that fixes the problem.

One of mine is 101724

Wrong. It means that that sector has not been written to yet.

Nope, its what the drive is waiting for before it reallocates the sector.

Mine is a 2TB drive.

- T
- The Natural Philosopher
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Fri, Feb 24, 2023 12:45 PM

No. Your raw error rate should be zero

Replace the disk.

- R
- Rod Speed
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Fri, Feb 24, 2023 5:59 PM

Nope.

- P
- Paul
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sun, Feb 26, 2023 6:49 PM

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 92225 ^^^ ^^^

The "Value" is higher than the "Threshold", so the drive passes.

Let's look at my ST4000DM000-2AE1 drive, a drive I bought to see if Seagate had learned their lesson yet. The "Value" is still higher than the "Threshold". The drive is not in any trouble (the drive has 550 hours on it). You can see the "Worst" it has ever been, is a bit closer to the failure threshold (so we know which direction the statistic goes in).

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 076 064 006 Pre-fail Always - 35656066

The field that has been removed and is not visible, is

Hardware_ECC_Corrected 35656066

Every bloody error was fixed! Yet, they don't tell you that. That is why the Raw Value in question, is "not a death sentence". It is a quality metric. If it is 10^7, that's still OK. If it is 2*10^8, that rates as "failed".

There is yet another missing field.

Recorded_Uncorrected_Errors 1078

What does that mean ?

Should not ( Raw - Corrected ) be equal to Uncorrected ?

Apparently not!

Even the fields with names, they are NOT being used for their intended purpose. Is the purpose defined in the standard ? I've never seen ANY standards text leak, so I don't even know whether good-quality definitions are available or not.

There is a field called "Current Pending", which was supposed to be a queue of questionable sectors that needed "processing". As far as I know, the processing might happen around a "write event" to the sector. Either the write works, or it doesn't. You could do a read verify. You could use automatic sparing, and replace the sector with another, if attempts to make the existing sector work, failed.

Now, I watched a whole bunch of HDTune smart tables (I own 32 drives). I watched as a drive declined in health. The Reallocated Sector Count field was accumulating counts. 200 one day for raw data. 300 the next day. Yet, while that was going on, Current Pending stayed at 0 like an obedient puppy! It was quite obvious, that the *actual* queue of dodgy sectors was hidden from us.

Then one day, finally, a Current Pending went to a raw value of 1. What was that. Well, since a hard CRC error appeared after that, it occurred to me, that spares in the local area were exhausted, and that seemed to correlate with the Current Pending *finally* going off the peg. THIS is why I am recommending to the OP to replace drive. It is based on me correlating Current Pending activity as related to "all_spares_exhausted" in that area.

As it is, Reallocated Sector Count is thresholded. The first hundred thousand corrected sectors are ignored. At some point, they start displaying the reallocations, and there is a finite number of those remaining to be counted. On one drive, if I had accumulated 5500 reallocations past the thresholded value, the drive would likely be declared "failed" at that point.

Summary: While it is fun to poke your finger at "Raw" field values, since the interpretation is "not possible", due to fields missing and fields being used for the wrong purpose, all we can continue to do is look at "Value" and "Threshold" as indicators. The Smartctl that generated the OPs table, we don't know the drive model number, and the "summary field" above that table undoubtedly says "Good". Which in many cases is bullshit, because the drive is not "Good", if you did the analysis properly the drive would be "Fair".

I use a benchmark transfer curve, to spot trouble. On Windows, there is HDTune for this. On Linux, Gnome-disks does have a benchmark (make SURE to turn off the write tick box as you don't want the bench overwriting the drive). The graphical resolution of the gnome-disks benchmark, is too crude for determining drive health. Of you spot a 50GB wide swath of disk surface running the bench at only 10MB/sec, that is an indicator to replace the drive as well, even though smartctl has rated the drive "Good". SMART works best when defects are uniformly spread over the surface. If the defects concentrate in one spot on the disk, then the metrics in the table will not work properly to declare a "Fail". Thus, if you the user, use a read benchmark curve, a good quality one, you can spot trouble before smartctl does.

Paul