Troubleshooting Sable's SSD
Early last Sunday afternoon I noticed that the battery-charge indicator had vanished from (main laptop)Sable's Gnome panel. (That's sort of like the row of icons and such you see along the bottom of the screen on a Mac, except that I've configured it to go vertically down the left-hand edge, where it doesn't reduce the hight of my browser window too much.)
Hmm, says I to myself, maybe it will come back after a reboot. So I did
that, and logging in presented me with an empty screen background. ??? A
little more experimentation showed that only the Gnome-2 desktop was
affected; the Ubuntu one (which I detest) worked fine. So did a console
terminal, and SSH. The obvious next step was to run fsck
,
the file-system checker (and many hackers' favorite stand-in for a certain
four-letter expletive).
Well, not quite the next step. Since I figured that fixing file-system
corruption might possibly make things worse, I moved over to one of my
spare laptops, Raven, sat Sable on the shelf next to my desk, and logged
in on Sable with SSH. Then I went to the top of my working tree and ran
make status
to see what needed to be checked in. I
think I've mentioned MakeStuff before -- it's basically a multi-function build tool based
on GNU
Make, and one of the things it can do is find every git repository
under the top-level directory, and do things like check its status, or
pull. (Commit takes a little more thought, so you don't want to do it
indiscriminately.)
Then I ran MakeStuff/scripts/scripts/pull-all on Raven. Done.
Well, almost. There are a few things in my home directory that aren't
under my working tree, mostly Desktop
,
Documents
, Downloads
, my Firefox bookmarks, and
my Gnome Panel configuration. I hauled out a USB stick, fired up
tar
(like zip, except that it can save everything
about a file, not just what DOS knows about). The command I actually
used, because I probably forgot a few things (and should have excluded a
few more, like Ruby and Perl), was
rsync -a --exclude vv --exclude ?cache --exclude ?golang . \ nova:/vv/backups/steve\@sable
And ran straight into the fact that USB sticks are usually formatted with a FAT filesystem, and limit files to 4Gb. Growf! Faced with the unappetizing prospect of shipping 17GB of backups over WiFi, I carried Sable over to my server and plugged in the ethernet cable that I leave hanging off the router for just such occasions. After that finished, I fired up Firefox bookmarked all my tabs, and exported tabs and bookmarks to an HTML file. Should have done that before I backed up everything, but I didn't think of it.
Finally, I was ready to run fsck
and find out the bad news.
I plugged in the USB stick with the Ubuntu live installer (one does
not run fsck
on a mounted filesystem!), brought up a
terminal, and ran
e2fsck -cfp /dev/sda5 # check for bad blocks, force, preen
(Force means to do a full check even if the disk claims it's okay; "preen"
means to make all repairs that can be done without human approval.)
Naturally, after turning up a few dozen bad blocks, it told me that I had
to run it manually. I could have replaced the -p
option with
-y
, to say "yes" to all requests for approval; instead I left
it off and hit Enter a hundred times or so. Almost all the problems were
"doubly-claimed blocks", mostly shared between some other file and the
swapfile
. Of course. Fsck offered to clone those blocks,
and I took it up on that offer. Then ran it again to make sure it hadn't
missed anything. It hadn't. But it was still broken, no doubt because of
all those corrupted files.
So this morning, after a couple of searches, I installed the
debsums
program, which finds all of the files you've
installed, and compares their checksums against the ones in the packages
they came from. The following command then takes that list, and
re-installs any package containing a file with a bad checksum:
apt-get install --reinstall $(dpkg -S $(debsums -c) \ | cut -d : -f 1 | sort -u)
Sable now "works" again. I know at one zip file was corrupted (it was a
download, and I was able to find it again), and fsck
doesn't
appear to have kept a log, so broken files will keep turning up for a
while. I know there aren't any bad zip files left because there's an
option in unzip, -t
, that compares checksums, just the way
debsums does, so I could loop through all my downloads with:
for f in *.zip; do echo -n $f:\ ; unzip -tqq $f; echo; done
I have two remaining tasks, I think: one is to validate all of my Git working trees (worst case -- just blow them away and re-clone them), and then comes the really hard one: deciding whether I still trust Sable's SSD, or need to get a new one. And if I get a new one, how big? Sable and its 500GB drive were purchased together, used, from eBay, and brand-new 1TB SSDs are pretty cheap right now. So there's that.