Note: This content is accessible to all versions of every browser. However, this browser may not support basic Web standards, preventing the display of our site's design details. We support the mission of the Web Standards Project in the campaign encouraging users to upgrade their browsers.

Tobi Waves


INDEX | NOW | 2003|2004|2005 / 02|03|09|10 / 02|03|04|27|28|29|30

Recovering from Harddisk Disasters

Friday, September 03, 2004 09:13 // SUCON'04, Technopark, Zurich, Switzerland // href

Tutorial by Theodore Ts'o

What to do when data got lost

Don't panic !

Stop and think, figure out what happened. create a backup with

|dd if/dev/hda1 of/dev/hdb1 bs1k convsync,noerrors|

If you have no spare disk, buy one. The disks are way cheaper than the data.

Disks have a life span of 2-3 years, if they are in heavy use ... you might want to swap them preventively just to be sure.

Physical issues

Harddrives can not only experience head crashes but also "high rides". This is the name for incidents when a head flies higher than normal. This condition will only get noticed when data is read back. The new solaris zfs tries to catch this problem by reading back recently written data whenever it has spare time.

Hard drives survive only about 50'000 power downs due to the controlled head crash happening in the landing zone ... This can be a real issues for laptop configurations where frequent disk spin-downs are used to save batteries.

Some harddrives are not designed for continuous use. This will be noted in the spec sheet ... Make sure you check the spec of cheap disks you plan on using in your web-server.

A small head crash will not necessarily cause an immediate disk failure. It could just chip off small amounts of material from the disk surface which will then fly around in the hard-drive case. This condition will cause an increasing number of additional head-crashes which again will chip of material ... This means that it is a good practice to take a full linear image backup of a 'damaged' disk as soon as possible. Errors may well increase as you work on fixing it.

Get a new disk if you find any bad blocks on a disk.

Modern disks

Disks used to have a simple physical geometry. This is still visible in the head, cylinders, sector geometry specifications. Modern disks use constant bit rate and multiple long spiraling tracks to fit more data. This is all hidden by the controller and exposed through a simple linear block number to the OS.

The only thing one can assume about physical disk layout is, that two blocks which are numerically close together will normally have a short seek time.

EFI/GUID partitioning schemes

_Universally Unique IDs (aka GUID)_

A GUID is a 16 Byte number. Either a random number. Collision probability 1/2^64 (birthday paradox). Another method is to take the mac address of the computer plus a hires time stamp. A 3 bit code in the UUID/GUID shows the method used to create the GUID.

The EFI/GUID partitioning scheme uses a GUID to identify each disk as well as each partition. Partition types are "well-known" GUIDs, but still GUIDs (16 Byte) this allows to have unique identifiers for each filesystem type without a central registry.

An EFI/GUID partitioned disk contains an old style MBR patition table in the first sector which claims that the whole disk is covered by a special partition type. This prevents old OSes from messing with an EFI disk. Linux can do EFI partitions on any machine you run linux on. The only special problem is to have a boot loader which is able to deal with it.

About the FAT FS

Because all files are stored as single linked lists, random access is very hard. This also makes file fragmentation very bad. On top of it FAT uses a first free block allocation scheme which again furthers fragmentation.

Inode based Filesystems (FFS)

Stores only the filename and a link to the inode in the directory. The inode then stores all the meta information on the file. This allows to create hard-links.

For short files all blocks are linked directly from the inode. Longer files are created with indirect blocks. Even longer ones are stored with double or even triple indirect blocks.

Inode based filesystems are very fragmentation resistant. This is the reason why there are no defragmenters for Linux.

Old FFS filesystems like UFS allow to specify the physical geometry of the disks to optimize the physical allocation of the filesystem elements. Newer FFS implementations do not bother with this anymore, as there is nothing to be known about disk geometry anyways.

How to recover from accidents

_Overview_

Ask yourself what has happened?

What is the lowest level where you have problems. Always fix the lowest level first.

How important is the data?

When was the last backup performed.

Create a plan of attack before you do anything else.

_Hardware Level_

First indication are often console messages from ide/scsi driver. If you catch a correctable error, you may be able to replace the drive before it actually breaks.

If you see BadCRC errors on a new system it may indicate a simple cabling problem.

The "dev xx:yz" elements in disk errors identify the device file minor/major number affected by the error and thus the partition.

Use |e2fsck -c| to mark bad blocks and see what files are affected.

Check S.M.A.R.T. logs.

In any case make a full image (dd) backup of the disk.

For the image backup you may use |dd_resque| from (www.garloff.de ...) it will alter its block size when it hits a problem to recover as much data as possible without loosing speed while reading is easy, and it has a progress bar.

Partition Table Corruption

If the filesystem can not be found, it may be "only" a problem with the partition table.

|fdisk -l| will show what is there.

Make a backup copy of the MBR. (dd is your friend)

|gpart -W /part.table /dev/hda| can scan the disk for filesystems and reconstruct the partition table. Old filesystems from old partitions still sitting on the disk may confuse gpart.

Filesystem Corruption Problem

Errors may be reported by |e2fsck| during quick boot check or during a full check.

EXT2/3 can also detect errors as it runs ... the actions it should take in this case can be configured at mount time or through tune2fs. For laptops 'remount-ro' is advisable. Servers should better 'panic' as this allows the system to get back into a sensible stat and not limp along. Often such minor corruptions are fixed in the |e2fsck| phase.

In general running |e2fsck| with -y (yes to everything) is fine as you can normally not do anything else than say yes anyway, but |e2fsck| may move orphaned inodes and disconnected directories into 'lostfound' and this should be cleaned up before booting the system fully. The 'file' command can help to identify files. The locate database can help identify the original location of the directory.

e2fsck will not notice blocks with wrong data which are part of a file as it does not maintain any CRCs.

Undeleting Files

In EXT3 unlink will zero out inodes and can thus not be recovered. (this may be 'fixed' at some point')

Undelete on a system level is not possible with EXT3.

Use userspace delete/undelete tools.

Oh and make backups.

|grep -ab regexp /dev/hda1 | awk -F: 'printf(%x\n", ($11023)/1024);}'| (use 4095,4096 for 4k blocks)

Gives the disk blocks where the regexp was found. Then use |lde| to examine the blocks visually.

e2image

The |e2image| tool lets you create a backup of the inode table.

The latest (not released yet) debugfs can use the inode table from an e2image backup, this allows to recover lost files. Even an accidental mkfs can be reverted to a large extent (contents of the root directory will be in lost+found).

It is good practice to run e2image every night.

S.M.A.R.T.

This is the internal health monitoring system of modern hard disks. It will give early warning about disk problems in the waiting.

|smartctl| and |smartd| are your friends here.

Conclusion

Make backups. Save your sanity.

 

udev, a way to manage /dev from userspace

Friday, September 03, 2004 17:03 // SUCON'04, Technopark, Zurich, Switzerland // href

by Greg Koah-Hartmann

Most Unix systems have a device filesystem. So does Linux with devfs. There are three main problems with it.

The code is ugly and beyond repair

The namespace is not LFS compliant

The author of the code has out of the loop for about two years.

A new solution has to be found, as the state of the /dev tree without some automatic management is not tenable. In Debian for example there are 18'000 static entries in there. And on the other hand there are USB plug and play devices which tend to get a different device name every time they are plugged in.

The only thing udev can not do, is to detect a process trying to access a device node that does not exist and then load the relevant driver. This feature of devfs does not seem crutial though.

In the kernel 2.6 there are two main components which make a new and simple solution possible:

The kernel can call a program called |/sbin/hotplug| whenever new devices are connected to the system.

The sysfs filesystem (mounted under /sys) contains all information about devices known to the kernel.

Udev provies a small userspace daemon which manages the /dev tree. It can populate it with a small set of default devices like ttys at boot time and then go on to add all other devices known to the system. It is configurable via simple text file with rules about the naming of the devices. These rules can be pretty sophisticated. Usb devices can be identified according to their vendor or product string as well as through any other property they provide. It is even possible to make udev run an external program which examines the device and then decides how the /dev entry should be called.

All distributions have adopted udev for their linux 2,6 editions. There are some teething problems with distros not using the official udev helper scripts. The author himself maintains the gentoo package.

Udev has to be started VERY early in the boot process, so that other programs can access the devices. Depending on the setup it may be necessary to add udev to initrd. Volume managers and RAID setups are mentioned.

*

 

NEWER | LONGER |