30 March 2008

NTFS vs. FATxx Data Recovery

Technorati tags:

By now, I've racked up some mileage with data recovery in FATxx and NTFS, using R-Studio (paid), GetDataBack (demo), manually via ye olde Norton DiskEdit (paid), and free Restoration, File Recovery 4 and Handy Recovery, and a pattern emerges.

Dispelling some myths 

NTFS has features that allow transactions to be reversed, and there's much talk of how it "preserves data" in the face of corruption.  But all it really preserves is the sanity of the file system and metadata; your actual file contents are not included in these schemes of things. 

Further, measures such as the above, plus automated file system repair after bad exits from Windows, are geared to the interruption of sane file system activity.  They can do nothing to minimize the impact of insane file system activity, as happens when bad RAM corrupts addresses and contents of sector writes, nor can they ameliorate the impact of bad sectors encountered on reads (when the data is not in memory to write somewhere else, it can only be lost).

From an OS vendors' perspective, there's no reason to consider it a failing to not be able to handle bad RAM and bad sectors; after all, it's not the OS vendor's responsibility to work properly under these conditions.  But they occur in the real world, and from a user's perspective, it's best if they are handled as well as possible.

The best defences against this sort of corruption is redundancy of critical information, such as the duplication of FATs, or less obviously, the ability to deduce one set of metadata from another set of cues.  Comparison of these redundant metadata allows the integrity of the file system to be checked, and anomalies detected.

Random sector loss

Loss of sector contents to corruption or physical disk defects is often not random, but weighted towards those parts of the disk that are accessed (bad sectors) or written (bad sectors and corruption) the most often.  This enlarges the importance of critical parts of the file system that do not change location and that are often accessed.

When this happens, there are generally three levels of recovery.

The first level, and easiest, is to simply copy off the files that are not corrupted.  Before doing so, you have to exclude bad hardware that can corrupt the process (e.g. bad RAM), and then you make a beeline for your most important files, copying them off the stricken hard drive - even before you surface scan the drive to see if there are in fact failing sectors on it, or attempt a full partition image copy.  This way, you get at least some data off even if the hard drive has less than an hour before dying completely.

You may find some locations can't be copied for various reasons that break down to invalid file system structure, physical bad sectors, overwritten contents, or cleanly missing files that have been erased.  If the hard drive is physically bad, you'd then attempt a partition copy to a known-good drive.  If you want to recover cleanly erased files, or attempt correction of corrupted file systems, then this partition copy must include everything in that space, rather than just the files as defined by the existing file system.

The second level of recovery is where you regain access to lost files by repairing the file system's logic and structure.  This includes finding partitions and rebuilding partition tables, finding lost directory trees and rebuilding missing root directories, repairing mismatched FATs and so on.  I generally do this manually for FATxx, whereas tools like R-Studio, GetDataBack etc. attempt to automate the process for both FATxx and NTFS. 

In the case of FATxx, the most common requirements are to rebuild a lost root directory by creating scratch entries pointing to all discovered directories that have .. (i.e. root) as their parent, and to build a matched and valid pair of FATs by selectively coping sectors from one FAT to the other.

Recovered data is often in perfect condition, but may be corrupted if file system cues are incomplete, or if material was overwritten.  Bad sectors announce their presence, but if bad RAM had corrupted the contents of what was written to disk, then these files will pass file system structural checks, yet contain corrupted data.

The third level of logical data recovery is the most desperate, with the poorest results.  This is where you have lost file system structural cues to cluster chaining and/or the directory entries that describe the files.

Where cluster chaining information is lost, one generally assumes sequential order of clusters (i.e. no fragmentation) terminated by the start of other files or directories, as cued by found directories and the start cluster addresses defined by the entries within these.  In the case of FATxx, I generally chain the entire volume as one contiguous cross-linked file by pasting "flat FATs" into place.  Files can be copied off a la first level recovery once this is done, but no file system writes should be allowed.

If directory entries are lost, then the start of files and directories can be detected by cues within the missing material itself.  Subdirectories in FATxx start with . and .. entries defining self and parent, respectively, and these are the cues that "search for directories" generally use in DiskEdit and others.  Many file types contain known header bytes and known offsets (e.g. MZ for Windows code files) and this is used to recover "files" from raw disk by Handy Recovery and others - a particularly useful tactic for recovering photos from camera storage, especially if the size is typical and known.

Results

I have found that when a FATxx volume suffers bad sectors, it is typical to lose 5M to 50M material from a file set ranging from 20G to 200G in size.  The remainder is generally perfectly recovered, and most recovery is level one stuff, complicated only by the need to step over "disk error" messages and retry bog-downs.

When level two recovery is needed, the results are often as good as the above, but the risks of corrupted contents within recovered files are higher.  The risk is higher if bad RAM has been a factor, and is particularly high if a "flat FAT" has to be assumed.

In contrast, when I use R-Studio and similar tools to recover files from NTFS volumes with similar damage, I typically get a very small directory tree that contains little that is useful.  Invariably I have to use level three methods to find the data I want.  Instead of getting 95% of files back in good (if not perfect) condition, I'll typically lose 95%, and the 5% I get is typically not what I am looking for anyway.

Level three recovery is generally a mess.  Flat-FAT assumptions ensure multi-cluster files are often corrupted, and loss of meaningful file names, directory paths and actual file lengths often make it hard to interpret and use the recovered files (or "files").

Why does mild corruption of FATxx typically return 90%+ of material in good condition, whereas NTFS typically returns garbage?  It appears is if the directory information is particularly easy to lose in NTFS.  I don't believe all the tools I've used, are unable to match the manual logic I use when repairing FATxx file systems via DeskEdit.

Survivability strategies

Sure, backups are the best way to mitigate future risks of data loss, but realistically, folks ask for data recovery so often that one should look beyond that, and set up file systems and hard drive volumes with an eye to survivability and recovery.

Data corruption occurs during disk writes, and there may be a relationship between access and bad sectors.  So the first strategy is to keep your data where there is less disk write activity, and disk access in general.  That means separating the OS partition, with its busy temp, swap and web cache writes, from the data you wish to survive.

At this point, you have opposing requirements.  For performance, you'd want to locate the data volume close to the system partition, but survivability would be best if it was far way, where the heads seldom go.  The solution to this is to locate the data close, and automate a daily unattended backup that zips the data set into archives kept on a volume at the far end of the hard drive, keeping the last few of these on a FIFO basis.

One strategy to simplify data recovery is to use a small volume to contain only your most important files.  That means level three recovery has less chaff to wade through (consider picking out your 1 000 photos from 100 000 web cache pictures in the same mass of recovered nnnnn.JPG files), and you can peel off the whole volume as a manageable slab of raw sectors to paste onto a known-good hard drive for recovery while the rest of the system goes back to work in the field.

The loss of cluster chaining information means that any file longer than one cluster may contain garbage.  FATxx stores this chaining information within the FATs, which also cue which clusters are unused, which are bad, and which terminate data cluster chains.  NTFS stores this information more compactly; cluster runs are stored as start, length value pairs, whereas a single bitmap holds the used/free status of all data clusters, somewhat like a 1-bit FAT. 

Either way, this chaining information is frequently written and may not move on the disk, and both of these factors increase the risk of loss.  A strategy to mitigate this common scenario is to deliberately favour large cluster size for small yet crucial files, so that ideally, all data is held in the first and only data cluster.  This is why I still often use FAT16, rather than FAT32, for small data volumes holding small files.

Another strategy is to avoid storing material in the root directory itself (for some reason, this is often trashed, especially by some malware payloads on C:) and to also avoid long and deeply-nested paths.  Some recovery methods, e.g. using ReadNTFS on a stricken NTFS volume, requires you to navigate through each step of a long path, which is tedious due to ReadNTFS's slowness, the need to step over bad sector retries along the way, and the risks of the path being broken by a trashed directory along the way.

Some recovery tools (including anything DOS-based, such as DiskEdit and ReadNTFS)can't be safely used beyond the 137G line, so it is best to keep crucial material within this limit.  Because ReadNTFS is one of the only tools that accesses NTFS files independently of the NTFS.sys driver, it may be the only way into to NTFS volumes corrupted in ways that crash NTFS.sys!

Given the poor results I see when recovering data from NTFS, I'd have to recommend using FATxx rather than NTFS as a data survivability strategy.  If readers can attain better results with other recovery tools for NTFS, then please describe your mileage with these in the comments section!

2 comments:

gggg said...

Your post is truly excellent, and shows original thought. Most all forums and articles push NTFS because they base all their info on what MS feeds them. But I often have problems of file corruption on NTFS, and NEVER have on FAT32. Have you noticed this greater frequency of problems as well? NTFS is supposed to be more reliable but never has been for me. ( Some of the blame might be due to 'converting' from FAT32 to NTFS..??? But on other systems, I began with NTFS from the beginning, and still encountered problems. I have never seen a blue screen on Win2000/FAT32, - and I use it all the time for many years!! ) I will point others to your blog.

Chris Quirke said...

Hi gggg!

On NTFS vs. FATxx, see...

http://cquirke.mvps.org/ntfs.htm

I think you hit the nail on the head when you said "...'converting' from FAT32 to NTFS". There are two problems with that:

1) Default permissions

When you install an NT-family OS on NTFS, it sets appropriate access control for the various parts of the OS etc. and this helps protect the OS against attack. This doesn't happen on FATxx of course, so when you then convert to NTFS, you can't get appropriate access control.

2) Mis-aligned volumes

When volumes and partitions are created, they may or may not be aligned correctly, from NTFS's perspective. If not, then when converted to NTFS, you get 512 byte clusters (i.e. each cluster is one sector). This impacts performance and undermines stability.

I suspect (2) is the cause of your bad mileage with NTFS. Normally, the mileage I'd expect you to get is an apparent lack of problems with NTFS (because they are papered over), while in reality it will be as likely to get corrupted from bad hardware as FATxx (no more or less).

The worse mileage kicks in when you try to repair the file system (poor tools) and recover data (invariably, loss of most directory info, in my experience).

So if you want to create FATxx volumes that may be converted to NTFS later, you need to ensure they are "properly aligned". If you use BING as your partitioning tool, you will see this option.

What pattern of problems with NTFS have you been seeing? Are there bad exits involved, and if so, why? What hardware? Are these "one big C: volume" systems?