30 March 2008

NTFS vs. FATxx Data Recovery

Technorati tags:

By now, I've racked up some mileage with data recovery in FATxx and NTFS, using R-Studio (paid), GetDataBack (demo), manually via ye olde Norton DiskEdit (paid), and free Restoration, File Recovery 4 and Handy Recovery, and a pattern emerges.

Dispelling some myths 

NTFS has features that allow transactions to be reversed, and there's much talk of how it "preserves data" in the face of corruption.  But all it really preserves is the sanity of the file system and metadata; your actual file contents are not included in these schemes of things. 

Further, measures such as the above, plus automated file system repair after bad exits from Windows, are geared to the interruption of sane file system activity.  They can do nothing to minimize the impact of insane file system activity, as happens when bad RAM corrupts addresses and contents of sector writes, nor can they ameliorate the impact of bad sectors encountered on reads (when the data is not in memory to write somewhere else, it can only be lost).

From an OS vendors' perspective, there's no reason to consider it a failing to not be able to handle bad RAM and bad sectors; after all, it's not the OS vendor's responsibility to work properly under these conditions.  But they occur in the real world, and from a user's perspective, it's best if they are handled as well as possible.

The best defences against this sort of corruption is redundancy of critical information, such as the duplication of FATs, or less obviously, the ability to deduce one set of metadata from another set of cues.  Comparison of these redundant metadata allows the integrity of the file system to be checked, and anomalies detected.

Random sector loss

Loss of sector contents to corruption or physical disk defects is often not random, but weighted towards those parts of the disk that are accessed (bad sectors) or written (bad sectors and corruption) the most often.  This enlarges the importance of critical parts of the file system that do not change location and that are often accessed.

When this happens, there are generally three levels of recovery.

The first level, and easiest, is to simply copy off the files that are not corrupted.  Before doing so, you have to exclude bad hardware that can corrupt the process (e.g. bad RAM), and then you make a beeline for your most important files, copying them off the stricken hard drive - even before you surface scan the drive to see if there are in fact failing sectors on it, or attempt a full partition image copy.  This way, you get at least some data off even if the hard drive has less than an hour before dying completely.

You may find some locations can't be copied for various reasons that break down to invalid file system structure, physical bad sectors, overwritten contents, or cleanly missing files that have been erased.  If the hard drive is physically bad, you'd then attempt a partition copy to a known-good drive.  If you want to recover cleanly erased files, or attempt correction of corrupted file systems, then this partition copy must include everything in that space, rather than just the files as defined by the existing file system.

The second level of recovery is where you regain access to lost files by repairing the file system's logic and structure.  This includes finding partitions and rebuilding partition tables, finding lost directory trees and rebuilding missing root directories, repairing mismatched FATs and so on.  I generally do this manually for FATxx, whereas tools like R-Studio, GetDataBack etc. attempt to automate the process for both FATxx and NTFS. 

In the case of FATxx, the most common requirements are to rebuild a lost root directory by creating scratch entries pointing to all discovered directories that have .. (i.e. root) as their parent, and to build a matched and valid pair of FATs by selectively coping sectors from one FAT to the other.

Recovered data is often in perfect condition, but may be corrupted if file system cues are incomplete, or if material was overwritten.  Bad sectors announce their presence, but if bad RAM had corrupted the contents of what was written to disk, then these files will pass file system structural checks, yet contain corrupted data.

The third level of logical data recovery is the most desperate, with the poorest results.  This is where you have lost file system structural cues to cluster chaining and/or the directory entries that describe the files.

Where cluster chaining information is lost, one generally assumes sequential order of clusters (i.e. no fragmentation) terminated by the start of other files or directories, as cued by found directories and the start cluster addresses defined by the entries within these.  In the case of FATxx, I generally chain the entire volume as one contiguous cross-linked file by pasting "flat FATs" into place.  Files can be copied off a la first level recovery once this is done, but no file system writes should be allowed.

If directory entries are lost, then the start of files and directories can be detected by cues within the missing material itself.  Subdirectories in FATxx start with . and .. entries defining self and parent, respectively, and these are the cues that "search for directories" generally use in DiskEdit and others.  Many file types contain known header bytes and known offsets (e.g. MZ for Windows code files) and this is used to recover "files" from raw disk by Handy Recovery and others - a particularly useful tactic for recovering photos from camera storage, especially if the size is typical and known.

Results

I have found that when a FATxx volume suffers bad sectors, it is typical to lose 5M to 50M material from a file set ranging from 20G to 200G in size.  The remainder is generally perfectly recovered, and most recovery is level one stuff, complicated only by the need to step over "disk error" messages and retry bog-downs.

When level two recovery is needed, the results are often as good as the above, but the risks of corrupted contents within recovered files are higher.  The risk is higher if bad RAM has been a factor, and is particularly high if a "flat FAT" has to be assumed.

In contrast, when I use R-Studio and similar tools to recover files from NTFS volumes with similar damage, I typically get a very small directory tree that contains little that is useful.  Invariably I have to use level three methods to find the data I want.  Instead of getting 95% of files back in good (if not perfect) condition, I'll typically lose 95%, and the 5% I get is typically not what I am looking for anyway.

Level three recovery is generally a mess.  Flat-FAT assumptions ensure multi-cluster files are often corrupted, and loss of meaningful file names, directory paths and actual file lengths often make it hard to interpret and use the recovered files (or "files").

Why does mild corruption of FATxx typically return 90%+ of material in good condition, whereas NTFS typically returns garbage?  It appears is if the directory information is particularly easy to lose in NTFS.  I don't believe all the tools I've used, are unable to match the manual logic I use when repairing FATxx file systems via DeskEdit.

Survivability strategies

Sure, backups are the best way to mitigate future risks of data loss, but realistically, folks ask for data recovery so often that one should look beyond that, and set up file systems and hard drive volumes with an eye to survivability and recovery.

Data corruption occurs during disk writes, and there may be a relationship between access and bad sectors.  So the first strategy is to keep your data where there is less disk write activity, and disk access in general.  That means separating the OS partition, with its busy temp, swap and web cache writes, from the data you wish to survive.

At this point, you have opposing requirements.  For performance, you'd want to locate the data volume close to the system partition, but survivability would be best if it was far way, where the heads seldom go.  The solution to this is to locate the data close, and automate a daily unattended backup that zips the data set into archives kept on a volume at the far end of the hard drive, keeping the last few of these on a FIFO basis.

One strategy to simplify data recovery is to use a small volume to contain only your most important files.  That means level three recovery has less chaff to wade through (consider picking out your 1 000 photos from 100 000 web cache pictures in the same mass of recovered nnnnn.JPG files), and you can peel off the whole volume as a manageable slab of raw sectors to paste onto a known-good hard drive for recovery while the rest of the system goes back to work in the field.

The loss of cluster chaining information means that any file longer than one cluster may contain garbage.  FATxx stores this chaining information within the FATs, which also cue which clusters are unused, which are bad, and which terminate data cluster chains.  NTFS stores this information more compactly; cluster runs are stored as start, length value pairs, whereas a single bitmap holds the used/free status of all data clusters, somewhat like a 1-bit FAT. 

Either way, this chaining information is frequently written and may not move on the disk, and both of these factors increase the risk of loss.  A strategy to mitigate this common scenario is to deliberately favour large cluster size for small yet crucial files, so that ideally, all data is held in the first and only data cluster.  This is why I still often use FAT16, rather than FAT32, for small data volumes holding small files.

Another strategy is to avoid storing material in the root directory itself (for some reason, this is often trashed, especially by some malware payloads on C:) and to also avoid long and deeply-nested paths.  Some recovery methods, e.g. using ReadNTFS on a stricken NTFS volume, requires you to navigate through each step of a long path, which is tedious due to ReadNTFS's slowness, the need to step over bad sector retries along the way, and the risks of the path being broken by a trashed directory along the way.

Some recovery tools (including anything DOS-based, such as DiskEdit and ReadNTFS)can't be safely used beyond the 137G line, so it is best to keep crucial material within this limit.  Because ReadNTFS is one of the only tools that accesses NTFS files independently of the NTFS.sys driver, it may be the only way into to NTFS volumes corrupted in ways that crash NTFS.sys!

Given the poor results I see when recovering data from NTFS, I'd have to recommend using FATxx rather than NTFS as a data survivability strategy.  If readers can attain better results with other recovery tools for NTFS, then please describe your mileage with these in the comments section!

27 March 2008

Why "One Bad Sector" Often Kills You

Technorati tags:

Has it ever seemed to you, that if there's "one bad sector" on a hard drive, it will often be where it can hurt you the most?

Well, there may be reasons for that - and the take-home should affect the way file systems such as NTFS are designed.

As it is, when I see early bad sectors, they are often in frequently-accessed locations.  This isn't because I don't look for bad sectors unless the PC fails, as I routinely do surface scans whenever PCs come in for any sort of work.  It's good CYA practice to do this, saving you from making excuses when what you were asked to do, causes damage due to unexpected pre-existing hardware damage.

Why might frequently-accessed sectors fail?

You could postulate physical wear of the disk surface, especially if the air space is polluted with particular matter, e.g. from a failed filter or seal, or debris thrown up from a head strike.  This might wear the disk surface most, wherever the heads were most often positioned.

You could postulate higher write traffic to increase the risk of a poor or failed write that invalidates the sector.

Or you could note that if a head crash is going to happen, it's most likely to happen where the heads are most often positioned.

All of the above is worse if the frequently-accessed material is never relocated by file updates, or defrag.  That may apply to files that are always "in use", as well as structural elements of the file system such as FATs, NTFS MFT, etc. 

Core code files may also be candidates if they have to be repeatedly re-read after being paged out of RAM - suggesting a risk mechanism that involves access rather than writes, if so.

As it is, I've often seen "one bad sector" within a crucial registry hive, or one of the core code files back in the Win9x days.  Both of these cause particular failure patterns that I've seen often enough to recognize, e.g. the Win9x system that rolls smoothly from boot to desktop and directly to shutdown, with no error messages, that happens when one of the core code files is bent.

I've often seen "one bad sector" within frequently-updated file system elements, such as FATs, NTFS "used sectors" bitmap, root directory, etc. which may explain why data recovery from bad-sector-stricken NTFS is so often unsatisfactory. 

But that's another post...

24 March 2008

Google Desktop vs. Vista Search

Technorati tags: , , , ,

Google accuses Microsoft of anti-competitive behaviour, in that Vista currently leverages its own desktop search over Google Desktop and other alternatives.  This issue is well-covered elsewhere, but some thoughts come to mind...

Isn't Google hardwired as the search engine within Apple's Safari?

Isn't Apple pushing Safari via the "software update" process as bundled with iTunes and QuickTime, even if the user didn't have Safari installed to begin with?

I'm seeing a lot of black pots and kettles here.

More to the point: If an alternate serach is chosen by the user or system builder, is the built-in Microsoft indexer stripped out?  This article suggests it won't be.

That's the ball to watch, because so far, Microsoft's approach to enabling competing subsystems has been to redirect UI to point to the 3rd-party replacement, without removing the integrated Microsoft alternative. 

That means the code bloat and exploitability risks of the Microsoft stuff remains, and that in turn makes it impossible for competitors to reduce the overall "cost" of that functionality (as using something else still incurs the "cost" of the Microsoft subsystem as well).

This is particularly onerous when the Microsoft subsystem is still running underfoot. 

For an example of the sort of problems that can arise; if you have an edition of Vista that does not offer the "Previous Versions" feature, you still have that code running underfoot, maintaining previous versions of your data files.  If someone subsequently upgrades Vista to an edition that does include "Previous Versions", then they can recover "previous versions" of your data files, even though those files were altered before Vista was upgraded.

So it's not enough to give Google (and presumably others, this complaint is not just for the benefit of the search king, is it?) equal or pre-eminant UI space.  If one has to accept the runtime overhead of some 3rd-party's indexer, then it's imperitive that Microsoft's indexer is not left running as well. 

As it is, indexer overhead is a big performance complaint with Vista.  If 3rd-party desktop search has to suffer the overhead of two different indexers, the dice are still loaded against the competition, because no matter how much more efficient the 3rd-party indexer may be, the overall result is worse performance.