14 January 2006

Bad File System or Incompetent OS?

"Use NTFS instead of FAT32, it's a better file system", goes the knee-jerk. NTFS is a better file system, but not in a sense that every norm in FAT32 has been improved; depending on how you use your PC and what infrastructure you have, FATxx may still be a better choice. All that is discussed here.

The assertion is often made that NTFS is "more robust" than FAT32, and that FAT32 "always has errors and gets corrupted" in XP. There are two apparent aspects to this; NTFS's transaction rollback capability, and inherent file system robustness. But there's a third, hidden factor as well.

Transaction Rollback

A blind spot is that the only thing expected to go wrong with file systems, is the interruption of sane write operations. All of the strategies and defaults in Scandisk and ChkDsk/AutoChk (and automated handling of "dirty" file system states) are based on this.

When sane file system writes are interrupted in FATxx, you are either left with a length mismatch between FAT chaining and directory entry (in which case the file data will be truncated) or a FAT chain that has no directory entry (in which case the file data may be recovered as a "lost cluster chain"). It's very rare that the FAT will be mismatched (the benign "mismatched FAT", and the only case where blind one-FAT-over-the-other is safe). After repair, you are left with a sane file system, and the data you were writing is flagged and logged as damaged (therefore repaired) and you know you should treat that data with suspicion.

When sane file system writes are interrupted in NTFS, transaction rollback "undoes" the operation. This assures file system sanity without having to "repair" it (in essence, the repair is automated and hidden from you). It also means that all data that was being written is smoothly and seamlessly lost. The small print in the articles on Transaction Rollback make it clear that only the metadata is preserved; "user data" (i.e. the actual content of the file) is not preserved.

Inherent Robustness


What happens when other things cause file system corruption, such as insane writes to disk structures, arbitrary sectors written to the wrong addresses, physically unrecoverable bad sectors, or malicious malware payloads a la Witty? That is the true test of file system robustness, and survivability pivots on four things; redundant information, documentation, OS accessibility, and data recovery tools.

FATxx redundancy includes the comparison of file data length as defined in directory entry vs. FAT cluster chaining, and the dual FATs to protect chaining information that cannot be deduced should this information be lost. Redundancy is required not only to guide repair, but to detect errors in the first place - each cluster address should appear only once within the FAT and collected directory entries, i.e. each cluster should be part of the chain of one file or the start of the data of one file, so it is easy to detect anomalies such as cross-links and lost cluster chains.

NTFS redundancy isn't quite as clear-cut, extending as it does to duplication of the first 5 records in the Master File Table (MFT). It's not clear what redundancy there is for anything else, nor are there tools that can hardness this in a user-controlled way.

FATxx is a well-documented standard, and there are plenty of repair tools available for it. It can be read from a large number of OSs, many of which are safe for at-risk volumes, i.e. they will not initiate writes to the at-risk volume of their own accord. Many OSs will tolerate an utterly deranged FATxx volume simply because unless you initiate an action on that volume, the OS will simply ignore it. Such OSs can be used to safely platform your recovery tools, which include interactively-controllable file system repair tools such as Scandisk.

NTFS is undocumented at the raw bytes level because it is proprietary and subject to change. This is an unavoidable side-effect of deploying OS features and security down into the file system (essential if such security is to be effective), but it does make it hard for tools vendors. There is no interactive NTFS repair tool such as Scandisk, and what data recovery tools there are, are mainly of the "trust me, I'll do it for you" kind. There's no equivalent of Norton DiskEdit, i.e. a raw sector editor with an understanding of NTFS structure.

More to the point, accessibility is fragile with NTFS. Almost all OSs depend on NTFS.SYS to access NTFS, whether these be XP (including Safe Command Only), the bootable XP CD (including Recovery Console), Bart PE CDR, MS WinPE, Linux that uses the "capture" approach to shelling NTFS.SYS, or SystemInternals' "Pro" (writable) feeware NTFS drivers for DOS mode and Win9x GUI.

This came to light when a particular NTFS volume started crashing NTFS.SYS with STOP 0x24 errors in every context tested (I didn't test Linux or feeware DOS/Win9x drivers). For starters, that makes ChkDsk impossible to run, washing out MS's advice to "run ChkDsk /F" to fix the issue, possible causes of which are sanguinely described as including "too many files" and "too much file system fragmentation".

The only access I could acquire was BING (www.bootitng.com) to test the file system as a side-effect of imaging it off and resizing it (it passes with no errors), and two DOS mode tactics; the LFN-unaware ReadNTFS utility that allows files and subtrees to be copied off, one at a time, and full LFN access by loading first an LFN TSR, then the freeware (read-only) NTFS TSR. Unfortunately, XCopy doesn't see LFNs via the LFN TSR, and Odi's LFN Tools don't work through drivers such as the NTFS TSR, so files had to be copied one directory level at a time.

These tools are described and linked to from here.

FATxx concentrates all "raw" file system structure at the front of the disk, making it possible to backup and drop in variations of this structure while leaving file contents undisturbed. For example, if the FATs are botched, you can drop in alternate FATs (i.e. using different repair strategies) and copy off the data under each. It also means the state of the file system can be snapshotted in quite a small footprint.

In contrast, NTFS sprawls its file system structure all over the place, mixed in with the data space. This may remove the performance impact of "back to base" head travel, but it means the whole volume has to be raw-imaged off to preserve the file system state. This is one of several compelling arguments in favor of small volumes, if planning for survivability.

OS Competence

From reading the above, one wonders if NTFS really is more survivable or robust that FATxx. One also wonders why NTFS advocates are having such bad mileage with FATxx, given there's little inherent in the file system structural design to account for this. The answer may lie here.

We know XP is incompetent in managing FAT32 volumes over 32G in size, in that it is unable to format them. If you do trick XP into formatting a volume larger than 32G as FAT32, it fails in the dirtiest, most destructive way possible; it begins the format (thus irreversibly clobbering whatever was there before), grinds away for ages, and then dies with an error when it gets to 32G. This standard of coding is so bad as to look like a deliberate attempt to create the impression that FATxx is inherently "bad".

But try this on a FATxx volume; run ChkDsk on it from an XP command prompt and see how long it takes, then right-click the volume and go Properties, Tools and "check the file system for errors" and note how long that takes. Yep, the second process is magically quick; so quick, it may not even have time to recalculate free space (count all FAT entries of zero) and compare that to the free space value cached in the FAT32 boot record.

Now test what this implies; deliberately hand-craft errors in a FATxx file system, do the right-click "check for errors", note that it finds none, then get out to DOS mode and do a Scandisk and see what that finds. Riiight... perhaps the reason FATxx "always has errors" in XP is because XP's tools are too brain-dead to fix them?

My strategy has always been to build on FATxx rather than NTFS, and retain a Win9x DOS mode as an alternate boot via Boot.ini - so when I want to check and fix file system errors, I use DOS mode Scandisk, rather than XP's AutoChk/ChkDsk (I suppress AutoChk). Maybe that's why I'm not seeing the "FATxx always has errors" problem? Unfortunately, DOS mode and Scandisk can't be trusted > 137G, so there's one more reason to prefer small volumes.

2 January 2006

WMF Exposes Bad Design

Crisis of the day; an unpatched vulnerability that allows malformed .WMF files to run as raw code, i.e. the classic "insane code" scenario that can explode anywhere, any time.

See elsewhere for evolving details of the defect, workarounds, vulnerability detection tools and so on. DEP is mooted as a protection, but I am not certain that all exploits will trip DEP; in any case, DEP's only fully effective on XP SP2 systems with DEP-capable processors, and where other software issues haven't required it to be disabled.

Code defects can arise anywhere, any time, regardless of what the code is supposed to be doing by design. The hallmark of the pure code defect is that the results bear no relation to design intentions, and can thus be considered insane.

So it follows that any part of the OS may need to be amputated (or bulkheaded off) at any time.

When the problem is an inessential associated file type, this should be as easy as redirecting that file type away from the defective engine that processes it - and this is where bad OS design comes to light.

File associations are not simply there to "make things work"; they are also the point at which the user exerts control. The problem is, the OS often blurs file association linkages based on information hidden from the user, such as header information embedded in the file's data. If anything, the trend is getting worse, with the OS sniffing file content and changing its management according to what it finds hidden there, even if this differs from the information on which the user judged the risk of "opening" the file.

This is unsafe design. Surely by 2005, it should be obvious to mistrust content that mis-represents its nature? Even when the risk significance is less obvious than a .PIF containing raw .EXE code, the fact that any file type can suddenly become hi-risk due to an exploitable code defect should imply that all file types should be "opened" only by the code appropriate for the file type claimed.

As it is, simply changing (or killing) the association for .WMF files may be ineffective, because if the OS is presented with a file of different file name extension and it recognises the content as WMF, it will pass it to the (defective) WMF handler.

The lesson here goes beyond fixing this particular defect, resolving to code better in future (again), and always swallowing patches as soon as the vendor releases them (a sore point in this case, as exploitation precedes patch). We should also ensure that file content is processed only as expected by the file name extension; any variance between this and the content should be considered hostile, and blocked as such.