How bad RAM presents
If RAM was originally OK, then goes bad, you'd start to see random errors, crashes, lockups, reports of corrupted registry or other files and operations, and perhaps some spontaneous resets. This random pattern may develop some reproducible errors, where the contents of the hard drive have been corrupted, either from bad RAM per se or from recurrent bad exits.
RAM crashes at full speed, so you won't notice any slowdown of the system. This contrasts with bad sectors on the hard drive, which slows the system due to attempts to retry the operation, and/or copy contents of failing sectors to spare sectors. On most consumer PCs, there's no attempt to detect RAM errors after the BIOS boot phase; where such detection is present, the system will usually halt as soon as a RAM error is detected.
Why bad RAM matters
RAM errors can not only corrupt what is written to disk, but also where it is written to disk, at a level beneath that of the file system. A sector intended to be written to the contents of a file may instead be written over some core file system structure, e.g. if a hi-order bit in the raw sector address is flipped from 1 to 0.
RAM errors can corrupt code, causing crashes, but a greater risk may arise where the code does not crash. Many disk operation calls may use a status byte in a register to differentiate between read and write operations, so a bit-flip there could cause a write instead of a read. This is why no disk access can be considered safe; any disk access starts with reading crucial areas of the file system, and if those reads become writes, the disk contents could be trashed.
Why bad RAM may be tough to exclude
When I started out with PCs in the era of DOS 3.3, 286 processors etc. I wondered why there were so many RAM testing utilities around. Surely you would just copy data to a spare register, write it into RAM, read it back from RAM, and compare with the spare register?
I had found that in practice, several testers would pass RAM as "OK" even though swap testing would clearly demonstrate that problems would clear up on the suspect PC with good RAM and appear on a known-good PC with the suspect RAM added.
So I though a bit more about how RAM can fail; not just by returning different data compared what was written to it, but altering data in other addresses when certain addresses are accessed, or behaving differently according to whether the RAM is read for instructions vs. data, or whether it is being accessed by the processor, AGP, or some other device via DMA.
Also, some failures can crash, lock up or reset the system, instead of being presented as a nice list of bad addresses. If the RAM testing boot disk is left in the system during the test, a spontaneous reset may be missed, unless you happen to notice the test has been running for fewer hours than have elapsed since you started the test.
For a long time, I gave up on RAM testing utilities, and just did swap testing as above. My faith in RAM testers started to return with SIMM Tester, and grew stronger with MemTest86 and MemTest86+. But I find that even with these tools, either one of the two MemTest86 projects may detect errors where SIMM Tester does not, and 8 hours of MemTest86 may pass, only to throw errors somewhere in the next 12 hours of testing.
How to design a RAM testing utility
This isn't about test sequence and data intended to provoke errors due to local power starvation or whatever. Instead, it's about how this core of test routines should be wrapped into a safe and usable utility - as illuminated by issues raised earlier in this post.
Microsoft have a free stand-alone RAM tester that is called the "Windows Memory Diagnostic". But why is "Windows" in the tool's name, given this is a tool that should run at the system level, before any OS has booted up or is left running in the background?
I used this stand-alone form of the tool, and noticed something rather nasty about it - when set to repeat the test sequence, it clears the results of all previous test passes! It also does not indicate elapsed clock time, so if the tester disk is left in the boot drive, the test will restart and look exactly the same as if it had been running without any interruptions.
Any RAM failure is significant, even if it shows up only once in 24 hours of testing. If you use MemTest86 and one such error occurs, you will see it listed when you return after an overnight unattended test - whereas even if Microsoft's tester flagged it at the time, you will only see the "OK" result of the last test pass when you return in the morning.
There's no point in doing 24 hours of testing, if only the last pass (possibly the last 20 minutes of testing) is reported! Who is going to sit and watch an "unattended" RAM test loop for 24 hours, just in case one pass fleetingly shows an error on screen?
How to integrate RAM testing with a mOS
I'd love to include RAM testing within my maintenance OS, but I can't see a way to fully automate this. The mOS boot disk should not boot past the RAM testing component into loading the full mOS, because that involves a lot of disk operations that may be unsafe when RAM is bad. There's no safe and standard way that the RAM tester can set a flag that it is in session, that will persist after a spontaneous reset. The best I can think of would be to boot the mOS to a menu that defaults to testing RAM, but that does not timeout but will wait forever for a keypress.
So I can't see a safe way to incorporate RAM testing into a wizard-driven mOS intended for unskilled use. It would be lovely to have a boot disk that would do x hours of RAM testing, then test the hard drive for physical errors, then test and possibly fix file system logical errors, before commencing with formal scanning for malware. But without a safe way for the mOS boot to detect whether RAM had been recently (define "recently") tested without errors, the best I could design would be a mOS that booted to a 3-item menu (test RAM, continue with the wizard, or display help) and stayed there until a selection was made.
How to get all this sooo wrong
The good news is that the Vista DVD has RAM testing incorporated into the mOS. The bad news is that Microsoft made just about every mOS design mistake possible:
- mOS boot will fall through to hard drive boot unless key is pressed
- mOS runs a lot of code before the UI from which RAM can be tested
- mOS looks for a Vista installation on hard drive before anything else
- mOS drops RAM tester on hard drive, then reboots to run it
- RAM tester does one pass only, unless this is overridden by user
- RAM tester displays no results on screen
- RAM tester writes results to Vista installation's logs on hard drive
If RAM is so bad that one test pass will always catch it, then it is surely too dangerous to run large complex GUI code (mistake 2), or to read into the logic of a Vista installation on the hard drive (mistake 3). If BIOS standard practice is to halt a system whenever bad RAM is detected, irrespective of what the OS was doing at the time, then surely it is foolhardy to boot up a complex OS from the hard drive (mistake 1), write material to hard drive (mistake 4), especially if the RAM has been proven to be bad (mistake 7)?
What happens if the nature of the defective hardware causes the system to reset, rather than lock up or carry on running so the tester can flag the error? Well, the Vista DVD will chain into the Vista installation on the hard drive and boot that (mistake 1), which is about the worst possible thing one can do - and this will happen even if you had explicitly excluded the hard drive from BIOS bootability, because the DVD boot chains directly into it irrespective of any BIOS-level settings you may have applied.
If the RAM did test bad, how would you know? It seems as if the only way would be by booting Vista from the hard drive and scratching around in Event Viewer. If the process of writing those results into Vista's logs didn't corrupt the contents of the hard drive, then booting Vista (with all the attendant paging, temp files and registry updates this may imply) to reach Event Viewer may well do so.
This is a bit like being a driving license tester faced with a pupil who immediately tries to mash down pedestrians a la Carmageddon at the start of the test. It's nice to see Microsoft (at last!) taking an interest in maintaining sick systems, but the lack of insight displayed is scary.