28 January 2007

Blogger! Can't Update Site Links!

When I first set up this blog, I entered links to my main web site and older Win9x web site, while leaving in the Google News link that was there.

Hmm... according to Google News, "Palestinian Leaders Plead for Clam"... not clear if they're asking for clam to be served them as food, or perhaps clamency for an aquatic comrade in arms?

Anyway, what I wanted to do was add my new Vista blog, but I can't see anywhere to edit that list of links - and I did look everywhere. Strange!

Internet Explorer 7 Exits When Started?

I see Microsoft has an article describing this behavior, though in my experience the problem may be more general than the article suggests, though the mechanism may be the same.

What happens is that whenever you run Internet Explorer 7 (IE 7), or things that invoke it such as Windows or Microsoft Update, it vanishes off the screen as soon as it appears, with no error messages as to why it has done so.

I've seen this in previously-working IE 7 installations, but more commonly after something has interrupted an IE7 installation. What would do that? Automatic Update, that's what... as seen in the "Windows Bugs" photo set at my other blog.

I've seen this regularly, as others have not, so I wondered why this might be. Perhaps it's because the PCs in question have a lot of old patches to catch up on, so Automatic Update gets active as soon as they reach the Internet, plus I often install IE 7 from a saved copy off CDR at around the same time.

What is supposed to happen is that the Internet Explorer 7 install does its thing, then prompts you to restart Windows within its own series of successive blue dialog boxes. Instead, these dialogs are still indicating files being installed etc. while Automatic Updates pops up its usual grey dialog telling you to restart Windows, and if you cancel that, it will pop up again and again.

I've always wondered whether Automatic Update co-ordinates itself with what Windows or Microsoft Update are doing, or whether the same material gets downloaded by each at the same time, doubling the bandwidth consumed. This case suggests problems of that nature; the IE 7 installer should trap and disallow (or gracefully clean up) software-initiated shutdown requests, and/or prevent other items installing themselves while the IE 7 install is in progress. Similarly, Automatic Updates should detect Microsoft's own installation activity, be this locally or as managed from update web sites.

Sun Java JRE Bloat

Well, it took Microsoft long enough to finally scale down Internet Explorer's ridiculously bloated cache allocation; Internet Explorer 7 follows other browsers in sizing this to 50M, irrespective of hard drive volume size, and it may complain that the present cache size is too large (e.g. if it was set so via user).

However, what Microsoft has finally learned, Sun is still getting wrong. After installing the new Sun Java JRE 1.6, I saw a Java SysTray icon, and poked around; there's a slider for temporary file cache to be allocated to Java (separate from browser caches) and the duhfault is 1G! Needless to say, that got scaled back to 20M pretty quickly.

So, there's a new bloat factor to remember when folks run out of space on C:, over and above the bloat of multiple Sun Java JREs, as discussed earlier in this blog. At least new JREs no longer pass control to older (exploitable) ones when requested to do so by Java malware (sorry, "valued Java applets"); still, at 100M+ a pop, old JREs aren't too cheap.

13 January 2007

Bad RAM, Bad RAM Tester Design

This long post covers Vista's mOS, MemTest86 and Microsoft's stand-alone RAM testing utility.

How bad RAM presents

If RAM was originally OK, then goes bad, you'd start to see random errors, crashes, lockups, reports of corrupted registry or other files and operations, and perhaps some spontaneous resets. This random pattern may develop some reproducible errors, where the contents of the hard drive have been corrupted, either from bad RAM per se or from recurrent bad exits.

RAM crashes at full speed, so you won't notice any slowdown of the system. This contrasts with bad sectors on the hard drive, which slows the system due to attempts to retry the operation, and/or copy contents of failing sectors to spare sectors. On most consumer PCs, there's no attempt to detect RAM errors after the BIOS boot phase; where such detection is present, the system will usually halt as soon as a RAM error is detected.

Why bad RAM matters

RAM errors can not only corrupt what is written to disk, but also where it is written to disk, at a level beneath that of the file system. A sector intended to be written to the contents of a file may instead be written over some core file system structure, e.g. if a hi-order bit in the raw sector address is flipped from 1 to 0.

RAM errors can corrupt code, causing crashes, but a greater risk may arise where the code does not crash. Many disk operation calls may use a status byte in a register to differentiate between read and write operations, so a bit-flip there could cause a write instead of a read. This is why no disk access can be considered safe; any disk access starts with reading crucial areas of the file system, and if those reads become writes, the disk contents could be trashed.

Why bad RAM may be tough to exclude

When I started out with PCs in the era of DOS 3.3, 286 processors etc. I wondered why there were so many RAM testing utilities around. Surely you would just copy data to a spare register, write it into RAM, read it back from RAM, and compare with the spare register?

I had found that in practice, several testers would pass RAM as "OK" even though swap testing would clearly demonstrate that problems would clear up on the suspect PC with good RAM and appear on a known-good PC with the suspect RAM added.

So I though a bit more about how RAM can fail; not just by returning different data compared what was written to it, but altering data in other addresses when certain addresses are accessed, or behaving differently according to whether the RAM is read for instructions vs. data, or whether it is being accessed by the processor, AGP, or some other device via DMA.

Also, some failures can crash, lock up or reset the system, instead of being presented as a nice list of bad addresses. If the RAM testing boot disk is left in the system during the test, a spontaneous reset may be missed, unless you happen to notice the test has been running for fewer hours than have elapsed since you started the test.

For a long time, I gave up on RAM testing utilities, and just did swap testing as above. My faith in RAM testers started to return with SIMM Tester, and grew stronger with MemTest86 and MemTest86+. But I find that even with these tools, either one of the two MemTest86 projects may detect errors where SIMM Tester does not, and 8 hours of MemTest86 may pass, only to throw errors somewhere in the next 12 hours of testing.

How to design a RAM testing utility

This isn't about test sequence and data intended to provoke errors due to local power starvation or whatever. Instead, it's about how this core of test routines should be wrapped into a safe and usable utility - as illuminated by issues raised earlier in this post.

Microsoft have a free stand-alone RAM tester that is called the "Windows Memory Diagnostic". But why is "Windows" in the tool's name, given this is a tool that should run at the system level, before any OS has booted up or is left running in the background?

I used this stand-alone form of the tool, and noticed something rather nasty about it - when set to repeat the test sequence, it clears the results of all previous test passes! It also does not indicate elapsed clock time, so if the tester disk is left in the boot drive, the test will restart and look exactly the same as if it had been running without any interruptions.

Any RAM failure is significant, even if it shows up only once in 24 hours of testing. If you use MemTest86 and one such error occurs, you will see it listed when you return after an overnight unattended test - whereas even if Microsoft's tester flagged it at the time, you will only see the "OK" result of the last test pass when you return in the morning.

There's no point in doing 24 hours of testing, if only the last pass (possibly the last 20 minutes of testing) is reported! Who is going to sit and watch an "unattended" RAM test loop for 24 hours, just in case one pass fleetingly shows an error on screen?

How to integrate RAM testing with a mOS

I'd love to include RAM testing within my maintenance OS, but I can't see a way to fully automate this. The mOS boot disk should not boot past the RAM testing component into loading the full mOS, because that involves a lot of disk operations that may be unsafe when RAM is bad. There's no safe and standard way that the RAM tester can set a flag that it is in session, that will persist after a spontaneous reset. The best I can think of would be to boot the mOS to a menu that defaults to testing RAM, but that does not timeout but will wait forever for a keypress.

So I can't see a safe way to incorporate RAM testing into a wizard-driven mOS intended for unskilled use. It would be lovely to have a boot disk that would do x hours of RAM testing, then test the hard drive for physical errors, then test and possibly fix file system logical errors, before commencing with formal scanning for malware. But without a safe way for the mOS boot to detect whether RAM had been recently (define "recently") tested without errors, the best I could design would be a mOS that booted to a 3-item menu (test RAM, continue with the wizard, or display help) and stayed there until a selection was made.

How to get all this sooo wrong

The good news is that the Vista DVD has RAM testing incorporated into the mOS. The bad news is that Microsoft made just about every mOS design mistake possible:
  1. mOS boot will fall through to hard drive boot unless key is pressed
  2. mOS runs a lot of code before the UI from which RAM can be tested
  3. mOS looks for a Vista installation on hard drive before anything else
  4. mOS drops RAM tester on hard drive, then reboots to run it
  5. RAM tester does one pass only, unless this is overridden by user
  6. RAM tester displays no results on screen
  7. RAM tester writes results to Vista installation's logs on hard drive
OK, let's walk through what happens when you test a system that may have bad RAM. Microsoft seems to expect this RAM to be so bad that a test single pass will catch it, even though we know from experience that you may only see one error in 24 hours of testing (mistake 5).

If RAM is so bad that one test pass will always catch it, then it is surely too dangerous to run large complex GUI code (mistake 2), or to read into the logic of a Vista installation on the hard drive (mistake 3). If BIOS standard practice is to halt a system whenever bad RAM is detected, irrespective of what the OS was doing at the time, then surely it is foolhardy to boot up a complex OS from the hard drive (mistake 1), write material to hard drive (mistake 4), especially if the RAM has been proven to be bad (mistake 7)?

What happens if the nature of the defective hardware causes the system to reset, rather than lock up or carry on running so the tester can flag the error? Well, the Vista DVD will chain into the Vista installation on the hard drive and boot that (mistake 1), which is about the worst possible thing one can do - and this will happen even if you had explicitly excluded the hard drive from BIOS bootability, because the DVD boot chains directly into it irrespective of any BIOS-level settings you may have applied.

If the RAM did test bad, how would you know? It seems as if the only way would be by booting Vista from the hard drive and scratching around in Event Viewer. If the process of writing those results into Vista's logs didn't corrupt the contents of the hard drive, then booting Vista (with all the attendant paging, temp files and registry updates this may imply) to reach Event Viewer may well do so.

This is a bit like being a driving license tester faced with a pupil who immediately tries to mash down pedestrians a la Carmageddon at the start of the test. It's nice to see Microsoft (at last!) taking an interest in maintaining sick systems, but the lack of insight displayed is scary.

Learning Vista

See http://cquirke.spaces.live.com, which is where I'll blog my initial bewilderment and hopefully progress as I actually work with (as opposed to, look at) Vista.