26 September 2020

Invisible BCD Boot Menu; Intel Graphics Driver

Geek summary: First post-install Win10 update of Intel Graphics drivers for i5-1035G1 renders the BCD Boot Menu invisible, although it still works.  Fixed if Device Manager, Display Adapter is Disabled; problem reproduced if Enabled, effects taking place after Windows restart.

I suspect the cause is failure of the driver to attain color values when started in the raw EFI context, as using the Win10 Settings, Recovery, Advanced UI will show the boot menu in proper color.  That UI reaches a different boot menu, with the normal boot menu seen via Other Operating Systems UI, without restarting through raw EFI boot.  Either the first menu applies the needed color settings, or bypassing the raw EFI phase preserves the successful Win10 OS context.

Test system where problem encountered; brand new Asus laptop based on new 10nm 10xxGx series processor, specifically i5-1035G1.  Not encountered in a new desktop PC built on Gigabyte motherboard with Pentium Gold G6400 processor, also as set up last week.

Background

EFI boot from internal storage enters that storage via {bootmgr}, which displays a boot menu if there are more than one OSLoader entry in the "DisplayOrder".  By default there's only one entry to boot Windows 10, so this boot menu is normally bypassed, and the bug is thus unobserved.

As part of my standard setup, I add boot entries for Safe Mode and Safe Cmd, to float these less-destructive troubleshooting opportunities above the deceptively-named "Refresh Your PC" (a bit more than a F5 web page "refresh") and "Reset Your PC" (far beyond pressing the Reset button to force a bad-exit Restart) bear-traps that you'd have to walk past to eventually find the Safe Modes.  This causes {bootmgr} to display the BCD Boot Menu for the Timeout seconds, thus revealing the bug.

Failure pattern

This particular system displays a GUI "Asus" image during the EFI firmware phase of the boot process, which fades before the BCD Boot Menu appears.  As this logo fades, the color undergoes a subtle shift to a less-blue hue of white; possibly a switch to greyscale, rather than a Win10 "night light" setting (as changing that setting does not change this behavior).  When the failure pattern is not in effect, the Asus logo does not change hue as it fades.

Normally, you'd then see the Boot Menu, but instead, the screen stays black.  There's still display signal present, and if if blindly use the arrow keys before pressing Enter, the menu works; you'd load whichever menu item you'd blindly selected.  If you use the trackpad or a mouse to move the mouse pointer, it will appear as the expected white arrow, and blindly clicking will also succeed in selecting and launching a menu entry.  

If you do nothing, the screen remains black for Timeout seconds and then boots normally.  The initial impression is that the system has "hung" or "crashed" (untrue, as safely tested by pressing Caps Lock to toggle the keyboard LED) or that the system is way slower to boot than expected, especially for an NVMe SSD.

Problem onset

I set up systems offline, to limit problems to one system rather than whatever is being pushed from the entire Internet.  During this phase, the BCD Boot Menu worked normally as expected, both before and after upgrading the "new laptop" Windows 10 version to a freshly-made version 2004.

Problem only appeared after attempting to disable Asus's aggressive underfootware, and initially I ascribed it to this and quickly reversed changes back to the default non-Microsoft Services, Startup entries, and Scheduled Tasks. However, this was also the first Restart after going online and letting Windows Update pull down and install updates, which included "driver updates", which in turn included OEM programs now pushed as "drivers" to evade user management via Settings, Apps or Control Panel, Programs and Features.

The fix

BIOS update, re-defaulting CMOS Setup settings, power off at the mains, holding down Power switch (part of keyboard) for 20+ seconds, BCDEdit nudge to {bootmgr} do not fix.  Device Manager, Display Adapter, Update Driver reports the latest (thus surely the "best") driver is already installed, and the Rollback Driver button is greyed out.

What fixes the problem, is Device Manager, Display Adapter, Disable and then a Shudown UI, Restart to put this change into effect across the EFI boot phase.  Enabling the Display Adapter reproduces the failure pattern after the Restart; the problem remains present until Display Adapter is Disabled again.

Note; I also disable the Windows 10 "Fast Startup" setting via the convoluted Settings, Power UI required.  So at least we know we're not resuming a flawed system runtime after a fake "shutdown".

Likely cause

I suspect the Intel graphics driver depends on context established by Windows, which is absent (nul pointer, anyone?) when the driver is run from raw EFI.  It either sets an incorrect graphics mode, or draws color values from zero'd memory such that "ink" and "paper" are both black.

Safety implications

Class 3 UEFI forces EFI boot, and thus all the flaky complexities of "Extensibility".  Whereas the ancient BIOS/MBR code was sufficiently trivial to be free of bugs, EFI is not, and adds the risk of malware positioning itself to run before any OS or storage device can boot.

The fact that a Windows driver can poison the pre-OS EFI boot process is worrying, especially as the choice of driver to load is either read by pre-OS EFI from Windows, or has been latched into pre-OS EFI behavior by a setting applied from within Windows.

Scenario 1

EFI executable .efi files are able to read the Windows registry, and do so, as the BCD is in fact a Windows registry hive in structure.  However, {bootmgr} is expected to be OS-agnostic, as at the time the Boot Menu is displayed, no decision has been taken as to what OS to boot - could be any version of installed Windows, a PreOS WinPE, a Linux, anything.  So the code that runs before the Boot Menu should not dip into Windows registry hives, e.g. to load drivers or pull variables such as the colors to use for the boot menu, etc.

In fact, safest would be for pre-OS {bootmgr} code to use the lowest default screen resolution, rather than loading any 3rd-party "drivers" for a "better visual experience".  This is a similar safety issue as code integration into "safe modes" (e.g. screen savers).

Scenario 2

When a device driver is selected in Windows, e.g. by disabling or enabling a Display Adapter, Windows may also be changing drivers within firmware EFI.  If so, then a different EFI driver will load, depending on that Windows setting, and a buggy EFI display driver could cause the problem directly, rather than via using null data.

All this is hard to assess, as modern systems blur hardware, firmware, "BIOS", drivers and OSs.  Everything is now likely to contain non-trivial and thus buggy code, and everything is treated as a black-box object that may "leak".  The interface programming model is supposed to blacken the boxes of the object-orientated model, hiding the gooey details more effectively; instead of the "calling code" examining exposed variables (object Properties), it now asks the object to return these variables (object Methods), trutsing the object's code to do that - which is not a great safety/security idea.

04 September 2020

The Clutch Effect

I'll start from the familiar, then delve into the implications.

You have a stack of paper, with a sheet near the bottom peeking out.  You grab that, and pull gently and slowly; the whole stack moves towards you.  You grab it and pull hard and fast; just that sheet emerges, leaving the rest of the pile where it is.

So you scale that up to the "tablecloth trick".  Disaster!

You have an old car with a shot clutch.  If you accelerate slowly and gently, the clutch "works" and the car accelerates proportionally in line with your expectations, based on the engine's revs and selected gear.  But if you stomp the gas, the engine revs speed up nicely but the car doesn't move much faster.

You have Diabetes Mellitus ("peeing lots of sweet urine", as per the tyranny of the measurable... a topic for another day). If you digest carbs slowly, you may be OK; if you digest fast, less so.

You're piloting a fighter aircraft, pursued by a guided missile, and you try to turn and climb to evade it - but at 9G, you black out.  The human-free missile has no such limitations.

You're a 1kg block of whatever, moving through space a nudge above zero.  Your experience is vastly different to that of a similar block moving a nudge below cSee "The universe has two speeds c, and zero; everything else is just a rounding error" for that and more.

You're a man carrying a bucket of water, and at that familiar scale, a liquid generally sinks to fill a container, with a flat surface on top that usually curls upwards a bit round the rim, unless it's mercury, which curves downwards instead.  Contrast that with an ant carrying a bead of water; though the scale isn't that much smaller than our familiar, the experience is already very different, and the opposite of how liquid helium climbs out of its container.

You're a steady DC current, flowing effortlessly through a coil but stopping dead at a capacitor.  You're perfect (i.e. highest frequency) AC, skipping effortlessly across a capacitor but stopping dead at a coil.  So far, so good... now you're a bolt of lightning striking a phone line, frying almost everything in an old 286 PC, yet the monitor remains unscathed - because the 90 degree bend of the thick copper wires at the graphic card's signal port melted before the current to reach it.  What's going on here, if lightning is DC?  Yes, but that very fast rise time behaves more like perfect AC... so to protect against lightning, tie knots in your cables - big ju-ju, works good!

What's common to all these scenarios that I've clumped together as "the clutch effect"?  It ties in with layers of abstraction, within which certain models work (e.g. Newtonian motion and speeds near 0) but beyond which, different models may be needed (e.g. Relativity at speeds near c).