I think I've figured out the data-killing KB5063878 bug; it's when UAC tangles with UI-less activities, as described in this report. Throws back to Vista's birth pains, when attempting add back lost immutability to the many-to-many relationship between things that happen, to what should not be allowed :-)
The report speaks of unexpected UAC prompts that now pop up due to changes added by KB5063878.
If that collides with last month's changes to UI-less "pre-Windows" code, e.g. before BCD is processed to menu OSLoaders (or blunder into {default}), or when WinRE boots instead, or when a "mini-Windows" is applying deep code changes before Windows loads, then you'll never see an error message, let alone UAC prompt to which the user can respond.
This is vendor-knows-best territory, locking out users and administrators alike.
Now if those changes were attempting to change partitioning, e.g. to shrink C: for space to be assigned to a new WinRE Recovery partition, then things could get messy - especially in a multi-threaded environment, and/or failure of the "black box" of steps to be properly atomic.
For example, imagine if one part of the process is allowed to message updated partition info to other threads, but another part of the process is blocked from actually applying those changes, then the runtime hopefully will screw up and crash out of functioning before it writes raw data to the wrong storage addresses, trashing file systems and/or partitions. If less lucky, such writes may trash these structures, so the storage is latched into a corrupted state.
So, is an apparent automatic recovery on reboot, actually safe? Well, if NTFS C: is foreshortened, everything may still appear valid and work. If the runtime crashes out, then the "dirty bit" should remain set, prompting the next boot's AutoChk to "kill, bury, deny" the file system's partition-end mismatch, "fixing" it to something at least valid for future file system operations.
The user may or may not see an AutoChk prompt to press a key to skip checking drive C:, but with fast-enough SSDs and the trend to hide details from users ("don'tcha wurry your pretty littul haid, Sue-Ellen, ever'thing's gonna be fine-just-fine"), perhaps that will be hidden, too.
This assumes Fast Startup doesn't just resume the doomed runtime and botch everything not already botched thus far, but that's likely to have been disabled for the next boot by some sort of "update in progress, boot properly, run this OSLoader instead" or similar logic.
Anyway, I think that is where I'd dig next, if trying to fix this mess, rather than just claiming "not my dog" after testing to clear a path to legally disclaim responsibility.
Why does bulk testing miss this?
Previous post explains that; bug may only arise when the full set of real-world conditions apply. Simply trashing an SSD controller with bulk writes from a KB5063878-updated Win11 won't do it, and neither may a virginal Win11 24H2 update to KB5063878 do it, if bulk writes don't happen at a time to overlap other factors that may not be present, etc.
Why only with bulk operations?
There may be race conditions involving various levels of OS and component cache managements that arise only in the context of cache saturation and flush periods extending beyond sate time-outs or blind wait periods. There may also be interplay with Delayed Start and Scheduled Tasks, especially when certain OEMs trigger underfootware to run every few minutes.
Vendor disclosure attempted
I've alerted Microsoft via Feedback Hub, and my ex-MVP colleagues via private email list, in case they don't experience the same lightbulb effect I did, when reading the "feed seed" article linked above.
No comments:
Post a Comment