Following up on reports and post-test denials of August Cumulative + SSU KB5063878 corrupting and possibly destroying storage when under a 50G+ bulk operation load, some likely scenarios come to mind, that may be missed during artificial accelerated testing sessions.
The initial focus was on SSDs based on Phison controllers, prompting Phison to test and exclude their controllers as a cause of the problem, while recommending heat sinks to protect SSDs against load-related failures. Subsequent reports suggest other SSDs, and even hard drives, can also be affected.
Accelerated testing?
Phison claims 4,500 cumulative testing hours across the drives reported as potentially impacted and conducted over 2,200 test cycles, which would be 187 days of testing if done on a single device. Testing 1,000 devices in parallel would reduce testing clock time to 4.5 hours per device, each iterating to a bit over 2 test cycles per device. You can shift the numbers around, e.g. 100 devices etc. limited by the number of clock days since the problems were first reported - but it's unlikely testing would have been real-world, i.e. based on individually-installed Windows 11 with a wide range of co-installed software, etc.
From Phison's perspective, all those software variables are irrelevant; as long as the hardware itself can be shown to work, it's some other vendors' problem if it's a software thing. In any case, attention shifts off Phison, and hardware specifics, once reports of other storage devices are considered. This spectrum of affected devices also suggests this isn't limited to overheating hi-performance SSDs.
The MemTest86 experience
A familiar type of artificial accelerated testing is MemTest86, in search of "bad RAM", but also as proof of hardware ability to not crash, power off, reset or lock up over a "long enough" clock-time period. I've done this for decades of PC builds, troubleshooting, laptop pre-acceptance testing, etc. and have settled on 24 hours as the shortest 99%-certain test period.
I've seen one case where the first error showed up at around 25 hours, and one where the first error showed up in an over-weekend 100+hour unattended test run. In both cases, the first error was the only error, and neither system latched into a persistent error state thereafter.
Shorter test periods, e.g. 18 hours, would be more convenient, e.g. allowing in and out turnaround within the same time of day, but I saw too many first-errors within the 18 to 24 hour period. Clearly, this makes the typical default 4-pass loops completing in an hour or two, unfit to be trusted as exclusionary.
Even so, "burn-in" testing with MemTest86 is not real-world, as it exercises a very limited subset of what the tested hardware has to do. It doesn't test GPU or DMA access to RAM, localized heat related to different kinds of CPU activity, and obviously anything to do with storage or other components.
Cache and Race Conditions
Microsoft's test methods are reportedly thorough, but not detailed, and are likely also to involve accelerated automated test methods that may be as narrow in their way, as is MemTest86's testing of processor and RAM.
Variables may include how soon after Windows 11 boot the tests are started, bearing in mind how "underfootware" can be triggered at arbitrary times - consider Delayed Start, that seeks to pretend Windows boots faster than it actually completes all inits and startups; application pre-loading that seeks to pretend applications aren't slow, but your Windows and hardware may be, stuff triggered via Scheduled Tasks, and hidden ServiceWorkers that may be triggered remotely.
Bulk operations will saturate caches, possibly revealing lower raw direct transfer speeds. These caches will be full at the "end" of file operations, needing time to spool out to where the data has already pretended to have been written... can you guess the problems that may happen next?
Modern PCs are more like networks of DOS-sized systems. A modern CPU has enough cache RAM to run a Windows 9x installation, while firmware logic within black-box devices such as hard and solid-state "disk" storage is at least a 20th-century DOS or BIOS.
Computational scale
There's a certain size of code that we may expect to be bug-free, at least if kept fully encapsulated; I'd guess somewhere between a DOS in a 1M memory map, and a Win9x in 4M or so.
Beyond that, things rapidly bog down such that attempts to complete a project will fail and have to be abandoned (WinAmp 3, Netscape, the original Microsoft Edge, even Windows 11 24H2's attempts to become acceptably reliable before 25H2 is due), or it will become a bunch of separate boxes of code linked together, as is the case with modern PC hardware subsystems and "web apps", or a thin layer of new top-soil over a decades-old mass of existing code, e.g. just about every OS other than Windows that is based on ancient *NIX, or how Windows 95 had to re-use solid 16-bit Assembly code to dance within 4M of RAM.
Whether it's a team of human workers, or a map of code black-boxes, new challenges and inefficiencies arise in how these interact. Add race conditions that arise when critical periods shift in phase, and exclusionary testing becomes a really hard problem that may defy automation.
So yes; Phison may prove thier controllers are OK, and Microsoft may conclude KB5063878 is OK, but neither may satisfy our need to be sure the KB will be safe on our particular systems, for reasons that both vendors can blow off as "not our problem". And what happens with this KB, may happen again with others, so we need a systematic fix for future scenarios.
Power (mis-)management
One very likely scenario involves power management, when power to a subsystem is cut before that subsystem has actually done the work that it claimed to have finished.
Consider an external USB hard drive and "Safe To Remove". To be aware that an external device is connected, it helps to see the relevant icon in the SysTray (sorry, "notification area"), but it's hidden under the More... by default. To click on it in order to initiate a "Safe To Remove" data flush to storage, you have to see the icon, which is instead hidden on the assumption you only need to see it when it has something to "notify" you. So far, so... not good.
Let's say you do remember you have an external plugged in and you do click the icon, then await the feedback that the device is safe to remove, or should not be removed because it is still in use. If you have Focus Assist enabled, you will never see that feedback, because that Notification is not considered sufficiently important, even though it's a part of your Focus that is to be Assisted. So, how are you supposed to know you can safely unplug the storage device?
We're told this "doesn't matter", but I've seen enough corrupted external storage to know that it does. The damage may be hidden by the "kill, bury, deny" logic of NTFS transaction rollback, AutoChk and ChkDsk, but you're still losing data that you expected to have been written to storage.
Finally, listen to how external USB hard drives often burble on after the "Safe To Remove" notification pops up. Has the drive firmware really flushed its cache to platters, or did it lie when it told the parent subsystem that it had finished all pending writes? Which drives do you think will look faster when tested by hardware reviewers, and feel faster to users (at least while all appears to be well)?
The same glitches that can corrupt external drives, can lose data if a component is expected to be idle, having completed all pending tasks, thus safe to be powered off.
Fast Startup and partition changes
So far, we've considered loss of pending write operations when caches take too long to flush, and/or when subsystems are prematurely disconnected and/or powered off - but there's another aspect to KB5063878 that could trash file systems and partitioning, if factors excluded from automated and/or accelerated testing were to pop up in real-world scenarios as occasional race conditions.
By duhfault, Windows 11 fakes "shutdown" as part of the Fast Startup "feature". Specifically, instead of doing a true shutdown (which has its own risks when wait time is shorter than time needed to complete tasks), Fast Startup hibernates the system state after all users are logged out.
The next startup then appears to be faster, because the previous runtime state is Resumed, persisting any runtime glitches, resource depletions, etc. More to the point, all sorts of sanity-checks and initializations are bypassed, on the assumption that the way things were at (fake) "shutdown", are still holding true when the runtime session is resumed.
So, if an external USB drive was disconnected while the system was "shut down", anything that was still to be saved within the hibernated runtime, will be lost. And if any changes to partitioning were made in between the (fake) "shutdown" and "startup", then the continued runtime will be unaware, and will write raw storage data blocks to where the partitions and file systems... used to be.
It was this that alerted me to the dangers of "Fast Startup", after booting USB partitioning tools to resize and shift partitions while the system was "shut down". The next Windows boot then promptly destroyed C: and other partitions, by overwriting the raw areas of storage where these partitions and their file systems were defined. Fast Startup is Not Your Friend.
The failure pattern I saw when Fast Startup missed out-of-runtime partition changes, is similar to that reported for KB5063878; drives "stop responding" and/or vanishing, once write operations no longer continue using stale and invalid in-memory assumptions on partition and file system raw locations, which may only happen once those raw storage blocks are lost from cache and have to be reloaded, only to find the structures trashed by preceding mis-directed writes.
The next boot may or may not silently "fix" things, either via AutoChk file system "repair", or by WinRE's startup recovery, or even the new "call home and auto-fix" facility that may or may not yet be in play. These "fixes" may cover up the damage and data loss, but is that enough for you?
KB5063878 and WinRE
KB5063878 does more than change code within Windows; it also changes WinRE. The previous monthly Cumulative also changed pre-OS code, such as the mini-Windows that hosts the installer for Windows, and/or that which processes the BCD before the decision to boot Windows is taken, as well as the Servicing Stack. These deeper changes make it harder to uninstall the KB, as the code that manages the uninstallation is itself subject to changes imposed by the KB being uninstalled!
When setting up a new laptop already running 24H2, I noted a 850M Recovery partition as expected. From within Windows, I shrunk C: to 150G, creating a new D: partition to fill the remaining space on the 500G SSD up to the 850M Recovery and vendor-specific 260M MyAsus partitions at the "end".
Windows Update installed only one Cumulative update, being KB5063878, along with the usual Defender and Dot Net updates. This may or may not have included changes added by the July 2025 Cumulative, thus creating a "YMMV" for those who had already installed the July Cumulative separately, to become the baseline to which uninstalling KB5063878 would return.
So right there, we have a divergence that would probably be missed by automated testing my Microsoft and Phison, as they seek to disclaim responsibility for reported problems.
After these updates, the space previously occupied by the 850M Recovery partition was left empty, while C: was now smaller, with space between C: and D: being allocated to a new 950M Recovery partition. It's unclear as to when these partitioning changes were applied, and there may be opportunities for these changes to be mis-merged with bulk file operations, lost to un-flushed caches in prematurely disconnected subsystems and/or hardware devices, tangled up with Modern Connected Standby and/or fake "Shutdown" of Fast Startup, etc.
"Many a slip between cup and the lip", as they say. These are the specific scenarios I'd set out to test, if I had the resources to do so - KB5063878 and/or Phison may not be "to blame" when considered in isolation, but nothing exists in isolation in today's sprawling, over-connected infosphere.
What next?
Microsoft is still pushing KB5063878 while still investigating reports of significant data loss.
So, as we can't trust vendors to block dangerous updates from the server side, we need ways to block specific updates before they install, especially when these are too entangled to be uninstalled once injected into the system.
And yes, we can expect scenarios where a malicious FUD campaign may socially-engineer users into delaying updates, to hold the door open to exploit code defects the updates would have fixed.
As it is, this risk is greater when we have to advise users to Pause all updates altogether, as the only way to avoid a specific update reported to be toxic.
No comments:
Post a Comment