09 September 2025

KB5063878: Too many NTFS Extents?

Still thinking about KB5063878 (when you get to remember a KB number, it's usually a bad one) and several things in the code stack may apply; Device Encryption, Ring -2, bigLITTLE cores and threads, file system resource depletion, motherboard and device firmware, processor microcode, motherboard chipset, and those elusive "Hardware Error" items that turn up in Reliability.

Disable Sandbox?

This recent article holds a clue, if you scroll down about a third way down, and I paste: 

“I myself was able to recreate the same initial error I got while copying the 151G file. Not only that, but the epic fail originated a WHEA hardware error in the event viewer related to the PCIe controller, which eventually forced me to restart. I then disabled sandbox, uninstalled the update, and the file copied just fine without a hitch… no errors, no freezes, no hangs.”

“I have a Crucial T710 2T, and I also suffered a glitch. Not as serious, but nevertheless, a glitch. I tried transferring a 151G file; it failed, and it lingered in my SSD as a ‘ghost’ file. I could not delete it, access it, or anything. After 3 attempts, I was able to delete it via Safe Boot Minimal,” another tester told Windows Latest.

After that, the article blandly states...

We don’t know how some people have a botched-up SSD after the recent Windows updates, but it appears to affect a very small number of users, and unless Microsoft finds something in telemetry data, we’ll never know what really happened.

No, it's not OK to shrug off data and storage loss as JOOTT (Just One Of Those Things), even if affecting "a very small number of users".  

Good to know that disabling Sandbox may be a workaround, and a lot cleaner than "just" trying to uninstall the face-hugging KB, where disabling the Sandbox may be required before this will work beyond an error and failure to uninstall.

Dead runtimes don't talk

Forget telemetry, it can't tell you anything about the most significant failures that kill the runtime, if not the entire system.  A bullet through the brain means you can't even log "Something went wrong"; don't get distracted by the tyranny of the measurable!

Reliability: Hardware Error

Part of the support ritual is to check Reliability, a useful feature prototyped in Vista and maturing somewhat thereafter, as a manageable tap into the fire-hose of Event Spewer.

There, I often see "Hardware error" in systems that are otherwise fine, with nothing amiss on DISM and SFC do-it-for-me code fixers.  There are no further details for these entries, and so far my limited efforts to link them to Event Viewer items has not shed a light.  Perhaps they are related to GPU glitches, or something else too deep in the hardware, such as... PCIe.

Ring -2

I can't find links for this, or recall the name of the subsystem involved, but I remember what I read; that a deep processor ring -2 mode is how "BIOS" presents USB keyboard and mouse to software as if they were PS/2, and by implication, possibly legacy hardware emulation in general.

In this mode, regular CPU execution (including kernel Ring 0) is paused while the Ring -2 code does its thing.  Any bugs here are likely to hard-hang the system, but could delay return to the point that time-sensitive code may time out or fail.  This code is so deep under the kernel carpet, perhaps Windows can only report "Hardware error"?

Under-the-rug stuff like this, or "remote admin" opportunities, are a great place for malware to dabble.

Current favorite: NTFS Extents

We know not to defrag SSDs, as that "just moves the junk around" and hammers the flash memory cells' limited write life; it's better to ask the SSD firmware to Trim, and hope it will eventually do so.  

Keeping track of where the file system thinks things are, and where the SSD firmware chooses to (eventually) write them, is the black art of the SSD firmware, and likely a big reason why SSDs cost more than the bare flash memory sold as camera cards and USB flash drives.  There's very likely to be resource limitations and opportunities for things to go wrong in this space, which may be why Phison found themselves in the cross-hairs after KB5063878 brought our new crisis du jour.

Upstairs in the NTFS, there's a known resource depletion risk; cluster chaining info "Extents".  Whereas FATxx dedicates slabs of pre-booked space for cluster chaining info (i.e. which storage block is next after reading the current one), NTFS stores the start of each run of contiguous clusters, and presumably how long the chain will be before the next extent is to continue the chain.

This avoids the scalability impact of FATxx File Allocation Tables, at the risk of adding the "lie to me" meta-bugs of "thin provisioning", e.g. where assumed compression, sparse files etc. fail to actually fit within available space.  There's also a lot of hop, skip and jump when MFT and other files have to extended to arbitrary fragments in the storage map, inviting further resource depletions and errors elsewhere, boosting write amplification, and widening critical periods while our digital Superman is poised mid-leap between skyscrapers.

The mystery is then not why things go wrong during massive file ops on a busy NTFS that has never been defragged, but why this is only rearing its head after KB5063878?  

  • What code has KB5063878 changed?  
  • Is it specific to the Sandbox subsystem?  
  • Is it affected by Intel's bigLITTLE mix of "real" and "eco" cores, and how Windows assigns threads to these? 
  • Does it happen less if stealth Device Encryption is not imposed?  
  • Does it still happen when offline, excluding incoming pokes?  
  • Does it still happen if all Power Management is disabled and flattened, including Modern Connected Standby and network "magic packet" wakes?  
  • Is it related to any particular hardware or firmware, aside from SSD controllers?  
  • Does it happen to low-spec SSDs, eMMCs and hard drives, or over SATA or USB?

These may be the next set of questions to test, now that we may have repro(ducability) at last.


04 September 2025

KB5063878 Bug: Gotcha!!

I think I've figured out the data-killing KB5063878 bug; it's when UAC tangles with UI-less activities, as described in this report.  Throws back to Vista's birth pains, when attempting add back lost immutability to the many-to-many relationship between things that happen, to what should not be allowed  :-)

The report speaks of unexpected UAC prompts that now pop up due to changes added by KB5063878.  

If that collides with last month's changes to UI-less "pre-Windows" code, e.g. before BCD is processed to menu OSLoaders (or blunder into {default}), or when WinRE boots instead, or when a "mini-Windows" is applying deep code changes before Windows loads, then you'll never see an error message, let alone UAC prompt to which the user can respond.  

This is vendor-knows-best territory, locking out users and administrators alike.

Now if those changes were attempting to change partitioning, e.g. to shrink C: for space to be assigned to a new WinRE Recovery partition, then things could get messy - especially in a multi-threaded environment, and/or failure of the "black box" of steps to be properly atomic.  

For example, imagine if one part of the process is allowed to message updated partition info to other threads, but another part of the process is blocked from actually applying those changes, then the runtime hopefully will screw up and crash out of functioning before it writes raw data to the wrong storage addresses, trashing file systems and/or partitions.  If less lucky, such writes may trash these structures, so the storage is latched into a corrupted state.

So, is an apparent automatic recovery on reboot, actually safe?  Well, if NTFS C: is foreshortened, everything may still appear valid and work.  If the runtime crashes out, then the "dirty bit" should remain set, prompting the next boot's AutoChk to "kill, bury, deny" the file system's partition-end mismatch, "fixing" it to something at least valid for future file system operations. 

The user may or may not see an AutoChk prompt to press a key to skip checking drive C:, but with fast-enough SSDs and the trend to hide details from users ("don'tcha wurry your pretty littul haid, Sue-Ellen, ever'thing's gonna be fine-just-fine"), perhaps that will be hidden, too.

This assumes Fast Startup doesn't just resume the doomed runtime and botch everything not already botched thus far, but that's likely to have been disabled for the next boot by some sort of "update in progress, boot properly, run this OSLoader instead" or similar logic.

Anyway, I think that is where I'd dig next, if trying to fix this mess, rather than just claiming "not my dog" after testing to clear a path to legally disclaim responsibility.

Why does bulk testing miss this?

Previous post explains that; bug may only arise when the full set of real-world conditions apply.  Simply trashing an SSD controller with bulk writes from a KB5063878-updated Win11 won't do it, and neither may a virginal Win11 24H2 update to KB5063878 do it, if bulk writes don't happen at a time to overlap other factors that may not be present, etc.

Why only with bulk operations?

There may be race conditions involving various levels of OS and component cache managements that arise only in the context of cache saturation and flush periods extending beyond sate time-outs or blind wait periods.  There may also be interplay with Delayed Start and Scheduled Tasks, especially when certain OEMs trigger underfootware to run every few minutes.

Vendor disclosure attempted

I've alerted Microsoft via Feedback Hub, and my ex-MVP colleagues via private email list, in case they don't experience the same lightbulb effect I did, when reading the "feed seed" article linked above.


02 September 2025

KB5063878 Storage Corruption: External Factors?

Following up on reports and post-test denials of August Cumulative + SSU KB5063878 corrupting and possibly destroying storage when under a 50G+ bulk operation load, some likely scenarios come to mind, that may be missed during artificial accelerated testing sessions.

The initial focus was on SSDs based on Phison controllers, prompting Phison to test and exclude their controllers as a cause of the problem, while recommending heat sinks to protect SSDs against load-related failures.  Subsequent reports suggest other SSDs, and even hard drives, can also be affected.

Accelerated testing?

Phison claims 4,500 cumulative testing hours across the drives reported as potentially impacted and conducted over 2,200 test cycles, which would be 187 days of testing if done on a single device.  Testing 1,000 devices in parallel would reduce testing clock time to 4.5 hours per device, each iterating to a bit over 2 test cycles per device.  You can shift the numbers around, e.g. 100 devices etc. limited by the number of clock days since the problems were first reported - but it's unlikely testing would have been real-world, i.e. based on individually-installed Windows 11 with a wide range of co-installed software, etc.

From Phison's perspective, all those software variables are irrelevant; as long as the hardware itself can be shown to work, it's some other vendors' problem if it's a software thing.  In any case, attention shifts off Phison, and hardware specifics, once reports of other storage devices are considered.  This spectrum of affected devices also suggests this isn't limited to overheating hi-performance SSDs.

The MemTest86 experience

A familiar type of artificial accelerated testing is MemTest86, in search of "bad RAM", but also as proof of hardware ability to not crash, power off, reset or lock up over a "long enough" clock-time period.  I've done this for decades of PC builds, troubleshooting, laptop pre-acceptance testing, etc. and have settled on 24 hours as the shortest 99%-certain test period.  

I've seen one case where the first error showed up at around 25 hours, and one where the first error showed up in an over-weekend 100+hour unattended test run.  In both cases, the first error was the only error, and neither system latched into a persistent error state thereafter.

Shorter test periods, e.g. 18 hours, would be more convenient, e.g. allowing in and out turnaround within the same time of day, but I saw too many first-errors within the 18 to 24 hour period.  Clearly, this makes the typical default 4-pass loops completing in an hour or two, unfit to be trusted as exclusionary.

Even so, "burn-in" testing with MemTest86 is not real-world, as it exercises a very limited subset of what the tested hardware has to do.  It doesn't test GPU or DMA access to RAM, localized heat related to different kinds of CPU activity, and obviously anything to do with storage or other components.

Cache and Race Conditions

Microsoft's test methods are reportedly thorough, but not detailed, and are likely also to involve accelerated automated test methods that may be as narrow in their way, as is MemTest86's testing of processor and RAM.  

Variables may include how soon after Windows 11 boot the tests are started, bearing in mind how "underfootware" can be triggered at arbitrary times - consider Delayed Start, that seeks to pretend Windows boots faster than it actually completes all inits and startups; application pre-loading that seeks to pretend applications aren't slow, but your Windows and hardware may be, stuff triggered via Scheduled Tasks, and hidden ServiceWorkers that may be triggered remotely.

Bulk operations will saturate caches, possibly revealing lower raw direct transfer speeds.  These caches will be full at the "end" of file operations, needing time to spool out to where the data has already pretended to have been written... can you guess the problems that may happen next?

Modern PCs are more like networks of DOS-sized systems.  A modern CPU has enough cache RAM to run a Windows 9x installation, while firmware logic within black-box devices such as hard and solid-state "disk" storage is at least a 20th-century DOS or BIOS.

Computational scale

There's a certain size of code that we may expect to be bug-free, at least if kept fully encapsulated; I'd guess somewhere between a DOS in a 1M memory map, and a Win9x in 4M or so.  

Beyond that, things rapidly bog down such that attempts to complete a project will fail and have to be abandoned (WinAmp 3, Netscape, the original Microsoft Edge, even Windows 11 24H2's attempts to become acceptably reliable before 25H2 is due), or it will become a bunch of separate boxes of code linked together, as is the case with modern PC hardware subsystems and "web apps", or a thin layer of new top-soil over a decades-old mass of existing code, e.g. just about every OS other than Windows that is based on ancient *NIX, or how Windows 95 had to re-use solid 16-bit Assembly code to dance within 4M of RAM.

Whether it's a team of human workers, or a map of code black-boxes, new challenges and inefficiencies arise in how these interact.  Add race conditions that arise when critical periods shift in phase, and exclusionary testing becomes a really hard problem that may defy automation.

So yes; Phison may prove thier controllers are OK, and Microsoft may conclude KB5063878 is OK, but neither may satisfy our need to be sure the KB will be safe on our particular systems, for reasons that both vendors can blow off as "not our problem".  And what happens with this KB, may happen again with others, so we need a systematic fix for future scenarios.

Power (mis-)management

One very likely scenario involves power management, when power to a subsystem is cut before that subsystem has actually done the work that it claimed to have finished.

Consider an external USB hard drive and "Safe To Remove".  To be aware that an external device is connected, it helps to see the relevant icon in the SysTray (sorry, "notification area"), but it's hidden under the More... by default.  To click on it in order to initiate a "Safe To Remove" data flush to storage, you have to see the icon, which is instead hidden on the assumption you only need to see it when it has something to "notify" you.  So far, so... not good.

Let's say you do remember you have an external plugged in and you do click the icon, then await the feedback that the device is safe to remove, or should not be removed because it is still in use.  If you have Focus Assist enabled, you will never see that feedback, because that Notification is not considered sufficiently important, even though it's a part of your Focus that is to be Assisted.  So, how are you supposed to know you can safely unplug the storage device?

We're told this "doesn't matter", but I've seen enough corrupted external storage to know that it does.  The damage may be hidden by the "kill, bury, deny" logic of NTFS transaction rollback, AutoChk and ChkDsk, but you're still losing data that you expected to have been written to storage.

Finally, listen to how external USB hard drives often burble on after the "Safe To Remove" notification pops up.  Has the drive firmware really flushed its cache to platters, or did it lie when it told the parent subsystem that it had finished all pending writes?  Which drives do you think will look faster when tested by hardware reviewers, and feel faster to users (at least while all appears to be well)?

The same glitches that can corrupt external drives, can lose data if a component is expected to be idle, having completed all pending tasks, thus safe to be powered off.

Fast Startup and partition changes

So far, we've considered loss of pending write operations when caches take too long to flush, and/or when subsystems are prematurely disconnected and/or powered off - but there's another aspect to KB5063878 that could trash file systems and partitioning, if factors excluded from automated and/or accelerated testing were to pop up in real-world scenarios as occasional race conditions.

By duhfault, Windows 11 fakes "shutdown" as part of the Fast Startup "feature".  Specifically, instead of doing a true shutdown (which has its own risks when wait time is shorter than time needed to complete tasks), Fast Startup hibernates the system state after all users are logged out.  

The next startup then appears to be faster, because the previous runtime state is Resumed, persisting any runtime glitches, resource depletions, etc.  More to the point, all sorts of sanity-checks and initializations are bypassed, on the assumption that the way things were at (fake) "shutdown", are still holding true when the runtime session is resumed.

So, if an external USB drive was disconnected while the system was "shut down", anything that was still to be saved within the hibernated runtime, will be lost.  And if any changes to partitioning were made in between the (fake) "shutdown" and "startup", then the continued runtime will be unaware, and will write raw storage data blocks to where the partitions and file systems... used to be.

It was this that alerted me to the dangers of "Fast Startup", after booting USB partitioning tools to resize and shift partitions while the system was "shut down".  The next Windows boot then promptly destroyed C: and other partitions, by overwriting the raw areas of storage where these partitions and their file systems were defined.  Fast Startup is Not Your Friend.

The failure pattern I saw when Fast Startup missed out-of-runtime partition changes, is similar to that reported for KB5063878; drives "stop responding" and/or vanishing, once write operations no longer continue using stale and invalid in-memory assumptions on partition and file system raw locations, which may only happen once those raw storage blocks are lost from cache and have to be reloaded, only to find the structures trashed by preceding mis-directed writes.

The next boot may or may not silently "fix" things, either via AutoChk file system "repair", or by WinRE's startup recovery, or even the new "call home and auto-fix" facility that may or may not yet be in play.  These "fixes" may cover up the damage and data loss, but is that enough for you?

KB5063878 and WinRE

KB5063878 does more than change code within Windows; it also changes WinRE.  The previous monthly Cumulative also changed pre-OS code, such as the mini-Windows that hosts the installer for Windows, and/or that which processes the BCD before the decision to boot Windows is taken, as well as the Servicing Stack.  These deeper changes make it harder to uninstall the KB, as the code that manages the uninstallation is itself subject to changes imposed by the KB being uninstalled!

When setting up a new laptop already running 24H2, I noted a 850M Recovery partition as expected.  From within Windows, I shrunk C: to 150G, creating a new D: partition to fill the remaining space on the 500G SSD up to the 850M Recovery and vendor-specific 260M MyAsus partitions at the "end".

Windows Update installed only one Cumulative update, being KB5063878, along with the usual Defender and Dot Net updates.  This may or may not have included changes added by the July 2025 Cumulative, thus creating a "YMMV" for those who had already installed the July Cumulative separately, to become the baseline to which uninstalling KB5063878 would return.

So right there, we have a divergence that would probably be missed by automated testing my Microsoft and Phison, as they seek to disclaim responsibility for reported problems.

After these updates, the space previously occupied by the 850M Recovery partition was left empty, while C: was now smaller, with space between C: and D: being allocated to a new 950M Recovery partition.  It's unclear as to when these partitioning changes were applied, and there may be opportunities for these changes to be mis-merged with bulk file operations, lost to un-flushed caches in prematurely disconnected subsystems and/or hardware devices, tangled up with Modern Connected Standby and/or fake "Shutdown" of Fast Startup, etc.  

"Many a slip between cup and the lip", as they say.  These are the specific scenarios I'd set out to test, if I had the resources to do so - KB5063878 and/or Phison may not be "to blame" when considered in isolation, but nothing exists in isolation in today's sprawling, over-connected infosphere.

What next?

Microsoft is still pushing KB5063878 while still investigating reports of significant data loss.

So, as we can't trust vendors to block dangerous updates from the server side, we need ways to block specific updates before they install, especially when these are too entangled to be uninstalled once injected into the system.

And yes, we can expect scenarios where a malicious FUD campaign may socially-engineer users into delaying updates, to hold the door open to exploit code defects the updates would have fixed.

As it is, this risk is greater when we have to advise users to Pause all updates altogether, as the only way to avoid a specific update reported to be toxic.