Still thinking about KB5063878 (when you get to remember a KB number, it's usually a bad one) and several things in the code stack may apply; Device Encryption, Ring -2, bigLITTLE cores and threads, file system resource depletion, motherboard and device firmware, processor microcode, motherboard chipset, and those elusive "Hardware Error" items that turn up in Reliability.
Disable Sandbox?
This recent article holds a clue, if you scroll down about a third way down, and I paste:
“I myself was able to recreate the same initial error I got while copying the 151G file. Not only that, but the epic fail originated a WHEA hardware error in the event viewer related to the PCIe controller, which eventually forced me to restart. I then disabled sandbox, uninstalled the update, and the file copied just fine without a hitch… no errors, no freezes, no hangs.”
“I have a Crucial T710 2T, and I also suffered a glitch. Not as serious, but nevertheless, a glitch. I tried transferring a 151G file; it failed, and it lingered in my SSD as a ‘ghost’ file. I could not delete it, access it, or anything. After 3 attempts, I was able to delete it via Safe Boot Minimal,” another tester told Windows Latest.
After that, the article blandly states...
We don’t know how some people have a botched-up SSD after the recent Windows updates, but it appears to affect a very small number of users, and unless Microsoft finds something in telemetry data, we’ll never know what really happened.
No, it's not OK to shrug off data and storage loss as JOOTT (Just One Of Those Things), even if affecting "a very small number of users".
Good to know that disabling Sandbox may be a workaround, and a lot cleaner than "just" trying to uninstall the face-hugging KB, where disabling the Sandbox may be required before this will work beyond an error and failure to uninstall.
Dead runtimes don't talk
Forget telemetry, it can't tell you anything about the most significant failures that kill the runtime, if not the entire system. A bullet through the brain means you can't even log "Something went wrong"; don't get distracted by the tyranny of the measurable!
Reliability: Hardware Error
Part of the support ritual is to check Reliability, a useful feature prototyped in Vista and maturing somewhat thereafter, as a manageable tap into the fire-hose of Event Spewer.
There, I often see "Hardware error" in systems that are otherwise fine, with nothing amiss on DISM and SFC do-it-for-me code fixers. There are no further details for these entries, and so far my limited efforts to link them to Event Viewer items has not shed a light. Perhaps they are related to GPU glitches, or something else too deep in the hardware, such as... PCIe.
Ring -2
I can't find links for this, or recall the name of the subsystem involved, but I remember what I read; that a deep processor ring -2 mode is how "BIOS" presents USB keyboard and mouse to software as if they were PS/2, and by implication, possibly legacy hardware emulation in general.
In this mode, regular CPU execution (including kernel Ring 0) is paused while the Ring -2 code does its thing. Any bugs here are likely to hard-hang the system, but could delay return to the point that time-sensitive code may time out or fail. This code is so deep under the kernel carpet, perhaps Windows can only report "Hardware error"?
Under-the-rug stuff like this, or "remote admin" opportunities, are a great place for malware to dabble.
Current favorite: NTFS Extents
We know not to defrag SSDs, as that "just moves the junk around" and hammers the flash memory cells' limited write life; it's better to ask the SSD firmware to Trim, and hope it will eventually do so.
Keeping track of where the file system thinks things are, and where the SSD firmware chooses to (eventually) write them, is the black art of the SSD firmware, and likely a big reason why SSDs cost more than the bare flash memory sold as camera cards and USB flash drives. There's very likely to be resource limitations and opportunities for things to go wrong in this space, which may be why Phison found themselves in the cross-hairs after KB5063878 brought our new crisis du jour.
Upstairs in the NTFS, there's a known resource depletion risk; cluster chaining info "Extents". Whereas FATxx dedicates slabs of pre-booked space for cluster chaining info (i.e. which storage block is next after reading the current one), NTFS stores the start of each run of contiguous clusters, and presumably how long the chain will be before the next extent is to continue the chain.
This avoids the scalability impact of FATxx File Allocation Tables, at the risk of adding the "lie to me" meta-bugs of "thin provisioning", e.g. where assumed compression, sparse files etc. fail to actually fit within available space. There's also a lot of hop, skip and jump when MFT and other files have to extended to arbitrary fragments in the storage map, inviting further resource depletions and errors elsewhere, boosting write amplification, and widening critical periods while our digital Superman is poised mid-leap between skyscrapers.
The mystery is then not why things go wrong during massive file ops on a busy NTFS that has never been defragged, but why this is only rearing its head after KB5063878?
- What code has KB5063878 changed?
- Is it specific to the Sandbox subsystem?
- Is it affected by Intel's bigLITTLE mix of "real" and "eco" cores, and how Windows assigns threads to these?
- Does it happen less if stealth Device Encryption is not imposed?
- Does it still happen when offline, excluding incoming pokes?
- Does it still happen if all Power Management is disabled and flattened, including Modern Connected Standby and network "magic packet" wakes?
- Is it related to any particular hardware or firmware, aside from SSD controllers?
- Does it happen to low-spec SSDs, eMMCs and hard drives, or over SATA or USB?
These may be the next set of questions to test, now that we may have repro(ducability) at last.
No comments:
Post a Comment