First, the fine print: This is suggestive content to be used at your discretion. Products, processes, and procedures below are not sanctioned, supported, warrantied, or otherwise recommended by WTG, Dell, or other OEMs.
We’ve researched this and are making the following suggestion. For modern Dell servers, for example, you will need 2 SD cards per server. We suggest also have a couple spare per data center/location. Start with one (and only one) server, to verify the process and test the SD card(s). This is the SD card WTG is suggesting you contemplate (we can assist if needed) to alleviate issues related to low endurance boot media: https://shop.westerndigital.com/products/memory-cards/sandisk-max-endurance-uhs-i-microsd#SDSQQVR-064G-AN6IA. We encourage you to consider this suggestive advice for a workaround and not necessarily a permanent fix. Dell, for example, manufactures the BOSS card for higher endurance boot media.
- SanDisk is a “known” in the industry – solid brand reputation
- Designed for high endurance (per SanDisk advertising, it “is engineered not only for continuous recording and re-recording, but also for continuous peace of mind for years to come”)
- U3 = (speed class 3) Version 6.0, released in February 2017, added two new data rates to the standard. FD312 provides 312 MB/s while FD624 doubles that. Both are full-duplex
- V30 = (video speed class 30) 30MB/s
The replacement process is as follows (for each server, non-disruptive):
Pre-work: Obtain installation media (ISO) for the same exact build/OEM version of vSphere currently installed – read VMware KB’s listed below.
- Enter maintenance mode.
- Ensure the vSphere / ESXi server is in a healthy state; reboot if needed.
- Leverage the ESXi CLI to take a backup of the configuration.
- vim-cmd hostsvc/firmware/sync_config
- vim-cmd hostsvc/firmware/backup_config
- Download backup via link that the cli will output
- Shut down server.
- Open server and remove SD cards.
- Label with what server/slot they came from and set aside for fallback.
- Insert two new SD cards and close server.
- Start server and ensure the SD cards are “seen” and configured for mirror mode.
- Reinstall ESXi with defaults (using exact version/build the original server had). Configure IP address that the original server had.
- SSH to server and restore configuration from backup (copy backup bundle to server with scp/ssh):
- vim-cmd hostsvc/firmware/restore_config /backup_location/configBundle.tgz
- Check server out and add back to vCenter (or reconnect).
- Exit maintenance mode.
- Repeat for other servers in cluster.
** July 2021 Update **
VMware has published additional guidance on the original KB (https://kb.vmware.com/s/article/83376). A workaround has been provided that moves the VMware Tools to ramdisk, helping to reduce IO on the SD cards. The “fix” still is replacing boot media with higher endurance media.
Future ESXi 7.0.x version will have this advanced option [relocation of VMware Tools to ramdisk] set automatically. Refer VMware KB: ToolsRamdisk option is not available with ESXi 7.0.x releases
This issue may occur on several releases of ESXi, however, the likelihood of experiencing the behavior is higher on ESXi 7.0 due to some changes in the product that require better performance and endurance from the boot device as noted below:
Starting in ESXi 7.0, the boot partition is formatted as VMFS-L instead of FAT (previous releases) to improve I/O performance.
The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 19, specifically says “As even read-only workloads can cause problems on low-end flash devices, you should install ESXi only on high-endurance flash media“.
However, information about the internal SD cards can’t currently be checked on the VMware Compatibility Guide, as they are not listed separately from their servers. Please contact your server hardware vendor.
VMware recently released a fairly important update to vCenter (details here: https://blogs.vmware.com/vsphere/2021/05/vmsa-2021-0010.html). Many folks are probably thinking: “Well, since I need to update vCenter anyway I might as well take the vSphere 7 plunge”. It’s a great thought but, like any release, please make sure you carefully read the release notes – especially if you boot from SD card (USB or other low-endurance media). We’ve been seeing a significant increase in boot disk corruption and/or inaccessibility related to a combination of vSphere 7 and U1/U2.
For background, vSphere 7 introduces a new disk layout and partition table for the OS/boot drive. It was (likely) a long-overdue enhancement, allowing for greater flexibility and capability in ESXi / vSphere itself. The details are in this VMware blog: https://blogs.vmware.com/vsphere/2020/05/vsphere-7-esxi-system-storage-changes.html.
When vSphere 7.0 U1 (and U2) was released, the disk (partition) formatting was changed from a fairly typical (old-fashioned) FAT to VMFS-L. Per VMware: “This new format allows much more and faster I/O to the partition.” In addition, vSphere 7.0 U1/U2 is “no longer throttling I/O to local boot drives”. This can easily result in lower-endurance media to “become overwhelmed” and possibly corrupt (https://kb.vmware.com/s/article/83376). VMware has been actively making updates to the KB article, even over the last few days.
In the field, we’ve observed this manifested as disconnected hosts, “hung hosts”, PSODs, etc. Unfortunately, a reboot will only mask the problem and (if the server boots) continue to wear on the media. There is no HCL/matrix for SD-type media either.
While every situation is different, WTG is suggesting:
Apply the critical vSphere vCenter security patch (even if it doesn’t include an upgrade to vSphere 7).
- Work with your hardware vendor (“OEM”) for specific guidance for this issue.
- Leverage suggested workarounds included in the KB (linked above).
- Consider acquiring SDXC / UHS-I / Class 10 SD cards (typically designed for professional video recording/production).
- If your server has a disk controller, look into appropriate SSD or HDD media (ideally mirrored).
- (similar to other releases) “If you install ESXi on M.2 or other non-USB low-end flash media, delete the VMFS datastore on the device immediately after installation to prevent the storage of virtual machine data” (p.19 “VMware ESXi Installation and Setup, Update 2 vSphere 7.0 / ESXi 7.0”)
In any of the cases of swapping media, we suggest making a backup of ESXi, performing a clean install, then reloading ESXi again (and restoring from backup). From SSH/shell, this is how to generate a system backup (if you’re using host profiles and distributed switching, you might not need to do this):
(download backup via link that the cli will output! You can use this to rebuild, if needed, later on.)
** November 2021 [Final] Update **
VMware has released a mitigating fix for this in vSphere 7.0U3. All customers that are impacted by this issue should consider updating to vSphere 7.0U3 and, if required, replacing physically failed media. VMware has final guidance posted and detailed in this KB: https://kb.vmware.com/s/article/85685
7.0U3 release notes here: https://docs.vmware.com/en/VMware-vSphere/7.0/rn/vsphere-esxi-703-release-notes.html. Of note:
“Deprecation of SD and USB devices for the ESX-OSData partition: The use of SD and USB devices for storing the ESX-OSData partition, which consolidates the legacy scratch partition, locker partition for VMware Tools, and core dump destinations, is being deprecated. SD and USB devices are supported for boot bank partitions. For warnings related to the use of SD and USB devices during ESXi 7.0 Update 3 update or installation, see VMware Knowledge Based Article 85615. For more information, see VMware knowledge base article 85685.”
Installation, Upgrade and Migration Issues
- The /locker partition might be corrupted when the partition is stored on a USB or SD device
Due to the I/O sensitivity of USB and SD devices, the VMFS-L locker partition on such devices that stores VMware Tools and core dump files might get corrupted.
This issue is resolved in this release. By default, ESXi loads the locker packages to the RAM disk during boot.
If you’ve already upgraded to vSphere 7.0 (especially U1 or U2), you’re going to want to do this sooner or later – before you find a hung host (or hosts) in production. WTG is happy to provide professional services to assist – just contact us for more details.