![]()
This post is less of a deep dive into a bug I ran into upgrading an x86_64 machine from FreeBSD 14.3 to FreeBSD 15 and more of a PSA: I have a possible workaround for anyone that runs into the same, but I don’t have a full root analysis or proper diagnosis of what the underlying issue was.
FreeBSD 15.0 was released a week ago, and I decided to try to upgrade one of my ZFS appliance servers (running nothing more than ZFS and some scripts) to it as a possible low-stakes trial run. The machine in question is a fairly old (but very reliable) Dell PowerEdge R720, running FreeBSD 14.3p2 and booting in UEFI mode from a ZFS zroot pool at the time.
As always, I started my FreeBSD upgrade with the usual sudo zfs snap -r zroot@freebsd-14.3p2 prior to anything else (yes, I know about ZFS boot environments, but I also know that ZFS snapshots are fast and free). The first part of the upgrade went swimmingly after installing the newest version of freebsd-rustdate from the pkg repos and executing freebsd-rustdate upgrade -r 15.0-RELEASE followed by freebsd-rustdate install; the initial upgrade of the kernel components to 15.0-RELEASE went well, and I was prompted to restart the system… and that’s when the troubles began.
This was a headless machine racked elsewhere, and to my dismay I found that I wasn’t able to SSH into it after waiting the requisite amount of time. After discovering I wasn’t even able to ping the machine, I resigned myself to attaching a display and keyboard to the machine and went to see what was going on. Despite it being at least ten minutes (and probably more) since the initial part of the upgrade had succeeded, connecting the display caught the final moments of the machine as it was syncing disks in prep for a reboot. Had some service really taken that long to shut down after I had executed sudo reboot?
Despite seeing what finally looked like progress, I figured I might as well monitor the system come back online since I was already there with the display and keyboard hooked up. After the agonizingly long POST process followed by Dell’s inventory manager startup and the HBA additional boot-time payload execution (and Intel’s iSCSI firmware, and, and, and), the machine began to boot FreeBSD. I looked away for a second.. only to see the last vestiges of a shutdown notice as the system began to reboot just after coming up! Ten minutes later, this time with my iPhone recording video in slow motion,1 I was able to unfortunately verify that it wasn’t a fluke, and the machine was indeed stuck in a reboot loop that kicked off just after all the services were brought up successfully and the kernel’s writes to the vtty quiesced. It didn’t look like anything had gone wrong: there was no kernel panic, services like sshd were being shut down in an orderly fashion, and the disks were synced before the machine would completely power cycle.
I naturally tried to boot up in single-user mode, both by pressing 2 at the UEFI bootloader menu and by manually executing boot -s at the bootloader prompt: they both failed in the same, exact manner. Stymied and really frustrated with how long each attempt took thanks to all the boot-time firmware applications and processes that had to happen before the system began to actually boot, I shut down the machine, yanked out the OS drive, and connected it to a spare desktop to be able to iterate more quickly.
And imagine my surprise when the FreeBSD 15 install booted up just fine with the disk in the test rig, with no hint of a reboot loop even in normal/multi-user mode!
I took the opportunity of having the machine fully boot up in normal mode to investigate the situation from the comfort of my regular environment, with my preferred keyboard layout, tools, and editors at my service (much better than a rescue shell!), and was surprised to see absolutely nothing out of the ordinary. The logs didn’t reveal any errors, no watchdog timers kicking in, no kernel panics (as I’d already surmised), no filesystem errors, no hardware failures (that were logged, at any rate). At this point, my best guess was that there was possibly something amiss with the ACPI support for this machine in the FreeBSD 15 release.2
Before trying to disable ACPI in the bootloader configuration, I figured I would try one last thing: finish the upgrade by updating the FreeBSD userland from 14.3 to 15.0, so I ran the requisite commands to bootstrap pkg, upgrade all installed packages, and then finish the freebsd-rustdate/freebsd-update install process. With the update fully complete, I ejected the root disk from the test machine and connected it back to the R720 server, though not with much hope of success.
But my skepticism was unwarranted: the fully upgraded userland didn’t exhibit the same symptoms I’d seen earlier and booted right into multiuser mode without a hint of a reboot loop! I had been prepared to really dive in to the issue, disabling ACPI to see if that fixed the problem, and bisecting my way through the system services to see if any were responsible (even though this happened in single user mode as well).
So if anyone finds themselves stuck with a reboot loop after upgrading to FreeBSD 15, try sticking the disk in another machine and completely upgrading the userland to FreeBSD 15 to see if that takes care of your problem. It did for me – though it certainly left me feeling particularly dissatisfied that I wasn’t able to figure out what actually caused the issue, and of course, always wondering if it would start happening again.3
A really dumb – yet undeniably useful – trick I learned to capture everything that takes place prior to a kernel panic or other sudden reboot, if the machine isn’t connected over serial. Don’t knock it until you’ve tried it, the iPhone camera is fast enough to capture everything that gets dumped to screen and you can scrub through it as slowly as you like. ↩
Although an errant interpretation of the ACPI data would normally result in a full shutdown, not a system reboot. ↩
Which is the only healthy reaction to heisenbugs if you’re not able to determine what’s causing them! ↩