My UAG is Down!

Hi folks. Worked a case with a customer recently for a pair of UAG v.2103 servers that were down. Both were pingable, but neither were serving clients and the administrators could not get to the Management Interface (port 9443). They had already rebooted them a couple of times to try and get it back in service. It was noteworthy that their UAG1 went down and the load balancer shifted the traffic to UAG2, but UAG2 failed about 2 hours later.

I hopped on to a zoom session with our customer and got a Webconsole session to the UAG1 appliance.

First check, ran: netstat -ano and noted there were no listeners for ports 443, 6443 or 9443; so it was pretty much dead in the water. Fortunately, the customer was running a pair of UAG’s with a load balancer in front – so business wasn’t impacted (they had less than 1000 Horizon users).

After continuing basic troubleshooting the answer came in the form of: df -h . The volume \dev\sda2 was at 100% and killed the system. Checking through the filesystem for large files – we found it in \var\log. Auth.log had grown to 6.8GB and filled up the root partition. Once that happens, several things will be seen:
– Services will start failing (including services for UAG).
– System logging facilities will fail (due to no space to write logs)
– Possible system/services corruption if a critical file/configuration file was in the process of being written to disk when the partition became full.

All three are ‘bad things’ in the IT world.

The Initial Fix:
So, the short term fixup is to clear out space on the file system. In this situation, it was the auth.log file that was taking up all the free space. This service (and consequential logging) is for tracking system authorization information, including user logins and authentication mechanism that was used. This includes authorizing local users via PAM, sudo and also onboard processes that use service accounts on the system. On inspection of the file, there are 3 entries for every authorization tracking, so if there’s a lot going on with your UAG, it can grow over time. In this instance, the UAG had accumulated 6.8GB of logging in 8 months – that’s a lot of logging.
I wasn’t sure if the customer had an auditing requirement in place, but figured it would be safer to copy out the file instead of just killing it. After using WinSCP to copy the file out of the partition, we then ran: truncate -s 0 auth.log to “zero” out the file without deleting the actual file name, so actual file permissions would remain unchanged.

Still Not Working:
After clearing out the space and rebooting, the system came back up and we had a console to look at in vCenter, but services still weren’t showing up. Back to the commandline.

We verified that there was plenty of free space on the system now and then ran: netstat -ano again, and noted that none of the listener services were up and running (ports 443, 8443, etc). We then headed to the system logs to discover what is failing us here.

The answer was in \var\log\messages where there were numerous startup errors with Java configuration files (I didn’t get a snapshot of those to share). Java is the engine that runs GUI interface that clients connect to inside of UAG, so if Java ain’t happy – nobody’s happy!
Fortunately, the customer had a recent backup of the UAG from a couple of days back, so trying to figure out how to fix a Java configuration and likely a file corruption issue wasn’t an issue. They did a restore of the prior UAG and a little reconfiguration of their environment, they were back up and running again. We walked back through the truncation of the auth.log file to ensure that it wouldn’t crash again in a couple of days. They then ran through the same process on the 2nd UAG server to get the environment back to full redundancy.

Long Term Fix:
I pulled the logs for a review of what could be spamming the auth.log, but everything looked like valid entries. Nothing generating excessive or concerning authentications, just a UAG doing it’s business. So from that, it would seem like this file should’ve been included in standard log rotation via a CRON job or something to manage the storage space better. It’s possible that VMWare may have excluded this file from log rotation due to a ‘best practice’ for auditing, where you do not delete log entries that would indicate or could trace a compromise _by default_, make the Administrator or Audit team clear the logs by intent to preserve any potential evidence.
In our case, the customer was ok with adding the auth.log file to the standard CRON job log rotations.

So, some key takeaways from this event:
– True backups (not snapshots) of your infrastructure are a fantastic fall-back position!!
– Horizon admins that use UAG, check in on your free space occasionally!!

Questions/comments are always welcome here.
Hope this helps!

VMWare Horizon: Internal Error Occurred – How FLEX console saved us.

Hey folks. Again, it’s been a while since I’ve posted up something here, but I found something recently that was worth sharing. A customer ran into an issue while running Horizon 7.12 and trying to do a Recompose on the Pool.

Background: The customer had installed Horizon 7.12 and created a pool of GPU enabled desktops using nVidia GRID. Unfortunately, after a 2 month deployment, he found that the virtual desktops (VDs) were experiencing lag, screen artifacts and overall slowness. The admin did some research and found some more optimal settings for the pool (not going to discuss the changes here) to “allocate all memory” for the VD pool to help with the video processing. After making changes to the base image and taking a snapshot, when the admin kicked off the Recompose on the pool, he got a very vague “An Internal Error Occurred” and that was it. No useful errors in the vCenter or Horizon console at all. Recompose on the pool was just failing.

Troubleshooting: What’s the first thing you do in this situation? Pull a log bundle on Horizon and find the problem!! Well, that’s what we did – having the customer timestamp when the Recompose failed and relay that along with the LogBundle for review. Digging through the logs, there was nothing obvious failing here. Going to my favorite “needle in a haystack” analysis style, I pulled up BareGrep and started doing targeted grepping of the log bundle for “fail”, “error”, “internal error”, etc.. Something I came up with in a Debug log was a _literally_ cryptic message (highlighted in red):

2020-10-30T15:13:49.440-05:00 TRACE (2564-1EF8) [Event] Raising windows event ([VLSI_DESKTOP_RECOMPOSE_FAILED] “domain.org\username failed to request a recompose of 99 machine(s) in desktop Graphics users no video card. Full Adobe Suite”: Node=server.domain.org, DesktopId=graphicsusers, Severity=AUDIT_FAIL, Time=Fri Oct 30 15:13:49 CDT 2020, MachinesCount=99, ViewAPIDesktopId=Desktop/Yjg3YTVlNTYtNTVhMy00YWIxLTkyOTEtMTc3YjAxMThmOTZl/Z3JhcGhpY3N1c2Vycw, DesktopDisplayName=Graphics users no video card. Full Adobe Suite, Source=com.vmware.vdi.vlsi.server.resources.DesktopViewComposerManager, UserSID=########, Module=Vlsi, UserDisplayName=#####.###\######, Acknowledged=true)
2020-10-30T15:13:49.440-05:00 ERROR (2564-2210) [RestApiServlet] Unexpected fault:(vdi.fault.EntityNotFound) {
errorMessage = BaseImageVm does not exist on VC VirtualCenter/Yjg3YTVlNTYtNTVhMy00YWIxLTkyOTEtMTc3YjAxMThmOTZl/MjY3NzJkN2QtODBkYi00OWI1LTkxMmMtMTM0MDNjZTY1OGEw,
id = (vdi.EntityId) {
dynamicType = null,
dynamicProperty = null,
id = BaseImageVm/Yjg3YTVlNTYtNTVhMy00YWIxLTkyOTEtMTc3YjAxMThmOTZl/MjY3NzJkN2QtODBkYi00OWI1LTkxMmMtMTM0MDNjZTY1OGEw/L0RhdGFjZW50ZXIvdm0vVkRJIC0gR3JhcGhpY3M/dm0tMTA5
}
} for uri /view-vlsi/rest/v1/desktop/recompose

Obviously, the internals of Horizon was hashing the name of the BaseImage, so we really couldn’t figure what it was trying to look for here (although just the error message was a clue here – keep reading). After chatting with a colleague at VMWare, it was noted that we were using the HTML5 interface for Horizon management (as you should these days) and that we might be able to get more information by doing the Recompose in the FLEX interface. Although FLEX is going away, it still has some features and reporting/feedback that is not in the HTML5 interface yet. Per my source, FLEX was deprecated in ESXi 7.0 as the HTML5 interface has been built out well enough to be the only management interface for ESXi and vCenter. However, other products HTML5 management consoles are still being developed, specifically Horizon. So, as long as your version of Horizon shipped with a FLEX console, you will have access to use that alternate console.

So, bringing up the FLEX console and walking through a Recompose function got us some additional information! Take a look:

That file reference is clue #2 of the puzzle – we now know what file the Recompose is looking for. We drilled in to vCenter (blacked out) to verify that the VM is there, and sure enough, under “VMs & Templates” it was located at \DataCenter\vSAN\VDI-Graphics. But that still didn’t look right…
On a hunch, I asked the customer to clear the error and let’s use the “Change/Browse” button on the “Parent VM” field (in the background). Once we did that, we found the problem. In the pop-up for locating the Parent VM, we were presented with “/Datacenter/vm/Parent/…” folder structure, where all the Parent VM’s were located – not “/Datacenter/vm/…”
Apparently, some process or some one, had moved the Parent VM’s down one level, consequently breaking all Pool Recompose operations unless the Parent VM field was repointed to the new location.

Once we re-pointed to the new location of the Parent VM, the Recompose process went off without a hitch!! The customer went back and checked the config of other pools and found they were affected with this issue as well. Fortunately, you can re-point in either the HTML5 or FLEX interfaces, but this error wasn’t handled well in the HTML5 console. Apparently, the HTMT5 console is still a work in progress, so when you run into Error conditions that aren’t explained well – give the FLEX console a shot!

Hope this helps.