failed – VDI Matters! (things that deal with VDI)

Hi folks. Worked a case with a customer recently for a pair of UAG v.2103 servers that were down. Both were pingable, but neither were serving clients and the administrators could not get to the Management Interface (port 9443). They had already rebooted them a couple of times to try and get it back in service. It was noteworthy that their UAG1 went down and the load balancer shifted the traffic to UAG2, but UAG2 failed about 2 hours later.

I hopped on to a zoom session with our customer and got a Webconsole session to the UAG1 appliance.

First check, ran: netstat -ano and noted there were no listeners for ports 443, 6443 or 9443; so it was pretty much dead in the water. Fortunately, the customer was running a pair of UAG’s with a load balancer in front – so business wasn’t impacted (they had less than 1000 Horizon users).

After continuing basic troubleshooting the answer came in the form of: df -h . The volume \dev\sda2 was at 100% and killed the system. Checking through the filesystem for large files – we found it in \var\log. Auth.log had grown to 6.8GB and filled up the root partition. Once that happens, several things will be seen:
– Services will start failing (including services for UAG).
– System logging facilities will fail (due to no space to write logs)
– Possible system/services corruption if a critical file/configuration file was in the process of being written to disk when the partition became full.

All three are ‘bad things’ in the IT world.

The Initial Fix:
So, the short term fixup is to clear out space on the file system. In this situation, it was the auth.log file that was taking up all the free space. This service (and consequential logging) is for tracking system authorization information, including user logins and authentication mechanism that was used. This includes authorizing local users via PAM, sudo and also onboard processes that use service accounts on the system. On inspection of the file, there are 3 entries for every authorization tracking, so if there’s a lot going on with your UAG, it can grow over time. In this instance, the UAG had accumulated 6.8GB of logging in 8 months – that’s a lot of logging.
I wasn’t sure if the customer had an auditing requirement in place, but figured it would be safer to copy out the file instead of just killing it. After using WinSCP to copy the file out of the partition, we then ran: truncate -s 0 auth.log to “zero” out the file without deleting the actual file name, so actual file permissions would remain unchanged.

Still Not Working:
After clearing out the space and rebooting, the system came back up and we had a console to look at in vCenter, but services still weren’t showing up. Back to the commandline.

We verified that there was plenty of free space on the system now and then ran: netstat -ano again, and noted that none of the listener services were up and running (ports 443, 8443, etc). We then headed to the system logs to discover what is failing us here.

The answer was in \var\log\messages where there were numerous startup errors with Java configuration files (I didn’t get a snapshot of those to share). Java is the engine that runs GUI interface that clients connect to inside of UAG, so if Java ain’t happy – nobody’s happy!
Fortunately, the customer had a recent backup of the UAG from a couple of days back, so trying to figure out how to fix a Java configuration and likely a file corruption issue wasn’t an issue. They did a restore of the prior UAG and a little reconfiguration of their environment, they were back up and running again. We walked back through the truncation of the auth.log file to ensure that it wouldn’t crash again in a couple of days. They then ran through the same process on the 2nd UAG server to get the environment back to full redundancy.

Long Term Fix:
I pulled the logs for a review of what could be spamming the auth.log, but everything looked like valid entries. Nothing generating excessive or concerning authentications, just a UAG doing it’s business. So from that, it would seem like this file should’ve been included in standard log rotation via a CRON job or something to manage the storage space better. It’s possible that VMWare may have excluded this file from log rotation due to a ‘best practice’ for auditing, where you do not delete log entries that would indicate or could trace a compromise _by default_, make the Administrator or Audit team clear the logs by intent to preserve any potential evidence.
In our case, the customer was ok with adding the auth.log file to the standard CRON job log rotations.

So, some key takeaways from this event:
– True backups (not snapshots) of your infrastructure are a fantastic fall-back position!!
– Horizon admins that use UAG, check in on your free space occasionally!!

Questions/comments are always welcome here.
Hope this helps!

	denismakarov1 on How to Install an nVidia GRID…
	scooterx01 on How to Install an nVidia GRID…
	denismakarov1 on How to Install an nVidia GRID…
	scooterx01 on RDS 2016 & 2019 in a Workg…
	Ale Salas on RDS 2016 & 2019 in a Workg…

	denismakarov1 on How to Install an nVidia GRID…
	scooterx01 on How to Install an nVidia GRID…
	denismakarov1 on How to Install an nVidia GRID…
	scooterx01 on RDS 2016 & 2019 in a Workg…
	Ale Salas on RDS 2016 & 2019 in a Workg…

Tag: failed

My UAG is Down!