On Friday, Jan. 10 at noon until Saturday, Jan. 11 around 1 p.m., the Campus Data Center suffered a sever service outage that persisted intermittently.
Most of the service outage was characterized by the Private Cloud Service (PCS) being unavailable. This service comes from the Quest Data Center located on the UC Davis campus. What clients saw was unavailability of all associated servers and services. Almost all Information and Educational Technology (IET) central services were affected by this outage, so SmartSite and related websites were inaccessible.
“The first outage of the PCS was Friday from noon till about 3 p.m. Once the PCS was restored from this outage, it took several more hours to restore all services. Possibly related to the outage, the uConnect firewall stopped working properly,” said Dave Zavatson, a principal engineer at the Campus Data Center. “While the PCS service and most dependent services were restored by 5:30 p.m., uConnect continued to be unavailable until 8 p.m. because of firewall issues.”
At around midnight on Jan. 11, the PCS’ service started exhibiting similar symptoms on a smaller scale. While the earlier outage eventually affected all hosts, this new outage was only affecting two VMWare hosts (VMWare is cloud software).
Workers at IET and the Campus Data Center began troubleshooting again, but at around 3 a.m. more hosts became affected by this second outage causing critical services to become unavailable.
Administrators continued working on the problem, and PCS service was restored around 7 a.m. At that point the process of restoring dependent services began and all services were restored by around 1 p.m.
“We know that the outage is due to storage area network software code. We have cases open with both VMWare and Netapp to isolate precisely what caused the service to be unable to communicate with the SAN,” said Babette Schmitt, chief information officer for IET.
Mark Redican, Telecom and Data Center director, sees very little logical relation between the uConnect firewall problems and the PCS’ failure.
“We are still investigating whether the uConnect firewall problems are related to the PCS failure. The timing of the failure is certainly suspect, but there is very little interaction between the two services except that some uConnect guest and external clients that rely on uConnect are hosted in the PCS,” Redican said. “We tried various troubleshooting steps to resolve the issue and ultimately needed to completely power off both firewalls and bring them back up sequentially to restore proper service.”
Due to the widespread nature of this outage, it was difficult for IET to communicate with clients about the conditions of services.
IET does have a Twitter account which did post outage and status notifications; however, the account only has 338 followers, which pales in comparison to the 33,300 students who constantly use these services.
“An outage such as this demonstrates just how critical these campus services are,” Zavatson said. “IET recognizes this and takes these services very seriously. Once the root cause analysis is complete we will make an incident report available to campus.”
The Campus Data Center’s next steps will be to work with campus leadership and Data Center clients to identify priorities and solutions for the business continuity.
“Solutions exist to address these shortcomings, and no doubt we’ll have much campus discussion about implementing them,” Schmitt said.