BART provides update on Saturday's computer network failure


BART provides update on Saturday's computer network failure

Tamar Allen, Assistant General Manager Operations provided the following update to the BART Board of Directors today related to Saturday’s computer problems that prevented trains from being dispatched between 6am-9am:

The type of failure that occurred this weekend is very rare. The last time a network switch failed resulting in a major disruption to service was in March of 2006. In that case the failure was triggered by a human error when updating software. This weekend’s failure was due to a failure of the switch itself. There are two efforts underway that will protect against the type and magnitude of failure experienced this weekend. We are accelerating both of these efforts:

1.     First, upgrading computer hardware and the network infrastructure to take advantage of emerging technology, protocols and standards for data management and cyber security.

2.     Second, establishing a remote redundant network disaster recovery center which would protect revenue service in the event of a network failure. This data center is expected to be fully built out within a month and fully operational within a couple of months.

More specifically this is what occurred on Saturday morning:

At approximately 02:45 am a single network switch, one of many in a complex system, failed. Essentially, instead of processing and passing on data, the switch kept recirculating data generating an unmanageable data spike. In this case the number of data packages requiring processing quickly increased from a norm of about 200 to more than 54K per millisecond. This overwhelmed the failed switch and had a cascading impact on other switches in the network resulting in a loss of communication between the Operations Control Center and all systems and devices in the field.

Cisco our network vendor, has confirmed that the failure was due to a fault within the switch itself. Because of this failure the Operations Control Center was unable to safely run service.  They were also unable to issue BART Service Advisories, which resulted in inaccurate mobile and web-based trip planning and Real Time Departure information until staff was able to remotely cancel all BART trips in the Trip Planner on our website and app.

We worked around these customer notification problems by:

·        Sending BART Service Advisory emails and texts to weekend subscribers

·        Posting updated information on our website including a big red alert box

·        Working closely with the media to frequently broadcast the status of service

·        Using our extensive social media network to update customers

Engineering and Maintenance teams responded to troubleshoot the problem. By 06:45 am the problem had been identified and isolated. Once the computer system was stable engineers began to methodically bring associated field devices back on line. This required coordination with field staff and safety validation by the OCC. Priority was given to those devices and systems that support safe train operations.

At 09:00 am train service commenced everywhere expect south of Daly City. By 10:40 am most station systems had been restored. At 11:08 am full train service was restored.  We continued to have minor platform digital sign problems throughout the weekend.

The failed switch has been replaced.

Again, we are accelerating the two efforts which were already underway prior to this event:

·        Upgrading computer hardware and the network infrastructure

·        Establishing a remote redundant network disaster recovery center