Federal technology news and analysis over the last month has been dominated by the OPM cybersecurity failures, and with good cause. We should all continue to track and learn from the lessons there. But there is also a danger that we will all focus too hard on that single topic when there are so many other important enterprise IT issues to dive into. One topic I believe federally focused enterprise technologists should have paid more attention to is the recent reporting of issues at the Department of State’s Bureau of Consular Affairs.
The Bureau of Consular Affairs Consular Consolidated Database (CCD) is a system critical to the nation. When it works the unsung heroes who keep it running do not get nearly enough credit. When it fails, the disruption to U.S. citizens abroad and at home and business interests everywhere is almost too high to calculate.
The CCD is fundamental to most all the services provided by the Bureau. It is key to processing passport applications for citizens who wish to travel as well as for visa applicants. The reviews of data associated with the over 50,000 visa applications a day is in many ways the first line of defense for the nation. So really, the data held by and operated on by this group is of paramount importance to our national security.
Imagine if the system of systems supporting the Bureau’s missions were to have outages. Our citizens overseas would not be served as well as they could be, overseas emergency passport applications would be delayed, visa applications would be more backlogged, and important collaboration with the national security community on issues like counterterrorism would be jeopardized.
Now imagine of the data in this system was made available to our nation’s adversaries. Imagine, for example, if the same actor that had over 100 million health records of U.S. citizens and 18 million investigation and employment records of federal employees from the OPM breach now had access to every passport ever applied for by a U.S. citizen and every visa ever applied for by any foreign citizen.
Imagine if that data included addresses, contact info and even photos of everyone involved. Imagine the correlation of data between and among those data sets and the knowledge that could be extracted.
Imagining this failure scenario is important so we understand the importance of maintaining both the availability and security of the CCD.
I read everything I could regarding outages impacting the CCD over the last year, including the official statements by the Department of State (the July 2014 outage was reported on here, the June 2015 outage was reported on here) as well as the additional color and context provided byNPR, US News and World Report, and Nextgov. I then contacted the Bureau to ask a few clarifying questions, seeking to extract information that will be of use to the more technical of us in the community.
The following is what I believe the current situation to be:
- The outages of 2014 and 2015 had different causes. The 2014 outages were due to a patch added to the system intended to resolve a known issue but resulted in unintended consequences. The 2015 outage was due to hardware failure with a recovery process compounded by both the production system and backup being impacted.
- The 2015 outage was painful, but the system is back online and the backlog the outage caused has for the most part been worked off.
- There is no indication that either outage was caused by hostile actors. There are certainly attack scenarios to think through and to plan to mitigate, but all available information I have seen or heard indicates there is no indication that this was anything other than software (2014) and hardware (2015) failures.
- Plans to improve the system are to upgrade it to provide a reliable service while at the same time execute a modernization effort. Upgrading a large, complex system of systems like this is done in ways that cause brief outages. Sometimes this is just totally unavoidable. Due to unforeseen events (the Nepal earthquake, Collapse and war in Yemen were cited by State) the upgrades were delayed. Had they been made the system might have continued to function.
Opinions:
- Outages occur in complex systems (like the 8 July NYSE and United Airlines outages). However, important systems should not suffer outages.
The fact that there are outages of a mission critical system means it is in need of modernization. It also means the senior leadership team at the Bureau should consider changes to their plans.
The modernization needed may well be redesign vice incremental updates. With new data approaches available from industry (including now proven/production ready Hadoop-based solutions), State should look at a total redesign, asap. Key criteria for the new system should be no outages, ever. Hardware and software both fail, that is true, but well designed systems account for this in ways that deliver continuous services for users. This system is important enough to have that design criteria met.
Other design criteria for the new system should be enhanced security to include absolutely extreme instrumentation and automation.
A system of this importance could use external help, and I don’t mean help by the same contractors that have been tasked with building it. I mean help by external, independent technologists who can provide insights from across industry and government.
Reaching out to other government agencies for design assistance is also a best practice. There are lessons at NSA, CIA, DIA, DHS, DoJ and FBI that can inform the design of the CCD.
What are your thoughts?