How network automation could have shortened an all-day customer-visible outage

A friend of mine recently related a tale of woe about network problems at his startup, a cloud service provider. Unfortunately, because they lacked a network automation system, they suffered a day-long customer-visible service outage; if they'd had an appropriate network automation system, they could have dealt with the problem in less than an hour.

It all started with a failing Ethernet switch, one of the pair of core switches in their data center installation. The failing switch would simply drop its 10Gb Ethernet connection to the other core switch, with no warning and no explanation. They tried the obvious quick fixes (try a different port on the failing switch, try a different cable between the switches, etc.), with no success; no matter what they tried, they couldn't resurrect the connection to the other core switch.

For various reasons, a drop-in replacement switch wasn't immediately available. After a physical inspection, counting open and used ports on both switches, they determined that they had just enough open ports on the working switch to allow them to re-home all the connections from the failing switch. "All" they needed to do was configure those ports on the working switch, along with associated VLAN definitions, access control lists, and so forth. Essentially, they needed to merge the functionality from the two switch configs (failing and working) into a single switch config.

Manual Pain and Suffering

Unfortunately, they had to do this configuration work by hand, because they don't use an automated configuration management tool such as NCG. Moving two dozen port configurations (plus associated VLAN definitions, access control lists, and so forth) from one switch to another by hand poses a number of problems:

  • The process is slow and error prone; it took them quite a while (many hours) and several iterations to get it right.
  • The process is complicated by inconsistencies and artifacts from past manual configuration of the devices. For example, they discovered that some of the nominally-unused ports on the "working" switch had been grouped into a port-channel group; they had to take time to understand that, figure out whether it was still needed or not, and then clean up those ports and associated virtual interfaces.
  • The process is risky. While they were making these changes on the working switch, they were risking inadvertently disrupting what was left of their network if they accidentally typo'd a command or applied something to the wrong port.
  • The process is intricate. The changes on the switch necessitate other changes beyond the switch. Even once they had the switch reconfigured, for example, they still needed to update their monitoring systems to monitor all the newly-activated ports on the switch. Since updating the monitoring systems is also a manual process, it too is slow, error-prone, and complicated.

Automated Nirvana

If they had been using an automated configuration management tool such as NCG, they could have been back in service much sooner (probably in less than an hour), with a much higher degree of confidence in the new config for the remaining switch.

A hypothetical automated configuration management system for their network would probably have the following characteristics:

  • A data file for each switch, describing the switch and listing its ports. Each port would probably be described by a single line in this file, containing the following information about the port:
    • name -- i.e., "GigabitEthernet0/3/1".
    • class -- What is this port used for? I.e., is it an inter-switch trunk carrying all VLANs? An access port on a particular VLAN? An unused port?
    • description -- a human-meaningful word or phrase describing the port, for use in interface labels, usage graphs, and so forth.
  • A set of master config templates for the switches. Since the two core switches are similar in make/model and in function, the same master config template would likely be used for both, thus ensuring consistency between the two switches.
  • A set of sub-templates for particular classes of ports on the switches; for instance, given the classes described above, you would sub-templates for classes "trunk", "access", and "unused". In addition to making the appropriate settings for a particular class of port, these sub-templates would also make any necessary additions to related things such as access control lists.
  • A set of templates for configuring the monitoring system (or systems) such as MRTG, NAGIOS, or similar. These would be used to generate monitoring configs that completely and correctly correspond to the switch configs.
  • An automated mechanism for getting configs onto the switches, such as RANCID or ZipTie.
  • A revision control mechanism such as RCS, CVS, Subversion or Git, to provide a history of the templates and data files that are inputs to the config generation process, as well as of the generated and installed configs.

Here are the steps they could have followed instead of doing everything by hand, had they been using such an automated system:

  1. Review the switch port lists to simply count the number of ports used on the failing switch and the number of ports available on the remaining switch, to quickly determine that there were enough open ports available on the remaining switch to accomodate everything.
  2. Edit the "port" list for the remaining switch, cutting and pasting the lines from the list for failing switch, and making minor adjustments as necessary (in particular, to port names, since it's unlikely that the open ports on the remaining switch exactly correspond to the used ports on the failing switch).
  3. Generate the new config file for the remaining switch, as well as all dependent config files (i.e., for the monitoring systems).
  4. Inspect the newly-generated config files for reasonability, likely by comparing them to the previously-generated config files from before this change.
  5. Install the newly-generated config files on the relevant systems, using tools such as RANCID or ZipTie.
  6. Check all the updates into the revision control system (RCS, CVS, Subversion, Git, or whatever) so that there's a record of changes and a fallback position.

Comparison of manual and automated results

Using network automation tools such as NCG, RANCID, and ZipTie:

  • The incident could have been resolved in less than an hour, rather than the outage lasting several hours while the incident was resolved by hand.
  • You could be much more confident that the resulting configs were complete, consistent, and correct.
  • All related configurations (i.e., the switch configurations and the monitoring system configurations) could be updated together, maintaining consistency between them.

In my experience, it only takes a week or two of work to use open source tools to assemble a network automation system for an existing network such as this (i.e., a handful of related switches and associated monitoring systems, all of which you already have working manually-created configs for).

Hopefully, my friend's company will see the light, and automate their network management so that they're better prepared for next time; maybe they'll even offer me a consulting contract to help them get there... ;-)

Please contact us to discuss how Netomata can help you avoid problems like this with your network, while making your network more cost-effective, reliable, and flexible.