A friend of mine recently related a tale of woe
about network problems at his startup, a cloud service provider.
Unfortunately, because they lacked a network automation system,
they suffered a day-long customer-visible service outage; if
they'd had an appropriate network automation system, they could
have dealt with the problem in less than an hour.
It all started with a failing Ethernet switch, one of the pair
of core switches in their data center installation. The failing
switch would simply drop its 10Gb Ethernet connection to the other
core switch, with no warning and no explanation. They tried the
obvious quick fixes (try a different port on the failing switch,
try a different cable between the switches, etc.), with no success;
no matter what they tried, they couldn't resurrect the connection
to the other core switch.
For various reasons, a drop-in replacement switch wasn't immediately
available. After a physical inspection, counting open and used ports
on both switches, they determined that they had just enough open ports
on the working switch to allow them to re-home all the connections
from the failing switch. "All" they needed to do was configure
those ports on the working switch, along with associated VLAN
definitions, access control lists, and so forth. Essentially, they
needed to merge the functionality from the two switch configs (failing
and working) into a single switch config.
Manual Pain and Suffering
Unfortunately, they had to do this configuration work by hand,
because they don't use an automated configuration management
tool such as
NCG.
Moving two dozen port configurations (plus associated VLAN definitions,
access control lists, and so forth) from one switch to another by
hand poses a number of problems:
- The process is slow and error prone; it took them quite a
while (many hours) and several iterations to get it right.
- The process is complicated by inconsistencies
and artifacts from past manual configuration of the devices.
For example, they discovered that some of the nominally-unused
ports on the "working" switch had been grouped into a port-channel
group; they had to take time to understand that, figure out whether
it was still needed or not, and then clean up those ports and associated
virtual interfaces.
- The process is risky. While they were making these changes
on the working switch, they were risking inadvertently disrupting
what was left of their network if they accidentally typo'd a command
or applied something to the wrong port.
- The process is intricate. The changes on the switch
necessitate other changes beyond the switch. Even once they had
the switch reconfigured, for example, they still needed to update
their monitoring systems to monitor all the newly-activated ports
on the switch. Since updating the monitoring systems is also a
manual process, it too is slow, error-prone, and complicated.
Automated Nirvana
If they had been using an automated configuration management tool
such as NCG,
they could have been back in service much sooner (probably in less
than an hour), with a much higher degree of confidence in the new
config for the remaining switch.
A hypothetical automated configuration management system for their
network would probably have the following characteristics:
- A data file for each switch, describing the switch and listing its ports.
Each port would probably be described by a single line in this file,
containing the following information about the port:
- name -- i.e., "GigabitEthernet0/3/1".
- class -- What is this port used for? I.e., is it
an inter-switch trunk carrying all VLANs? An access port on a
particular VLAN? An unused port?
- description -- a human-meaningful word or phrase describing the port,
for use in interface labels, usage graphs, and so forth.
- A set of master config templates for the switches. Since the two
core switches are similar in make/model and in function, the same master
config template would likely be used for both, thus ensuring consistency
between the two switches.
- A set of sub-templates for particular classes of ports on the
switches; for instance, given the classes described above, you would
sub-templates for classes "trunk", "access", and "unused". In addition
to making the appropriate settings for a particular class of port, these
sub-templates would also make any necessary additions to related things
such as access control lists.
- A set of templates for configuring the monitoring system (or systems)
such as MRTG,
NAGIOS, or similar. These would
be used to generate monitoring configs that completely and correctly
correspond to the switch configs.
- An automated mechanism for getting configs onto the switches, such as
RANCID
or ZipTie.
- A revision control mechanism such as RCS, CVS, Subversion or Git,
to provide a history of the templates and data files that are
inputs to the config generation process, as well as of the generated
and installed configs.
Here are the steps they could have followed instead of doing everything
by hand, had they been using such an automated system:
- Review the switch port lists to simply count the number of
ports used on the failing switch and the number of ports available
on the remaining switch, to quickly determine that there were enough
open ports available on the remaining switch to accomodate everything.
- Edit the "port" list for the remaining switch, cutting and
pasting the lines from the list for failing switch, and making minor
adjustments as necessary (in particular, to port names, since
it's unlikely that the open ports on the remaining switch exactly
correspond to the used ports on the failing switch).
- Generate the new config file for the remaining switch, as well
as all dependent config files (i.e., for the monitoring systems).
- Inspect the newly-generated config files for reasonability, likely
by comparing them to the previously-generated config files from before
this change.
- Install the newly-generated config files on the relevant systems,
using tools such as RANCID
or ZipTie.
- Check all the updates into the revision control system (RCS, CVS,
Subversion, Git, or whatever) so that there's a record of changes
and a fallback position.
Comparison of manual and automated results
Using network automation tools such as
NCG,
RANCID,
and ZipTie:
- The incident could have been resolved in less than an hour, rather than the outage lasting several hours while the incident was resolved by hand.
- You could be much more confident that the resulting configs were complete,
consistent, and correct.
- All related configurations (i.e., the switch configurations and the monitoring system configurations) could be updated together, maintaining consistency
between them.
In my experience, it only takes a week or two of work to use
open source tools to assemble a network automation system for an
existing network such as this (i.e., a handful of related switches
and associated monitoring systems, all of which you already have
working manually-created configs for).
Hopefully, my friend's company will see the light, and automate their network
management so that they're better prepared for next time; maybe they'll even
offer me a consulting contract to help them get there... ;-)
Please contact us to
discuss how Netomata can help you avoid problems like this with your
network, while making your network more cost-effective, reliable, and
flexible.