Brent Chapman's blog

Netomata releases web-based Config Review Tool

6:41pm20Aug2010

Once you've used a tool like the Netomata Config Generator (NCG) to generate configs for a bunch of devices on your network, how do you convince yourself that those new configs are complete and correct and ready to deploy? How do you determine that the newly-generated configs differ from the old configs in only the ways that you want, and that you haven't inadvertently introduced unintended changes?

Wouldn't it be great if you could, say, compare the newly-generated configs to the original (hand-created) configs for those devices, or to the previous generated configs? And how cool would it be if there was some sort of "approval" mechanism wrapped around this, so that you could easily identify the files that had been reviewed and approved as good-to-go for installation?

We've got a tool for you!

We've just released the Netomata Config Review Tool, which addresses these issues. It is a simple web-based tool for reviewing NCG-generated config files and approving them for installation on devices. It is written in Ruby as a web CGI program; it should work fine on any web server that supports CGI programs, such as Apache. We're releasing it as open source under a GPLv3 license (the same as NCG).

This tool is an outgrowth of a recent consulting project that we did for Netflix, helping them install NCG and set it up to generate configs for the routers at their dozens of shipping hubs throughout the USA. We'd love to do a project like this for your organization, too!

How it works

For each device, the tool keeps track of 3 config files (if they exist):

  • Original: the config that the device was originally running (which was presumably created by hand)
  • Generated: the most recent config generated by NCG
  • Approved: the most recent generated config that has been "approved" via this process

For each device, this tool lets you:

  • View the Original, Generated, and (if it exists) Approved config
  • See diffs between pairs of configs:
    • Original => Generated
    • Generated => Approved
    • Original => Approved
  • Approve a Generated config (i.e., make it the Approved config for the device)
  • Unapprove a currently approved config (i.e., delete the Approved config for the device)

The tool does not (yet) install approved configs on devices; the assumption is that you will use a tool such as RANCID to do that, from the files in the "approved" directory.

How to get it

You can read all about it, see screen shots, and download the code at http://www.netomata.com/wiki/config_review_tool

Presenting a free webinar about Network Automation, Wed 23 June 2010

1:24pm11Jun2010

On Wed 23 June 2010, I'll be presenting a 30-minute overview of network automation benefits and tools as part of a free online webinar produced by SearchNetworking, entitled Optimizing and Managing the Dynamic Enterprise Network:

Today, as more applications and IT functions converge on the network infrastructure, user expectations are higher than ever. The advent of cloud computing and virtualization demands a solid yet flexible network that can instantly adjust to changing conditions. Unfortunately, many IT departments today find themselves facing this technology challenge with lean networking teams and low budgets. That makes choosing the right network management and optimization tools critical.

In this free, one-day virtual seminar our experts will cover how to rethink your management strategy and implement techniques that allow networking teams to understand performance, make the most of the infrastructure, and offload low-level tasks so that they can focus on improving performance and making progress.

Attend and gain insight on how to:

  • Manage your network in the age of the dynamic network
  • Ensure application performance on the WAN
  • Use network automation to make your network more cost-effective, reliable, and flexible

And much more!

My part of the webinar is scheduled to start at 10:30am PDT (1:30pm EDT). After the webinar, I'll be online to answer questions from the audience.

Optimizing and Managing the Dynamic Enterprise Network:

Seats still available at *free* DevOps Day conference, Fri 25 June 2010 in Mountain View

10:14am11Jun2010

A group of folks who are active in the emerging "devops" field are putting together DevOps Day, a free one-day conference on Friday 25 June 2010, in Mountain View, CA, hosted by LinkedIn:

DevOps Day is an open event for discussing all topics around improving the interaction between what is traditionally considered development activity and that which is traditionally considered operations activity.

...

DevOps Day US is a single-track conference organized around a series of panels where open discussion amongst all conference participants is encouraged.

This is a one-day "hmm, we're all facing similar issues; let's get together and talk about this" event being put together by practitioners, not a "conference" being sponsored by folks who are trying to sell you something. I expect it to be more like an extended user group meeting than anything else, and I'm looking forward to some very interesting discussions.

Planned discussions include:

  • Your mileage may vary: Experiences and lessons learned facing DevOps problems in the IT trenches (even if they weren’t calling it DevOps!). The good, the bad, the surprises, and ideas for the future.
  • Infrastructure as code: Automation is essential to DevOps. The infrastructure as code concept drives many of today’s cutting edge automaton techniques. What is it all about? Where are its limitations?
  • Changing culture to enable DevOps: Changing tools is easy when compared to changing people and processes. How can we cultivate an organization’s culture to identify and solve DevOps problems?
  • Does the Cloud needs DevOps? Does DevOps need the Cloud?: Examining the role that cloud technologies can play in solving DevOps problems and the role that DevOps solutions can play in getting the most out of cloud technologies.
  • DevOps requires visibility: monitoring, testing, and performance: Examining the (often overlooked) role of monitoring and testing techniques in solving DevOps problems.
  • DevOps outside of Web Operations: Much of the public discussion about DevOps focuses on Web Operations. This panel is about taking the lessons of DevOps to other types of IT.
  • Making the business case: We know that solving DevOps problems improves your business operations and improves the bottom line, but how do you do you explain that to your CEO or CFO? How do you get the executives to buy in and invest in DevOps solutions?

Expected participants include Luke Kanies (creator of Puppet) and Adam Jacob (creator of Chef), as well as practitioners from organizations such as LinkedIn, Shopzilla, Etsy, Cisco, ITA Software, and Tripwire.

DevOpsDays 2010 US

All in all, it's a very interesting topic, and this looks like it will be a fascinating event. I'll be there, and I hope to see you there too!

Speaking about Automating Network Configuration at NANOG in SF, Sun 13 Jun 2010

5:24pm2Jun2010

This quarter's NANOG meeting is in San Francisco, and I'll be presenting a 90-minute tutorial on Automating Network Configuration:

You've been using tools like Puppet and cfengine to corral the complexity on your servers. You revel in the scalability, reliability, and ease of maintenance of doing it The Right Way. You don't fear the next change because you know the tools will just get it Right. But you still tremble at an 'enable' prompt, hoping you remembered all the bits that need to be twiddled, on all the networking devices everywhere. Is your DNS tied on straight - both ways? Is it all *really* being monitored by Nagios? As your network's complexity increases, so do the errors, inconsistencies, and omissions caused by manual configuration, and brokenness abounds. But wait - there's a way out of the swamp! Come hear Brent Chapman as he reveals methods and tools for automating the mind-numbing task of configuring network devices and services. Among other things, he'll talk about his cool new open source Netomata Config Generator, which addresses some of these problems.

Brent Chapman is the founder, CEO, and technical lead of Netomata, Inc. He is the coauthor of the highly regarded O'Reilly & Associates book Building Internet Firewalls. He is also the founder of the Firewalls, List-Managers, and Network-Automation Internet mailing lists, and the creator of the Majordomo mailing list management package. In 2004, Brent was honored with the annual SAGE Outstanding Achievement Award 'for outstanding sustained contributions to the community of system administrators'. He has been a frequent and popular speaker at USENIX, LISA, BayLISA, and many other events over the past 15 years.

I expect to be there for the full NANOG meeting, from Sun 13 Jun 2010 through Wed 16 Jun 2010; if you're there, too, I hope you'll come to my talk, or at least catch me and say hello.

And if you haven't registered for NANOG yet, it's not too late... As the NANOG web site says:

NANOG49 will feature presentations on networking advancements and techniques, educational tutorials, interesting tracks, and more. Whether you are new to the networking profession or a seasoned veteran, NANOG49 will educate and inform with a full agenda of interesting topics.

I highly recommend it, and I hope to see you there!

O'Reilly offering 25% Memorial Day discount for Velocity conference

6:28pm27May2010

O'Reilly Velocity Web Performance & Operations Conference 2010 The O'Reilly Velocity conference is only in its third year, but it has rapidly become one of my favorite events. If you do web operations or architecture, I'd say it's a "must do" conference; the amount of info you'll pick up in 2 short days (3 if you attend the workshops) is amazing.

Even better, O'Reilly has just announced a special 25% discount on registration, good from now through Memorial Day weekend (until Tue 1 Jun 2010); just use the discount code "MEMORIALDAY" when you register.

I hope to see you there!

ZipTie versus RANCID

8:59am26Mar2010

Someone recently asked me to share my thoughts on ZipTie (now officially known as "AlterPoint NetworkAuthority Inventory" or "AlterPoint NAI") versus RANCID as network configuration management tools.

To begin with, what are these tools?

RANCID is a command line tool which handles configuration communications with various types of networking devices (most major brands of routers, switches, load balancers, firewalls, etc.). You can use it to copy config files to and from devices, or to execute a series of commands on the device. Essentially, RANCID pretends to be a human user of the device's command line interface, and you give RANCID a simple "script" to follow in dealing with the device (i.e., "when you see the 'login:' prompt, send 'admin'; then, when you see the 'password:' prompt, send 'opensesame'; then, when you see the 'alibabascave>' prompt, send 'enable'; then ..."). RANCID is sometimes used by itself, but more often used as a building block in larger, custom-built automated network management systems; people use it in conjunction with tools to manage an archive of config files (such as CVSweb), or in conjunction with tools to programmatically generate config files (such as our own Netomata Config Generator (NCG) tool), or in a wide variety of other ways.

ZipTie, on the other hand, has a slick web-based user interface, and is designed to be a complete "environment" for managing the devices on your network. According to its web page:

NetworkAuthority Inventory provides continuous discovery and tracking of your network devices. Using a simple, web-based interface you can backup and restore device configurations, detect configuration changes and compare configurations between devices. NetworkAuthority Inventory generates an accurate, real-time, detailed view of every device in your network and keeps it up to date.

So, what are the key differences between RANCID and ZipTie?

  • As already discussed, RANCID is a command line based tool that can also be used from shell scripts and other programs, while ZipTie is a web-based tool that is designed for interactive use (there are ways to drive ZipTie programmatically, but that's not its main purpose).
  • ZipTie includes a "discovery" mode, to find the manageable devices on your network; with RANCID, you have to tell it what you want it to manage.
  • Both ZipTie and RANCID will move configs to and from network devices. ZipTie gives you a web interface to do that, while RANCID is command line driven. Which of those is "better" depends on your situation, and your team's skills and preferences.
  • ZipTie has lots of different reports and graphs and such; RANCID has none of that.
  • ZipTie is largely self-contained; it probably already does most of what you might want, and there are various extensions (some provided by AlterPoint, and some by the community) to make it do even more, but integrating it with other tools might be more challenging. RANCID, on the other hand, does very little (just moves configs on and off devices, really, although you can also use it to run scripted commands on those devices) by itself, but is easier to integrate with other systems that you're building yourself.
  • ZipTie has a cool "compare config" tool, that shows you how two config files (from different devices, or from different times on the same device) differ. With RANCID, you have to extract the right versions of the right files from CVS and then compare them yourself with "diff".
  • RANCID is some pretty ugly Perl code; it's hack piled upon hack atop other hack, haphazardly and occasionally supported by its user community, most of whom are excellent network engineers and but only so-so programmers. ZipTie, on the other hand, is developed and supported by professional programmers at a "real" company, which uses it as the core of their money-making product, so they have a strong incentive to maintain and improve it. The flip side of that is the whole "open source versus commercial" debate; RANCID is open source, and ZipTie is commercial, although the basic package (which might be enough to meet your needs) is free.

So, essentially, I suggest the following approach to comparing these two tools for your situation:

  • Try ZipTie, to see if it does what you need, since it's already got so much functionality built-in (discovery, graphs, reports, config comparisons, etc.)
  • If ZipTie and its various add-ons don't do what you need, and you feel that you need to build your own solution, then building it on top of RANCID probably makes sense.

So you think you know traceroute...

11:43am13Nov2009

Most network engineers and sysadmins would probably say that they're intimately familiar with 'traceroute', and consider it one of their fundamental network troubleshooting tools... I certainly do. But you might be amazed to learn, as I did, how much you don't know about traceroute.

Richard Steenbergen of nLayer Communications, Inc., did an excellent presentation on traceroute at this month's NANOG (North American Network Operators Group) meeting:

Among other things, this presentation shows you:

  • How traceroute works
  • What you can learn from the DNS hostnames returned by traceroute
    • Where the ISP/carrier boundaries are
    • Where the equipment is located, geographically (do you know what a CLLI code is?)
    • What type of equipment the ISP/carrier is using
  • What the round trip times reported by traceroute really mean
  • How you can be led astray by ICMP prioritization, rate limiting, asymmetric paths, and load balancing

One of the coolest tricks I learned from this presentation is, to find out more about what's at the other end of some hop that appears to be a point-to-point link, assume that the IP address you see is one of the two addresses in a /30 subnet (as is commonly assigned to point-to-point links), and do a DNS reverse lookup of the other address in the /30.

This is useful, for example, in figuring out which egress port a packet went out on, since traceroute normally only shows you the ingress ports for each device along the way. For example, let's say I was looking at the following traceroute output, and wanted to know the egress port on router #3, as the packet moved to router #4:

brent% traceroute www.google.com
traceroute: Warning: www.google.com has multiple addresses; using 208.67.219.230
traceroute to google.navigation.opendns.com (208.67.219.230), 64 hops max, 40 byte packets
 1  192.168.0.1 (192.168.0.1)  3.145 ms  2.573 ms  2.382 ms
 2  75-101-29-1.dsl.static.sonic.net (75.101.29.1)  9.555 ms  9.054 ms  9.089 ms
 3  127.at-X-X-X.gw3.200p-sf.sonic.net (208.106.96.193)  9.510 ms  9.871 ms  9.194 ms
 4  200.ge-0-1-0.gw.equinix-sj.sonic.net (64.142.0.210)  11.965 ms  11.870 ms  11.839 ms
 5  0.as0.gw2.equinix-sj.sonic.net (64.142.0.150)  11.928 ms  12.519 ms  12.394 ms
 6  GigabitEthernet3-1.GW2.SJC7.ALTER.NET (157.130.194.17)  11.360 ms  16.257 ms  11.268 ms
 7  0.so-0-0-1.XL4.SJC7.ALTER.NET (152.63.51.50)  11.729 ms  11.679 ms  11.403 ms
 8  0.so-7-0-0.XL2.PAO1.ALTER.NET (152.63.113.21)  14.775 ms  17.455 ms 0.so-5-0-0.XL2.PAO1.ALTER.NET (152.63.48.9)  15.548 ms
 9  POS7-0.GW6.PAO1.ALTER.NET (152.63.55.14)  12.886 ms  13.143 ms  13.029 ms
10  65.203.37.46 (65.203.37.46)  13.517 ms  14.708 ms  16.566 ms
11  * * *
12  * * *
^C

To find out more about router #3's egress port, I look at the IP address for router #4 (64.142.0.210), figure out what would be the other IP address in the same /30 (64.142.0.209; hint: the lower address in a /30 pair always ends in an odd number, and the higher address always ends in an even number, so if the address you know ends in an odd number, the other address in the same /30 is going to be the next higher number, and if the address you know is even, the other is going to be the next lower number), and do a DNS reverse lookup of that address:

brent% dig -x 64.142.0.209

; <<>> DiG 9.4.3-P3 <<>> -x 64.142.0.209
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49382
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;209.0.142.64.in-addr.arpa.	IN	PTR

;; ANSWER SECTION:
209.0.142.64.in-addr.arpa. 259200 IN	PTR	200.ge-6-3-0.gw3.200p-sf.sonic.net.

;; Query time: 31 msec
;; SERVER: 208.67.222.222#53(208.67.222.222)
;; WHEN: Fri Nov 13 09:42:05 2009
;; MSG SIZE  rcvd: 91

Another handy tip from the presentation is that, since light travels through fiber optic cable at about 200 km (or 125 miles, if you prefer) per millisecond, each 1 ms of delay shown by traceroute (which, remember, is round trip delay) should represent about 100 km (62.5 mi) of distance if the delay were due entirely to the distance travelled (i.e., no queuing or processing delays). Using that fact, you can see that 40ms for a packet to go from San Francisco to New York (about 2500 miles, or 4000km) would be "normal", but 40ms for a packet to go from San Francisco to San Jose (about 50 miles, or 80km) would indicate a problem; it should take the packet less than 1ms to cover that distance and back, so something else (congestion or processing delays, for example) must account for the other 39ms.

There's a lot more in this presentation, about more complex issues such as

  • how the way in which routers handle traceroute packets can produce biased results (most routers handle traceroute packets much more slowly than they handle "real" data packets, which can make things look much worse than they are)
  • how asymmetric paths can lead you astray (traceroute only shows you the path to a system, but if you're pulling lots of bytes from the system, as would typically be the case with a remote server, you probably care more about the path back from the system, which might be totally different
  • how using MPLS, which is increasingly common in carrier networks, can lead to very confusing round-trip times in traceroute

Anyway, if you ever use traceroute, I highly recommend that you review this excellent presentation. I think you'll be pleasantly surprised at how much you learn.

Thanks to Strata Chalup of Virtual.net for bringing this very informative presentation to my attention.

Quadruple whammy for IT as a startup grows

1:38pm27Oct2009

At some point during their growth, usually around the 50-100 employee stage, most startups face a "quadruple whammy" of IT infrastructure challenges. If the startup doesn't recognize that this is happening (or, better yet, anticipate and prevent it from happening), IT can quickly become a major drag on the startup's continued growth.

Early on, a startup's IT needs are generally handled internally on an ad hoc basis by a de facto IT team of various personnel, acting in addition to their primary responsibilities as engineers, managers, and so forth. This works fine for a while, often for several years. At some point, though, as the startup continues to grow, several factors all come together:

  • The IT workload is increasing, as the number of employees, offices, and customers all increase simultaneously.
  • Expectations for the company's IT infrastructure are also increasing, faster even than the numerical growth of the company might suggest. As the company grows, everyone (new hires, long-time employees, management, customers, investors, regulators, etc.) expects more of the company's IT infrastructure, and becomes less tolerant of deficiencies that were accepted earlier on.
  • The de facto IT team are all getting busier with their "real" jobs, leaving less time to "help out" with IT, at exactly the time when IT problems are becoming more complex because of the company's growth and rising expectations.
  • New hires are less able to fulfill their own IT needs, both because of changes in the nature of hiring over time (i.e., early hires in a startup tend to be more self-sufficient, while later hires have a higher expectation of what is already in place), and because of the ever-growing complexity of the company's IT infrastructure.

As a result of these factors, bad things start happening:

  • Routine IT infrastructure requests (moves, adds, changes, and troubleshooting) are an increasing burden on the de facto IT infrastructure team, all of whom have other primary responsibilities, to the point where the IT work is beginning to interfere with those other responsibilities.
  • Despite the best efforts of the de facto IT infrastructure team, the IT infrastructure isn’t living up to expectations throughout the company, and is beginning to become an obstacle.
  • Many IT infrastructure decisions are being made in an expedient and ad hoc fashion, without adequate contemplation of future needs, growth path, maintainability, and so forth, due to lack of a coherent IT infrastructure architecture and road map.

Essentially, at this point, the startup needs to put in place the framework of IT architectures, systems, processes, and people that will enable its IT infrastructure to facilitate the company's growth, rather than impede that growth.

Netomata's staff have helped many startups through this transition; if this situation sounds all too familiar to you, contact us, and we can help you too!

Perils of treating network management as a second-class service

3:07pm5Oct2009

Too many organizations treat network management as a "nice to have" part of their operational toolkit, rather than a "must-have" capability. You can usually get away with this for a while, but eventually your luck runs out...

Last week, I related an all-too-typical tale of woe about how a startup suffered an all-day customer-visible outage because of a network problem, explaining how network automation could have shortened the outage from hours to minutes. Well, it turns out that lack of network automation wasn't their only problem...

As it happened, at the time of the outage, they didn't have any network management capability, because their sole network management host had suffered a disk failure several days before and they hadn't gotten around to restoring the host yet because it was "just the network management system".

Unfortunately for them and their customers, the failed system that was "just the network management system" would have:

  • enabled them to detect the failing ethernet switch (which was the root cause of the outage) much sooner, perhaps even before the switch totally failed, because that was where they were running their network status and performance monitoring tools such as Nagios and MRTG.
  • helped them diagnose the switch failure much more quickly, once the outage began, by referring to those same network status and performance monitoring tools.
  • quickly and efficiently paged everybody on the operations team when the outage began, instead of diverting somebody (who could otherwise have been working on resolving the problem) to alert everybody by phone, because their paging system was part of the status monitoring tool.
  • helped them quickly swap out the failed switch with a replacement, because the failed switch's last-saved configuration was backed up on the network management system.

In retrospect, I'm sure they wish that they had engineered "just the network management system" with the same level of service reliability as their customer-visible "production" systems. I'm sure they wish that they had treated the failure of "just the network management system" with the same sort of urgency as they would a failure of one of their customer-visible "production" systems.

Once the network management system failed, they were living on borrowed time. When something else failed (i.e., the ethernet switch), they were severely hampered in their ability to detect and deal with that failure, which resulted in an extended customer-visible outage. Even though the network management system isn't itself customer-visible, it is an essential part of providing a reliable service, and needs to be treated as such.

Netomata can help you avoid problems like this with your network, while making your network more cost-effective, reliable, and flexible; please contact us to discuss how.

How network automation could have shortened an all-day customer-visible outage

4:02pm29Sep2009

A friend of mine recently related a tale of woe about network problems at his startup, a cloud service provider. Unfortunately, because they lacked a network automation system, they suffered a day-long customer-visible service outage; if they'd had an appropriate network automation system, they could have dealt with the problem in less than an hour.

It all started with a failing Ethernet switch, one of the pair of core switches in their data center installation. The failing switch would simply drop its 10Gb Ethernet connection to the other core switch, with no warning and no explanation. They tried the obvious quick fixes (try a different port on the failing switch, try a different cable between the switches, etc.), with no success; no matter what they tried, they couldn't resurrect the connection to the other core switch.

For various reasons, a drop-in replacement switch wasn't immediately available. After a physical inspection, counting open and used ports on both switches, they determined that they had just enough open ports on the working switch to allow them to re-home all the connections from the failing switch. "All" they needed to do was configure those ports on the working switch, along with associated VLAN definitions, access control lists, and so forth. Essentially, they needed to merge the functionality from the two switch configs (failing and working) into a single switch config.

Manual Pain and Suffering

Unfortunately, they had to do this configuration work by hand, because they don't use an automated configuration management tool such as NCG. Moving two dozen port configurations (plus associated VLAN definitions, access control lists, and so forth) from one switch to another by hand poses a number of problems:

  • The process is slow and error prone; it took them quite a while (many hours) and several iterations to get it right.
  • The process is complicated by inconsistencies and artifacts from past manual configuration of the devices. For example, they discovered that some of the nominally-unused ports on the "working" switch had been grouped into a port-channel group; they had to take time to understand that, figure out whether it was still needed or not, and then clean up those ports and associated virtual interfaces.
  • The process is risky. While they were making these changes on the working switch, they were risking inadvertently disrupting what was left of their network if they accidentally typo'd a command or applied something to the wrong port.
  • The process is intricate. The changes on the switch necessitate other changes beyond the switch. Even once they had the switch reconfigured, for example, they still needed to update their monitoring systems to monitor all the newly-activated ports on the switch. Since updating the monitoring systems is also a manual process, it too is slow, error-prone, and complicated.

Automated Nirvana

If they had been using an automated configuration management tool such as NCG, they could have been back in service much sooner (probably in less than an hour), with a much higher degree of confidence in the new config for the remaining switch.

A hypothetical automated configuration management system for their network would probably have the following characteristics:

  • A data file for each switch, describing the switch and listing its ports. Each port would probably be described by a single line in this file, containing the following information about the port:
    • name -- i.e., "GigabitEthernet0/3/1".
    • class -- What is this port used for? I.e., is it an inter-switch trunk carrying all VLANs? An access port on a particular VLAN? An unused port?
    • description -- a human-meaningful word or phrase describing the port, for use in interface labels, usage graphs, and so forth.
  • A set of master config templates for the switches. Since the two core switches are similar in make/model and in function, the same master config template would likely be used for both, thus ensuring consistency between the two switches.
  • A set of sub-templates for particular classes of ports on the switches; for instance, given the classes described above, you would sub-templates for classes "trunk", "access", and "unused". In addition to making the appropriate settings for a particular class of port, these sub-templates would also make any necessary additions to related things such as access control lists.
  • A set of templates for configuring the monitoring system (or systems) such as MRTG, NAGIOS, or similar. These would be used to generate monitoring configs that completely and correctly correspond to the switch configs.
  • An automated mechanism for getting configs onto the switches, such as RANCID or ZipTie.
  • A revision control mechanism such as RCS, CVS, Subversion or Git, to provide a history of the templates and data files that are inputs to the config generation process, as well as of the generated and installed configs.

Here are the steps they could have followed instead of doing everything by hand, had they been using such an automated system:

  1. Review the switch port lists to simply count the number of ports used on the failing switch and the number of ports available on the remaining switch, to quickly determine that there were enough open ports available on the remaining switch to accomodate everything.
  2. Edit the "port" list for the remaining switch, cutting and pasting the lines from the list for failing switch, and making minor adjustments as necessary (in particular, to port names, since it's unlikely that the open ports on the remaining switch exactly correspond to the used ports on the failing switch).
  3. Generate the new config file for the remaining switch, as well as all dependent config files (i.e., for the monitoring systems).
  4. Inspect the newly-generated config files for reasonability, likely by comparing them to the previously-generated config files from before this change.
  5. Install the newly-generated config files on the relevant systems, using tools such as RANCID or ZipTie.
  6. Check all the updates into the revision control system (RCS, CVS, Subversion, Git, or whatever) so that there's a record of changes and a fallback position.

Comparison of manual and automated results

Using network automation tools such as NCG, RANCID, and ZipTie:

  • The incident could have been resolved in less than an hour, rather than the outage lasting several hours while the incident was resolved by hand.
  • You could be much more confident that the resulting configs were complete, consistent, and correct.
  • All related configurations (i.e., the switch configurations and the monitoring system configurations) could be updated together, maintaining consistency between them.

In my experience, it only takes a week or two of work to use open source tools to assemble a network automation system for an existing network such as this (i.e., a handful of related switches and associated monitoring systems, all of which you already have working manually-created configs for).

Hopefully, my friend's company will see the light, and automate their network management so that they're better prepared for next time; maybe they'll even offer me a consulting contract to help them get there... ;-)

Please contact us to discuss how Netomata can help you avoid problems like this with your network, while making your network more cost-effective, reliable, and flexible.

Syndicate content