strict warning: Only variables should be passed by reference in /var/www/sites/ on line 126.

Perils of treating network management as a second-class service

Too many organizations treat network management as a "nice to have" part of their operational toolkit, rather than a "must-have" capability. You can usually get away with this for a while, but eventually your luck runs out...

Last week, I related an all-too-typical tale of woe about how a startup suffered an all-day customer-visible outage because of a network problem, explaining how network automation could have shortened the outage from hours to minutes. Well, it turns out that lack of network automation wasn't their only problem...

As it happened, at the time of the outage, they didn't have any network management capability, because their sole network management host had suffered a disk failure several days before and they hadn't gotten around to restoring the host yet because it was "just the network management system".

Unfortunately for them and their customers, the failed system that was "just the network management system" would have:

  • enabled them to detect the failing ethernet switch (which was the root cause of the outage) much sooner, perhaps even before the switch totally failed, because that was where they were running their network status and performance monitoring tools such as Nagios and MRTG.
  • helped them diagnose the switch failure much more quickly, once the outage began, by referring to those same network status and performance monitoring tools.
  • quickly and efficiently paged everybody on the operations team when the outage began, instead of diverting somebody (who could otherwise have been working on resolving the problem) to alert everybody by phone, because their paging system was part of the status monitoring tool.
  • helped them quickly swap out the failed switch with a replacement, because the failed switch's last-saved configuration was backed up on the network management system.

In retrospect, I'm sure they wish that they had engineered "just the network management system" with the same level of service reliability as their customer-visible "production" systems. I'm sure they wish that they had treated the failure of "just the network management system" with the same sort of urgency as they would a failure of one of their customer-visible "production" systems.

Once the network management system failed, they were living on borrowed time. When something else failed (i.e., the ethernet switch), they were severely hampered in their ability to detect and deal with that failure, which resulted in an extended customer-visible outage. Even though the network management system isn't itself customer-visible, it is an essential part of providing a reliable service, and needs to be treated as such.

Netomata can help you avoid problems like this with your network, while making your network more cost-effective, reliable, and flexible; please contact us to discuss how.