8.1 What is Network Management?
Having made our way through the first seven chapters of this text, we're
now well aware that a network consists of many complex, interacting
pieces of hardware and software - from the links, bridges, routers, hosts
and other devices that comprise the physical components of the network
to the many protocols (in both hardware and software) that control and
coordinate these devices. When hundreds or thousands of such components
are cobbled together by an organization to form a network, it is not surprising
that components will occasionally malfunction, that network elements will
be misconfigured, that network resources will be overutilized, or that
network components will simply "break" (e.g., a cable will be cut, a can
of soda will be spilled on top of router). The network administrator,
whose job it is to keep the network "up and running," must be able to respond
to (and better yet, avoid) such mishaps. With potentially thousands
of network components spread out over a wide area, the network administrator
in a network operations center (NOC) clearly needs tools to help monitor,
manage, and control the network. In this chapter, we'll examine the
architecture, protocols, and information base used by a network administrator
in this task.
Before diving in to network management itself, let's first consider
a few illustrative "real-world" non-networking scenarios in which a complex
system with many interacting components must monitored, managed, and controlled
by an administrator. Electrical power-generation plants (at least as portrayed
in the popular media, e.g., movies such as the China Syndrome) have a control
room where dials, gauges, and lights monitor the status (temperature,
pressure, flow) of remote valves, pipes, vessels, and other plant components.
These devices allow the operator to monitor the plant's many components,
and may alert the operator (the famous flashing red warning light) when
trouble is imminent. Actions are taken by the plant operator to control
these components. Similarly, an airplane cockpit is instrumented
to allow a pilot to monitor and control the many components that make up
an airplane. In these two examples, the "administrator" monitors
remote devices and analyzes their data to ensure that they are operational
and operating within prescribed limits (e.g., that a core meltdown of a
nuclear power plant is not imminent, or that the plane is not about to
run out of fuel), reactively controls the system by making adjustments
in response the changes within the system or its environment, and proactively
manages the system, e.g., by detecting trends or anomalous behavior
that allows action to be taken before serious problems arise. In
a similar sense, the network administrator will actively monitor,
manage and control the system with which s/he is entrusted.
In the early days of networking, when computer networks were research
artifacts rather than a critical infrastructure used by millions of people
a day, "network management" was an unheard of thing. If one encountered
a network problem, one might run a few pings to locate the source of the
problem and then modify system settings, reboot hardware or software, or
call a remote colleague to do so. (A very readable discussion of the first
major "crash" of the ARPAnet on October 27, 1980, long before network management
tools were available, and the efforts taken to recover from and understand
the crash is [RFC 789]). As the public Internet
and private intranets have grown from small networks into a large global
infrastructure, the need to more systematically manage the huge number
of hardware and software components within these networks has grown more
important as well.
Figure 8.1-1: A simple scenario illustrating the uses of network
management
In order to motivate our study of network management, let's begin with
a simple example. Figure 8.1-1 illustrates a small network consisting
of three routers, and a number of hosts and servers. Even in such
a simple network, there are many scenarios in which a network administrator
might benefit tremendously from having appropriate network management tools:
-
Failure of an interface card at a host (e.g., H1) or a router (e.g.,
A). With appropriate network management tools, a network entity (e.g. router
A) may report to the network administrator that one of its interfaces has
gone down (which is certainly preferable than a phone call to the NOC from
an irate user who says the network connection is down). A network administrator
who actively monitors and analyzes network traffic may be able to really
impress the would-be irate user by actually detecting problems in the interface
ahead of time and replacing the interface card before it fails. This
could be done, for example, if the administrator noted an increase in checksum
errors in frames being sent by the soon-to-die interface.
-
Monitoring traffic to aid in resource deployment. A network administrator
might monitor source-to-destination traffic patterns and notice, for example,
that by switching servers between LAN segments, the amount of traffic that
crosses multiple LANs could be significantly decreased. Imagine the
happiness all around (especially in higher administration) when better
performance is achieved with no new equipment costs. Similarly, by monitoring
link utilization, a network administrator might determine that a LAN segment,
or the external link to the outside world is overloaded and a higher-bandwidth
link should thus be provisioned (alas, at an increased cost). The
network administrator might also want to be notified automatically when
congestion levels on a link exceed a given threshold value in order to
address a provisioning problem before it becomes serious.
-
Detecting rapid changes in routing tables. Route flapping - frequent
changes in the routing tables - may indicate instabilities in the routing
or a misconfigured router. Certainly, the network administrator who
has improperly configured a router would prefer to discover the error
his/herself, before the network goes down.
-
Monitoring for SLAs. With the advent of Service Level
Agreements (SLA) - contracts that define specific performance metrics
and acceptable levels of network provider performance with respect to these
metrics - interest in traffic monitoring has increased significantly over
the past few years [Larsen 1997].
UUnet and AT&T are just two of many many network providers
that guarantee SLAs [UUNet 1999, AT&T
1998] to their customers. These SLAs include service availability
(outage), latency, throughput and outage notification requirements.
Clearly, if performance criteria are to be part of a service agreement
between a network provider and its users, then measuring and managing performance
will be of great importance to the network administrator.
-
Intrusion detection. A network administrator may want
to be notified when network traffic arrives from, or is destined to, a
suspicious source (e.g., host or port number). Similarly, a network
administrator may want to detect (and in many cases filter) the existence
of certain types of traffic (e.g., source-routed packets, or a large
number of SYN packets directed to a given host) that are known to be characteristic
of certain attacks.
The ISO, the organization that gave us the well-known 7-layer ISO
reference model (see Chapter 1), has also created a network management
model, that is useful for placing the above anecdotal scenarios in a more
structured framework. Five areas of network management are defined:
-
Performance management. The goal of performance management
is to quantify, measure, report, analyze and control the performance (e.g.,
utilization, throughput) of different network components. These components
include individual devices (e.g., links, routers, and hosts) as well as
end-end abstractions such as a path through the network. We will see shortly
that protocol standards such as the Simple Network Management Protocol
(SNMP) [RFC 2570] play a central role in performance
management.
-
Fault management. The goal of fault management is to log,
detect, and respond to fault conditions in the network. The line between
fault management and performance management is rather blurred. We can think
of fault management as the immediate handling of transient network failures
(e.g., link, host or router hardware or software outages), while performance
management takes the longer term view of providing acceptable levels of
performance in the face of varying traffic demands and (hopefully rare)
network device failures. As with performance management, the SNMP protocol
plays a central role in fault management of IP networks.
-
Configuration management. Configuration management allows a network
manager to track which devices are on the managed network and the hardware
and software configurations of these devices.
-
Accounting management. Accounting management allows the network
manager to specify, log, and control user and device access to network
resources. Usage quotas, usage-based charging, and the allocation of resource
access privileges all fall under accounting management.
-
Security management. The goal of security management is to control
access to network resources according to some well-defined policy.
The key distribution centers and certificate authorities that we studied
in section 7.4 are components of security management. The use of
firewalls to monitor and control external access points to one's network,
a topic we will study in section 8.4, is another crucial component.
In this chapter, we'll cover only the rudiments of network management.
Our focus will be purposefully narrow - we'll examine only the
infrastructure
for network management - the overall architecture, network management
protocols, and information base through which a network administrator "keeps
the network up and running." We'll not cover the decision
making processes of the network administrator, who must plan, analyze,
and respond to the management information that is conveyed to the NOC.
In this area, topics such as fault identification and management [Katzela
1995, Mehdi 1997], proactive anomaly detection
[Thottan 1998], alarm correlation [Jakobson
1993], and more come into consideration. Nor will we cover the
broader topic of service management [Saydam 1996]
- the provisioning of resources such as bandwidth, server capacity and
the other computational/communication resources needed to meet the mission-specific
service requirements of an enterprise. In this latter area, standards such
as TMN [Glitho 1995, Sidor
98] and TINA [Hamada 1997] are larger,
more encompassing (and arguably much more cumbersome) standards that address
this larger issue. TINA, for example, is described as "a set of common
goals, principles, and concepts cover the management of services, resources,
and parts of the Distributed Processing Environment" [Hamada
1997]. Clearly, all of these topics are enough for a separate
text, and would take us a bit far afield from the more technical aspects
of computer networking. So, as noted above, our more modest
goal here will be cover the important "nuts and bolts" of the infrastructure
through which the network administrator keeps the bits flowing smoothly
An often-asked question is "What is network management?" Our discussion
above has motivated the need for, and illustrated a few of the uses of,
network management. We'll conclude this section with a single-sentence
(albeit a rather long, run-on sentence) definition of network management
from [Saydam 1996]:
"Network management includes the deployment, integration and coordination
of the hardware, software and human elements to monitor, test, poll, configure,
analyze, evaluate and control the network and element resources to meet
the real-time, operational performance, and Quality of Service requirements
at a reasonable cost."
It's a mouthful, but it's a good workable definition. In the following
sections, we'll add some meat to this rather bare-bones definition of network
management.
References
[AT&T 1999] AT&T, "AT&T raises
the bar on data networking guarantees," http://www.att.com/press/0198/980127.bsc.html
[Glitho 1995] R. Glitho and S. Hayes
(eds.) , special issue on Telecommunications Management Network, IEEE
Communications Magazine, Vol. 33, No. 3, (March 1995).
[Hamada 1997] T. Hamada, H. Kamata,
S. Hogg, "An Overview of the TINA Management Architecture," Journal
of Network and Systems Management, Vol. 5. No. 4 (Dec. 1997). pp. 411-435.
[Jakobson 1993] G. Jacobson and
M. Weissman, "Alarm Correlation," IEEE Network Magazine, 1993, pp.
52-59.
[Katzela 1995] I. Katzela, and
M. Schwartz. "Schemes for Fault Identification in Communication Networks,"
IEEE/ACM
Transactions on Networking, Vol. 3, No. 6 (Dec. 1995), pp. 753-764.
[Larsen 1997] A. Larsen, "Guaranteed
Service: Monitoring Tools," Data Communications, June 1997, pp.
85-94.
[Mehdi 1997] D. Mehdi and D. Tipper
(eds.), Special Issue: Fault Management in Communication Networks, Journal
of Network and Systems Management, Vol. 5. No. 2 (June 1997).
[RFC 789] E. Rosen, "Vulnerabilities
of Network Control Protocols," RFC 789.
[RFC 2570] J. Case, R. Mundy,
D. Partain, B. Stewart, "Introduction to Version 3 of the Internet-standard
Network Management Framework" RFC 2570, May 1999.
[Saydam 1996] T. Saydam and T. Magedanz,
"From Networks and Network Management into Service and Service Management,"
Journal of Networks and System Management, Vol. 4, No. 4 (Dec.
1996), pp. 345-348.
[Sidor 1998] D. Sidor, TMN Standards:
Satisfying Today's Needs While Preparing for Tomorrow, IEEE Communications
Magazine, Vol. 36, No. 3 (March 1998), pp. 54-64.
[Thottan 1998] M. Thottan and C.
Ji, "Proactive Anomaly Detection Using Distributed Intelligent Agents,"
IEEE
Network Magazine, Vol. 12, No. 5 (Sept./Oct. 1998), pp. 21-28.
[UUnet 1999] UUnet, "Service Level
Agreement," http://www.uk.uu.net/support/sla/
Copyright 1999. James F. Kurose and Keith W. Ross. All Rights Reserved.