[insert project logo here (125x200px max)]

Navigator

Mailing-lists

Please report any errors or omissions you find to our `Help' mailing-list, or post a message in the Forums.

Copyright and Licensing Information

Snap is (c) Jonathan T. Moore, 1999-2002 and licensed under the GNU General Public License (GPL).

All other parts of Splash are (c) Willem de Bruijn, 2002-2003 and licensed under the BSD Open Source License.

All sourcecode is made publicly available.

Acknowledgment

Splash and the Splash website are hosted by SourceForge.net

SourceForge.net Logo

osi-open source certified logo

Splash - Documentation

SNMP Plus a Lightweight API for SNAP Handling

Applying High Speed Active Networking to Network Management
A Master's Thesis

Willem de Bruijn
Universiteit Leiden

2003

Abstract

We describe a practical solution to network management issues based on combining traditional infrastructure with active networking technology. A software platform, Splash, is presented that unites the widely used net-snmp package with SNAP, an active network optimized for safe, predictable execution. Special care is taken to ensure high processing speed. For this purpose the active network interface is compared to standard SNMP on a quantitative basis. Performance is proven to be comparable for low level operations, while under more elaborate scenarios Splash is shown to outperform SNMP in terms of response time, network bandwidth utilization and general flexibility. By combining both interfaces in a single process, Splash can be used as a drop-in replacement for net-snmp, allowing network operators to use the monitoring paradigm (polling with SNMP, or an agent-based approach) most appropriate for a given situation.

Preface

This thesis marks the endpoint of a six year period of enrollment at the Universiteit Leiden for me. During these years I've gotten to know and appreciate student life in every aspect. While I was far from academically inclined when I first arrived, I hope to have made up for that later on. I'll let this work speak for itself on that subject.
During the course of my stay in Leiden I've increasingly become interested in the field of Computer Science. Especially after completing the preliminary courses, most notably those dealing with mathematics, did the subject start to come alive. Gratitude goes to Walter Kosters for pushing me when necessary and to Igor Grubisic for showing me that mathematics can be quite entertaining after all.
In the latter years I had the chance to work on some interesting novel projects. This was made possible mainly by the open attitude to student research the academic staff at the Leiden Institute of Advanced Computer Science have. I would hereby like to thank all of the people I've gotten to know at LIACS for their help. A personal thanks goes to Michael Lew for helping me get my first accepted conference paper.
The last year I've been working solely on the Splash project detailed here. Luckily I could fall back on excellent assistance from Herbert Bos here at LIACS and Jonathan Moore at the University of Pennsylvania. Hardly knowing what the abbreviation SNMP stood for, let alone what active networking meant at first, they guided me through the process at all times. The most enticing reason for taking up this research was the opportunity to actually contribute to the scientific community. Even if this proves to be my last exercise in this domain it has been a great stimulant for finishing what I started and undoubtedly a learning experience I can benefit from for times to come. I would therefore like to thank Jonathan for this opportunity and especially Herbert for all the help he's given me throughout the last year. A big thanks also goes to Doug DeGroot, who was kind enough to take the time to read through and comment on my thesis in an extremely short timespan.
Academics aside, I wouldn't have been where I am without my flat mates' friendship. You can't always be on the tip of your toes and the relaxed attitude at Flanor 4c made sure I never got, and will not soon get, too overstressed. It was a lot of fun, guys! I'm sorry for the times I did get a bit worked up, such as after my failed first attempt at obtaining a driver's license and during the last weeks before handing in my thesis.
Naturally, the people I have depended on the most are the last to mention. Without my family, especially my parents, I wouldn't even have had the chance to enroll, let alone obtain a degree. Thank you for supporting me and letting me make my own choices in these years.

Part 1
Background


Chapter 1
Introduction

The purpose of this study is to show that so called active networks (AN) can be applied practically to network management (NM). For this purpose we will compare an AN-based toolkit with a reference SNMP implementation. An often heard argument against the use of active networks is that they increase processing overhead considerably. Experiments will show that active networks can extend NM functionality without incurring a performance penalty on basic operations. We introduce a network management toolkit that combines SNMP and AN interpreters in a single executable. The resulting program can serve as a drop-in replacement for a standard snmp daemon.
We all rely on the proper functioning of computer networks for our everyday tasks, be it directly, by accessing the Internet through PC's and mobile phones or indirectly, by withdrawing money from cash machines and booking flights at our travel agencies. We take the existence of a backbone network for granted and only notice our dependence on it when it fails. Unfortunately, cables do sometimes break, power will fail once in a while and software isn't infallible, as recent problems such as the California power shortage, Y2K and the crash of the Ariane 5 rocket demonstrated.
Small glitches in an environment can have widespread consequences. That is why critical applications are equipped with backup systems: power stations run below their maximum output, banks have multiple overlapping transaction systems and jets have ejection seats. Depending on the weight of a problem a suitable action must be taken. Increasing power output by a few percentages is a gradual solution, abandoning an airplane is a more definitive one.
Increasing network availability is not merely a case of laying more cables in the ground and buying more hardware. It is equally necessary to maintain the existing resources in working order. Tasks scale from fixing loose network connections to governing the quality of service of a national telephone system in an emergency situation. The ad hoc fixing of problems can be called network maintenance. Naturally, a countermeasure must be proportional to its problem. Selecting an appropriate action constitutes the difference between maintenance and management. The real task of keeping networked systems running smoothly therefore is not confined to taking care of instantly emerging issues, but deals with governing the entire decision making process. This area is generally referred to as network management.
With the increase in on-line activity in the last few decades the task of managing networks has grown in complexity. So has our dependence on it being executed correctly. New applications and environments are implemented everyday, for instance on line banking, wireless ethernet and 3G mobile services. To be able to service for both present and future needs network management tools must be secure, reliable, efficient and flexible. However, most frequently used network management tools date back at least a decade and the gap between what they offer and what we need has been widening since [,1.3, 3.0].
Research into network management has progressed through the years. Solutions have been found to deal with new problems on a regular basis. Yet, we still rely on relatively old tools. The alternative solutions put forward so far must therefore suffer from another weakness than pure ineffectiveness. Considering the expenses put into the existing infrastructure it is impractical to come up with a completely new system and expect a quick global acceptance. This has often been executed, however, and has most probably been the main weakness of many challengers. Instead of replacing proven tools in the network administrator's toolkit with untested alternatives a new system should be able to interoperate with the features already in place.
A specific branch of network management research deals with the use of so called active networks. Active networks are based on the notion that programmable network packets, as opposed to the static data packets used in traditional networking, can solve many issues by moving computation to the most useful location. A large body of work on AN-based systems exists and their advantages in view of flexibility have been demonstrated many times. Programmable packets do have a downside, however. So far, all attempts to adapt an active network to network management have failed due to the increased processing cost they incur and the subsequent drop in performance perceived. This performance argument has also often been used for general criticism of active networks. It is therefore an important barrier to cross if active networks are to be used in practical situations.
It is our thesis that a new, more resource efficient class of active networks can, contrary to popular views, help to increase network management functionality without incurring a performance penalty on basic operations.
In the following chapters we will discuss the present state of network management and active networking (chapter ), and identify their current weaknesses (chapter ). Then, in part , a software design will be introduced that is based on the insights gathered previously. Also, specific test scenarios will be discussed that can show the advantages of the new platform over an existing reference system. The suggested design has been implemented. Details of this product are discussed in part . The augmented product will be compared to a standard reference system in part . Finally, we will draw conclusions from these experiments and suggest follow-on research in part .

Chapter 2
Domain Background

The research discussed in the following chapters is based on network management and active networking technologies. These are specialized fields of science, therefore we cannot assume a working knowledge of the domain from the reader. This chapter should give any reader acquainted with general computer science terminology enough background information to be able follow the argumentation detailed in the following chapters.

2.1  Network Management

2.1.1  Overview

The term Network Management (NM) refers to the task of maintaining computer networks in working order. It consists of both monitoring and configuration tasks and encompasses many areas, of which the most commonly mentioned are:
  • Fault Management
  • Configuration Management
  • Accounting
  • Performance Management
  • Security Management
Together these areas are referred to as FCAPS in the Open Systems Interconnect (OSI) Model [].
Computer networks are composed of a mix of general purpose computer systems and specialized network connection elements, e.g. routers, switches and firewalls. Traditionally, network management is concerned with maintaining the network links and therefore is primarily concerned with the networking components. This said, network management tools can also be used for the monitoring and configuring of general purpose computer systems.
When viewing a network as an abstract structure we can define the connections between two components as edges and the components themselves as nodes. Workstations and PC's mostly have only a single edge between them and the rest of the network. We define these type of components as end nodes, while components with more than one connection will be referred to as routers.
Managing a network can in essence be performed by manually controlling each machine. However, since we are dealing with a communication medium it is possible to relieve ourselves of this burden by accessing nodes through the network. This is especially useful in networks containing many nodes or those that cover a large area. When talking about network management we imply the use of remote administration tools.

2.1.2  History

The global increase in computers and the growth in system complexity in the last few decades have highly increased demand for network management tools. Since the advent of the Internet computer networks have grown quickly and so did the task of managing all connected devices. Although the task became increasingly more difficult and time consuming, networks also made it possible to automate and reduce some of the work.
In the early days of networking remote shell access to most systems was made possible using tools such as Telnet[] and Rlogin[]. Since shell login allows access to all parts of the system this has long been a preferred tool for Network Management. Even today shell access is actively in use, most probably also for network management tasks.
During the 1980's many suppliers of networking components began adding proprietary management tools to their hardware, such as Netview by IBM, Accumaster by AT&T and DMA by Digital Equipment [].Failing to agree on a common communication method for the tools meant different hardware systems could not cooperate easily within the same network. Naturally, this gave way to interoperability problems. Furthermore, staff had to be trained specifically for each system, making network management a costly service.
Recognizing the need for a standardized communication framework, the Internet Architecture Board initiated a process of discussion[] in 1988 that led to the Simple Network Management Protocol[]. SNMP has been around since 1989, when the first version was defined as a soon to be replaced protocol. As is often the case, the system became so widespread that it is still in use today and will probably be so for quite a while.
As a replacement the Common Management Interface Protocol (CMIP) and the Common Management Interface Services (CMIS) standards were developed. CMIP/CMIS made up for many of the problems of SNMP and displayed a more elaborate system of datastructures and communication means. Despite the benefits CMIP/CMIS brought in terms of scalability and flexibility it has never really caught on, probably partly due to the increase in processing overhead and system complexity its use entailed. In the past decade no competing technology has been able to succeed SNMP as the primary NM interface. Supplementary tools have helped in overcoming some its shortcomings. Among the most popular are Remote Monitoring (RMON) [] for transferring network statistics and the SNMP Agent eXtensibility Framework (AgentX)[] for connecting SNMP with other pieces of software.

2.1.3  SNMP

Since SNMP is the current standard in the field, One cannot talk about network management without discussing it. SNMP is a framework for network management standardized in Request For Comments (RFC) documents. It consists of definitions for [,1.10.01]:
  • an overall architecture
  • the structure and identification of management information
  • communication protocols
  • operations for accessing management information
  • a set of fundamental applications

Management Information Base

In SNMP, data is structured in a hierarchy. All objects in this hierarchy have a globally unique identifier called an Object Identifier (OID). We call the complete data structure the Management Information Base or MIB. However, we should note that, in NM literature, subtrees of the global structure are often also referred to as MIBs. To avoid confusion we will only use the term subtree when discussing specific parts of the hierarchy.
An OID consists of a string of numbers. The complete string denotes a unique element in the MIB hierarchy. For example, the system name of a node is encoded in the OID with sequence .1.3.6.1.2.1.1.5 . These numbers correspond to subtrees in the MIB. For the system name example this gives .iso(1).org(3).dod(6).internet(1).mgmt(2).mib-2(1).system(1).sysName(5). To simplify the process of retrieving information we can also use a human readable form, giving RFC1213-MIB:system::SysName or even sysName. Serving as an example, the system subtree is shown in figure . The listing was obtained from the net-snmp tutorial[,tutorial: SNMPtranslate]
+--system(1)
   |
   +-- -R-- String    sysDescr(1)
   |        Textual Convention: DisplayString
   +-- -R-- ObjID     sysObjectID(2)
   +-- -R-- TimeTicks sysUpTime(3)
   +-- -RW- String    sysContact(4)
   |        Textual Convention: DisplayString
   +-- -RW- String    sysName(5)
   |        Textual Convention: DisplayString
   +-- -RW- String    sysLocation(6)
   |        Textual Convention: DisplayString
   +-- -R-- Integer   sysServices(7)
   +-- -R-- TimeTicks sysORLastChange(8)
   |        Textual Convention: TimeStamp
   |
   +--sysORtable(9)
      |
      +--sysOREntry(1)
         |
         +-- ---- Integer   sysORIndex(1)
         +-- -R-- ObjID     sysORID(2)
         +-- -R-- String    sysORDescr(3)
         |        Textual Convention: DisplayString
         +-- -R-- TimeTicks sysORUpTime(4)
                  Textual Convention: TimeStamp

Figure 2.1: MIB example: the system subtree
A subset of the MIB has been standardized by the IAB. Examples of this subset are access to Internet Protocol statistics, the route table and the system MIB mentioned above. Full RFC references can be found on the web [].
For other entries, mostly platform dependent trees, unique OIDs have to be requested from the Internet Assigned Numbers Authority (IANA)[]. Trees have been registered for Cisco Routers, Oracle databases and other proprietary environments.

Communication Protocols

The SNMP protocol is sometimes confused with the overall framework because they share the same name. However, it is important to distinguish the protocol from the framework. The protocol has been revised a number of times. When we refer to solely the protocol we will mention the name and the version. In many cases this is abbreviated to SNMPvX, where X stands for the version number.
The original protocol, still in use, has only basic functionality. It is possible to retrieve system information and set variables remotely by directly addressing them through their OID. The protocol consists of basically two messages: a GET and a SET request. Requests are sent over UDP to the server in a packet called a Protocol Data Unit (PDU). Subsequent packets can be compressed by issuing a GETNEXT request instead of a full GET. Multiple requests can be bundled in a single PDU, but with SNMPv1, that is as far as data aggregation goes.
Unfortunately, the simple GET/SET structure means that for each referenced variable or set of variables a request- and response package has to be sent, creating a lot of network traffic. This problem has partly been dealt with in the second version of the protocol.
SNMPv2 allows single requests for sets of related data by issuing a bulk request GETBULK requests can be used to receive complete subtrees of the MIB. For instance, the example tree segment displayed in figure 2.1 can be obtained by issuing one GETBULK system request. Network bandwidth is further alleviated by creating a hierarchy of SNMP servers called proxies. This replaces the one-to-many topology of the NM infrastructure with a more flexible layout that can mimic the topology of the underlying network. Obviously, only hierarchical networks benefit from this feature.
During the standardization process of version 2 partly compliant implementations spread to bridge the gap with SNMPv1. Protocols such as SNMPv2p, SNMPv2u, SNMPv2* and SNMPv1.5, can all be considered pseudo standards. One derived version deserves extra attention, namely SNMPv2c. Version 2c is the most widespread implementation of the v2 series, consisting of all SNMPv2 features excluding the advanced security infrastructure. Instead, it still relies on the community-string based authentication found in v1, hence the `c'.
The latest version of the protocol, v3, added much requested authentication and encryption. Although using these features comes at a performance penalty, securing access to configuration options can be necessary.

2.1.4  Practical Situations

Having discussed the tasks of network management and how SNMP carries these out we should stop and ask a second question. How important is SNMP to the average network administrator? Is it really the central tool for network management? Some tasks listed in the FCAPS model are extremely hard to perform using SNMP, for instance accounting. How do administrators deal with these issues, if SNMP cannot help them with this?
Many distributed administration tasks, for instance user accounting, cannot be handled practically by SNMP. Distinct tools have been created for individual tasks. Microsoft's Active Directory, a tool for domain-wide user and resource control, for instance, was hailed as a holy grail when it was first introduced. We suspect that SNMP is steadily losing ground to these proprietary applications.
The worst case scenario is that a standardized framework is rendered irrelevant because of its sheer incompetence in dealing with administrators' greatest headaches. Judging from the number of non-SNMP based NM solutions available, SNMP is moving down the queue. A real concern is that command line scripts and proprietary tools will again be utilized for tasks where standard toolkits could equally well be used.
It is hard to say how urgent this concern is in the field of network management. Indications that SNMP cannot handle the tasks that it should are abundant, however.

2.2  Active Networking

2.2.1  Introduction

To be able to explain the term active networking (AN) we must first talk about everyday networking practices. The telephone system can be seen as the first large scale remote communications network. Originally, each telephone connection had to be manually set up by an operator. Establishing a call between two people literally meant connecting the two wires to each other. These types of networks are called circuit switched networks. Digital data oriented networks have used circuit switched networks as a primary connection until the American Defense Advanced Research Projects Agency (DARPA)[] began constructing a more fault tolerant form of communication in the late 1970's. The first so called packet switched network, DARPAnet, allowed communication to take multiple routes from the source to the destination node. Instead of setting up dedicated links between communicating nodes, communication streams were cut up into small packets that could each be sent across a different network fiber. Linking networks together and breaking up communication into small packets increases network redundancy, since the failure of a single link doesn't necessarily entail a break in connectivity. Packets can be rerouted around an errorzone and reach their destination safely. The Internet Protocol (IP) is the predominant example of a packet switched network. Our present Internet actually originated from the DARPA initiative.
Cutting up communications into small packets, as is done in packet switched networks, permits traveling across multiple edges in the network. To facilitate this all edges are - in theory - kept continually open. Directing a packet through a network consisting of multiple connections entails taking traveling decisions at intermediate locations. Routing, as this is called, is generally executed using extra logic at the intermediate nodes. In the networks we use today, such as the Internet, this logic is pre-configured into the individual nodes. Based on information attached to each data packet a node chooses the outgoing edge on witch it will resend the packet.
Active networking, on the contrary, places the necessary logic inside the packet itself. In the extreme case, the software running on the nodes only executes programs embedded in the data packet and displays its information to the packets. This behaviour can increase network flexibility by changing policies based on the nature of the data.

2.2.2  Approaches in Active Networking

How program code is fetched can differ between AN implementations. In the previous section we mentioned the case where executable code is embedded directly into a network packet. This method is generally called the revolutionary approach. A similar, but different attempt uses indirect access to program code. Instead of carrying the code itself, packets then contain links to foreign code. When a packet enters a node the linked-to code is downloaded from a remote location. The transitional approach, as this is called, has the advantage of reduced packet size. Especially when foreign code is cached on a node for consecutive usage can the transitional approach increase efficiency.
The revolutionary approach is, we believe, the more intuitive one. Self contained capsules truly follow the idea of programmable autonomous packets. Also, if executed programs change frequently, caching code will be a useless exercise. For practical reasons a transitional approach may be warranted in certain circumstances.
The two approaches have more commonalities than differences and most of the time implementation details will be largely invisible. For the remainder of this document we will follow the revolutionary approach. In spite of this, many remarks, if not all, could equally well correspond to the transitional approach.

2.2.3  Applicability

Active networks change the way we think about network processing. Moving control from the nodes to the packets is a far reaching paradigm shift. The topic has spurred much controversy among scientists and developers, therefore the practice has many critics. We will briefly go into the arguments most often heard and try to give counterarguments.
The discussion is not just a nuisance to active network proponents, although some of them might think it is. In reality, discussion helps to clearly mark the boundaries of applicability for the field. We do not, contrary to some proponents, believe active networking to be a revolutionary successor to passive networking. Instead, we will try to show where the critics have made a substantial contribution to the discussion and where their statements can be countered. The result of this should be a justification of research into active networks as tools for certain tasks, not as a holy grail of computing. We will begin by giving an example task for which active networking might make a good tool. This intuitive justification will then be verified by a close inspection of active networks' inherent characteristics.

Example Use

In case of general processing, active networking can be used to let packets make their own routing decisions. For instance, realtime communication streams need low latency links, while software downloads prefer high bandwidth over latency. When these two streams could choose between a narrow, yet short distance modem connection and a wide, but unresponsive satellite link they would probably choose differently. In the current setup, where the intermediate nodes govern the decision making process, disseminating between these types of streams, while possible, needs cooperation from all participants in the process. Chances are that at least one intermediate router uses a one-size-fits-all strategy, in which case one of the streams will have a suboptimal connection. Active networking, instead, would allow the end users to control their stream, a more natural approach. By moving data processing options from the nodes to the data itself we can not only alter the way data is routed through a network. Other possible uses include on the fly data compression, - conversion, - aggregation and - multiplexing. These features can have uses for fields such as realtime multimedia delivery, multicasting and network management, among potentially many others.

Discussion

Active networking has not yet caught on as an established technology. This can be expected when we look at the arguments speaking against AN deployment. Out of possibly many, we will here discuss what we believe are the main contributing factors to the slow adoption of active networking in every day practices.
Resource Consumption   Firstly, AN systems consume more resources, both in terms of network bandwidth and processing time, than regular packet switching. In the past, the trade-off between flexibility and efficiency has always lead to optimizing the latter. This trade-off has changed in the last few years. Increased processing power has enabled us to start focusing more on flexibility. Researchers working in the AN field obviously believe this process will continue in the next decades, but this reliance on increased processing power is a gamble. Even they must admit so much. Relying on an increase in processing power is not enough, therefore, to implement AN based software systems, since the relative performance penalty remains the same. Concurrently with the drop in processing costs, new research has led to more efficient AN implementations. If both processes continue the current technological barrier will cease to be an issue at some point in the future. Exactly at which point in time AN's can compete with traditional systems depends on the rules of the game. Each field of application has its own demands. Especially those fields where network latency is a lesser issue are likely candidates.
Traditionally, processing power has been an expensive resource. To maximize efficiency, network resources have been designed to minimize per packet processing overhead, which has led to inflexible machines running one-size-fits-all policies. Since the price of processing power has dropped considerably in the last decades, and most probably will continue to do so in the near future, it is now becoming possible to increase processing. This does not mean that we can increase per packet processing. Although processing power increased considerably, network throughput has seen ever bigger growth numbers and per packet processing time has actually decreased. Despite of this, it has become possible to allow more complex processes on the remote node for a subset of tasks that could benefit greatly.
Dataplane vs. Controlplane   We already suggested that active networking might not be suited to all networking tasks. To further investigate to which tasks it might be applicable we will introduce a distinction frequently made between two classes of networking tasks: dataplane tasks and controlplane tasks.
The dataplane is the domain of general purpose networking. All processing that occurs inside the network is generally regarded as part of the dataplane, with the exception of those messages needed for administering the dataflow. The controlplane is the logical domain of these `meta' messages. Examples of controlplane tasks are therefore connection setup and destruction, synchronization and network management.
One of the practical outcomes of this distinction is that dataplane processing, forwarding of packets, has to be carried out at top speed to allow for high performance tasks. Control plane processing, on the other hand, is less time critical in general. This brings us to employing active networking first in the controlplane. We should only try to incorporate active networking into dataplane processing when it has shown to be able to handle the less demanding controlplane operations.
We should point out that the distinction between dataplane and controlplane is becoming blurred with the increase in complex processing used in the dataplane, e.g. QoS negotiation and tasks specific routing, and the growth in single frequency networks, where both types of communication are handled through the same circuit. Active networks have so far been suggested mostly for tasks traditionally found on the controlplane, we believe with good reason. Again, it is not our intent to propose active networking as a general replacement for traditional practices.
Young Technology   Another reason for the limited deployment of active networks worth mentioning is the fact that it is still a relatively young research domain. In the last years efficiency has increased, as we will show in section . Other design issues, most notably safety concerns, are still not resolved satisfactory. Research into these issues is underway, but until this is dealt with many applications cannot be replaced with AN systems.
Situations in which AN systems can already excel are being explored. In [] Ramanujan and Thurber investigate the use of active networks for dynamically scaling multicast video streams. Using active network for management tasks has been researched in [], [], [], [], [] and [] among many others. More successful case studies will be necessary if we are to convince network administrators that they should replace proven technology with AN alternatives.
Interoperability   Simultaneously with the drop in processing power costs, the application domain for computer networks has expanded from basic tasks such as emailing to fields including realtime conferencing and streaming media broadcasting. These new applications place extra demands on the underlying network infrastructure, e.g. realtime delivery constraints and increased effective data throughput. Specialized technologies for quality of service negotiation and multicasting can be implemented in network nodes to allow for such services, but each of these only fixes a single problem. From past experience we can learn that agreeing on a common standard can take a long time. Therefore it is preferable to limit the number of agreements necessary. This is where AN comes into play. By transferring the logic from the network devices to the data packets we only have to define a single framework, once, to address both present and future problems.
Increasing interoperability between rivaling implementations is another issue for which a proper solution must be found. Tennenhouse and Wetherall make the case for a flexible approach, which leaves the possibility of adding newer designs open:
[...] we are not suggesting that a single model be immediately standardized. The tensions between available programming models and implementation technologies can sort themselves out in the research marketplace as diverse experimental systems are developed, fielded, and accepted or rejected by users. For example, if the marketplace identifies two or three encodings as viable, then nodes that concurrently support all of them will emerge. As systems evolve to incorporate the best features of their competitors, we expect that a few schemes will become dominant.[...] []
The Active Network Encapsulation Protocol ANEP is an example implementation of research underway to create an encapsulation layer to further reduce transition problems.
The End-to-End Argument   An notorious problem holding back the deployment of AN is the extra processing it incurs in highly connected nodes. Increasing processing in the Wide Area Network (WAN) or backbone can decrease overall performance. If only a subset of all packages incur this overhead a skewed situation grows where all packets suffer from a cost that is only acceptable for some. it is therefore considered bad practice to move services to the core of the network that are not beneficial to all packets. This counterargument to processing in the network's core is generally referred to as the end-to-end argument [].
One service often rejected following the end-to-end argument is the interpreting of programmable packets: active networking. However, in the case of active networks this argument has been dealt with in part. In [], Bhattacharjee et al. demonstrate that by localizing data processing overall resource consumption can actually decrease. Their counterargument is that central nodes in the physical network are themselves often endnodes in view of specific tasks. Moving processing to these nodes is therefore not only legitimate, but even preferable. From this discussion its becomes clear that not all networked task are equally suited to be ported to active networking environments. Network management is, as we will see in chapter . The experiments will show resource consumption savings similar to those in [].
Security Concerns   A major concern with all remote invocation frameworks is or at least should be security. Many holes have been found for instance in the Microsoft Windows operating systems relating to automatic execution of unsafe ActiveX components or VBScript routines.
Active networks have so far been proposed mostly as research project and security has often been neglected because of that. While not a problem for a research project, this limits adoption of existing frameworks for critical tasks. A production environment will need to have tools for authentication and encryption. One simple short-term solution would be to use IPSec [] to secure the underlying transport layer. In the case of network management one could use the SNMPv3 security framework. Script MIB is a remote invocation scheme that relies on SNMP's infrastructure for safety. In any case, securing an active network, even if it is itself insecure, should pose no technical boundaries.
Overview   Some arguments pleading against active network deployment can be refuted, others' impact can be reduced. However, we believe active networks should be implemented only in those situations we can expect them to be directly of value. Since AN research is a relatively young field we should first focus on tasks in the controlplane. The next section will briefly discuss the current state of AN research by discussing the best known implementations. The following chapter then deals with applying active networks to network management and tries to answer these questions: is it doable? is it preferable? how is it to be accomplished?

2.2.4  Available Frameworks

Many active networking frameworks have been introduced over the last five years. Giving a complete listing of all these packages is not our goal at the moment. The following will only introduce those packages that we deem important contributions to the research domain. Many of the following have been made possible through funding by DARPA.
We will follow the thorough overview written by Jon Moore in []. In this work he identifies the most actively developed initiatives and discusses them based on their safety, efficiency and flexibility. For a more detailed overview of the field we kindly direct the reader to the references section or the previously cited document ([]).
At the Massachusetts Institute of Technology research is underway on two related environments. The Active Network Transport System orANTS [] for short is one of the oldest ANs . ANTS is set apart from the rest by its code-referencing interface, i.e. packets do not carry the actual code, but links to this code. The need for increasing efficiency led to the Practical Active Network (PAN) []. PAN started as a follow-on to ANTS to research practicality based on computation overhead. Its special characteristics are therefore efficiency and speed optimizations, such as in-kernel execution and code caching.
Another early entry has been and is still being developed at the University of Pennsylvania: the Packet Language for Active Networks []. PLAN is based on the notion that safety can be ensured - at least partly - by reducing the expressiveness of the language. Continuing on this work the Safe and Nimble Active Packets (SNAP) environment was introduced by members of the PLAN team. SNAP will be discussed in greater detail in chapter . Distinguishing feature is the controlled execution environment created by reducing the allowable language constructs. Efficient execution was another key point of the design. Both PLAN and SNAP are part of the so called Switchware project [], a program for researching different approaches to active networking. One of its other subprojects is the Secure Active Network Environment, SANE []. As the name implies, it mostly researches security in connection to active networks. For this purpose it utilizes public/private key cryptography.
A non academic player, BBN Technologies, created the Smart Packets environment []. Smart Packets were especially tailored to network management tasks. Since these tasks often have to take place in partly failing networks robustness was a key concern.
Another approach is being taken with the mobile agent environment []. MØ uses agents to create all the functionality in a network for which synchronization is necessary. It is based on the MØ language and is being developed at the University of Uppsala in cooperation with the University of Geneva.
The SafetyNet [] initiative being developed by Wakeman et al. at the University of London is another example of a network where safety is being enforced by language constructs.
Finally, StreamCode [,], developed by NEC in cooperation with ETH-Zürich, is a high performance AN system. The discerning feature of the system is the hardware implementation they use to execute code at the speed at which it is read from the network. It seems that research into StreamCode has stopped, however.
The environments mentioned are but a subset of the research domain. Research is also underway in fields that, while using a different terminology, share a lot of features with active networks. The boundaries between these can sometimes be very thin. Mobile code, for example, can be seen as an active network without the attention to intermediate node execution. Extensible kernels are also a research domain where foreign code is allowed to execute on a networked node. On the networking side research is taking place into ad-hoc and mesh networks, systems designed to dynamically reconfigure themselves. These fields encounter some of the same problems as active network research, specifically the performance and security issues.
Most closely related to active networks is the notion of mobile agents. Agents are defined as small programs running in the background. Mobile agents carry out requests by traversing a network and executing at cooperating nodes. In contrast with active networks agents are mostly geared at application level processing. The term agent has been used in relationship with active networks, however.

With the information provided in this chapter one should be able to follow the line of thought set out in the rest of this work. The next chapter starts with the inspection of problems related to traditional network management. We will also make the case for using active networking technology to overcome these problems. Finally, our main thesis will be discussed in detail.

Chapter 3
Issues and Contribution

Network Management has seen a transition from ad hoc fixes to standardized methods in the 1980's. Since then advances in NM have nearly stood still. Changes in application networks have not. The gap between what's needed from network management tools and what they can deliver has necessarily grown wider.
We will explore some of the problems currently facing NM in this chapter. Previous solutions to these problems will be analyzed. Finally, we will discuss a novel approach combining the useful features of the alternatives with the performance and stability of SNMP.

3.1  Network Management Issues

Since regular users aren't concerned with network management, one can say that it is a necessary evil. In an ideal world, all of the network's bandwidth would be available for application communication. Minimizing the footprint of network management on resources is one issue that will continuously exist. It is questionable, however, whether this is a priority to administrators. Following are a number of network management concerns, some have to do with optimization, others pinpoint specific shortcomings of the SNMP infrastructure. This overview is based in part on a previous inspection by German Goldszmidt [,1.3, 3.0].
Performance   SNMP has been optimized for performance. Since it only allows relatively basic operations this results in local optimizations in view of the entire NM process. Depending on the complexity of the task at hand further optimizations can be envisioned. Alternatives have proved to outperform SNMP in specific tasks, some of which use active networking as a basis. References are given in the Related Work section ().
Bandwidth Consumption   While performance is of value, it is not the only feature worth optimizing. The communication methods of SNMP are intentionally low-level, so that the system can be used as a basis for building complex operations. A consequence of the limited communication options is a bloat in data transfer. The rigid structure necessarily placed on a network infrastructure when using SNMP increases the amount of traffic generated for a request. The SNMPv2 protocol revision eased the communication restrictions somewhat by providing the option of creating hierarchies of management nodes. Channeling data in this fashion can reduce bandwidth needs considerably. Still, many other topologies that can offer further savings aren't catered for.
A note worth making about bandwidth consumption is that optimizing it has only value in specific situations. First and foremost, optimizations in network management throughput will only benefit end-user applications when the controlplane and the dataplane use the same network circuit. In many Internet connections this is the case, e.g Ethernet and telephone modems. However, when protocols allow for multiple independent circuits, as is the case in ISDN, or frequency division (FD) multiplexed fiber-optic networks, reducing control channel bandwidth consumption will do nothing for the other channels. When the control channel is not utilized maximally there is no need whatsoever to reduce bandwidth consumption.
Scalability   Predefined topologies can be suboptimal in certain situations. This can render them impractical to use. Especially the centralized topology of SNMPv1 has the problem that bandwidth consumption scales linearly with the number of nodes. A centralized topology is - from a scalability point of view - the single worst option. Not only bandwidth needs increase dramatically when centralized networks scale. By overloading the management station with information response times may increase. Software complexity can also increase considerably. To a lesser extent the same argumentation holds for hierarchical approaches.
Proponents of existing infrastructure often resort to the statement that practical examples show that current infrastructures can handle the load. Countering this claim is simple: the question isn't whether it is doable. Better would be to ask whether it is preferable. The obvious answer is no, it is not.
Functionality   Partly due to its standardized nature SNMP also imposes a rigid structure on the data format and on data access methods. Functionality of the system is clearly defined, but for most modern cases simply not in an appropriate fashion. A very simple example of this is the fact that SNMP cannot easily show us information regarding relative data error rates. It's even harder to request concise information from a complete network. With SNMP we must resort to sending low-level GET requests for each individual value to each individual node. For devices that do not have much computing power this situation is preferable. However, many networked devices have ample computational power that can be used for network management. SNMP simply does not allow differentiation between devices, while almost all networks are heterogeneous in nature.
The crux of the matter lies in the fact that since all processing must take place at the management station overly complex code is necessary at that site to control the entire process. From the low-level communication mechanisms central to SNMP follows that the management stations must involve in tedious micro management tasks. In [] Bhattacharjee et al. argue the case for implementing code inside the network nodes "because certain functions can be most effectively implemented with information that is only available inside the network".
We should not misidentify these functionality concerns as performance or bandwidth issues. Although there exist common situations, the functionality argument also holds for individual scenarios. From a performance perspective it is unimportant whether requests have to be sent to multiple nodes, since getting the complete response only takes as long as the longest link when executed in parallel. Obviously, bandwidth occupation can equally remain unchanged.
Reliability   Reliability concerns, too, point to introducing new methods for managing network systems. A problem inherent to using a centralized scheme is that such a method introduces a single point of failure, namely the management station. The same holds for extensions on the centralized topology. Hierarchical approaches also suffer from the single point of error problem and introduce more intermediate `choke points'.
Even if availability of the management station is so high that the previous problem does not crop up, overall network reliability can be crippled by imposing a centralized or hierarchical management topology on the network. Apart from placing a greater responsibility on individual nodes, the significance of a small number of edges in the network is also increased. Especially on occasions where a speedy reaction is important, e.g under high bandwidth consumption, will network redundancy pay off. However, if the management topology itself has no access to redundant resources it will fail to resolve certain issues. For instance, errors in the route tables can remove access to parts of the network, even when there are back up physical connections. Being able to route around error spots, or right into them, can be necessary to fix such issues.
Security   Remote configuration of devices needs a trustworthy communication protocol. Malicious attacks are carried out on networked devices all the time. It is therefore highly necessary to secure the management infrastructure tightly. That this is a responsibility of a NM toolkit has been recognized recently, with the addition of encryption and authentication to SNMP, leading to SNMPv3.
According to a survey executed in December 2000 most administrators still use the extremely insecure versions 1 or 2c []. Why would people keep using the older versions when a secure alternative exists? From the survey we learn that lack of SNMPv3 support in existing devices is troubling adoption. Therefore we can expect the problem to go away for SNMP in time. An interesting short term solution can be found in using a secure layer below SNMP. Secure IP has been made possible by the IPSec infrastructure. IPSec is a standardization initiative managed by a taskforce of the IETF []. A study on how SNMPv2 with IPSec compares to SNMPv3 has been carried out [].
Perhaps of concern in the short term, in the long term security is not a main concern in the SNMP world. However, we decided to mention it, since securing an network management infrastructure is of vital importance. Applications rivaling SNMP must have security features in place before they can be seen as adequate adversaries. Recent experiences in using IPSec to add security to insecure SNMP show promising results. The same solution can perhaps be applied to other insecure NM toolkits.
Concluding Remarks   The weaknesses of the SNMP system are well known. Therefore its status as predominant NM tool has regularly been under attack. Yet for various reasons none of the challenging technologies has been able to surpass it. Failure due to lack of industry support is not the main issue in most cases. However, in a situation where a single entity has such a strong presence as SNMP we should take this practical consideration into account. To overcome the problems inherent to SNMP, one could suggest designing a new system from scratch. Considering the enormous initial costs of having to replace existing software and even hardware, widespread adoption of such a system will be very unlikely in the near future. Therefore a more gradual transition has to be made.
First and foremost, a new network management infrastructure should be able to work together with SNMP. Any new features can then be built upon the infrastructure already in place, mixing SNMP queries where necessary with improved communication where possible. Finally, a unified interface is necessary, seamlessly integrating results obtained the old-fashioned way with those obtained with the augmented system.

3.2  Related Work

3.2.1  Overview

Since the inception of SNMP alternative technologies have tried to overcome some of its shortcomings. Already mentioned are the CMIP/CMIS infrastructure, RMON and AgentX extensions and, of course, the revisions of the platform itself.
A lesson we can learn from these technologies is that extensions to SNMP are much more easily adopted than a complete overhaul of the existing infrastructure. Research initiatives carried out in the last decade point to the same heuristic.
Hierarchical Topologies   Various types of contenders have been in the spotlight during the 90s. First off were the hierarchical networks, of which CMIP/CMIS was the prime contender. Most of these have died a silent death. Instead of overthrowing SNMP, the idea of hierarchy was simply adopted with the launch of version 2 of the protocol in 1993. An interesting paper concerning hierarchical layouts that we'd like to point out does not place the management objects, but the policies in a hierarchy []. Doing so can increase automation of network management tasks, something still under investigation today.
Two-Tier Software Suites   The next big thing in network management research has been the creation of two tier networks. Based on the insight that micro management can complicate high level tasks, alternative solutions have been sought in adding higher level languages to the low level SNMP interface. Generally, these middle-ware tools can be seen as an extension to the hierarchical approach, a point made very clear from the title of this paper discussing a two tier approach: "Hierarchical Network Management: a concept and its prototype in SNMPv2" []. A number of publications introduce an object oriented middle-ware layer to hide the lower level details from the manager [,]. At the moment we can safely say that two tier networks have largely failed. While management stations often have sophisticated software packages for handling SNMP micro management, no toolkit has seen wide adoption in legacy devices.
Web-Based Management   Perhaps connected to the Internet boom of the late nineties is the idea of moving management tools from specialized software packages to the World Wide Web. Web-based approaches to network management, too, have been proposed []. Often, the web-based approaches are no more than a specific instance of the two-tier examples discussed previously. Web-based management is still being considered for various tasks, for instance directory services. The Web-Based Enterprise Management initiative [] by the Distributed Management Taskforce (DMTF) seems to be widely supported. However, this initiative is mainly concerned with accounting and web technology support, not with the tasks that are handled by SNMP.
Remote Invocation   Moving processing from the management station to the remote nodes has been executed many times. Remote code loading can result in unsafe execution environment, therefore most implementations use domain specific scripting languages, for instance IDEAL []. Standardized in RFC 2593 by the Distributed Management Workgroup of the IETF, script MIB [] is the de facto standard for building scripting environments around SNMP. As the name suggests, script MIB adds a new subtree to the MIB that contains known scripts and allows remote invocation of these scripts. Once more SNMP has not been overthrown by rivaling technology. Instead, we've seen a incorporation of scripting technology into the base framework.
Overview   The historical progression of network management tools in the last ten years shows two strong heuristics for future research. First, there seems to be a continuing growth in flexibility of suggested paradigms. Starting from the centralized approach, we've seen hierarchical layouts, weak distribution of tasks and strongly distributed remote processing. A number of papers discuss this progression in detail and can thus serve as a thorough background introduction into the field [,]. In short, topology growth has seen the following trend:
unconnected ® centralized ® hierarchical ® multi-tier ® topology independent
The second rule of thumb we can distill from the previous overview is that environments coupled to the existing SNMP infrastructure always seem to be preferred over their autonomous counterparts. Taking financial costs of hardware replacement into account this is to be expected.
Before we continue with an overview of active network based solutions it is necessary to name some key terms used often in the referenced literature. Terms such as Management by Delegation, Distributed Management and Decentralized Management have been used throughout the decade to denote different network management extensions. Used to describe hierarchical solutions in the early nineties, they are still often cited to point to remote invocation schemes or intelligent agent based designs. As such, these terms give an indication of the general approach taken to network management research.
The last group of research projects we need to discuss are those that are completely topology independent. Extending the idea of remote method invocation is that of roaming processes or mobile code. In this setup, scripts are executed remotely not by direct orders from a central authority, but algorithmically. Scripts can roam more or less freely through the network and exhibit some intelligent behaviour. Again blurred by the use of independent, yet overlapping terms, such as Mobile Code, Mobile Agents, Intelligent Agents and Active Networks, this field encompasses a wide range of projects. We will refer to all solutions, contrary perhaps to their developer's original definitions, as active networks.

3.2.2  Using Active Networks

Active networking allows for far greater flexibility than predefined data processing. This can, if implemented right, reduce the amount of network bandwidth occupied by management related traffic. Pre-processing data on the remote hosts will increase demands for processing power, but can result in less transferred data, since aggregated results can be sent to the remote administrator. With more remote processing power becoming available active networks are increasingly becoming a viable alternative to existing techniques. An introductory overview of agent based management paradigms can be found in [].
Programmable packets can decrease response time by directly reacting on the remote node, as discussed in []. By using simple algebraic operations we can compute derived results at the node. These results can be directly used as input to decision making algorithms that respond locally. Especially under high network load direct response will be of use. But even under normal circumstances this can help decrease NM overhead.
ANs allow a greater degree of cooperation between nodes. Since packets can traverse the network as they see fit we can let a packet flow through a subnet, gather global instead of local results and react accordingly, all without intervention from a management station.
Unfortunately, most AN systems in place today cannot exploit the features described above. Processing costs of AN-based tools can be orders of magnitude larger than pure packet processing. SNMP code runs as a local executable, whereas active packets need to be interpreted. Systems based on high level interpreted languages, e.g. Java, simply cannot offer the performance needed in the general case to profit from these additional features. A second class of systems has dropped security constraints in favour of performance. These cannot offer the stability needed for a network management application.
Employing active networks for network management has been frequently proposed [,,,,,,,,,], yet up until now the community has only succeeded in surpassing SNMP performance in specific scenarios or for large networks[,].

3.3  Definitions

Some of the abstract terms we will use in the remainder of this work can have an ambiguous meaning. Therefore we will discuss them in detail and present concise definitions.

3.3.1  Functionality

According to the Oxford English dictionary, functionality can be described as "of a functional character", where function is "the mode of action by which it fulfils its purpose". Increasing a system's functionality in this sense means either (1) increasing in how far it fulfils its purpose or (2) expanding on the purposes it is designed for. We will strive to do both. The first, increasing the efficiency of a network management environment, relates to such issues as decreasing response time and bandwidth utilization. The second goal is less well defined, We will use the term flexibility for reasons discussed below.
These goals cannot be seen independently of one another, since they are not orthogonal values. Increasing a system's efficiency will naturally lead to opening up new uses for it. Similarly, extending the reach of a network management system will not only allow it to handle more tasks, but can - as we will see - give it more options for solving existing tasks, thereby possibly increasing the efficiency of the system.
Efficiency   To increase in how far a system fulfills its purpose is to improve upon the existing situation. SNMP's weaknesses have been mentioned previously. Especially in complex scenarios, SNMP usage can entail unnecessary micro management at the monitoring station and correspondingly poor system-wide response times. This Achilles heel reduces overall functionality, since it makes certain tasks impractical to carry out.
Flexibility   Flexibility can be defined as the degree to which a system can adapt to its surroundings. A flexible system, therefore, can more easily adapt to a new environment than a rigid one. As such, it can expand beyond those environments for which it was originally devised. It can expand beyond the purposes for which it was designed.
In our case this boils down to (1) how well a NM environment can adopt the logical topology of the underlying network and (2) how well a solution can be tailored to a problem at hand within a certain framework.
Topology adoption in SNMP is restricted to either a basic client/server structure or a hierarchy of proxy servers. Improving on this using an active network is elementary considering the programmability of network traversal on the nodes.
Adapting to a specific task will most likely also be straightforward thanks to the programmability of the individual solutions, i.e. the packets. In short, it is precisely the flexibility argument which points us in the direction of using active networks. Flexibility in itself is not a virtue, however. We should utilize the flexibility sported by active networks to execute network management tasks that cannot be executed using SNMP.

3.3.2  Performance

Performance is the speed at which tasks are executed on average. As similar attempts have been frustrated by a severe performance penalty on basic tasks, it is imperative that we optimize for low-level responses equally well as for complex scenarios. A twofold solution to this problem has been chosen. Firstly, we rely on an active networking environment that has been proven to handle other simple tasks equally well as traditional software. Secondly, and in line with the interoperability argument discussed in 3.2.1, we have no intention to replace SNMP but mean to cooperate with the system.

3.4  Contribution

In theory, active networks can surpass SNMP both in terms of performance and functionality. Increased processing cost has so far held back successful introduction of an AN based system.
Recent advances in active networking technology have reduced the overhead they incur dramatically, while maintaining the necessary safety measures. Since this new class of active networking environments will not necessarily suffer from the performance penalty inherent to earlier systems it should now be possible to provide a general alternative to SNMP.
It is our thesis that that the new class of active networks can, contrary to popular views, help to increase network management functionality without incurring a performance penalty on basic operations.
To improve on the existing functionality an alternative must decrease system response time, decrease network bandwidth and allow far greater flexibility than is achievable using SNMP. At the same time, basic operations may not suffer a severe performance penalty, which was the case with previous attempts.
We will implement an AN-based network management application by combining a traditional SNMP interpreter and an active network interface. Experiments will be carried out to show that the AN interface is superior to SNMP in terms of functionality and equivalent regarding performance. By combining the existing infrastructure with an augmented interface interoperability concerns are also tackled.

3.5  Reliability and Security Concerns

In the introduction we stated that a network management tool should be secure, reliable, efficient and flexible. The latter two terms have been discussed and will recur often. By focusing specifically on increasing functionality we have chosen to disregard reliability and security concerns. These concerns, for instance safety of execution, robustness, stability and authentication will be mentioned, but are not our primary concerns. Where appropriate, these terms are used as indications of valuable production-ready features, not as elements of our research.

The terms introduced above can serve as testing guidelines, but are too vague to directly compare network management environments' merits on. In the following part we will introduce a software design that - in theory - could help us attain our goals. Also, test scenarios that translate the discussed terms into quantifiable values will be introduced.

Part 2
Design


Chapter 4
Software Design

The goals set forward can lead to conflicting requirements. The design of the software should be such that an acceptable trade-off between opposite values is made. We will outline the design below. Decisions will be discussed in relation to the goals that must be satisfied.

4.1  Dual Interface

Interoperability with SNMP is one of our primary concerns. Performance another. Creating a system which fulfils both goals at the same time poses some problems. Walter Eaves of University College London has created an interface between an AN and SNMP using Inter Process Communication []. Doing so improves interoperability, but IPC adds a lot of latency.
Instead, we opt for combining the active networking and SNMP interfaces into the same executable. This should give the same benefits, but without the performance penalty. The result is shown in figure .
Figure
Figure 4.1: A framework design: a common MIB back-end services requests from both SNMP and mobile agent front-ends.
The resulting application will be able to act as a drop-in replacement for the traditional SNMP server. Additional constraints are necessary to ensure that AN behaviour does not interfere with SNMP handling. It is essential that the SNMP interface can listen for connections concurrently with the AN handler. This can be accomplished by either implementing the software using multiple threads of execution or by merging the connection handling routine. Practical limitations have lead to the latter, details will be discussed in section .
Merging the two interfaces into a single application may not alter their original behaviour in any way. The original systems will undergo changes in time. It is therefore important to separate the two systems as much as possible. Practically, this boils down to only sharing the connection handler between the two platforms. All other code must stay independent from one another. Selecting which connection handler to keep is a practical issue and therefore implementation dependent.

4.2  Common Back-end

Because both interfaces need access to the same information it is no more than logical to share the data repository. As discussed, SNMP relies on the standardized Management Information Base. Instead of creating additional access methods for the active networking environment we opt for combining the back-end access methods similarly to how we combine the frond-end connection handler.
Giving both interfaces access to the exact same information repository has some additional benefits. By doing so we can compare the two systems solely on communication metrics. Therefore our claims are necessarily data independent. Furthermore, configuration changes made with one system are automatically and instantaneously visible to the other. This is highly beneficial for interoperation with existing systems.

4.3  Modules

The separation of concern discussed in the previous section naturally leads to a modular system design approach. By separating the functional parts from each other, updates to the individual parts will most likely be less disruptive to the whole. Inspection of the task list suggests dividing the system into the following modules:
  • a connection handler
  • an SNMP request handler
  • an active network packet handler
  • a connection between SNMP and the MIB
  • a connection between the AN and the MIB
  • the Management Information Base software
Depending on the software we select as our reference SNMP system some of these may already be implemented. The necessary extensions will have to be inserted somewhere into the available package. Precisely two connection points are of concern, namely the integration of the handlers and the integration of the MIB access methods. How this is to be accomplished is not a design -, but an implementational issue and will be discussed in section .

4.4  Client Application

On the monitoring station a last piece of software will be needed. A client application has to send data to either an SNMP aware server or to a node in the active network. For this cause two separate applications can be used whose behaviour is identical, obscuring the communication taking place in the background. Merging the two into a single application can be a next step for productivity, but is not strictly necessary for our research.
For the SNMP client a standard application can be selected. SNMP requests are normally sent over UDP directly to the server. In essence, an AN client can be constructed to follow the same principle. Doing so will ensure shortest delay times. Instead, we will only use UDP communication between the client and a local server. From then on server to server communication relying on active packets takes over. Results are sent back to the client using the reverse procedure. From a performance point of view it might be wiser to use direct communication between client and server. However, in doing so we feel we would be simplifying the test environment to such an extent that it might not even be classified as an active network. For the claims set forward it is enough to prove that performance of an AN-based tool is equal to that of SNMP. We will see in the experimental chapters that even with the selected, suboptimal solution comparable results can be obtained.
In line with the quality concerns outlined in section both client applications will have to be able to process multiple identical requests and report extensive benchmark results.

The identification of a client application wraps up our design of a network management system. In the following chapter we will present test cases that can be used to compare an implementation of this design with a reference SNMP application. Thereafter one such implementation will be introduced.

Chapter 5
Test Design

In this chapter situations will be outlined that can serve as benchmarks for proving the claims set forward in section 3.4. Applicability of each situation in relation to the goals will be discussed. Furthermore, a suitable network topology for testing will be displayed.

5.1  Quantifying our Claims

In section 3.4 the two goals underlying this thesis and methods for achieving them have been mentioned. To obtain quantifiable results it is imperative that for each method an accompanying metric is found.
Functionality   Recall that we split the functionality argument in two: efficiency and flexibility. We will now have to find case studies for both values. Starting of with efficiency, metrics have to be found for efficient processing. Two measures for processing overhead can be found: consumption of processing resources and consumption of network bandwidth. The first can be quantified by counting used processor cycles for abstract cases or by simply keeping time in a more real world scenario. For the second argument bandwidth calculations can be carried out, using only pen and paper if the traversal of the data packets is known a priori or by reading the traffic from the network. In both cases we will opt for an experimental approach, selecting roundtrip time time in m seconds and actual traffic in Bytes.
In the next section we will make the case for a set of scenarios that should mimic real world practices as closely as possible. If the model we select for benchmarking resembles actual use close enough, performance and bandwidth measures should suffice to make the case for efficiency.
Flexibility, the degree to which a system can adapt to its surroundings is difficult to translate into a metric. Instead, argumented case studies are given to show that: (1) increasing flexibility has a positive result on efficiency and (2) active network based systems are inherently more flexible than SNMP. Examples are given of recurring network management tasks that cannot be handled by SNMP due to its rigid structure.
Performance   Compared to traditional software, AN-based tools will always incur additional processing. Also, tasks that can be handled by issuing a single SNMP request do not benefit from the increase in functionality active networks can offer. From these two observations follows that an AN-based management tool will naturally be less efficient for simple tasks. Our goal is not to outperform SNMP for these low-level requests, but to minimize the penalty. For this, benchmark timing results are necessary, comparing SNMP with an augmented system under simple low-level request handling scenarios. Comparisons will be made on the round-trip time in m seconds.

5.2  Test Selection

5.2.1  From Claim to Scenario

The thesis defines visible goals. Testing performance can be carried out in a `clean room' environment, but doing so makes no sense when analyzing functionality. In the experiments the systems should therefore be viewed as black boxes. Real world behaviour is our primary concern.
In order to measure the relative functionality of a network management tool we must identify the various classes of tasks it must be able to perform. Subsequently, experiments are to be carried out that compare SNMP functionality with the functionality of the active network based environment.

5.2.2  Functionality Tests

At this point it should be mentioned that every class enumeration can be debated. The FCAPS model introduced in 2.1.1 could, for example, be used to identify different classes of scenarios based on execution domain. Instead, complexity will be used to partition the use space. This decision stems from the observation that since each variable is stored and retrieved in the same manner under SNMP partitioning into execution domain based classes does not necessary divide the method space. Executing the same request for multiple types of variables will not show more interesting results than running the request once. Selecting test cases based on their complexity will allow us to compare multiple methods of communication and, if designed well, span the entire field of network management tasks.
Tasks can be divided into levels of complexity based on:
  • the need for postprocessing, i.e. computation of derived results
  • the number of actors involved; are we dealing with a single client/server request, a number of distinct servers or a distributed problem?
  • the type of actions performed based on obtained results
From this list a set of scenarios can be derived that demand increasingly more complex operations for successful execution. This gives:
  1. requesting directly available information from a single node; no response
  2. requesting derived information from a single node; no response
  3. requesting directly available information from multiple nodes; no response
  4. requesting derived information from multiple nodes; no response
  5. requesting information derived from data spread across multiple nodes; no response
  6. requesting information derived from data spread across multiple node; executing additional actions based on these results
Since scenarios 3 and 4 consist of no more than the resending of requests from scenarios 1 and 2 their metrics can be computed from the other scenarios. Performance can be reduced to the performance of the slowest connection by executing requests concurrently, while network bandwidth will simply be the sum of all individual connections. Therefore these scenarios will not be explicitly carried out.

5.2.3  Performance Tests

The previous set of tests cannot completely visualize the performance penalty under basic operations. Only scenario 1 deals with low-level requests. Since SNMP has multiple types of requests and can bundle multiple requests in a single PDU we have to expand the set of basic tests. For clarity these tests will be separated from the others. The previous set of tests deal mostly with the functionality of the tools while this set focuses on performance. The set of experiments is subdivided into these two categories as well. Naturally, scenario 1 will not be reproduced in the functionality tests since it already features extensively in the performance tests. To keep with the increasing order of complexity the order of tests is hereby also defined.
SNMP requests mostly deal with retrieving information using either the GET, GETNEXT or GETBULK requests. Furthermore variables can be set using SET requests and traps can be created, allowing a node to send a result only when a predefined threshold has been reached. An application must be able to handle all GET and SET requests and, indirectly, also GETNEXT and GETBULK requests. Since our interpreter will be a stateless device we will not handle traps.
Ideally, a comparison could be made on a weighed set of operations mimicking dynamic SNMP workload. Unfortunately, no figures are available on the relative frequency of request type uses. In vague terms we already made the statement that GET requests are most probably more abundant than SET requests. This is intuitively clear, since SETs only occur when conditions change, while GETs are usually executed at fixed intervals to identify the current conditions. Most of the time no changes will occur, therefore most of the time a number of GETs will be sent before a single SET is executed. Since no figures on relative usage can be found we will make no statements regarding combined performance gains. Instead, we will discuss each request type as an individual test case. To obtain a wide range of statistics the following requests have been selected.
  1. a single GET request (GET1)
  2. a single SET request (SET1)
  3. a request containing five GETs (GET5)
  4. a request containing one SET and one GET (GETSET)
The GET1 and SET1 queries have been chosen for obtaining standard benchmark results. Also executing GETSET and GET5 instructions can display trends not visible from a single request, e.g. performance increase by bundling requests. We have selected these and not GETBULK or GETNEXT instructions because they can be more naturally compared to the single GET1 and SET1 instructions.

5.3  Performance Breakdown

Issuing multiple requests increases the possibility for trend analysis based on subtasks in the communication process. Knowledge of the performance of these subtasks is of vital importance for optimizing system throughput and latency. As an extension to the previous tests we will discuss a breakdown of a single request transfer for various network distances. Since we are only concerned with optimizing AN performance the SNMP case will not be dealt with. Performance statistics for the individual subtasks will be displayed and analyzed. The knowledge we obtain here can directly be made to good use when optimizing the functionality tests.
The communication process design was discussed in section 4.4. Based on this, a single request procedure can be broken down into the following subtasks:
in the client application:
  • preprocessing: data structure preparation and connection build-up
  • delay: the waiting time between sending the request and retrieving the response
  • postprocessing: response processing and connection tear-down
on the intermediate node:
  • preprocessing: translation from UDP request to AN packets
  • delay: the waiting time between sending the request and retrieving the response
  • postprocessing: translation from AN packets to UDP datagram
on the server node:
  • preprocessing: execution of pre-SNMP code and translation into PDU
  • delay: waiting for the MIB access handler
  • postprocessing: execution of post SNMP code

5.4  Network model

As we are concerned with testing flexibility it is imperative that the testing takes place in a network on which multiple topologies can be visualized. This rules out simple layouts consisting of solely linear or hierarchical connections. Instead we selected a honeycomb-like structure. The precise layout is depicted in figure .
Figure
Figure 5.1: Network topology used for experiments.
During the various stages of the test process the network will be used to model different types of connections. For the performance tests only direct client to server connections are of interest to us. In order to distinguish network latency from per node processing time all tests will be carried out on servers at varying distances from the client. To be precise, a linear list of nodes will be selected in the network to which all requests will be sent. The final results are closely related to the number of intermediate connections. From now on this will be referred to as the `hopcount'.
When testing functionality network complexity scales with the increase in scenario complexity. Linear networks are no longer sufficient in this case. Especially the distributed problems have to be carried out on networks containing multiple routes from one node to another. A honeycomb-like network configuration allows packets to choose a route through the network from a large set of applicable routes. More specifically, there are no single points of errors, choke points, to be found in the center of such a network. Choke points limit the option space, since all packets will have to route through them. Such points can be considered a nuisance to us because they limit the option space and therefore the quality of the tests.

5.5  Quality of Results

Due to fluctuations in network latency, processor scheduling and other unmanageable issues the outcome of a single test can become skewed. To rule out incidental effects as much as possible averaging can be applied to a number of identical runs.
Computing an average can be accomplished in various ways. The mean is computed by taking the sum of all values and dividing this by the number of values. An objection to taking the mean is that it doesn't really remove erroneous results from the input set. Another averaging method employed regularly in scientific work is taking the median. The median is defined as the middle element of the input set. It can be thought of as the discrete alternative to the extreme of a normal distribution function. Stochastic distributions are defined not only by the location of their extreme, but equally by their spread. In terms of median calculation we will give an estimate of the spread by calculating the accompanying `quartiles'. Quartiles (Q1 and Q3) are defined as the medians of the sets that are created by splitting the input at the original median (Q2). As such they give an indication of the spread of the distribution.
When running performance tests the median will be used as the value to be minimized. All results displayed will be, unless otherwise specified, the median of 101 runs. The stability of the test environment can be read from the offset between the quartiles and the median. Therefore quartile offset for our test application is not allowed to grow beyond that of its reference SNMP counterpart.

The presented tests can be used to compare network management tools. The next part discusses how the AN-based test application was implemented. How well the specific implementation compares with a reference system will be handled thereafter.

Part 3
Implementation


Chapter 6
Preliminaries

The previous chapter introduced a number of hardware and software features necessary for implementing the active network management environment. We intentionally delayed discussing implementational issues in-depth until after the design.
In the following three sections we will discuss the details of the software packages. To start off, we will identify all prerequisites and select hard- and software based on these. The next sections will then delve deeper into the selected packages' implementation and explain our alterations to them.

6.1  Software Selection

Building a complete environment that fulfills all our goals from the ground up is a cumbersome and time-consuming process. Due largely to the abundance of freely available open source software packages it is most probably also an unnecessary one. While there does not currently exist a system capable of performing all the tasks we defined there are packages that can relieve us of part of the work. We also need to select an appropriate reference SNMP system to compare our results with.

6.1.1  The net-snmp Package

To obtain useful experimental results it will be imperative that we select as reference system an SNMP package that is widely in use. For reasons discussed in section 4.2, this same system should also serve as the SNMP back-end of our active network. Therefore an open source solution is to be preferred.
The predominant SNMP package on UNIX systems, especially open source platforms such as FreeBSD and Linux, is the net-snmp[] package originally developed at the University of California at Davis. Previous versions of this package were named ucd-SNMP for obvious reasons and are still in use today. Aside from being widely supported and actively developed, the net-snmp package's source code is distributed under a BSD-like open source license. This allows us to fully inspect and alter the code where necessary.
In line with the argument given in 4.1 we have minimized code sharing between this package and our extensions. No more than thirty lines of code had to be inserted into the main SNMP agent codebase to make it accept additional packets. Separating the concerns in this way has already proven its use, since during the development phase multiple updates to the SNMP package have been incorporated into the source tree. We have stopped incorporating updates when the first tests were run to ensure that all results can be compared. The final version of net-snmp we incorporated was version 5.0.6.

6.1.2  The SNAP Package

With the SNMP basis in place all we need now is an active networking environment that complies with our demands. It must be able to process requests at high speeds, since one of our goals is to handle network management tasks in the same time as can be achieved with SNMP. Section 2.2.4 outlined a number of active networks. Most of these can be directly eliminated from our selection list based on the performance requirement. Of those left most have not been designed to enforce safe execution, another feature highly important for a network management application.
The active network we have used suffers neither from performance nor from safety issues. Safe and Nimble Active Packets, SNAP for short, has proven to be able to handle general networking tasks in roughly the same time as standard packages. Basic networking has been compared by implementing a ping request []. An introductory test of network management was undertaken by comparing the latency of ping, SNAP and SNMP in a Distributed Denial of Service use case []. Put together, the results of these two tests suggest that we can accomplish our goals with the help of SNAP.
Our work has been based on version 1.1 of the SNAP interpreter. At the time of writing this package was not freely available on the Internet. However, an earlier version can be found on the SNAP website []. Contrary to our approach with net-snmp we did alter much of the codebase of the SNAP package. Combining the two applications in a single executable meant we had to reimplement one as a library. Being less actively maintained, SNAP was selected to be repackaged. The final layout of the modules is depicted in figure .
Figure 6.1: Splash module layout
So far we have made a number of claims about the safety and performance of SNAP. From its design follow characteristics not found, at least not together, in other active networking environments. Understanding how SNAP implements the interpreter and how programs are encoded into packets is necessary if we want to build a speed optimized NM package on top of it. These and other implementational issues related to SNAP will be discussed in detail in section .
The original interpreter was in some aspects too limited to meet our demands. Therefore we created a new codebranch and altered several important aspects, e.g. the service library interface and client interface. These changes to the original design we will discuss in section .

6.2  Hardware Selection

Finally, an appropriate network had to be selected. For the tests we obtained access to a cluster of twenty 200 MHz Pentium Pro's running Redhat Linux 7.1 with kernel version 2.4.7. The nodes are interconnected in a multiply redundant honeycomb-like structure of which the layout is shown in figure 5.1. Communication with the outside world was limited to secure shell login. This should suppress most superfluous bandwidth usage. On the local nodes routed, a dynamic route configurator, was the only active bandwidth occupying service.
All nodes have been configured identically. Each node accepts both vanilla SNMP requests and SNAP packets. Both handlers can be started as independent daemons. The SNMP daemon, snmpd, can also be configured at runtime to accept SNAP packages.

The brief overview of our framework given here hides many of its intricacies. SNMP has been discussed in detail in section 2.1.1. To complement this, the next section will feature the SNAP active networking environment. Design goals and implementational features will be discussed and an example program will be shown.

Chapter 7
An in-depth look at the SNAP Active Networking Environment

Safe, Nimble and Active Packets is an active network designed around three goals: safety of execution, efficiency in resource utilization and flexibility of the platform. These three goals make it a perfect candidate for basing our research on. Before we employ the system in the field of network management we first need to understand its inner workings. In this section we will display how SNAP handles packets and what the implications of its design mean in the current context. To accustom the reader with the field a simple example is given.

7.1  Language

In section 2.2 active networks have been introduced. As any AN, SNAP consists of a language specification and a software packet capable of interpreting programs adhering to the language. Implementation of the software will be discussed later on. First we will give an overview of the SNAP language.
Just as many other interpreted languages, e.g. Java bytecode, the SNAP language consists of simple assembly-like instructions. The main advantage of having a low level language is the relatively high processing speed obtainable. Any drawbacks, most notably development difficulties, can be overcome by using a higher level language and compiling its code to SNAP bytecode. A PLAN to SNAP compiler [] has been created for this purpose.
Language Constructs   Although SNAP code resembles machine level assembly language, it is developed especially for interpreting network packets. Therefore it has a number of distinguishing features. Most prominent is the lack of backward jumps. As mentioned previously, SNAP has been designed with safety of execution in mind. One of the methods used to reach this goal was to create so called linear execution time. Since SNAP does not allow jumping to previously executed code - and thus lacks constructs for expressing unlimited loops - execution of a SNAP packet will take at most a fixed amount of time that scales linearly with the number of instructions in the packet. Through limiting the expressiveness of the language in this way SNAP can guarantee safe execution without the need for dynamic runtime checks or other CPU intensive overhead. However, one can easily see that by resending a packet to the same computer we can allow backward jumps, since execution would start all-over again. This and other side effects of network programming are dealt with by using a special construct: the resource bound.
Resource Bound   A resource bound, or limit on the amount of resources a packet my consume, is implemented as a simple counter inside the packet. All instructions that are not intrinsically execution safe consume an amount of this resource bound. Once a packet has eaten up all of its resource bound it is dropped by the interpreter. Currently resource consumption is limited to packet sending, but the backward jump example above shows that other language extending constructs could also be implemented in this fashion.
Instruction Set   By having low level instructions a relatively small set of instructions can be used to perform a wide array of operations. SNAP consists of a base instruction set containing operators for control flow, stack manipulation, heap manipulation, relational comparison, basic arithmetic, network processing and packet inspection. A complete overview of the base instruction set in SNAP v1.0 can be found in Appendix A of [], an updated reference is located on the project website [].
Contributing to the flexibility of the environment, another key point, is the inclusion of a service infrastructure. Operations not part of the core language can be added to the system as services. Our SNMP connection is a prime example of such an extending service. Services can also increase processing speed, since often used high-level operations can be compiled into machine code and called as a service. This can reduce the necessary number of SNAP instructions, thus decreasing packet size and increasing processing speed.
Data Access   Accessing data from network packets poses new problems. Since packets are designed to travel from node to node, direct memory access cannot be used. Instead, datastructures are encapsulated inside the packet itself. Doing so limits the maximum size of the data, since packets may not exceed the the network's Maximum Transfer Unit size.
The main data access mechanism is a simple stack. Contrary to machine dependent stacks, however, the SNAP stack allows elements of different datatypes and -lengths to be processed identically. The SNAP core language has support for integer, floating point, IP address and character string basetypes. Mimicking machine hardware, SNAP packets also contain a heap-like structure. However, this heap cannot be addressed directly. The heap merely serves as a storage medium through the use of pointers on the stack.
In the core implementation, SNAP packets cannot use any storage medium on the executing nodes. In addition to the aforementioned CPU utilization, memory overflow is another cause of system instability. Limiting a packet's access to the executing machine's memory therefore increases SNAP's innate safety.

7.2  Example program

We will now discuss a simple SNAP coded example program. The program, shown in figure , is used retrieve an SNMP value from a remote location. A high-level overview of the program operations identifies three tasks: network travel, request handling and result delivery.
; part 1: travel
forw
bne athome-pc

; part 2: call SNMP
push "sysName.0"
calls "snmp_getsingle"

; part 3: reverse direction
getsrc
ishere
bne athome-pc

push 1
getsrc
forwto

; part 4: return results
athome:
push 7777
demux

#data 0

Figure 7.1: Example SNAP Packet : an SNMP GET request
Network Travel   Firstly, the packet has to travel from the source, the monitoring agent, to the destination. Secondly, after completion of its remote task it must travel back to the source to deliver the results. In this elementary example travel through the network is expressed using the two instructions in part 1. The forw instruction tells the interpreter at the executing node to compare the current location with the destination and resend the packet on an applicable network interface when they do not match. If the two do match the next instruction is executed.
The second instruction, bne athome-pc executes a branch on not equal operation on the top stack element. This is part of the elementary control flow we implemented in the code. A packet has to carry out a return-trip. To distinguish between the two paths the top stack element is used. On its initial route to the destination the top element carries a zero. This value is initialized at compile-time by the last line in the program: #data 0. At the destination the element has to be swapped for a 1 to express that the packet is going home. The bne instruction will therefore branch when the value is 1, i.e. when it is back home. The number of instructions it will jump over is athome-pc, or an offset calculated from the distance between label athome: and the current instruction (pc is an abbreviation of program counter).
The swapping of zero to one is handled in part 3. The first three lines retrieve the source field from the package and compare it with the current location. If they are the same the destination and source necessarily are the same and we can directly jump to part 4. Otherwise a one is pushed onto the stack, the source field is retrieved and the packet is forwarded to the source.
Request Handling   Part 2 contains the request handling. In this example, request handling is limited to a single system call for brevity and clarity. Note, however, that any arbitrarily complex operation can be executed here. The push operation pushes a string onto the stack. This string is then read in by the consecutive operation. The calls instruction searches for a service, in this case snmp_getsingle, and executes the accompanying compiled function. It first needs to convert stack elements into arguments. After the call ended it then must convert return values back into stack elements. After executing this call the top stack element should contain a string representation of the sysName.0 object.
Result Delivery   Finally, when the packet is back at the source node the SNMP result has to be transferred to the client. A demux, short for demultiplexer, instruction is executed for this. The instruction takes as arguments the top two stack values, which must contain the receiving portnumber and returnvalue, respectively. The preceding instruction, push 7777, adds the portnumber to satisfy the precondition. Notice that we do not explicitly remove the control flow value, 1, from the stack. This has been taken care of by the bne instruction.
After a successful run the client should have received the destination node's sysName.0 value.
Advanced Example   To show the strength of using an active network we will now look at a more interesting scenario. Suppose we want the names of all systems on the way to the destination: a traceroute. Using SNMP this would entail sending the same request to all hosts. Even standard ICMP based traceroute sends multiple requests. Instead, using SNAP we can modify the previous example packet slightly to obtain the same result in a single go.
Recall that the forw instruction stops execution and resends a packet to its destination when it is not already there. If we were to move this instruction to below the service call in effect the previous three instructions would be executed at all intermediate hops. The only problem is that the branch instruction removes the zero at each run. Therefore we need to insert a push 0 between the service call and the forwarding command. To deal with the same problem on the way home we must implement a similar extra check. Inserting
getdest
ishere
bne athome2
push 1
forw

athome2:

after the athome: label should do the trick. We simply add a new label and check if the current location is equal to the destination. If it isn't, we forward the packet. Otherwise the result delivery process is handled as previously.
From these two examples it may seem as if control flow takes up most of the packet code. However, this is true purely because of the limited actual program execution going on and the relatively difficult to code control flow in this instance. Other examples will be discussed in section , contradicting this assumption.

7.3  Practicality Framework

Having handled the basics of SNAP it is time to assess SNAP's distinctive qualities. SNAP is optimized for safety, flexibility and efficiency. Let's discuss these per item.
Safety   Safety should be a key concern to any networked application. SNAP has been developed with the following design goal concerning safety:
SNAP packets should not be able to subvert or crash a node (robustness); SNAP packets should not be able to directly interfere with other packets without permission (isolation); and SNAP packets' resource usage should be predictable, both for individual packet executions as well as globally across multiple nodes (resource predictability) [].
A detailed overview of these features can be found in the cited work. Since safety is especially of concern when dealing with network management we will briefly go into each of the mentioned subgoals.
Robustness of the system is assured by removing operations that are a potential hazard to the stability of the system. Accessing the actual computer system underlying a running SNAP interpreter is not possible using the core instruction set. Memory is shielded from the packets since they can only alter values inside the packet. Similarly denying access to the CPU would present an unworkable situation. Instead of removing access it is closely guarded. A worst case estimate of CPU cost can be calculated a priori by reviewing packet size. Packets that are deemed too large can be dropped by the interpreter. Limiting access to system hardware in this manner guarantees safety of packet execution, regardless of the actual code it carries.
Isolation is guaranteed because packets cannot communicate without the help of additional services. By incorporating applicable services communication can be allowed, but for security reasons this is not part of the base package.
A strict maximum on resource consumption and the lack of backward jumps enforce resource predictability. Upon arrival of a packet a decision can be made to execute or drop the packet based on guaranteed behaviour.
Flexibility   An active network is only useful if it can be applied quickly and easily to specific use cases. SNAP consists of two mechanisms for handling operations. Firstly, the core set of instructions is extremely low level. This allows complex procedures to be built on top of the system. Due to the its confined nature it cannot, however, innately express all types of known algorithms, i.e. the language is not Turing complete. A complete formal proof of this is left to the reader, but it can be expected when we compare our language with a set of other languages. In [], Douglas R. Hofstadter identifies the fine line between a Turing complete and an incomplete language (Bloop and Floop). In essence, SNAP can be made to accept Bloop programs, but not Floop programs. The difference between Bloop and Floop are the addition of unbounded loops. SNAP can only accept Floop programs if it accepts unbounded loops. However, for this a SNAP program needs an infinite resource bound. Restricting the resource bound thus directly restrict the language's capabilities to handling fixed length algorithms. An extended version of this argument can be found in [].
The flexibility of SNAP is not only limited by the fact that it cannot handle all mathematical problems, it is equally restricted by the fact that little interaction with the execution platform is possible. However, these shortcomings are necessities for guaranteeing safety. If a user is willing to sacrifice some degree of safety these problems can easily be overcome. The service infrastructure allows arbitrarily complex programs written in general purpose programming languages to be accessed from the SNAP interpreter. Hypothetically, one can even use SNAP solely as a tunnel to another environment, thus allowing the same level of expressibility as is possible under any other existing execution environment. An interesting approach has been taken by Kind et al [], who rewrote the resource bound implementation to incorporate safe execution of loops.
The degree to which flexibility is traded off for safety is an issue that can be dealt with on a case by case basis. Also, many active networking applications do not need language constructs that go beyond what is possible in SNAP. The program shown in figure 7.1 can serve as an example of this statement. More examples are given in [].
Efficiency   Efficiency in case of active networking boils down to execution overhead. Active networks necessarily introduce increased overhead over traditional networks. To be practical, however, they should be able to handle Internet Protocol like functionality at IP-like performance. Proof that SNAP is relatively efficient has been given in [], where SNAP was compared to ICMP ping. Efficiency of SNAP in the network management domain will be discussed in section .

7.4  Interpreter

The software package used to handle SNAP packages is nothing more than an interpreter. Currently, there only exists an implementation for the Linux operating system. To distinguish between SNAP programs and the software package on which they execute we will refer to the last as the execution environment or SNAP-ee. The SNAP-ee is based around the interpreter, a large switch statement that carries out the operations defined in the SNAP core instruction set on the packet's datastructures. To help fulfill its tasks additional code is necessary. Most notably, the SNAP-ee contains interfaces to the SNAP network, the client and the services.
Networking   The SNAP network interface serves as the access point to other SNAP daemons. It is not used for communication with client applications. For that purpose IPC is used instead. The specifics will be explained in detail in the next section. Inter-server communication is carried out over the Internet Protocol. Just as UDP and TCP, the SNAP network protocol (SNAP-np) is positioned directly on top op what is called raw IP, the level in the TCP/IP stack that handles wide area networking, but lacks support for reliable connection oriented networking. Similarly to UDP, SNAP-np adds only a thin layer on top of raw networking. Support for connections, transport safety and receive order preservation are not part of the protocol. Implementing these features is not necessary since SNAP-np is only used for sending individual packets. SNAP-np packets can be distinguished from other types of data by the protocol number encoded in their IP header and by the use of the IPv4 router alert option. Packets with the router alert bit set can be taken out of the general queue by the kernel for additional processing in a user level application. Several different methods have been tried for filtering out SNAP-np packets. The pure SNAP implementation reads all incoming packets in the user space application and filters out appropriate packets at that location. This incurs a large performance penalty, probably mostly due to the large number of userspace/kernelspace contextswitches necessary and the relatively small number of SNAP packets in a normal network datastream. For our tests we have altered this setup considerably. A detailed description of the alterations can be found in section .
The SNAP-np packet, stripped of the IP header, consists of a twelve byte header containing information about the packet's version, currently unused flags, the program counter, source port and the lengths of the three main datastructures, i.e code, heap and stack. The all important resource bound value is computed on the fly from the IP header's Time To Live field. The rest of the datagram can be filled in with code, heapdata and stackdata. This layout permits nearly instantaneous execution upon arrival at a node. Only a handful of integrity checks are performed prior to execution. Most importantly, no copying of data is necessary. Removing the need for so-called unmarshalling in this fashion was one of the means used for increasing efficiency.
Since SNAP-np does not contain support for various features found in reliable protocols (e.g. TCP), special precautions have to be taken regarding packet size. The total size of a SNAP packet, including heap and stack, must always stay below the Maximum Transfer Unit of a node to avoid fragmentation. This means that the packet programmer must either estimate the worst case size of a packet or implement an auto return-home feature when a packet is becoming too large. For instance, in case of the advanced example given earlier, the maximum number of traveled SNAP aware nodes must be known upon injection of the packet into the network stream or else the packet might get dropped or destroyed along the way.
Client Communication  
Injecting a request into the network is taken care of by sending a packet to a known SNAP server in the specified format. A client must have access to the SNAP-np send instruction for this. Presently, this function is simply copied into the client application. The revised Splash implementation, on the other hand, uses a more elegant solution. In most cases it is sufficient to send a packet to a local SNAP server. From there on the packet is forwarded according to the control flow implemented in the packet itself. This way the strengths of the active network can be maximally exploited. While not possible under the original configuration, it is also possible to send a packet directly to the destination or a designated intermediate hop in our augmented environment. Specifics are discussed in section . Doing so will limit expressiveness, but increases responsiveness, since it reverts active network general case processing back to special case passive packet handling at the intermediate hops. Being able to finely select the transport mechanism in this way increases the flexibility of the system.
Transferring of data from a SNAP packet to a client is handled by the demux statement. This statement creates a UDP connection from the daemon to a client application. Again, the augmented application uses a different mechanism for daemon to client communication.
Services   For any extensible system the service infrastructure is an essential element. By enabling access to all parts of the local computer services can expand upon the base instruction set. A service infrastructure introduces potential security hazards and should be implemented with due care. In case of SNAP these security concerns are extremely important, since some of the system's characteristics are based on the notions of resource predictability and safety of execution. Service calls operate outside of the SNAP-ee and therefore do not fall under its restrictions.
The original service infrastructure was hard-coded into the application. Moreover, the calling mechanism placed certain restrictions on the accepted services, which proved unworkable for our research. Because of this a new and more flexible service infrastructure has been implemented. The old infrastructure will not be discussed here in much detail.
We should note that since services originally had to be coded into the application there existed no actual difference between a service call and a SNAP instruction. A consequence of this fact is that services had to be written in the same language as the interpreter, C. In the next section we will discuss the new infrastructure and the advantages is brings to our research.

Chapter 8
Splash: Combining SNMP with SNAP

Although our primarily task was to combine the SNAP and SNMP daemons into a single executable, multiple issues arose along the way that made us decide to change key aspects of the execution environment. In this section we will discuss the integration of the two software packages as well as the alterations made to the original systems. The combined efforts lead to the application used in the experiments: Splash.

8.1  Connecting SNMP and SNAP

Having selected two applications that meet our initial demands, we now have to connect them. This task must be carried out in a way that minimizes extra processing overhead while keeping in mind the other design goals set forward in chapter 4. The resulting program is supposed to carry out both original applications' tasks and expand upon these. Therefore we have given it a distinct name: Splash. Since SNAP itself has been revised to be part of Splash we have to differentiate between the vanilla implementation and our revised version. For this purpose we will use the name snap-wjdb when we talk about our revised implementation.
Creating a single executable from two distinct software packages can be smoothly carried out by adopting a library oriented approach. Acknowledging this rule of thumb, the net-snmp developers have split their package into a daemon application and a set of function libraries. The SNAP implementation, on the other hand, collects all functionality in a single executable. One of the tasks at hand was therefore to split SNAP into one or more libraries and a thin wrapper daemon.
Repackaging SNAP as a wrapper executable and library can be carried out relatively easy by exporting the old main(...) function from the library. We have chosen a somewhat more complex interface, mostly for performance reasons. The SNMP library is used by both a SNAP service and by the original SNMP daemon. This gave rise to a new problem: in a single executable it is not possible to create multiple runtime MIBs, nor is multi-threading supported by the SNMP library. To overcome this issue we have merged the network select and receive loops of the two systems. The new SNAP library exports functions similar to standard libc FD_ISSET, FD_SET and recv(..) routines. This makes it possible to add a SNAP capable handler to any other network daemon.
Expecting an initialized MIB exported by the SNMP daemon, the current Splash implementation will not work from a vanilla SNAP daemon. Stand alone operation could be implemented, but by disabling SNMP network access from the command line essentially the same situation can already be reached with our augmented snmpd.

8.2  Interfaces

The packet format and core instruction set of SNAP weren't altered for the SNAP-wjdb implementation. Aside from the repackaging of code mentioned above, the core of the interpreter could therefore remain largely unchanged. Connecting the interpreter with the other elements of Splash did give rise to some inconveniences. The various SNAP interfaces have been revised to fit our needs. We will now again discuss each of the interfaces first mentioned in section 7.4.

8.2.1  SNAP-np interface

SNAP-np, the interserver networking protocol, performed extremely poorly in the initial tests. The main cause of this proved to be the way active packets were being filtered out of the general network queue. Since SNAP runs right on top of raw IP it has no access to the socket infrastructure commonly used for differentiating packet destinations. Instead, an alternate selection based on bits in the packet header is used. Originally, this was envisioned to be taken care of by the kernel. Using for instance the netfilter package in Linux 2.4 [] the goal of high processing speed combined with fine-grained control can be reached. However, in practice SNAP dealt with this in another way. By opening up a low level ETH_SOCK ethernet interface, all packets where read off the interface by SNAP itself. Since vanilla SNAP could run in kernelspace this was not necessarily problematic. When running in userspace, however, performance was considerably below expectations. Because local access to SNMP is a design goal the application had to be able to run in userspace at higher speeds.
In SNAP-wjdb, packet filtering is carried out differently. Contrary to filtering based on portnumbers, filtering based on protocol is possible at the level SNAP works. Protocol based filtering is executed in kernelspace, therefore SNAP packets can be accepted in a userspace application without having to read all other packets. A disadvantage of this method is that each intermediate node must accept SNAP packets, otherwise they get dropped.
To circumvent this problem network instructions have been added that do not resend packets on a hop-by-hop basis (as forw, forwto and send do), but instead directly send a packet to its destination using passive IP packet handling en route. dforw, dforwto and dsend instructions can be used to skip past conventional nodes in the network. Side-effect is that response times can be decreased when intermediate packet handling is not needed. The simple example program in figure 7.1 would, for instance, work equally well when we would replace the forw operator with dforw. The advanced version of it, on the other hand, would not.

8.2.2  Client interface

The client interface in SNAP also uses a network tunnel to transfer results from the SNAP daemon to the client application. During testing we noticed that delivering data through the original interface could incur a performance penalty on the system. To minimize this penalty the infrastructure has been rewritten from the ground up. SNAP instructions were not altered, however, to ensure compatibility with older packets.
The codebase for the new daemon/client interface can be shared between clients and the SNAP daemon. Transport protocols are largely abstracted from the developer by an API that exports simple send(..) and recv(..) functions similar to other network interfaces. Underlying transport protocols can be toggled at runtime by setting an environment variable. Currently, the interface supports UDP, UNIX and raw IP protocols as transport layers, but others can be added later.

8.2.3  Networking similarities

During subsequent code revisions the SNAP-np interface and the client interface have grown toward each other. Although the version of SNAP-wjdb used for our experiments did consist of separate codebases it is our intention to merge the two in the end. There are clearly use cases conceivable under which SNAP-np would benefit from running over UDP or TCP. Due to the reliance on a protocol based selection mechanism this is currently not possible. However, by merging the interfaces we could clean up the SNAP-wjdb codebase and further extend SNAP's modi operandi.

8.2.4  Service interface

The third and last of SNAP's interfaces concerns access to services. Critique on the old interface has already been given in section 7.4. We wanted to be able to update the SNMP service more frequently than the SNAP library, since most of our development time went into creating this service. Where possible, increasing the flexibility of SNAP by means of extending its reach on the underlying system was a second concern.
Plug-in architecture   The service interface in SNAP-wjdb is based on a plug-in architecture. Plug-ins are prevalent in systems where extensibility is a concern. Prime examples are the Windows multimedia system sublayer that accepts additional encoders and decoders and the Mozilla/Netscape plug-in architecture for web-enabled applications, such as Macromedia's Flash, Sun's Java and Microsoft's aforementioned multimedia system.
In any implementation plug-in libraries must adhere to certain rules. For instance, there must be an agreement on how the calling application accesses the routines in the library. In SNAP-wjdb services can be written as normal Linux ELF libraries, but must export certain additional functions. Note that the standard library format enables us to link to the services at compile-time instead of using the plug-in architecture. However, since the plug-in architecture uses fundamentally the same low-level interface as shared libraries we saw no reason to do so. In any case, if we would have encountered serious drops in performance we could have reverted back to the previous situation.
init initializes static datastructures
getnextfunc loops through the exported service functions
getlastresult optionally returns a structured dataset
free_local_returnstruct frees the dataset
fini cleans up leftover datastructures
Table 8.1: A service's minimal set of exported functions
Aside from its actual service functions a SNAP-wjdb service must export the functions outlined in table 8.1 to be recognized by the main application. These various functions are needed by the service interface to handle service initialization and destruction as well as structure copying. Through the use of conversion functions the service infrastructure is capable of translating stack values directly into function arguments. Similarly, return values are automatically placed on top of the stack. However, when multiple return values can be expected, for instance by executing an SNMP request, an indirect variable passing method can be applied. The getlastresult and free_local_returnstruct functions are used to retrieve and optionally destroy extended sets of data.
Initialization   A plug-in based architecture can, if not implemented correctly, impose a performance penalty on the system. To overcome this problem we have moved all plug-in specific code to the initialization stage of the SNAP-wjdb package. Three functions are used for handling services. In its default behaviour, SNAP-wjdb executes an init call prior to accepting packets. When called, the function searches through a number of standard library directories for files that match the following pattern: snap_svc_[name].so. Matching libraries are then loaded and scanned for the obligatory routines. Service handlers can be found by calling a library's mandatory getnextfunc function. This function is expected to iterate through all service calls in the library and return the name of the SNAP call together with a pointer to the actual handler, the number of arguments N and type of returnvalue R. This information has to be programmed in by the library developer. Although this slightly increases development complexity there are numerous advantages. The most important one and the reason for implementing this scheme is that nearly all existing functions can be called directly, i.e. without the need for wrapper functions. With the help of the service infrastructure we can tap into the extensive array of libraries designed for Linux quickly and easily. Only functions that do not take any arguments cannot, presently, be called instantaneously.
After initialization all accepted services have been linked to and can be used as if they were linked to at compile time. By having to use a vtable for function lookup, performance will inevitably be worse than it would be with statically linked libraries. There is no reason to expect that a plug-in based architecture performs worse than a shared library based implementation.
Execution   When a SNAP packet executes a calls instruction the service interface looks up the handler in a hashtable, taking the SNAP argument as service name. If a service with that name exists it then converts the top N stack values into arguments and calls the appropriate handler with these stack values. NB: The infrastructure does not check whether these are of the expected types of the function's arguments. After a call completes the returnvalue is automatically converted by looking up the handler's returnvalue R. The resulting variable is placed on the stack. Lastly, the getlastresult function is called to see if extended datastructures were prepared. If so, these values are also converted and placed on the stack. One can see that this process is not particularly type-safe. Successfully calling a function relies on proper use of the getnextfunc by the library developer and on correct value placement on the stack, a responsibility of the packet developer. Behaviour under error conditions is largely undefined, a cause of possible problems. Increasing the robustness of the service interface is necessary prior to using it in critical applications. A standardized exception catching framework should be decided upon for this purpose.
Library development   The service infrastructure is designed for rapid functionality development. While services can be programmed from the ground up in any language there already exist support routines and examples for C and C++ developers. The snap_svc.c file implements most of the functions in table 8.1. Developers of new services need only supply a new file that contains the getnextfunc function tailored to the exported services. If these services are functions of other libraries they can be directly accessed by linking to these libraries, otherwise new functions can be added in the same file. A template has been created for rapid development purposes. Additionally, a basic example can be found in the snap_svc_test source and header files.

8.3  Services

For the experiments we have had to implement a number of services. Most important is the SNMP connection library, but we will also discuss the others here. The mechanisms underlying the services have been discussed previously, we will not mention those here anymore.
SNMP service   The SNMP connection library has been separated from the main SNAP-wjdb codebase since we expected that we would frequently need to update it. The library serves as a wrapper around the necessary initialization, PDU creation, execution and teardown parts of the SNMP process. Executing an SNMP request can be a tedious exertion, because of the many datastructures that have to be set up. We don't want to have to place this logic inside the SNAP packets themselves, therefore most of the work must be abstracted from the user. On the other hand, too much abstraction can limit the usefulness of the connection library itself. Therefore we chose to export many functions, including low-level ones dealing with general SNMP functionality. Reducing SNAP package size is made possible by exporting a second set of functions that automate much of the low-level processes, in essence handling a special case. For instance, retrieving a single variable can be accomplished by executing one service call. However, retrieving multiple variables can be speeded up by resorting to lower level service calls, so removing unnecessary duplicate instructions. Where possible, checks where implemented in the library to deal with not executed low-level functions. For instance, the initialization routines will be executed automatically if the necessary structures haven't been initialized prior to adding a variable to the PDU. Exploiting these safety checks can reduce a packet's size. One should be thoroughly aware of the existence of inserted safety checks before skipping instructions, otherwise undefined program behaviour may be encountered.
service name functionality
snmp_init initializes the library
snmp_init_ip idem, but uses IP to connect to the SNMP server
snmp_pdu_init initializes an empty protocol data unit
snmp_pdu_addvar_null adds a value to be retrieved
snmp_pdu_addvar_withvalue adds a value to be updated
snmp_send sends the prepared PDU to the server
snmp_close closes an open connection
Table 8.2: Low-level SNMP service calls
The list of exported low-level functions is shown in table 8.2. From this list it can be seen that the connection between SNAP and SNMP is a pure client/server one. However, one of our initial demands was that they could coexist in the same execution space. The first function, SNMP_init, is set up to handle this. It creates a connection to an SNMP server not through the use of a regular transport protocol, but by simply passing pointers between the two systems. The PDU created earlier is therefore not marshalled/unmarshalled and transferred, but directly referenced from the SNMP library. Naturally, this greatly reduces processing overhead.
The other initialization routine does use a standard UDP connection to an SNMP server and should therefore not be used to connect to the local executable. Its intended use is the querying of other nearby SNMP servers not running Splash. A Splash daemon can thus serve as a proxy for SNMP servers, another bandwidth saving utility.
From these low-level services more advanced ones can be constructed. Obviously, doing so does not increase flexibility. It can help limit packet size and increase code reusability. The two most basic examples we created were snmp_getsingle and snmp_setsingle. The first was used in the example shown in figure 7.1. Both functions call - in consecutive order - the init, pdu_init, addvar, send and close functions. Since we know that only one variable has to be retrieved or set we can reduce the number of necessary calls from five to one, taking as arguments the combined arguments of the low-level functions.
A more advanced use of SNMP calls can be seen in another set of services. Two control flow problems have been dealt with by requesting information from SNMP. Firstly, to travel across the network without prior knowledge of the underlying topology and without use of datastructures a round-robin interface selection scheme has been implemented. By calling snmp_getnexthop with the IP address of the incoming interface as argument a next hop for a packet will be generated through the use of SNMP information. Second, snmp_getneighbours places all local IP addresses, except 127.0.0.1 and the incoming address, on the stack for a flood send. Both of these function can also be implemented without the use of SNMP by directly querying the kernel, but in some instances SNMP can serve as a useful abstraction to the raw data from the kernel. Note that, for the experiments, in the end we decided to use kernel calls for these tasks. The SNMP requests proved to be very fragile with regard to exception handling. One of the underlying reasons is that one request's response may be necessary to construct the following OIDs, an undesirable situation, since if the first fails we have no input for the next. Using kernel calls many steps can be purged from the process, resulting in a cleaner and sleeker implementation. Kernel call services will be discussed in the next section, along with a few others.
Other implemented services   During the development process issues arose that were hard to implement using only SNMP access. We could have blindly created SNAP instructions for each problem, but this would surely lead to a bloat in SNAP-wjdb code. Alternatively, we created a number of services to deal with each specific type of problem.
Configuration of nodes can in theory be performed through SNMP calls. The net-snmp package, however, does not export many variables that can be SET. In those cases where we needed to alter behaviour, communication with the subsystem was therefore implemented through kernel-calls. More specifically, services have been created for read/write access to the route-table, the interface table and the /proc interface. We chose to implement this functionality in Splash services instead of an SNMP MIB for two reasons: (1) to reduce development time and (2) to experiment with the service library layout.
A different kind of extension is the data dictionary. Following the security guidelines of the SNAP language we did not want to add shared memory directly to the interpreter. However, there are tasks in which a communication mechanism between packets was needed, examples are given in chapter . Therefore we implemented a simple mechanism that can be used to store SNAP stack values on a node. A hashtable wrapper is used that exports basic GET, SET and DEL functionality. No resource control constructs exist, so this would surely not be an ideal implementation for critical systems. In a production environment some sort of timed automatic data release, a maximum imposed on the amount of stored data and a cost function eating up resource bound should be added. For our research purpose the simple library proved acceptable, however.
Lastly, we've created kernel call based versions of the control flow services mentioned in the SNMP service library. Internally relying on the above mentioned route and proc services, the if_getnexthop and if_getneightbours services replace their SNMP based counterparts in our experiments.

This last chapter on Splash completes the overview of our test system's implementation. In the following chapters we will discuss the experiments carried out to support our thesis. In the first (), low level request Splash performance is compared with SNMP, while in chapter  functionality claims will be reviewed.

Part 4
Research


Chapter 9
Experiments: Performance

Having discussed theoretically obtainable NM improvements and the framework with which this is to be accomplished what remains is to prove the stated claims. In this and the following chapter we will experimentally establish the comparative qualities of SNMP and the active network. This first chapter will compare processing speed, while the next will deal with the functionality argument.

9.1  Theory Recap

The experimental results displayed hereafter are based on the test cases introduced in section 5.2.3. Contrary to the functionality tests discussed in section 5.2.2 the following results are not intended to back our claim of superiority over traditional SNMP. One should recall that to improve upon SNMP we first have to establish that a rivaling environment is comparable to SNMP in terms of SNMP's main utility: low-level request processing speed. While the implemented tool, Splash, contains both an SNMP and AN interface, we will use the name in the experiments to denote the active network interface as opposed to its SNMP counterpart.
We will first make the case for Splash in general terms by comparing the round trip time results of identical requests executed under SNMP and Splash. Then, a single request is examined more closely to distill the relative cost of the subprocesses involved. With this information we can select situations in which Splash will excel as well as those for which the system is not suitable. The insights gathered will help explain the comparative quality of Splash in the more elaborate functionality tests discussed in the next chapter.

9.2  Considering Pre- and Postprocessing

The results shown in figure  show the overhead of SNAP and SNMP for the various scenarios without `preprocessing', i.e. without the time it takes to establish an SNMP session. The SNMP platform suffers from a disproportionately long preprocessing time. During preprocessing several datastructures have to be set up, for instance the Protocol Data Unit used for sending the requests. To accomplish this, the net-snmp application has to open the extensive SNMP library, which in turn must initialize several data structures. A lightweight package such as SNAP has less internal data structures than SNMP and can therefore respond more quickly. For simple requests, preprocessing can have a large impact on performance. For a single SNMP GET request the overhead of preprocessing is approximately 500 times that of the actual request, SNAP only needs around 30 request durations. With this preprocessing taken into account, Splash outperforms SNMP by an order of magnitude. Caching of structures can, however, largely hide this weakness. Therefore we chose to keep it out of the direct comparison.

9.3  Round trip Results

As discussed previously, we have selected a set of 4 requests to obtain general purpose performance statistics. These requests are: a single GET, a single SET, a combination of GET and SET and a combination of 5 distinct GETs. All of these have been sent to 7 servers at varying distances from the monitoring host. As explained in 5.4, The 7 queried nodes are linked in a linear fashion. Therefore network delay for a node n is the accumulated delay for hopcount n-1 plus the delay incurred by traveling from n-1 to n. While hopcounts 2 to 7 include actual network links, the hopcount 1 case executes a request at the monitoring client's node. Results for this case are different from the others' because all processing takes place on the same host, as we will see later on.
Figure 9.1: Low-level Requests Round trip Results
Despite the fact that net-snmp supports SNMPv3 we only used version SNMPv2c for our comparative tests. Version 3 only adds authentication and encryption features not present in SNMPv2. Using these technologies has a severe impact on performance. Since SNAP currently lacks such features comparing it with version 3 would be inappropriate.
For purpose of clarity the results in figure 9.1 are displayed in two separate graphs. Figure  displays the trend in performance improvement when combining multiple similar requests while figure  shows the differences in speed between retrieving and setting values. All results were obtained after taking the median of 101 test runs. For all tests, the upper and lower quartiles were within 2.57% of the median.

9.3.1  Retrieving One or More Values

Retrieving values from a remote node is arguably the most often used feature of SNMP. It is important, therefore, to be able to execute especially these requests at comparable speeds.
From figure  we can see that executing a single request using Splash takes only fractionally longer than executing it using SNMP. When sending the request to a remote host the penalty of using Splash lies around 30% for the worst case, i.e. when the network delay plays no significant role. When network delay increases both platforms scale at what appears to be the same rate. Therefore the comparative penalty drops with each hop. After 7 hops the penalty is reduced to approximately 10%. In any case, performance results for this important case fall in the same order of magnitude for the two systems, although SNMP still outperforms Splash by a percentage depending on the network delay.
When executing a request at the local host the outcome changes dramatically. In this case Splash actually outperforms SNMP. In practice, this is not important, since querying of local data can be handled more easily through normal shell tools or the /proc file system. Nevertheless, from this observation must follow that SNMP consumes more computing resources. If all external factors remain the same, the program that performs slower under time constrained circumstances must do so because it consumes more resources. It is uncertain whether this has to do with CPU time or I/O, but it is clear that the heavyweight net-snmp daemon slows a system down more than the lightweight SNAP interpreter, even though both share the same back-end.
Both environments allow for the bundling of multiple similar requests into a single packet, thus saving bandwidth by combining the packet headers. SNMP does this by allowing multiple OIDs with the same request type to be added to an initialized PDU. By having procedural packets Splash can innately express any number of combined requests, without the need for predefined structures. The 5GET case makes it directly clear that the specialized framework used by SNMP outperforms Splash on all levels. Splash performs worse than SNMP even at the closest remote hop. This could be expected, since the remote workload is higher for an AN based system. What is striking, however, is the fact that the two environments scale differently with hopcount, again at the disadvantage of Splash. The underlying reason does not become directly apparent. We would expect Splash to respond fractionally slower on the remote host, but since intermediate handling remains the same for all distances the function's angle cannot be explained by differences in the methods of request handling. It appears that an external factor is influencing the outcome of our tests.
In the following section the subprocesses will be discussed more closely. To be able to explain the function angle anomaly occurring in the 5GET case we will briefly touch on one of the factors here. Contrary to initial expectations, packet size, even in the limited scope of our research, plays a significant role in network delay. The specialized data structure used by SNMP (the PDU) enables the system to minimize packet size. Splash, on the other hand, has to send complete programs in bytecode format, resulting in a discrepancy in packet size. For the specific 5GET request we executed, SNMP packets were 101 bytes long on the way to the server and 267 on the return trip. Splash packets weighed in at 444 and 652 bytes respectively. Doing a weighed ping test revealed that delays are indeed approximately a factor 2 longer for packets of size 545 than for packets of size 180. Chosen packet sizes are estimates of the average size of the round trip packages.
Acknowledging the influence packet size has on network delay and the fact that Splash packets will always be larger than their SNMP equals we must conclude that Splash requests will always scale worse than their SNMP counterparts. This is not obvious in the 1GET case, since other factors are of more importance for such small packets. However, the larger the number of additional instructions, the larger the impact of this size related delay. Optimizing Splash packets for these instances can reduce this weakness. We have not looked into the matter when obtaining these results, however.
Network delay plays no significant role when processing requests locally. As with the single GET case, Splash can therefore still outperform SNMP in the 1 hop case.

9.3.2  Setting vs. Retrieving Values

While retrieving values from a remote host is most often executed, the setting of variables also plays an important factor in determining comparative processing speed. Figure  depicts the tests concerning SET requests. We can immediately see that altering variables is considerably slower than retrieving them, in the current case approximately a factor of 3. This holds for both environments.
Comparing SNMP with Splash in the SET case follows the same line of reasoning as in the GET case. While Splash can outperform SNMP on the local host it suffers a small penalty remotely. Both request types scale with network delay, thereby decreasing Splash's disadvantage when the distance increases.
Combining SET and GET requests presents a different picture altogether. Since SNMP cannot combine SET and GET requests into a single PDU two separate packages have to be sent to the remote node. Using parallelization it is possible to send these two simultaneously, thus reverting the test case to that of the slowest subtask, in this case the SET request. However, in the test case the variable that is to be retrieved is the same variable that is being set. This means that the two have to be carried out in consecutive order. Splash can combine these requests in a single packet, while SNMP will have to wait for the SET to finish before it can issue the GET. Splash easily outperforms SNMP.
The fact that SNMP has to send two requests explains the alternate shape of its function: it grows twice as fast, since it encounters twice as much delay. Splash GETSET, on the other hand, scales with single network delay. More importantly, the combination of GET and SET into a single packet improves considerably over sending separate packets, regardless of the packet distance. Even though the Splash packet is significantly larger than the SNMP PDU, just as in the 5GET case, it can outperform SNMP in all instances. The incurred penalty of having to send two packets completely outweighs the size based delay, apparently. This is the most elementary example of how the flexibility of the SNAP language allows Splash to outperform SNMP by combining multiple interdependent requests into a single packet. The next chapter will follow through on this line of thought, expanding the gap between Splash and SNMP performance.

9.4  Subprocesses

9.4.1  Overview

The previous results do not reflect the performance of SNAP and SNMP handling alone. As noted earlier, part of the overhead can be ascribed to network transfer time.
Decomposing the overhead reveals that it is composed of overhead due to `pure' SNMP and SNAP handling time, and overhead due to the factors that are common to both implementations, i.e. network transfer, internal processing in the MIB and data conversion.
An important subtask is remote MIB processing. We noticed early in the experiments that the internal processing time for a request in the MIB is directly dependent on the variable that is being requested. Response times of identical request type but for different variables (e.g. GET sysName.0 and GET ifNumber.0) may vary widely. The factor that determines this is the location of the variable. Requesting kernel values is a more computationally intensive task than copying in-memory variables. This effect does not show in our results directly. Instead, we have tried to select requests with minimal internal processing time by requesting values that reside in memory. Naturally, the same values have been requested using SNAP and SNMP.
Figure  shows us how the different subprocesses that come into play affect the overall results. For this breakdown into subprocesses we selected the 1GET request from the previous tests. Since it executes a minimal SNMP request the relative overhead is maximized.
Contrary to the previous tests, these results were obtained from a single test run. We use them solely to discuss the comparative processing times of individual subtasks. The request was sent to five servers of different distance to the client. From this we can distill the scalability with regard to network delay. Total processing time cannot be compared with previous results because the system had to export additional debugging information to obtain these results. However, the relative results are still applicable to the previous runs.
From figure  we can see that the redirection and client steps include a large amount of idle time, where the program is waiting for another task to complete. For instance, the waiting stage of the redirection server takes equally long as the time within which the packet is sent through the network combined with the time needed to process the packet at the remote host. Similarly, the waiting stage of the client is equal to the total processing time of the redirection server. Due to an overlap in processing during the communication phases, the figures do not depict this behaviour exactly. However, one can distill this fact when taking the timing overlap into account.
   
Figure 9.2: Single GET Subprocess Results
Apparently, back-end MIB data retrieval plays only a minor part in the entire process. Because back-end processing is especially limited for a single in memory GET request the current situation can be taken as a worst case estimate of processing overhead. From previous experiments we can deduct that the same holds for SNMP. In case of Splash it is relatively easy to improve upon these results. For inter-server communication, a local redirection server is used. However, there is no technical reason for doing so. If we were to use Splash solely as a remote interpreter we could do away with the redirection server, reverting all processing back to the 1 hop case, where no intermediate server comes into play. In section 4.4 we discussed why we chose to use the current less-than-optimal solution from a performance point of view. Combining multiple SNMP requests and dealing with them on site, as we will do in the next chapter, is another means of reducing the relative cost of the overhead.

9.4.2  Individual Tasks

Figure 9.2 displays a somewhat course grain outlook of the subprocesses. To better clarify the process we will briefly discuss each of the intermediate steps.
Remote Server   In the general case, hops 2 to 5, processing on the remote server takes approximately 1 millisecond. At most half of this time is used by the SNMP daemon. The other half is used for creating the necessary structures, retrieving results and executing the packet's travel logic. Naturally, the hopcount plays no role in this part of the request.
Redirection Server   All requests are sent through a local server. This redirection server executes the special stages of the packet program used for initialization and finalization. During the latter stage, it unpacks the return values and hands them to the client through a local UDP connection. The rationale behind this was discussed in section 4.4. Figure  shows that this setup imposes a severe penalty on the processing speed. In particular, postprocessing (unpacking and redirecting of the return values) takes up a large amount of time. We can also see that the difference in processing time between 2 and 5 hops is approximately equal to the difference in ping times. This has been discussed previously, but again it shows that total processing time scales linearly with network delay.
Client   The client, finally, adds another 100 ms to the final result. Mostly due to printing of performance statistics and the result to the screen, this is a penalty that does not exist under normal operation.
Special Case: Local Processing   For hopcount 1 we sent the request to the local server. Statistics for this case are quite different from the general case. No redirection is necessary, which decreases response time. However, remote computation increases since postprocessing is now being executed on the remote server.

9.5  Performance Overview

A number of conclusions can be drawn from the performance tests. Advantageous to our case is the observation that SNMP appears to consume more processing power than Splash. Also, SNMP suffers from much longer preprocessing overhead.
Putting Splash at a disadvantage, SNMP still outperforms Splash by a percentage in the general case. Compared to previous results obtained with AN systems, where performance was several factors or even orders worse than the reference system this relatively small penalty is a great improvement, however. Another drawback is the increased packet size inherent to using programmable packets and especially the effect this has on network delay. Under identical network delay Splash will always perform marginally slower than SNMP, but since the significance of remote host processing diminishes with network delay a small extra overhead on the remote host will be reduced when intermediate delays grow. If, as it does, the system scales worse, however, this may not hold. In the experiments, comparative processing speed sometimes actually dropped when network distance increased. Since we have not tried to optimize packet size we expect that improvements can be made in this area.
From examining the subprocesses it becomes clear that time spent on actually retrieving the single SNMP value from the MIB can be small compared to the overall process. It is essential to decrease the relative overhead if we want to increase Splash's effectiveness. We will do so in the next chapter by aggregating response data and reducing network travel.
Most important outcome of these tests is that Splash can execute those requests to which SNMP is especially tailored with only a minor performance penalty. As the SNAP and SNMP results are in the same order of magnitude, Splash could conceivably be used as a replacement for SNMP. When optimizing the codebase, as has probably been done in the long lifespan of net-snmp, even better results are surely obtainable. As we stated earlier, however, there is no reason not to use the SNMP "half" of Splash. The results here merely show that SNAP is not far behind SNMP for any type of request, even those in which SNMP obviously excels.
Following on these experiments, the next chapter gives a functional comparison of the two platforms as discussed in the functionality test design of section 5.2.2.

Chapter 10
Experiments: Functionality

Most of SNMP's drawbacks have to do with the rigid structure it imposes on the network topology and the inefficiency in data transfer and system response time this brings with it. Discussed here are several techniques that can reduce this overhead. For each of these we will present an implementation in Splash.

10.1  Introduction

In the previous chapter it is shown that Splash performance is near that of SNMP, but not completely on par. However, the tests ran to obtain these results were tailored especially toward SNMP, consisting solely of the low-level atomic operations in which it excels.
While SNMP may outperform more flexible solutions in quick retrieval requests, the limitations of the framework quickly become apparent when implementing more elaborate NM solutions. In section 5.2.2 NM tasks have been presented that cannot be handled practically by issuing these atomic requests. The list is by no means exhaustive, but it contains enough everyday use-cases that are difficult to carry out using SNMP.
For the presented use-cases we will disregard the issue of performance altogether. The reason for this is simple: using SNMP, it is possible to retrieve large amounts of data in parallel, execute the necessary computations on the monitoring host and update the necessary values in a single step. By utilizing large scale parallelization the time the entire process takes is reverted to the case of retrieval and subsequent setting of the slowest value. Nevertheless, the amount of data that has to be transported may be of such proportions that this is deemed an inapplicable solution to many problems.
Although it is hard to prove such a claim, we suspect that SNMP's excessive network bandwidth utilization in high-level tasks limits the exploration of practical network management solutions. With the anticipated growth of network-enabled appliances the need for more flexible and resource friendly management tools shall almost certainly increase in the coming years. The introduction of new dynamic networking paradigms (e.g. ad hoc networking) especially calls for more intelligent communication [].
The techniques that have been selected can significantly reduce network traffic and network wide system responsiveness. For each of the test cases an example has been implemented in Splash. To save space we refrained from presenting the complete programs below. Instead, we will show instructive pseudo-code snippets. The complete programs, together with other prototype packets and background information can be found on the Splash website at http://splash-snap.sourceforge.net/papers/dsom2003/. A detailed description of SNAP's semantics can be found in [] and on the project website [].

10.2  Serverside Processing

Reducing data transfer and response time can be accomplished by removing superfluous communication between the nodes. In the first set of use-cases we will only look at single client to server communication.

10.2.1  Serverside Data Aggregation

In [], a SNAP-based program is introduced that travels to a predefined list of hosts. This application, called a surveyor, has the advantage over simple polling that it minimizes the distance traveled through the network and thereby reduces overall management traffic. Figure  shows an example of a simplified surveyor packet created for visiting only a single host. A more general solution called a list-based surveyor is presented in figure .
1010
		  ; continue to dest
		  forw          
		  ; last host on list?
		  bne atdest      

		  ; operational code
		  push "ifInErrors.2"
		  snmp_getsingle
		  push "ifInErrors.3"
		  snmp_getsingle
		  add                   

		  ; move to next host
		  pull 1                 
		  forwto    

		  ; return data
		atdest:
		  demux       
	
Figure 10.1: A simple surveyor packet
   
1010
	; test against hard coded threshold value
	gti 30000
    bez normalrun-pc
	
	push 0
	calls "proc_setipforward"
	
normalrun:
	; continue normal processing here
	
Figure 10.2: Serverside reacting

The simple surveyor program travels to a remote host, executes a number of instructions and returns home. Figure 7.1 shows the same packet tailored to a single SNMP get request. One of the advantages of the programmability of AN packets is that it is possible to retrieve a number of requests in this fashion, compute a derived result based on those intermediate values and then return to the host with only the derived value.
Figure 10.1 gives an example of this functionality by computing the aggregate of the network packet count of two interfaces. This specific example is very straightforward for purpose of clarity. The application of this technique can, of course, be generalized to situations where more elaborate calculations are needed to obtain useful statistical data. With the basic arithmetic in place any number of calculations can be carried out in this fashion using Splash.
Considering that many monitored statistics are in fact derived values, aggregating data on the server can limit occupied network bandwidth greatly. Splash packets are larger than their SNMP counterparts, as could be seen in the 5GET request discussed in 9.3. By bundling data results prior to transmitting them this problem can be overcome.
Theoretically, the bandwidth reduction obtainable by serverside data aggregation is, for m raw input values and a single requested derived value, 100 - [ 100/m]%. If we define packetsize in bytes as P(packettype), choosing Splash over SNMP can be considered a viable option when m-1 > = PSplash - PSNMP .

10.2.2  Serverside Reacting

In the previous example we exploited only Splash's ability to compute derived results and initiate a return trip to the monitoring client. For use-cases in which reactions are predefined returning home before reacting is unnecessary. Especially in situations where speedy reacting is vital would we much rather take immediate action on the spot.
1010
		; calculate new value
		push "ifInUcastPkts.2"
		calls "snmp_getsingle"
		push "sysUpTime.0"
		calls "snmp_getsingle"
		div 				

		; get previous value
		push "intrafficdx"
		calls "memmap_getint"

		; push new value
		push "intrafficdx"
		pull 1
		calls "memmap_addint"

		; compare new and old values
		multi 2
		gt
		bez isok-pc
		push "error"
	
Figure 10.3: Serverside Trend Analysis
   
1010
	; forward to destination
	dforw              
	pullstack			
	bez atsource-pc     

	; execute requests here
	
	; forward to next dest
	pullstack
	dsend
	exit
	
	; return data
atsource:	
	push 7777
	demux    

#data 1
#data 10.0.0.34
#data 1
#data 127.0.0.1
#data 0
#data 0
#data 0
	
Figure 10.4: A list based surveyor packet

As an extension to serverside data aggregation, we will now extend the simple surveyor packet to perform direct actions based on the computed values. We added access to the underlying operating system through back-end services that carry out basic network maintenance tasks. Examples are functions to set interface status (UP or DOWN), forwarding rules (TRUE or FALSE), /proc file system values and route table entries. As net-snmp does not directly support this, the corresponding functions were encapsulated in SNAP services. These functions can be called from the surveyor packets, for instance to shut down interfaces on nodes from which too much traffic originates or to disable forwarding in nodes where error rates are unacceptably high.
A common test case is the following: network throughput is considered too high if a predefined threshold is reached. In the surveyor packet we can test against this threshold and execute a special code section. The snippet in figure  shows this example of serverside reacting. In the example we disable IP forwarding if the total amount of incoming traffic exceeds a predefined threshold.
Serverside Reaction can play an important role in increasing system responsiveness. By taking action on location we can remove a complete round trip through the network, which can take a disproportionate amount of time, as became clear from the performance tests. Especially when traveling through the network is difficult due to an already high network load will serverside reacting pay off. Automation of tasks in this manner can only be used if actions are known a priori, however.
The subsequent savings related to employing this technique can be written down as a saving in return trip time tnet per action test. Therefore, defining the remote calculation time as tcalc and client side calculation as tlocal gives the following threshold test for using Splash: tnet + tlocal > tcalc. From the performance results we can see that network traversal, including processing, can take up a disproportionate amount of time, both for Splash and for SNMP. Therefore remote calculation can be preferable under many instances. It will be most advantageous in complex decision making situations, however, since each action test adds a new saving in time.
Each remote action test removes a return trip to the monitoring client. Network bandwidth consumption is therefore also reduced, with 2P*i, where i is the number of action test branches resulting in serverside reaction.

10.2.3  Serverside Trend Analysis

An often recurring NM task is trend analysis. RMON was created as an extension to SNMP for this purpose. We will now display an element of Splash that mimics such behaviour by being able to store historical data on location.
For this next example we added a data dictionary service to demonstrate the application of serverside trend analysis. The data dictionary consists of simple set, get and delete instructions for manipulating named variables. Keeping track of previously calculated information on the remote nodes reduces the need for copying intermediate statistics to the monitoring agent. Consequently, it makes calculating derived results based on time possible on location.
Figure 10.3 displays the code for comparing an error rate over time: a straightforward trend analysis example. The depicted snippet could be inserted in any one of the packets discussed earlier.
Savings are in this case twofold: network savings scale according to the serverside data aggregation example while response time behaves identical to the serverside action taking example.

10.3  Network Wide Processing

The previous examples showed some advantages of choosing an alternative over SNMP. Nevertheless, the displayed tasks could just as well be carried out using on-demand code loading or mobile code environments. Active networks have the added ability to traverse a network autonomously. The following techniques exploit the added capability of AN based systems to make various decisions on location, functionality that sets it apart from the other environments.

10.3.1  Network wide Data Aggregation

Statistical data frequently consists of raw values that are found scattered through the network. In this case serverside processing will not reduce traffic significantly, since each server has to be addressed individually. Distributed derived values can be compared with a threshold without having to access all hosts. For instance, comparing a global value against a threshold can stop as soon as the threshold is reached. The benefit of this technique is most apparent when the threshold is easily reached, for instance in case a single boolean value has to be tested on each machine. In this case a hop-by-hop approach, where totals are computed from the totals over the visited nodes, is preferable to a centralized approach.
Again, let us exemplify this idea by altering the simple surveyor packet. The original packet searches only for aggregates. By using the list surveyor, depicted in figure 10.4, we can compute an aggregate over the visited hops. Testing the aggregate against a threshold at each hop quickly reveals a possible error condition. Not only is network bandwidth minimized, the problem itself is also localized, since we know at least one section of the network that triggers a response. If necessary, immediate action can be performed on location.
Retrieving a value from n hosts will always take 2n single network traversals (n round trips) using simple polling. The amount of messages sent is dropped to at most n+1 when using hop-by-hop traversal. Implementing action tests on the way can further reduce the data bandwidth, depending on the chance of triggering a global action test. Bandwidth savings therefore lie between 100 - [((100)(n+1))/2n]% and 100 - [ 100/2n]%.
Response times may increase by using hop-by-hop traversal, since no parallelization is possible. Whether employing this technique is sensible in time restrained situations depends on two factors: (1) whether remote reaction is expected and (2) network delay. In NM situations the monitoring client is often relatively far away from the remote nodes, while these are heavily interconnected. This holds for instance when scanning workstation subnets. If the link between the MC and the nodes incurs a delay of tmain, while the node links incur - on average - a delay of tinter response time will not decrease until tmain*2*n > tmain*2 + tinter*(n-1) for a network of n remote nodes. Applicability is therefore dependent on number of participating nodes and relative cost of network traversal.

10.3.2  Network wide Reacting

One of the problems considered hard to perform using SNMP is that of resolving distributed problems. In this section, we consider the example of a Distributed Denial of Services (DDoS) attack. Discovering the originating network nodes and taking action on these nodes is essential to stop the network from becoming flooded. Using SNMP, the only way to find out where a problem occurs is by sending messages to a large number (possibly all) of the nodes in the network. This increases the load considerably in an already overloaded network and, as a side-effect, transmits a lot of data, much of which is useless.
Instead, using Splash, we altered the original surveyor program to react locally when a problem is spotted. At each hop, the network load is compared to a predefined threshold value. If this value is exceeded, normal execution is halted. The packet requests the IP addresses of all neighbouring nodes and forwards itself to these nodes. It then immediately returns to the management station where it delivers an error message. The resulting program, shown in figure , is inserted directly after the operational code of figure 10.1. Since SNAP does not allow unlimited loops, we have to resend a packet to the current host to execute the special case instruction for an a priori unknown number of neighbouring hosts.
By using this algorithm the DDoS test is recursively copied throughout the errorzone and dies out only at nodes that operate under normal load. The management station therefore only receives error reports from those areas that need extra attention. Optionally, error reporting could also be replaced with serverside reacting to even further alleviate network stress.
Successful in specific situations, this technique can incur even more bandwidth utilization than simple polling if used incorrectly. The crux lies in the use of a so called flood-fill send. For a network of n remote nodes, having on average s network connections, n*s packets will be sent worst case. This is greater than using simple polling when s > 2. However, since the process starts only when a possible error is detected and spreads only in the errorzone + 1 additional hop savings can occur in practice. For an errorzone of r neighbouring nodes r*s packets are being sent inside of the network plus an additional 1 + r between the MC and the network if serverside reaction is not used. Worst case this results in a total of r*(s+1) + 1 messages, which is more efficient than n*s only if r + [ r/s] + [ 1/s] < n. Depending on the number of interconnections this threshold value will lie somewhere between r = ([(n-1)/2]) for s=1 and r = n for s=¥. Naturally, the upper boundary is always satisfied, therefore we can say that expected bandwidth savings increase with the interconnectedness of the network and decrease with size of the errorzone. The size of a possible errorzone depends on the topology of the network and the number of external connections. Finding a clear threshold is therefore not possible in the general case. As a heuristic we can say that in sparsely connected networks other solutions should probably be considered.
Response time savings for this technique are dependent on the type of action that should be undertaken. For a basic action, e.g. the shutting down of an interface, response time will decrease with a single return trip for each affected machine. Particularly in circumstances where the network load is already extremely high, as is the case with DDoS attacks, will this pay off. The response time calculation for the previous example still holds, with the additional factor that delays will be greater than normal in these circumstances. Secondly, using serverside reacting, each affected node can be repaired before communication takes place, thereby reducing overall bandwidth utilization incrementally. This bonus will recursively copy through the network, speeding up the entire recovery process possibly exponentially. The precise speed-up depends on the order in which connections are overloaded. A linear connection will necessarily become available at linear rate. This can be considered a worst case situation, however.

10.4  Self organizing Networks

Computing totals and executing existing functions on location have direct applications in the current network management domain. Nevertheless, these applications can hardly be called revolutionary. With the ever increasing abundance of networked devices and the correspondingly increasing cost of maintaining these devices research is currently underway to automate NM tasks as much as possible. Automated networks are popularly referred to as self-organizing networks. The following examples serve as a primer into how active networks can assist in exploring new NM algorithms needed for automating much of the network administrator's tasks. The next examples increase the use of the remote interpreters' functionality, but employ no bandwidth or response time saving techniques other than the previous. Saving estimates will therefore not be given.

10.4.1  Autonomous Network Roaming

We now present a more intelligent descendant of the surveyor family of network programs: the autonomous surveyor. The autonomous surveyor can travel through a network without need for prior knowledge of the underlying topology. Selecting a next hop based on local data is in essence a simple extension of the serverside reacting case. However, it allows for new classes of algorithms. The distributed surveyor already showed how locality can be exploited to reduce global problem solving complexity. Autonomous processing generalizes on this idea.
1010
		; threshold test
		gti 1000                
		bez normalrun-pc 

		; send to neighbour
	      specialcase:
		getdst                      
		calls "if_getallneighbours"  
		push 0                  
		send               
		; resend to this host (loop)
		push specialcase        
		getdst             
		forwto        
	
Figure 10.5: Distributed processing
   
1010
		forw

		; go home if resources
		; are used up
		getrb
		lti 2
		bne gohome-pc

		; get and goto a next host
		getdst
		calls "if_getnexthop"
		forwto

		; go to client
	      gohome:
		push athome
		getsrc                   
		send 

		; return info
	      athome:
		demux
	
Figure 10.6: An autonomous surveyor packet

The example program shown in figure 10.6 uses autonomous processing to select a next hop based on local data. Sent into the network without knowledge of the topology, the autonomous surveyor has to design its own route through the network. The algorithm used for selecting the next destination directly impacts which nodes can be accessed. We chose a simple heuristic: select as the outgoing interface the entry listed directly after the incoming interface in the iFace table. A round robin scheme is used to connect the list's outer elements. This will only allow us to traverse the outer edges of a densely connected network. In the honeycomb-like structure we selected for testing there will therefore be a number of nodes that are not traversed. We will provide a more robust solution in the next example. By placing the interface selection algorithm in a service the resulting SNAP packet stays very simple, as shown in figure 10.6.
The most apparent use of an autonomous agent is network discovery. This in itself has a number of applications, mainly in highly dynamic and self-reorganizing networks, such as ad hoc and peer-to-peer nets. Hop-by-hop destination selection has yet another practical benefit over polling. Because this application selects its next destination from the set of neighbouring nodes, disabling forwarding has no impact on its ability to move through the network. Similarly, as long as we do not disable all interfaces, access to a node is retained, even when the routing tables do not reveal this. In such situations, hop-by-hop traversal is necessary to fix problems. They cannot be solved with SNMP.

10.4.2  Stateful Network

So far, we have discussed approaches to network management that act directly on the available data. When no additional factors influence the execution of a program we call a program stateless, i.e. it is not dependent on an internal state. The applications discussed are not completely stateless, as execution is dependent on external data, namely SNMP variables and on the packet's internal stack. However, so far we have not used these values to actively guide a program's execution beyond that of taking immediate counteractions based on a threshold value. A stateless environment has inherent limitations. One shortcoming became apparent in the previous example. Since the autonomous surveyor had no knowledge of its current location it used a simple heuristic to select the next hop. Result was that the surveyor could not reach some of the internal nodes of our test network.
Guaranteeing access to all nodes can be accomplished by tracking previous behaviour. A single variable containing the last exited interface on a node can be used to select another outgoing interface each time we visit this node. Using the internal stack of a program remembering all visited hops is practically impossible due to the maximum size of the packet and the processing overhead it would entail. More importantly, each time a packet is destroyed the state of the network is wiped completely. Instead, we will use the previously mentioned data dictionary to add state to the network.
To demonstrate the advantage of using a stateful network we have added new NM functionality to our SNMP interface. The augmented library can access a remote SNMP daemon as well as the local one. In accordance with the interoperability argument it would be useful if not all nodes in a network would need local Splash daemons. By using remote SNMP access, Splash can serve as a proxy server to devices that for some reason are unable to run Splash themselves. However, to do so one needs to know the location of these systems.
This example uses the previously mentioned techniques to create a localized display of the NM topology. A discovery packet based on the distributed surveyor searches for all Splash-enabled neighbouring nodes. Before jumping to a new node it writes its current knowledge of the network to the local data dictionary. When a node has no Splash daemon running the packet gets lost and cannot update the dictionary. Otherwise the intermediate results are overwritten by the packet on its return trip. Intermediate results in a dictionary will therefore refer to a system running either stand-alone SNMP or no NM tools whatsoever. A second packet can test this as well by sending a test SNMP request to the server.
After the discovery phase, an autonomous agent can be sent to retrieve data from any node N in the network. Traveling through the network based on information found in the dictionaries along the way, the packet knows which nodes are running Splash. When the agent lands on the closest node to N running Splash it sends the SNMP request and continues its internal program as if it was being executed on node N itself.
One can think of more advanced scenarios where this two-tier topology can be used to breach the functionality gap between flexible active networks and inflexible existing infrastructure. In any case, adding state to the network can increase the option space of a roaming application. Since SNMP relies on a pre-configured network it is not suited to highly dynamic networks. The previous example has shown that active networks can be applied naturally to these environments and that an implementation can be created in Splash.
At this point we should make clear that Splash was not originally devised to incorporate inter agent communication. The implemented data dictionary can be used as a method of indirect communication, but other agent platforms may include more graceful solutions.

10.5  Functionality Overview

Comparing functionality of two systems is harder than comparing performance due to a lack of hard quantifiable data. However, we have supplied a number of use-cases where SNMP usage either entails too much overhead or cannot be applied at all. Table  gives an overview of the discussed techniques and the improvements each one brings over simple polling. We have also demonstrated that these techniques can be implemented with Splash. Considering that Splash can also carry out all of SNMP's functions aside from traps we state that it is superior in terms of functionality. Adding traps to the system has not been undertaken, but it should be apparent that doing so will bring with it no new technical hurdles. Therefore the lack of traps is in our view not a convincing counterargument to this claim.
serverside data aggregation reduces network bandwidth utilization
serverside reacting reduces response time
serverside trend analysis improves on both metrics
network wide data aggregation reduces network bandwidth utilization
network wide reacting reduces response time
autonomous network roamingexpands upon application space
stateful network expands upon application space
Table 10.1: Overview of functional techniques
Each of the techniques mentioned offers an independent advantage for which we tried to select an insightful example. The presented use-cases are kept brief for purpose of clarity. Many of them can also be expressed in other networked environments, such as mobile code systems. The real power lies in combining the extended functionality with the flexibility sported by active network environments. AN based systems allow the user to combine useful techniques into specialized problem solvers on a case by case basis.
One should note that the algorithms discussed here do not necessarily exhaust the option space given by our system. Other methods for improving network management currently implemented in legacy applications, for instance those discussed in [] and [], can also be ported to Splash. The obvious benefit is selection freedom: an administrator can trade-off functionality against performance for each individual use-case, scaling from simple polling to the latest advances in network management research. As such, AN systems can be used to experiment with new algorithms, contrary to SNMP. Furthermore, it is unnecessary to add new software packages for each new use-case.
These tests conclude the experimental part of our research. What remains is to point to issues that have not been dealt with and draw final conclusions. In the following chapter we will pinpoint shortcomings of Splash's current implementation and introduce follow on research projects.

Part 5
Inference


Chapter 11
Future Work

Splash was designed first and foremost as an experimental testing platform. In its current state the system can handle all requests described in this document and possible many more. Splash can therefore serve as a platform for various interesting academic pursuits. In this chapter we will discuss a number of issues that might make up interesting follow-up research - and software development projects.

11.1  Research

In our research we've mainly been comparing performance figures between SNMP and Splash. Building on previous related work we have tried to lay a foundation for a new round of research into active networking by displaying that high performance processing is possible using AN technology.
Considering the current status of the Splash platform and the directions into which academic research is going we will now suggest research topics in two fields that we believe can benefit from active networking. Two other pointers are also given concerning research into active networking itself, naturally using Splash as a starting point.

11.1.1  Zero Configuration Network Management

Advances in network management technology have lagged behind primarily because of the undisputed position of SNMP. In the last few years various research projects, including this one, have shown experimental results indicating that the main reason for using SNMP, its unparalleled performance, no longer holds.
At the same time the research community has broken new ground in network technology by shifting its attention from the well known static networks to more dynamic systems. Ad-hoc, mesh and peer-to-peer networks are but a subset of the new networking paradigms under investigation. These fluid networks demand features from the management infrastructure not envisioned years ago and therefore not catered for in today's NM systems.
One of the more promising directions in network management is called zero configuration. Suited especially to environments where connections are highly volatile, zeroconf tools handle network management tasks without human intervention. Research into this field is also followed closely by traditional players in the field because of the huge savings in personnel costs it can bring to an organization. Zero configuration IP networking is being investigated by the IETF zeroconf working group [,]. A notable example of zero configuration in practice is Apple's Rendezvous []. A more general framework for on demand networking, including automation tools for higher level management tasks such as resource handling negotiation, is the Open Grid Services Architecture [].
We've tried to cater for expansion into this field. The back-end functionality is already largely in place. Splash's access to the complete OS subsystem surpasses the functionality of traditional SNMP. The service infrastructure allows rapid inclusion of additional required tool sets. Finally, Splash is in the unique position of being able to work on different levels of the network, bypassing traditional tools such as the routetable, if necessary.
Research into zero configuration using Splash will deal primarily with writing the packets that carry out resource negotiation and network discovery, not with modifying the underlying software platform. All in all it should be possible to show substantial improvements in this field in a relatively short timespan using Splash. The fact that zero configuration encompasses many small issues makes it a candidate for work ranging from a few weeks up to at least a full semester.

11.1.2  Technology Integration

Another subject that might be of interest is the adaption of Splash to other domains. We have thus far only looked at the field of network management. Other AN research groups have taken up specialized fields such as realtime multimedia delivery.
With the help of, among others, SNAP it is now safe to say that performance is not an obstacle to everyday active network deployment. It is therefore possible to start exploring uses of AN systems outside the narrow scope they've been confined to so far.
Distributed applications are currently being developed using a so called web services infrastructure. The open grid service architecture [] has been defined to allow for standardized web services in the near future. Web services can be seen as a special kind of remote execution. Keeping in line with the end-to-end argument it is perfectly understandable that the accent in this field lies on remote execution at the end nodes. However, active networks can be used to ease the development of such initiatives by allowing greater flexibility in the underlying environment.
We believe Splash makes a great candidate for this kind of task by virtue of its extensibility. It already allows for on demand loading and execution of legacy code, runtime alteration of the environment and migration of agents. Missing features are mostly practical in nature, namely services or instructions should be added to actually download and execute legacy code. Once this is added it should not be hard to show that Splash can be used to underly a grid architecture. Next step is to display the advantages of using an active network as opposed to using passive IP and custom built tools to handle this functionality.
Naturally, instead of embedding Splash in a multi-tier grid environment, one could also search for applications that can directly be added to the Splash service architecture. The workload for this type of research therefore cannot be estimated at this point.

11.1.3  Flexible Agents

The design of SNAP packets allows them to be handled very quickly. It also holds back deployment of Splash into numerous situations. While we do not have the intention of overthrowing the current Splash implementation, we observe that the platform would benefit greatly from a more flexible agent infrastructure.
Naturally, any extension to the core SNAP-ee should at any time be backward compatible to remain relevant in the field of high performance active networking. Many flexible, yet relatively slow alternatives already exist. Merging the advantages of both systems could make an interesting research project, however.
This is not the right place to go into details on how such a task should be accomplished, but we will give a few pointers. During our research we observed that task expressibility in Splash was limited by the lack of agent cooperation. Missing language constructs is another problem that frustrates the programming of Splash. To overcome these issues we suggest the inclusion of various features that can be considered perilous to resource consumption. Inter agent communication and extended language constructs could be governed similarly to network utilization. An interesting solution to this problem has been implemented by Kind et al [].
We believe these two examples are merely a subset of useful construct already used in higher level mobile agent approaches. Taken from another, yet related field is the concept of remote code loading. Allowing the execution of machine dependent platform code through the use of on demand services can speed up processing and open up a vast array of useful tools to Splash developers.
SNAP has found a niche in the AN world by providing high performance and secure operation. Yet, completely locking out certain advanced features has limited its scope somewhat. There are good indications that at least a number of these features can be added to the system without sidestepping the original goals. The precise how and what can make up an interesting research consisting of a lot of reading into advanced mobile agent approaches combined with implementation of an, at least partly, governed version of these constructs. No such mix of robust, fast and high level construct currently exists. Research into this field will probably result in a truly original contribution to the scientific community. One should therefore also expect this to take up at least a few months of work, possibly up to a full semester.

11.1.4  Security

The preceding examples hopefully showed some interesting new directions for Splash. During their discussion we conveniently disregarded security issues. However, security is a major concern for a networked platform and Splash currently contains no such features. It can be expected that authentication and encryption are features that will be demanded for everyday deployment of active networks. The SafetyNet [] initiative was started as an acknowledgment of this fact.
For Splash or a production ready relative to be accepted as a viable solution to networking problems it is imperative that the security issue is dealt with in a suitable way. Especially the emphasis on high performance calls for an implementation that differs from earlier attempts at securing AN platforms.
This study is largely open ended. Powerful cryptography must necessarily be a part of any security system. Some kind of scaling from unsafe yet fast to secure and relatively slow execution will be necessary to keep true to Splash's original intents. For a single person we expect this work to take at least three months. However, the workload can be distributed among a number of assignments, both academic and engineering. For instance, implementing a security infrastructure while disregarding its implications for performance shouldn't take more than a few weeks.

11.2  Software Platform

While Splash is mainly a research platform there are a number of practical issues, ranging from sloppy coding to missing features, that could make up a nice software project. Neglecting the point of originality, the following proposals deal mostly with beautifying the available platform. The primary goal here is to heighten Splash's position as a research and development platform.

11.2.1  Performance

Performance issues were of primary importance in the development of both SNAP and Splash. Nevertheless, further optimizations are most definitely within reach. The current installment of Splash has been optimized locally where strictly necessary. A more general solution can be found that removes certain subtasks altogether. Upon close inspection one can see that the current code base is far from lean. For example, data conversion needed between various subtasks reduces overall processing speed considerably. Creating a global data handling standard would be beneficial on its own. Other problem areas include startup time due to service loading and client communication waiting times.

11.2.2  Network

As discussed in section 8.2.3, the inter server - and client communication protocols overlap in theory. Yet in the current implementation the two are strictly separated. One programming task therefore would be to merge the two and possibly place them in a separate library decoupled from the daemon code. A preliminary implementation has been made for the, more elementary, client communication. The new infrastructure, SNAP_demux_handler, separates the Splash communication layer from the underlying transport protocol and can currently handle raw IP, UDP and UNIX pipe protocols. Extending it to encompass TCP, IPv6 or lower level protocols shouldn't pose any problems.
Apart from extending the handler's protocols, we would also want the handler library to serve as a basis for inter server communication. This can be accomplished easily by moving the marshalling and unmarshalling functions - that are extremely simple - into the library. Furthermore, the separated recv(..) loops, currently existing for inter server, client to server and server to client, should be merged as well. The networking code is the most duplicated code in the existing distribution and for maintainability purposes should therefore be rewritten as soon as possible.
A subject concerning both this and the previous task is data formatting. Ideally, an identical datastructure would be used throughout the entire system, including the networking code. The preliminary networking framework was created for handling demux statements and as such can only handle raw text strings. A more advanced framework should replace this, for instance the one already in use in the service interface.

11.2.3  Small issues

The issues discussed above are in our opinion the most problematic in the present distribution of Splash. That said, many small problems exist that users may need to work around for the time being. Both as a guide for future work and as an introduction into the possible practical problems one might encounter when using Splash we will quickly deal with a number of smaller issues here.
Instructions and Services   There exists a clear, yet unnecessary distinction between SNAP instructions and services. From the experiments it showed that Splash packets can be relatively large due to their service calling code. One way to avoid this would be to create bytecode instructions based on the service name. A hashtable is being used as a lookup structure, but the keys are encoded inefficiently as raw text string. Instructions, on the other hand, are encoded into bytecode by the assembler. These two solutions can perhaps be merged to create consistent bytecode implementations. A one-way function could be used for service name to bytecode translations.
Following such an operation, the gap between instruction and service handling could be bridged, since all bytecode instructions could then also be referenced through the hashtable, instead of through a separate switch statement, as is now the case.
Unifying Front-End   This work has dealt with comparing Splash with SNMP. On multiple occasions we've stressed the importance of interoperability between the two systems. While possible in theory, no framework for interoperation exists so far. A unifying front-end capable of handling both Splash and SNMP results, is therefore desirable for practical deployment. Similarly, which of the two platforms is preferable for sending a specific request can be calculated automatically. Therefore both the preprocessing and the postprocessing involved could be abstracted from the user through a special purpose front-end user interface.
Code Cleanup and Documentation   Pushed through the development cycle a number of times, the Splash code base looks far from consistent at the moment. A serious tidying up would be useful. Many stale macro's and duplicated functions can easily be removed, for instance one of the two hash tables used by the daemon.
A more rigorous cleanup is the removal of a large amount of stale code related to
  1. a previous kernelspace implementation. This behaviour is broken in the -wjdb build and all existing references should be removed.
  2. alternative packet formats. A previous implementation had support for multiple packet formats. For practical reasons we only use the most efficient format. Some handlers still exist for the alternate layouts, as well as their complete definitions. However, the new interfaces have not been designed with multiple packet formats in mind. Therefore the code base should be purged from references to these formats or their handlers should be extended to the new interfaces. For efficiency as well as readability reasons we suggest doing the first.
Related is the issue of documentation. As for now, the Splash internals have been documented by an automated tool, doxygen. The output of this tool can be found on the Splash website []. However, we've only started using this tool after completion of the project, therefore instructive commentary is sparse. Tidying up of the code base should go hand in hand with the addition of useful commentary, preferably in a syntax recognized by doxygen.
The projects discussed here identified some shortcomings in the current implementation of Splash. From these observations and the outcomes of the experiments conclusions will be drawn in the next - final - chapter.

Chapter 12
Conclusion

The research executed in this work should have demonstrated that an active networking environment can be used as a network management tool. Our goal was twofold. First, to show that such a system can surpass the current de facto standard tool, SNMP, in terms of functionality. Second, that it would not incur a large performance penalty.
In chapter 2, a number of active networks were shown, most of which were deemed inadequate for network management tasks due to the relatively high processing overhead they incurred. After having presented some weaknesses of SNMP and how an active network might be used to overcome these in chapter 3, we have shown a blueprint for such an AN-based application in chapter 4 and identified a number of test scenarios with which we can compare network management toolkits in chapter 5.
In part 5.5 an implementation of the aforementioned design, Splash, was presented. Splash implements an architecture that combines mobile agent support with standard SNMP. This architecture minimizes the overlap of the code bases for the underlying SNMP daemon and active network while providing the functionality of both within a single process. Splash combines the widely deployed net-snmp package with a user-space SNAP active packet environment. The hybrid nature of Splash allows it to circumvent a main drawback of other SNMP alternatives, i.e. lack of interoperability.
The experiments discussed in part 8.3 serve to verify the claims set forward in our thesis. To properly do so they must show that (1) Splash can, contrary to previous AN based management tools, execute low-level request at roughly the same speed as SNMP, while (2) at the same time reduce management traffic or improve overall responsiveness vs. SNMP in use-cases where derived or distributed results come into play. The first task is tackled in chapter 9, the second in chapter 10.
Performance experiments showed that Splash's performance is indeed only fractionally lower than SNMP's for simple requests. Furthermore, from the design of our system follows that it can interoperate with the reference system. If the incurred slowdown is too high a burden one can always choose to use SNMP requests in specific low level tasks and only use SNAP agents for complicated scenarios.
Functionality tests showed that with Splash it is possible to deploy algorithms that can (1) reduce network traffic, (2) decrease response times and (3) solve problems SNMP is incapable of handling. The used techniques, discussed in chapter 10, are independent of one another and are applicable in specific situations. The flexible active network framework allows one to combine these and other techniques on a case by case basis.
Finally, shortcomings of our implementation were discussed in chapter 11. They showed that Splash can not yet replace existing systems. The most notable weakness is its lack of security. While the underlying SNAP interpreter ensures safety with regard to resource consumption, security and robustness are qualities not yet found in, but easily added to Splash.
One can see that Splash is not a production ready management application as it stands. However, we never set out to create such a system. Splash's main contribution is that it showed that active networks can be applied to network management practically. Proving our thesis, Splash is able to surpass SNMP in terms of functionality while at the same perform nearly as well as SNMP in traditional tasks. The underlying active network and extensible service infrastructure will allow it to adapt to unforeseen surroundings, both in- and outside of the network management domain. By doing so, Splash creates a safe environment for adoption of new networking practices. A trial project of the technology in a metropolitan area wireless network, demonstrating the practical value of active networking, is currently underway. To further encourage research in this field we have released Splash in the public domain [].