Stephen Ibaraki: Interviews with leading business and IT experts

CIPS CONNECTIONS

Interviews by Stephen Ibaraki, FCIPS, I.S.P., MVP, DF/NPA, CNP

Neal Allen, International Top-Ranking Network Service/Troubleshooting Authority; Level 3 Escalation Engineer, Fluke Networks

This week, Stephen Ibaraki, FCIPS, I.S.P. has an exclusive interview with Neal Allen, Fluke Networks.

Neal Allen

Neal Allen has been working for Fluke since 1989, and has been in the Fluke Networks networking products group since 1992 when it formed. Special focus has been placed on technical marketing, product development, beta testing, and special projects. In 2002 the Technical Assistance Center was restructured, and in the capacity of a level 3 escalation engineer, Neal has dealt with many of the issues surrounding analysis of the higher OSI layers while still maintaining awareness of the lower layers and how they affect monitoring and troubleshooting.

Several of the many special projects include:

Assisting with or managing troubleshooting at all US Interop trade shows since 1993
Assisting with troubleshooting at the 1996 Olympic Games in Atlanta
Assisting with network service restoration at the Pentagon following the 9/11 attack
Assisting with troubleshooting efforts onboard the aircraft carrier USS John C. Stennis while underway

Index and links to Questions

Q1 Profile your current role with Fluke.

Q2 Going back in history, what lessons do you want to share from your work at the Olympic Games?

Q3 Do you have useful tips to add from your work at the Pentagon after the 9/11 attack and your troubleshooting efforts on the aircraft carrier?

Q4 Please share some stories from your work.

Q5 You participate in troubleshooting at Interop trade shows? Can you profile the most difficult problems in this environment and their solutions?

Q6 In your view, what are the most serious networking roadblocks for businesses?

Q7 What are the main issues that must be considered in a switched environment?

Q8 Techniques for troubleshooting a switch.

Q9 Which are your top recommended resources and why?

DISCUSSION:

Opening Comment: Neal, you bring a lifetime of proven experience and substantial contributions to the networking field. We thank you for doing this interview with us and sharing your deep insights with our audience.

A: I am always happy to share my discoveries with others.

Q1: Profile your current role with Fluke?

A: In my current capacity in the Technical Assistance Center (TAC), I usually only see the more unusual or intractable problems related to our diagnostic and monitoring products. Although it is possible to reach me on the first call into the TAC for some products - we use a skill based queue system - it is much more likely that I will be working with the technician with whom you first speak. Internally we describe escalations as falling into two categories: hand-off escalations where I become the primary contact, and "drive-by" escalations where one of the other TAC technicians walks over to my desk and asks questions about an open issue. I get a lot of drive-by issues.

I also have many secondary tasks which I call "hobbies". My hobbies include everything from working with engineering on product planning and beta testing, trade show support, sales support on key accounts, training development, and mentoring/tutoring other TAC staff.

Q2: Going back in history, what lessons do you want to share from your work at the Olympic Games?

A: That was a lot of fun, and provides an excellent example of what I have noticed as a significant and disturbing trend in the 40-50 new-hire interviews I have conducted in the last year or so. Almost everyone involved in network support seems to have either forgotten all of the basics, or never really knew them.

At the Olympics I was first tolerated as a necessary evil. Our corporate participation in the Games came fairly close to the actual event and all of the network planning had been in process for well over a year before we got involved. There was a core of high level network architects and systems analysts who designed and implemented the Games network. When I arrived on-site as part of the Fluke Networks delegation we were butting into the existing organization. At first I had to get fairly pushy to be permitted to accompany the other support staff on trouble calls. That changed fairly fast though, as I was able to provide some value added services: I knew how the deployed technology actually worked, and when OSI Layers 1 - 3 were the source of reported problems I was able to not only help isolate the problem quickly, but I could also interpret the symptoms and often trace them back to the root cause. In short order I was sought out by some of the analysts when they were called out of the NOC for troubleshooting the venues.

An example of one problem was at a fairly remote venue where track and field competitions were being held. The network at the venue was experiencing considerable slowness and occasionally dropped connections. The first stop for the network analysts was to go to the site with a notebook PC and log into the router console. While they were poking around the configurations I took a walk and examined the physical network installation. As I walked along one of the tracks, I kicked the network cables off the top of about 20 1" generator fed power cables (to make sure they were a minimum of three feet from the power cable electromagnetic fields). My walk around the field took about 15 minutes. When I returned to the on-site NOC they reported that the network problems had gone away, and they had no idea why. They were very concerned that the respite was temporary. I spent about 30 minutes explaining why the problem was actually solved. During that time the explanation progressed from a high level on down to the inner workings of the MAC Layer protocol as the questions became more and more specific as to why I would interpret a particular error the way I did. The analysts had been focused on higher Layer issues for so long they had forgotten how the MAC Layer protocol worked, so the errors did not point them toward the source of the problem.

This is typical of what I have seen in recent interviews. Almost everyone I talk to starts "at the keyboard" and troubleshoots from the router console upwards in the OSI Model. Layers 1 and 2, as well as part of Layer 3 are treated like your average electrical outlet. "I plugged it in, of course power is there. What could possible go wrong with the power?" Nobody seems to be able to offer a working description of how the technologies actually work "below the keyboard" (at OSI Layer 3 and down) and that creates frustration and inefficiency in troubleshooting. Often the root problem was never exposed or understood at all.

I largely attribute this situation to the ubiquitous presence of switches. In a switched network any cable fault, marginal NIC or switch port, or environmental influence will almost always limit its effects to a single connection. Thus it is not thought of as a network problem at all; it is an end-user PC problem. Add to that the reliability increases that have been made over the years and you have a situation where router and VLAN configurations occupy most of the time and effort of those people who have the router password. All of the other problems tend to be handled through the Help Desk. The Help Desk doesn't know how the network operates and doesn't have access to network failure or configuration data, and the people with access to the network don't care about end-user problems. And we wonder why there is finger-pointing.

Q3: Do you have useful tips to add from your work at the Pentagon after the 9/11 attack and your troubleshooting efforts on the aircraft carrier?

A: The Pentagon was pretty straight-forward. Pieces of the network were missing and had to be replaced. Everyone there was great and worked together despite all of the group and service boundaries which had to be accommodated.

The aircraft carrier was a completely different situation. I am not at liberty to describe the exact situation, so let's equate it to a comparable situation where e-mail was sometimes very slow or completely unavailable for short periods of time. This is typical of today's switched network problems. The network was just fine with typically single-digit utilization per port, but was accused of being the source of the problem. I was onboard to find the "network" error and it didn't take long to disprove that theory.

What I suspected, but didn't have time to completely unravel, was that interactions between the servers was the cause of the problem. We now offer a product which would have done this for me, but at the time I was not able to elicit a satisfactory description of the interactions between the servers involved in login, authentication, and e-mail services. Without that basic dependency and relationship information I would have to perform protocol captures of all of the traffic to and from each server identified as troubleshooting progressed in order to create that interaction diagram. I remain convinced that one of the servers was doing extra duty with some other application and when it became burdened with that extra duty it was slow or unresponsive in supporting e-mail. Because of the multi-tier architecture which involved interaction between multiple servers and services before e-mail appeared on the client user interface it was highly likely that a validation or authentication step was not taking place in a timely manner. The e-mail servers themselves appeared to be well configured and lightly utilized.

There are four communities within the typical organization which must cooperate in order to isolate and solve this sort of problem: network support, server support, applications or database support, and management. In most situations the problem would be a natural reluctance of people to volunteer information which might reveal that the problem originates in their area of responsibility. It is very hard to troubleshoot if you cannot gather information and symptoms. It is even harder if nobody knows which servers or services are involved and this was the problem I observed. Up until I started asking questions, there was no reason for people in the different areas of responsibility to share detailed information. In fact, security considerations discouraged open sharing. Once we started asking it was hard to find all of the people who knew the answers to questions about server and application interactions. I left before some of the shore support people who held critical knowledge were located. I hope they isolated the cause because the network support people were once again being accused of the slowdown, even though their part was working fine.

As networks become more segmented, even as they become more integrated, this problem will continue to get worse. I was recently asking myself how I would troubleshoot an application interaction problem like this when the servers involved are running as virtual systems on a blade server. I would have no access to data passing over the virtual Ethernet between the virtual servers on the blade and no access to data passing between blade servers in the chassis. My view would be limited to traffic between the blade chassis and the network if I used a span or tap on the right uplink. The only other alternative I can think of would be to load monitoring software on each virtual server, and most people don't like loading things on their servers. To spice things up you could add spyware, virus, 'bot, or hacker activity into the equation.

Q4:Please share some stories from your work.

A: Since I am highlighting issues below Layer 3 as being important but ignored, here are a few.

Symptom:
The network worked flawlessly until about 5pm, then became intermittent over a period of 15-30 minutes and finally quit working until between 8-9pm. That is, under the heaviest loads during the day it worked and at night it worked. During the early evening it failed for no apparent reason.

Cause: After considerable "widening of the scope" of the problem it was determined that the UTP cable from the server farm was draped across the suspended ceiling. The distance to the roof was just the plenum space between the suspended ceiling and the actual roof - maybe one foot. The air conditioning system used the plenum space above this suspended ceiling and was automatically turned off at 4:30pm for this office (in Phoenix, Arizona in the summer). The heat from the roof would then increase the attenuation of the UTP cable until the link failed. When the roof cooled off in the evening the network would resume operation.

Symptom:
The network worked pretty well most of the time, but very intermittently (workdays only) the link to a nearby remote site began dropping for as long as 10 seconds. This was very annoying, but not catastrophic.

Cause: The line-of-sight microwave link between buildings was being interrupted by a construction crane. The problem was not immediately apparent because the microwave bridge equipment did not drop link on the network side when the microwave was interrupted. It just dropped traffic after the buffers were exceeded because the path was down. This was discovered only after troubleshooting everything else and finally mounting an optional video camera to the antenna assembly. When a network management application sent an alert indicating that the remote router was not reachable we finally observed the crane passing in the video.

Symptom:
The network in an accounting department located on one of the middle floors of a downtown high-rise building would intermittently lose service for a couple seconds. Sometimes it caused data loss on the servers. Once it locked up a server and it had to be rebooted. It only happened two or three times per day and happened at random intervals.

Cause: After gradually replacing every bit of cable, every server, and all network infrastructure in the office the problem persisted. A week into the problem I was expressing frustration with the office manager, who was getting fairly short-tempered, when she suddenly interrupted saying "look - it's going to happen again." She was watching one of the CRT monitors (thankfully they were not flat screen) which presented a "warped" screen image for just a tiny moment. I wish she told me that symptom sooner. The express elevator to the penthouse office was passing right behind the wall. The elevator's passage (a metal box in a metal shaft) created a huge electromagnetic field as it passed the accounting department. The accountants moved to a window office since it was not cost effective to screen that electromagnetic field out of the room.

Task:
While troubleshooting it became apparent that there was something wrong with a specific server. A shared media hub was used to gain access to the traffic. The link between the switch and the server was attached to the hub, along with protocol analyzer. The shared media hub would permit the protocol analyzer to see all of the traffic.

Observed symptoms: Traffic was observed by the protocol analyzer, but not the transactions which were sought. In fact, very little traffic was seen despite the activity indicated by the server application and the switch console.

Cause: The hub was not just a shared media hub. It is almost impossible to purchase an exclusively shared media hub today. In fact, most "hubs" are now unmanaged switches. The protocol analyzer was an old but loved piece of luggable hardware owned by the customer I was working with. It linked at 10 Mbps. The 10/100 hub permitted the switch and the server to link at 100 Mbps. Between the 10 Mbps collision domain and the 100 Mbps collision domain in the "hub" was an OSI Layer 2 bridge. Same-speed connections were in the same collision domain. If the transaction was specifically addressed to the server then the traffic was never forwarded across the bridge to the protocol analyzer. All the protocol analyzer saw was broadcast traffic, the first query for new conversations and the next frame for traffic which had aged out of the bridge forwarding table in the hub. By forcing the switch to 10 Mbps also we were able to see all of the traffic because it all passed through the same collision domain.

Q5: You participate in troubleshooting at Interop trade shows? Can you profile the most difficult problems in this environment and their solutions?

A: Due to the nature of a show like Interop, the biggest problem is in not saying anything politically incorrect. For example, exhibitors at these shows are often using the venue to introduce or launch new products. The product, such as new network infrastructure gear, may be making its first debut outside the development lab. I can remember several instances where the design engineer was complaining that the Interop network is not working and we were arguing that there wasn't any traffic coming out of their box, or that there was a problem with what did come out. How do you tell someone that their baby is ugly even as that person is entering the baby in a beauty contest? Furthermore, it is our job as part of the show staff to do whatever is possible to make the debut successful despite any little problems.

I love going to this show because I get to work with many of the products and new technologies which will begin appearing in our customer's networks over the next 6-12 months. It is like a sneak preview of what our customers will be calling us about. In fact, it is about the only useful training on "new" which I obtain each year.

I would say the biggest problem is in following two rules for troubleshooting:

Don't assume that it does work.
Don't assume that it doesn't work.

When Fast Ethernet was still a proposed standard Barry Reinholt from the University of New Hampshire Interoperability Lab brought to the show most of the "working" Fast Ethernet switches which were in the process of being released to market by the various vendors. We used them to build one of the first multi-vendor Fast Ethernet production networks. Keep in mind that multi-vendor interoperability was a goal then, not a routine expectation. It almost worked great. We spent many, many hours troubleshooting until it was finally discovered that one of the fundamental operating rules of a bridging device was not being followed by one switch near the middle of the hierarchy. When the bridge forwarding table filled up it didn't flood traffic for addresses not in the table to all ports. It has been long enough now that I don't remember for sure, but I think it either dropped traffic for the new address which was not inserted into the table, or dropped traffic for the oldest address which it discarded from the table. Up until the forwarding table on one switch was full the entire deployment worked flawlessly. All of the symptoms pointed at a problem with one client, since everyone else worked fine. We didn't follow the two rules or we would have questioned the observed behavior more closely much sooner.

There is an instance of this which plagues small wireless deployments right now. When some low cost APs' bridging tables fill up they discard traffic for anyone not in the bridge forwarding table. If you reboot the AP then a new list of wireless clients can occupy the table entries and operate just fine. If you are the eleventh client, (using ten entries in the table as an example, as the table size appears to be somewhere near that number on one such AP), then everyone else can get to the Internet except you, despite your NIC driver showing good signal strength. Again, the symptoms pointed toward a misconfigured client.

Q6: In your view, what are the most serious networking roadblocks for businesses?

A: Training. Training. And training.

Unfortunately, few businesses factor serious ongoing training into daily operations anymore. That is now the off-hours responsibility of the employee.

New-hires are often just out of a certification class and have little or no real experience. Furthermore, the training certification program is almost always designed around a particular vendor's offerings. This is great for the vendor, but not always great for the student. The part of the course which offers general technology information which they will be using when they get hired is often an after-thought addition to the course to "round it out". They teach that part as fast as possible, or skip it and tell the student to do some extra reading. The deep content in the course is related to the vendor product, which the student often won't be permitted to utilize until they have proven to their new employer that they are safe near the network. After all, would you give the password to the administrator account which controls your whole business to the new guy? I doubt it.

On the other end of the technology continuum are the people who got into networking in the 80s and had to learn all of the problems as they appeared. They have that knowledge buried in the back of their head, but never use it anymore. They are struggling to keep up with new product deployments on new technologies. They spend most of their time logged into a console or in front of the whiteboard.

The knowledge held by the new guy often relates to the newer or bleeding edge technologies because that is what was covered in class and the knowledge needed is lying unused in the senior staff's head. How do you transfer that knowledge in both directions?

Then there is the problem of interpreting the data presented by the network. I can't tell you how many times we have had customers call us with questions like:

Your tool says I have duplicate IP's, what do I do?
Your tool says there are collisions on my network, what do I do?
The graph says that my network traffic is 100% broadcasts, is that bad/what do I do?" (This one comes in two flavors, first the traffic is all broadcasts because that is what the switch forwarded - of course, it is only 0.01% of the bandwidth. The second instance is that there is a broadcast storm and it really is 100% of the bandwidth.)

Questions like these point to a fundamental lack of understanding the principals of network operation. If the underlying technology were understood, then questions like this would not be asked.

My favorite is related to Ethernet Auto-Negotiation. It appears that almost everyone believes that if one link partner is negotiating, then you can do anything you want with the other end and it will figure it out. At the same time they always configure everything to Full Duplex because it didn't work. The correct answer is that a negotiating station is required to configure itself to Half Duplex if the link partner is not negotiating.

All of the supporting details are, but the basic principals of network operation and the operating rules of the common technologies are not that much to learn. If it the basics are not understood then you will have many frustrating experiences.

Q7: What are the main issues that must be considered in a switched environment?

A: The most important issue is that you cannot successfully troubleshoot a switched environment unless you know what the switch is doing. You also shouldn't be designing the network unless you know that.

By walking up to the switch and looking at it you learn almost nothing. Unless you know how it was configured, and what OSI Layer(s) it is operating at you cannot easily interpret measurements and test results related to it. And you need a current map of the network architecture.

Examples:

If you connect to a spare unused port and cannot get to the internet, is it because:

The port is configured for a VLAN, and your PC isn't.
The VLAN the port is a member of does not currently have a connection to the default gateway.
The DHCP queries your PC is sending are going to a DHCP server which does not have a scope for your subnet.
802.1x Security is enabled and you didn't authenticate.
There is an odd interaction between the Layer 2 dynamic protocols and the Layer 3 dynamic protocols which has resulted in the "wrong" port in a redundant path being blocked and you can't go anywhere.

You connected to your desired server, but things are really slow because:

Your web access is progressing at a snail's pace despite being able to FTP files quickly from the same server because rate limiting is turned on for HTTP traffic.
Downloads are working ok, but uploads fail because there is a duplex mismatch on your link or on the uplink between the nearest switch and the next switch.
Your network architecture relies on switches only forwarding traffic to the destination port, so only a few routers are used and almost everything is in the same broadcast domain. This results in the broadcast level being so high that your CPU is constantly being interrupted to evaluate unnecessary broadcast traffic.

Q8: Techniques for troubleshooting a switch.

A: Troubleshooting switches has never been an easy task. It is made more difficult by the unavailability of the password. The people who do most of the day-to-day troubleshooting don't have the password and the people with the password troubleshoot "from the keyboard upward". This leaves a significant gap for helpdesk and first-in technicians to fill with few if any resources.

Although this is certainly not a complete list of the available options, this is what I came up with in a short time for methods which could be used to troubleshoot a simple "the network is slow" scenario involving a Layer 2 switch. This does not take into consideration the multitude of higher-Layer features available from today's switches, only a basic approach. Furthermore, for each method below there are serious pros and cons associated with each. There are good reasons for using or not using each.

Method 1: Access the switch console
Logging into the switch is the way you configure it, and often the only way people troubleshoot it.

Method 2: Connect to a spare (unused) port
Connecting a test device to an open port is the easiest way to start, for someone who does not have the switch password. Many problems can be found this way, but many others are not visible.

Method 3: Configure a mirror or span port
Having the switch send a copy of all of the traffic to and from a problem device, such as the slow server, may be the only way you can solve a protocol problem. Certainly it is a good place to start searching for problems. Unfortunately you need the switch password, and you can misconfigure badly enough to disconnect a portion of the network.

Method 4: Connect to a tagged or trunk port
Using a test device to monitor or probe multiple subnets at a time makes the job easier. Not that many test devices can use tagged or trunk ports.

Method 5: Insert a hub into the link
Inserting a hub into the link gives you access to all of the traffic to and from a problem device, just like a span, and you don't need the switch password to do it. Disconnecting the link to insert the hub during the workday can be problematic, as can inserting a half-duplex device into a full-duplex link.

Method 6: Place the tester in series
This is another way to bypass the need for the switch password, which almost all helpdesk and junior technicians are lacking. You still have to disconnect during the workday, and there are few tools which can be inserted in series.

Method 7: Place a Tap inline on a link
Yet another way to bypass the need for the switch password, and the tap may be left in place indefinitely. The cost of a simple tap is not that great, and having the tap present on critical paths (which cannot be disconnected during the workday) permits test devices to connect and monitor whenever necessary. Traffic volumes may affect test results in special cases, and monitoring may be directional if a traditional tap is used by a single input test device.

Method 8: Use SNMP-based network management
SNMP permits access from anywhere as long as there is a routed path, and most general information is available. Particularly useful for trend analysis. Some features are not available from most SNMP agents in typical network infrastructure devices, such as switches. It is difficult to find a protocol problem using SNMP, but most path or link problems are visible. Recent security trends have caused SNMP availability to diminish.

Method 9: Have the switch send sFlow, NetFlow or IPFIX
Since the whole purpose of switches is to limit where traffic is forwarded, and thereby improve traffic flow, it is difficult to troubleshoot one. The flow technologies offer a way for the switch to report on its activities as a routine part of forwarding. Set up a flow receiver anywhere and use the flow output data to develop protocol and conversation tracking and trending, by protocol port number. This is similar to SNMP use, except the switch sends it without being asked constantly. It will not reveal protocol problems but will help with many other problem categories.

Method 10: Set up a syslog server
Syslog permits the switch to report on all sorts of routine activities, authentication and login, and any configuration changes. This is a particularly useful method of troubleshooting login or authentication problems.

Method 11: Use the server (host) resources
Almost all products come with some level of self-diagnostic utility or utilities. These diagnostics may not solve many network problems, but they are good for server resource utilization and other related issues.

Method 12: Use a combination of the above methods
There is no one solution for all problems.

[Editor's note: A link to a white paper with additional details about each method will be provided in a blog link in the IT Managers Connections blog (http://blogs.technet.com/cdnitmanagers/).]

Q9: Which are your top recommended resources and why?

A: Training. If you don't know what your tools are telling you, how can you fix anything? This can be a formal class, or something as simple as downloading the available standards (most of them are available if you look) and reading them. Yes, they are dry, dull, and boring. They don't even always have the most elegant or the best solution. However, they are the Standard and are therefore "Right" by definition. If you know what the standard says then you know what should be happening on your network. The challenge is then for you to find out which thing isn't behaving according to the standard.

Here are a few places to look. Each of these links permits limited downloads of the actual standards personal use. There is no excuse for not knowing.

IEEE standards
http://standards.ieee.org/getieee802/

IETF RFCs
http://www.ietf.org/rfc.html

ISO / IEC
http://isotc.iso.org/livelink/livelink/fetch/2000/2489/Ittf_Home/PubliclyAvailableStandards.htm

INCITS
http://www.incits.org/

This site provides easy access to most specific protocol, port, or vendor numbers you are likely to need.

IANA
http://www.iana.org/numbers.html

I am continually appalled at how many experts spout inaccurate data which they obtained from the popular press without ever having opened a Standard. I have really angered a number of instructors in purchased Industry training classes by challenging their facts by reading passages in the Standards (which I happened to have with me�). How can I be expected to trust the new information they are teaching me if the parts I know are described inaccurately?

I recommend using the popular press and industry training to get a general idea of how something works, and then supplement and verify that information by reading the source standard(s). It is often hard to get the "big picture" from the Standards documents.

Closing Comment: Neal, we will continue to follow your significant contributions in the networking arena. We thank you for sharing your time, wisdom, and accumulated deep insights with us.

A: Thank you for the soapbox! I hope some of my ramblings are of use to your readers.