Interviews by Stephen Ibaraki, FCIPS, I.S.P., MVP, DF/NPA, CNP
Neal Allen, International Top-Ranking Network Service/Troubleshooting Authority; Level 3 Escalation Engineer, Fluke Networks
This week, Stephen Ibaraki, FCIPS, I.S.P. has an exclusive interview with Neal Allen, Fluke Networks.
Neal Allen has been working for Fluke since 1989, and has been in the Fluke Networks networking products group since 1992 when it formed. Special focus has been placed on technical marketing, product development, beta testing, and special projects. In 2002 the Technical Assistance Center was restructured, and in the capacity of a level 3 escalation engineer, Neal has dealt with many of the issues surrounding analysis of the higher OSI layers while still maintaining awareness of the lower layers and how they affect monitoring and troubleshooting.
Several of the many special projects include:
Opening Comment: Neal, you bring a lifetime of proven experience and substantial contributions to the networking field. We thank you for doing this interview with us and sharing your deep insights with our audience.
A: I am always happy to share my discoveries with others.
Q1: Profile your current role with Fluke?
A: In my current capacity in the Technical Assistance Center (TAC), I usually only see the more unusual or intractable problems related to our diagnostic and monitoring products. Although it is possible to reach me on the first call into the TAC for some products - we use a skill based queue system - it is much more likely that I will be working with the technician with whom you first speak. Internally we describe escalations as falling into two categories: hand-off escalations where I become the primary contact, and "drive-by" escalations where one of the other TAC technicians walks over to my desk and asks questions about an open issue. I get a lot of drive-by issues.
I also have many secondary tasks which I call "hobbies". My hobbies include everything from working with engineering on product planning and beta testing, trade show support, sales support on key accounts, training development, and mentoring/tutoring other TAC staff.
Q2: Going back in history, what lessons do you want to share from your work at the Olympic Games?
A: That was a lot of fun, and provides an excellent example of what I have noticed as a significant and disturbing trend in the 40-50 new-hire interviews I have conducted in the last year or so. Almost everyone involved in network support seems to have either forgotten all of the basics, or never really knew them.
At the Olympics I was first tolerated as a necessary evil. Our corporate participation in the Games came fairly close to the actual event and all of the network planning had been in process for well over a year before we got involved. There was a core of high level network architects and systems analysts who designed and implemented the Games network. When I arrived on-site as part of the Fluke Networks delegation we were butting into the existing organization. At first I had to get fairly pushy to be permitted to accompany the other support staff on trouble calls. That changed fairly fast though, as I was able to provide some value added services: I knew how the deployed technology actually worked, and when OSI Layers 1 - 3 were the source of reported problems I was able to not only help isolate the problem quickly, but I could also interpret the symptoms and often trace them back to the root cause. In short order I was sought out by some of the analysts when they were called out of the NOC for troubleshooting the venues.
An example of one problem was at a fairly remote venue where track and field competitions were being held. The network at the venue was experiencing considerable slowness and occasionally dropped connections. The first stop for the network analysts was to go to the site with a notebook PC and log into the router console. While they were poking around the configurations I took a walk and examined the physical network installation. As I walked along one of the tracks, I kicked the network cables off the top of about 20 1" generator fed power cables (to make sure they were a minimum of three feet from the power cable electromagnetic fields). My walk around the field took about 15 minutes. When I returned to the on-site NOC they reported that the network problems had gone away, and they had no idea why. They were very concerned that the respite was temporary. I spent about 30 minutes explaining why the problem was actually solved. During that time the explanation progressed from a high level on down to the inner workings of the MAC Layer protocol as the questions became more and more specific as to why I would interpret a particular error the way I did. The analysts had been focused on higher Layer issues for so long they had forgotten how the MAC Layer protocol worked, so the errors did not point them toward the source of the problem.
This is typical of what I have seen in recent interviews. Almost everyone I talk to starts "at the keyboard" and troubleshoots from the router console upwards in the OSI Model. Layers 1 and 2, as well as part of Layer 3 are treated like your average electrical outlet. "I plugged it in, of course power is there. What could possible go wrong with the power?" Nobody seems to be able to offer a working description of how the technologies actually work "below the keyboard" (at OSI Layer 3 and down) and that creates frustration and inefficiency in troubleshooting. Often the root problem was never exposed or understood at all.
I largely attribute this situation to the ubiquitous presence of switches. In a switched network any cable fault, marginal NIC or switch port, or environmental influence will almost always limit its effects to a single connection. Thus it is not thought of as a network problem at all; it is an end-user PC problem. Add to that the reliability increases that have been made over the years and you have a situation where router and VLAN configurations occupy most of the time and effort of those people who have the router password. All of the other problems tend to be handled through the Help Desk. The Help Desk doesn't know how the network operates and doesn't have access to network failure or configuration data, and the people with access to the network don't care about end-user problems. And we wonder why there is finger-pointing.
Q3: Do you have useful tips to add from your work at the Pentagon after the 9/11 attack and your troubleshooting efforts on the aircraft carrier?
A: The Pentagon was pretty straight-forward. Pieces of the network were missing and had to be replaced. Everyone there was great and worked together despite all of the group and service boundaries which had to be accommodated.
The aircraft carrier was a completely different situation. I am not at liberty to describe the exact situation, so let's equate it to a comparable situation where e-mail was sometimes very slow or completely unavailable for short periods of time. This is typical of today's switched network problems. The network was just fine with typically single-digit utilization per port, but was accused of being the source of the problem. I was onboard to find the "network" error and it didn't take long to disprove that theory.
What I suspected, but didn't have time to completely unravel, was that interactions between the servers was the cause of the problem. We now offer a product which would have done this for me, but at the time I was not able to elicit a satisfactory description of the interactions between the servers involved in login, authentication, and e-mail services. Without that basic dependency and relationship information I would have to perform protocol captures of all of the traffic to and from each server identified as troubleshooting progressed in order to create that interaction diagram. I remain convinced that one of the servers was doing extra duty with some other application and when it became burdened with that extra duty it was slow or unresponsive in supporting e-mail. Because of the multi-tier architecture which involved interaction between multiple servers and services before e-mail appeared on the client user interface it was highly likely that a validation or authentication step was not taking place in a timely manner. The e-mail servers themselves appeared to be well configured and lightly utilized.
There are four communities within the typical organization which must cooperate in order to isolate and solve this sort of problem: network support, server support, applications or database support, and management. In most situations the problem would be a natural reluctance of people to volunteer information which might reveal that the problem originates in their area of responsibility. It is very hard to troubleshoot if you cannot gather information and symptoms. It is even harder if nobody knows which servers or services are involved and this was the problem I observed. Up until I started asking questions, there was no reason for people in the different areas of responsibility to share detailed information. In fact, security considerations discouraged open sharing. Once we started asking it was hard to find all of the people who knew the answers to questions about server and application interactions. I left before some of the shore support people who held critical knowledge were located. I hope they isolated the cause because the network support people were once again being accused of the slowdown, even though their part was working fine.
As networks become more segmented, even as they become more integrated, this problem will continue to get worse. I was recently asking myself how I would troubleshoot an application interaction problem like this when the servers involved are running as virtual systems on a blade server. I would have no access to data passing over the virtual Ethernet between the virtual servers on the blade and no access to data passing between blade servers in the chassis. My view would be limited to traffic between the blade chassis and the network if I used a span or tap on the right uplink. The only other alternative I can think of would be to load monitoring software on each virtual server, and most people don't like loading things on their servers. To spice things up you could add spyware, virus, 'bot, or hacker activity into the equation.
Q4:Please share some stories from your work.
A: Since I am highlighting issues below Layer 3 as being important but ignored, here are a few.
Cause: After considerable "widening of the scope" of the problem it was determined that the UTP cable from the server farm was draped across the suspended ceiling. The distance to the roof was just the plenum space between the suspended ceiling and the actual roof - maybe one foot. The air conditioning system used the plenum space above this suspended ceiling and was automatically turned off at 4:30pm for this office (in Phoenix, Arizona in the summer). The heat from the roof would then increase the attenuation of the UTP cable until the link failed. When the roof cooled off in the evening the network would resume operation.
Cause: The line-of-sight microwave link between buildings was being interrupted by a construction crane. The problem was not immediately apparent because the microwave bridge equipment did not drop link on the network side when the microwave was interrupted. It just dropped traffic after the buffers were exceeded because the path was down. This was discovered only after troubleshooting everything else and finally mounting an optional video camera to the antenna assembly. When a network management application sent an alert indicating that the remote router was not reachable we finally observed the crane passing in the video.
Cause: After gradually replacing every bit of cable, every server, and all network infrastructure in the office the problem persisted. A week into the problem I was expressing frustration with the office manager, who was getting fairly short-tempered, when she suddenly interrupted saying "look - it's going to happen again." She was watching one of the CRT monitors (thankfully they were not flat screen) which presented a "warped" screen image for just a tiny moment. I wish she told me that symptom sooner. The express elevator to the penthouse office was passing right behind the wall. The elevator's passage (a metal box in a metal shaft) created a huge electromagnetic field as it passed the accounting department. The accountants moved to a window office since it was not cost effective to screen that electromagnetic field out of the room.
Observed symptoms: Traffic was observed by the protocol analyzer, but not the transactions which were sought. In fact, very little traffic was seen despite the activity indicated by the server application and the switch console.
Cause: The hub was not just a shared media hub. It is almost impossible to purchase an exclusively shared media hub today. In fact, most "hubs" are now unmanaged switches. The protocol analyzer was an old but loved piece of luggable hardware owned by the customer I was working with. It linked at 10 Mbps. The 10/100 hub permitted the switch and the server to link at 100 Mbps. Between the 10 Mbps collision domain and the 100 Mbps collision domain in the "hub" was an OSI Layer 2 bridge. Same-speed connections were in the same collision domain. If the transaction was specifically addressed to the server then the traffic was never forwarded across the bridge to the protocol analyzer. All the protocol analyzer saw was broadcast traffic, the first query for new conversations and the next frame for traffic which had aged out of the bridge forwarding table in the hub. By forcing the switch to 10 Mbps also we were able to see all of the traffic because it all passed through the same collision domain.
Q5: You participate in troubleshooting at Interop trade shows? Can you profile the most difficult problems in this environment and their solutions?
A: Due to the nature of a show like Interop, the biggest problem is in not saying anything politically incorrect. For example, exhibitors at these shows are often using the venue to introduce or launch new products. The product, such as new network infrastructure gear, may be making its first debut outside the development lab. I can remember several instances where the design engineer was complaining that the Interop network is not working and we were arguing that there wasn't any traffic coming out of their box, or that there was a problem with what did come out. How do you tell someone that their baby is ugly even as that person is entering the baby in a beauty contest? Furthermore, it is our job as part of the show staff to do whatever is possible to make the debut successful despite any little problems.
I love going to this show because I get to work with many of the products and new technologies which will begin appearing in our customer's networks over the next 6-12 months. It is like a sneak preview of what our customers will be calling us about. In fact, it is about the only useful training on "new" which I obtain each year.
I would say the biggest problem is in following two rules for troubleshooting:
When Fast Ethernet was still a proposed standard Barry Reinholt from the University of New Hampshire Interoperability Lab brought to the show most of the "working" Fast Ethernet switches which were in the process of being released to market by the various vendors. We used them to build one of the first multi-vendor Fast Ethernet production networks. Keep in mind that multi-vendor interoperability was a goal then, not a routine expectation. It almost worked great. We spent many, many hours troubleshooting until it was finally discovered that one of the fundamental operating rules of a bridging device was not being followed by one switch near the middle of the hierarchy. When the bridge forwarding table filled up it didn't flood traffic for addresses not in the table to all ports. It has been long enough now that I don't remember for sure, but I think it either dropped traffic for the new address which was not inserted into the table, or dropped traffic for the oldest address which it discarded from the table. Up until the forwarding table on one switch was full the entire deployment worked flawlessly. All of the symptoms pointed at a problem with one client, since everyone else worked fine. We didn't follow the two rules or we would have questioned the observed behavior more closely much sooner.
There is an instance of this which plagues small wireless deployments right now. When some low cost APs' bridging tables fill up they discard traffic for anyone not in the bridge forwarding table. If you reboot the AP then a new list of wireless clients can occupy the table entries and operate just fine. If you are the eleventh client, (using ten entries in the table as an example, as the table size appears to be somewhere near that number on one such AP), then everyone else can get to the Internet except you, despite your NIC driver showing good signal strength. Again, the symptoms pointed toward a misconfigured client.
Q6: In your view, what are the most serious networking roadblocks for businesses?
A: Training. Training. And training.
Unfortunately, few businesses factor serious ongoing training into daily operations anymore. That is now the off-hours responsibility of the employee.
New-hires are often just out of a certification class and have little or no real experience. Furthermore, the training certification program is almost always designed around a particular vendor's offerings. This is great for the vendor, but not always great for the student. The part of the course which offers general technology information which they will be using when they get hired is often an after-thought addition to the course to "round it out". They teach that part as fast as possible, or skip it and tell the student to do some extra reading. The deep content in the course is related to the vendor product, which the student often won't be permitted to utilize until they have proven to their new employer that they are safe near the network. After all, would you give the password to the administrator account which controls your whole business to the new guy? I doubt it.
On the other end of the technology continuum are the people who got into networking in the 80s and had to learn all of the problems as they appeared. They have that knowledge buried in the back of their head, but never use it anymore. They are struggling to keep up with new product deployments on new technologies. They spend most of their time logged into a console or in front of the whiteboard.
The knowledge held by the new guy often relates to the newer or bleeding edge technologies because that is what was covered in class and the knowledge needed is lying unused in the senior staff's head. How do you transfer that knowledge in both directions?
Then there is the problem of interpreting the data presented by the network. I can't tell you how many times we have had customers call us with questions like:
Questions like these point to a fundamental lack of understanding the principals of network operation. If the underlying technology were understood, then questions like this would not be asked.
My favorite is related to Ethernet Auto-Negotiation. It appears that almost everyone believes that if one link partner is negotiating, then you can do anything you want with the other end and it will figure it out. At the same time they always configure everything to Full Duplex because it didn't work. The correct answer is that a negotiating station is required to configure itself to Half Duplex if the link partner is not negotiating.
All of the supporting details are, but the basic principals of network operation and the operating rules of the common technologies are not that much to learn. If it the basics are not understood then you will have many frustrating experiences.
Q7: What are the main issues that must be considered in a switched environment?
A: The most important issue is that you cannot successfully troubleshoot a switched environment unless you know what the switch is doing. You also shouldn't be designing the network unless you know that.
By walking up to the switch and looking at it you learn almost nothing. Unless you know how it was configured, and what OSI Layer(s) it is operating at you cannot easily interpret measurements and test results related to it. And you need a current map of the network architecture.
If you connect to a spare unused port and cannot get to the internet, is it because:
You connected to your desired server, but things are really slow because:
Q8: Techniques for troubleshooting a switch.
A: Troubleshooting switches has never been an easy task. It is made more difficult by the unavailability of the password. The people who do most of the day-to-day troubleshooting don't have the password and the people with the password troubleshoot "from the keyboard upward". This leaves a significant gap for helpdesk and first-in technicians to fill with few if any resources.
Although this is certainly not a complete list of the available options, this is what I came up with in a short time for methods which could be used to troubleshoot a simple "the network is slow" scenario involving a Layer 2 switch. This does not take into consideration the multitude of higher-Layer features available from today's switches, only a basic approach. Furthermore, for each method below there are serious pros and cons associated with each. There are good reasons for using or not using each.
Method 1: Access the switch console
Method 2: Connect to a spare (unused) port
Method 3: Configure a mirror or span port
Method 4: Connect to a tagged or trunk port
Method 5: Insert a hub into the link
Method 6: Place the tester in series
Method 7: Place a Tap inline on a link
Method 8: Use SNMP-based network management
Method 9: Have the switch send sFlow, NetFlow or IPFIX
Method 10: Set up a syslog server
Method 11: Use the server (host) resources
Method 12: Use a combination of the above methods
[Editor's note: A link to a white paper with additional details about each method will be provided in a blog link in the IT Managers Connections blog (http://blogs.technet.com/cdnitmanagers/).]
Q9: Which are your top recommended resources and why?
A: Training. If you don't know what your tools are telling you, how can you fix anything? This can be a formal class, or something as simple as downloading the available standards (most of them are available if you look) and reading them. Yes, they are dry, dull, and boring. They don't even always have the most elegant or the best solution. However, they are the Standard and are therefore "Right" by definition. If you know what the standard says then you know what should be happening on your network. The challenge is then for you to find out which thing isn't behaving according to the standard.
Here are a few places to look. Each of these links permits limited downloads of the actual standards personal use. There is no excuse for not knowing.
This site provides easy access to most specific protocol, port, or vendor numbers you are likely to need.
I am continually appalled at how many experts spout inaccurate data which they obtained from the popular press without ever having opened a Standard. I have really angered a number of instructors in purchased Industry training classes by challenging their facts by reading passages in the Standards (which I happened to have with me…). How can I be expected to trust the new information they are teaching me if the parts I know are described inaccurately?
I recommend using the popular press and industry training to get a general idea of how something works, and then supplement and verify that information by reading the source standard(s). It is often hard to get the "big picture" from the Standards documents.
Closing Comment: Neal, we will continue to follow your significant contributions in the networking arena. We thank you for sharing your time, wisdom, and accumulated deep insights with us.
A: Thank you for the soapbox! I hope some of my ramblings are of use to your readers.