Werner Herren of Fluke Networks gets to the root of your network performance issues.
The process for tackling network performance problems
To get to the root cause of network performance issues, network engineers typically follow a four step troubleshooting process: The tools available to assist problem solving fall into two categories: Network Management Systems (NMS) and packet capture and analysis tools. The NMS primarily plays a role at the monitor/alert phase, monitoring the company’s routers and servers and asking whether they are working and responding as expected.
However, some NMS are so complex to set up that they are only able to manage down to Layer 3 devices, so switches are not monitored at Layer 2. Polling data is aggregated over many minutes and so is smoothed out, hiding the impact of usage spikes. Additionally, because the NMS is centrally located, measurements made with the intention of understanding end user response times are inaccurate because the test is using a different part of the network to reach the device under investigation.
As a network engineer progresses through the troubleshooting process, the usefulness of the NMS decreases and it fails to provide the detailed information required to investigate performance problems fully.
A recent Fluke Networks survey of approx. 3,000 network professionals found that 82% of respondents ranked network and application performance problems as a concern or critical issue, with 52% stating that an NMS has insufficient capabilities to get to the root cause of the problem most or all of the time. 51% of respondents said that they needed to leave their desk some or most of the time to troubleshoot the problem.
To obtain more detailed information, the engineer has to turn to freeware or commercial packet capture and analysis tools. These have a limited role in the alert stage as they only view a single point in the network, but come into their own at the root cause analysis stage.
The complexity of packet analysis tools requires skilled and experienced engineers, and they are time consuming to use, as the result can be too much data – millions of packets to wade through, displayed through different user interfaces. This makes the troubleshooting process much more difficult and time consuming.
Where problems can hide in the network
The gap between these tools – an NMS without comprehensive information and complex packet capture tools – increases MTTR. Nagging, intermittent problems can ‘hide’ in the network, reducing both productivity and the credibility of the IT department.
To investigate and resolve performance issues quickly, the engineer needs end-to-end visibility across the network: a dedicated solution for automated network and application analysis that fills the gap between traditional NMS and packet capture.
This needs to address:
– Unmanaged equipment, which may have been purchased because it is less expensive but will cost more to troubleshoot when problems occur, as there is no visibility of the health of each network segment and utilisation levels cannot be monitored. In contrast, with a managed switch a network engineer can go to any switch port and see what the errors look like, view the utilisation and see who is connected to that port.
– Undocumented networks, a continual problem given that frequent changes on a network make any documentation out-of-date shortly after completion. Physically trying to trace the path would take a long time, but without accurate documentation the engineer does not know which packets are flowing where. What is required is a means of discovering the real-time path through the network.
– Too much data, when the problem may lie in just a few packets. Problem-solving would be much faster with an automated method of sifting through captured packets to find the bad ones – an application centric analysis that takes a top down approach.
– Problems in the past, which only come to the engineer’s attention hours after they have occurred. What’s needed is a means of going back in time by capturing and analysing large amounts of granular data over an extended period of time, say 24 hours, to catch intermittent issues.
– New technology that isn’t monitored, such as 10Gb Ethernet or 802.11n Wi-Fi. Many organisations have not invested in instrumentation for such technologies because they believe the substantial increase in capacity will overcome any problems.
Wireless devices – the engineer needs a way to identify and monitor WiFi devices, including BYOD, and to identify WiFi and non-WiFi interference from Bluetooth devices, cordless telephones, microwaves etc. using spectrum analysis.
– Problems that reside outside the network, so that the engineer can identify them and hand the performance issue and supporting evidence to other IT teams or external service providers, with sufficient information to enable further investigation and a rapid solution.
A new approach to problem solving
What is needed is an holistic network and application performance solution that captures all the data in the network and provides intelligent analysis to enable engineers to isolate root cause more quickly, or identify if the real problem lies outside the network. It needs to collect, aggregate, correlate and mediate all information, including flow, SNMP data and information gathered from other devices, with granularity up to one millisecond.
Data should be displayed through a single user configurable dashboard, so that guided workflows can be applied to isolate the root cause of the issue quickly. By taking away the need to make assumptions and enabling the user to follow a logical process until the issue is identified and resolved, MTTR is reduced and the network engineer becomes more effective.
The first requirement when addressing and resolving network problems is a system that provides a timely alert that a problem has occurred. The worst case scenario is to find out through a call from a user, in which case the engineer is already on the back foot.
Many network management tools alerts have to be manually configured for each network by setting the system to ping or discover all the devices in each broadcast domain. With an always on network and application performance solution, however, automated discovery and guided workflows make it quick and easy to immediately see which are connected. This dramatically reduces the time required for set-up and monitoring.
The network engineer now needs to investigate the scope of the problem. In order to facilitate rapid and accurate investigation, the solution needs to be able to collect and store all pertinent data , e.g. SNMP, flows, packets, end-user-response time etc. and store these for future analysis. A network and application performance solution also provides a real time method of discovering the path from the client to the service or application, significantly reducing the amount of time required; the path between the two devices can then be found and monitored for any problems across internal and external networks and the devices in the path. Results are displayed in a graphical format to facilitate understanding and rapid root cause analysis.
At this stage the problem has been isolated to a single network segment, switch, router, server or application and the path, devices and ports in the path have been identified. Now the path needs to be analysed, requiring traffic statistics for each link to determine whether the issue is due to a faulty device, link media, noise or interference, or traffic overload.
One of the great advantages of SNMP (Simple Network Management Protocol) is its ability to help isolate fault domains. Using SNMP to query each connection point along the way will determine if a traffic bottleneck is the source of the slowdown.
This is straightforward if the devices in the path are managed and the engineer has the passwords or community strings to interrogate the devices. Otherwise he or she has to connect a tool in each link without disrupting the network to view the packets and traffic statistics. This can be extremely time consuming if there are a lot of links over a large geographic area, and may require multiple tools in different locations.
An automated network infrastructure health check using a network and application performance tool makes it possible to monitor all of the SNMP supported devices, looking at application flows for those showing packet loss or high utilisation by querying the SNMP MIBs on the routers and reporting back at regular intervals. Whether there are tens or hundreds of switches in the network, the process is simple and quick.
Some problems will only be visible by being at the point where the problem has arisen. This requires a portable device with the right testing capabilities and the right interface to connect at the problem point, whether that is in front of a client or a 10G link in a datacentre. With many people working remotely, having a tool which gives this visibility is vital – and this will only increase in importance with the growth of BYOD.
A portable tool can also be shipped to a remote site to see what is happening with unmanaged equipment in the network without the need for an accompanying engineer. Ideally it should be able to perform path analysis, measure application infrastructure health and application flows and analyse WLAN performance, as well as reviewing roaming and retry capability and investigating any interference from outside devices.
A network application and performance solution provides the visibility engineers need to document and audit the health of their corporate network. It enables them to spot poor performance and identify where the paths of applications or servers are running slowly, so that the slowest and most critical paths can be addressed. The information obtained can be used to prioritise projects such as server upgrades and make the business case for approval.
Werner Heeren is the Regional Sales Director for High Growth Markets (Middle East Africa) for Fluke Networks. Werner has been with Fluke Networks since 1999. He started in 1997 for Fluke Industrial’s Benelux Sales team as a Product Manager and pre-sales support.
When he joined Fluke Networks in 1999 and after achieving several rewards during the following years like “Eye of the Eagle”, “Top performer to quota”, “Summit Club member”, “Recognition award for highest grow rate”, Werner has taken the challenge to lead Fluke Networks successfully into the High Growth Markets Middle East and Africa. Werner is of Belgian nationality, is based in Dubai and holds an engineering degree.Click below to share this article