1. Symptoms
Today’s patient is a large paging company that has just updated its high-speed paging equipment, added additional information services, and significantly expanded and adjusted its computer network. The debugging process went smoothly initially, but the good times didn’t last long. Serious issues arose just one day after the official launch of the upgraded system. The technical center manager, Mr. Yan, reported the following malfunctions: Initially, they sporadically noticed delays when entering user data on the paging console, with keyboard updates taking longer and longer, from about one second initially to over ten seconds now. The network service speed quickly became very slow, with paging operators sometimes waiting for over a minute for the screen to update when entering data. The delay fluctuated between 10 seconds and 1 minute. During peak business hours, the speed of processing pages couldn’t keep up with the demand, resulting in severe user queuing. Equipment management personnel checked the hubs and switches, and their indicator lights were constantly blinking, seemingly faster than before, suggesting high network traffic. When checking the CPU resource utilization of the main server using software, it was found to be at 93%. The CPU resource utilization of the computers at the five workstations was above 85%. This occurred on April 26th, and they suspected a virus might be at play. They used three different antivirus software to scan, but the issue persisted. Since the paging center data room lacked hardware tools for network maintenance, the engineering contractor was at a loss, so they urgently sought help from the Network Hospital.
2. Diagnostic Process
Thirty minutes later, we arrived at the scene. As Mr. Yan mentioned, the continuously blinking indicator lights indicated high network traffic. The network uses NT as the platform and IP as the working protocol. Using a network tester, the F683, we tested any interface connected to the network. The results were as follows: The network traffic averaged between 57% and 83%, which was significantly high. The collision rate was between 4.9% and 5.3%, broadcasts ranged from 42% to 74%, and errors were between 2% and 3%. The normal traffic fluctuated between 8.1% and 0.7%. It was evident that a large portion of the network bandwidth was occupied by illegal data frames. The primary illegal frames were high-traffic broadcast frames, followed by error frames. To identify the source of broadcast and error frames, we first initiated the network tester’s error finding and statistical testing function, which, after 2 seconds, revealed errors such as oversized frames, incomplete frames, FCS error frames, and a few short frames. We then used the “Error Statistic” button on the network tester to view the sources of these errors, and they all pointed to a server named “Cindy.” To find the source of excessive broadcasts, we pressed the “Top Sender” test button on the network tester, and it also identified “Cindy” as the server responsible for the excessive broadcast frames. Additionally, “Cindy” was found to be sending around 0.8% of normal IP frames. When “Cindy” was disconnected from the network, all individual workstation issues disappeared. To confirm whether the issue was with the network card itself or the network card driver, we reinstalled the network card driver on “Cindy” and ran the machine again, and the issue reoccurred. This confirmed that the network card was likely the problem. After replacing the network card, the network returned to normal.
3. Conclusion
Average network traffic is a crucial factor determining network speed. In Ethernet, instantaneous traffic can exceed 90%, making it suitable for handling burst traffic. When average network traffic is below 40%, network speed is generally not perceived as slow subjectively. In this case, due to a network card malfunction on the “Cindy” server, besides sending some normal IP packets (about 0.8%), it was also sending about 2% to 3% error frames and, most importantly, excessive broadcast frames (42% to 74%), causing fluctuations in keyboard updates from 10 seconds to 1 minute. The excessive broadcast frames had the most significant impact on the network. Broadcast frames are a means of periodic and ad-hoc communication among network devices, but excessive broadcasts consume unnecessary bandwidth. Typically, a malfunctioning network card can exhibit various symptoms. A common type is the “quiet” type, where the network card does not send any data, and the machine cannot access the network. Another common type is the “mad” type, where the network card sends a significant volume of illegal frames and error frames in addition to normal data. In this case, the issue was caused by excessive broadcast frames. Broadcast frames can traverse bridges and switches within a network segment, causing all devices on the network segment to have their network cards frequently interrupt the CPU of the host machine due to the reception of a large number of broadcast frames, resulting in an 85% CPU resource utilization. As a result, the network speed for running local applications on these devices is significantly affected. Interestingly, many users only discovered the network-related issues when they disconnected their machines from the network, as the issues were initially attributed to workstation problems and were often mistaken for a virus outbreak. Many network administrators and network maintenance personnel usually follow a story like the one described below: First, they attempt to scan and remove viruses using multiple antivirus software, to no avail. Then, they format all workstations, reinstall the operating system and application software, but because the issue is with the server, it remains unresolved. Finally, they have to format all machines, including the server, and reinstall the system platform and application software. If the issue is due to incorrectly installed server network card drivers (for example, the installed driver version doesn’t match, it can work but isn’t smooth), the story might end with the reinstallation of the correct drivers. If it’s a “mad” type network card malfunction, the story can continue for a long time because “mad” patients don’t adhere to network rules and send a significant amount of illegal frame traffic, consuming bandwidth and affecting all network members. Unfortunately, the proportion of “mad” patients in network malfunction statistics is not very low!
4. Diagnostic Recommendations
“Network health testing” and “network benchmark testing” are both essential for real-time and long-term monitoring of network traffic patterns, helping maintenance personnel understand the patterns of network application and traffic changes, and enabling them to detect and address network issues promptly. In a “Network Maintenance Plan,” health testing is recommended as a daily requirement to monitor real-time metrics like traffic/utilization, collisions, broadcasts, errors, etc., or to simplify monitoring programs by selecting a specific busy time each day for testing. This way, network anomalies can be detected immediately, as many network issues may not manifest or be obvious when network traffic is low or relatively calm. A more reliable method is to perform certification testing on the network. Besides certifying the wiring system, the working network should also be certified, enabling the detection and elimination of existing network issues and potential performance problems before the network is put into normal operation, optimizing network performance to the fullest.
5. Afterword
The next day, we conducted a simplified network certification test for the paging network, with the server’s tolerance for traffic shocks at 100%. If it weren’t for the mentioned malfunction, the network’s overall performance would have been considered quite good.