1. Symptoms
Today’s “patient” is quite unique, hailing from a telecommunications management center. They urgently sought help from our network hospital to resolve a pressing issue. We immediately headed to their location. While en route, we continued communicating with the center’s director to understand the “condition.” The management center is located in a central region, overseeing two county-level cities and seven counties. Two months ago, their network management system issued an alarm, indicating abnormal conditions in one of the county-level cities’ networks. A month ago, a provincial inspection team found that the county-level city was missing from the network topology diagram in the management center. They explained it away by saying that it coincided with construction work in the city, which temporarily disabled the network management function, removing the city’s network icon from the network management machine’s display. The urgent problem at hand is that the inspection team is conducting a second-round assessment tomorrow, and the network management system is a crucial part of it. They can’t use the excuse of ongoing construction to avoid scrutiny this time. The alarming message on the network topology diagram is still there, and the problem in the city has not been thoroughly resolved. The county-level city network has continuously reported that everything is fine, and the speed is fast. However, they haven’t actively pursued fault identification. Recently, the issue, which initially appeared intermittently, has become a high-frequency and persistent problem.
In light of this situation, we decided not to go to the regional center but to proceed directly to the county-level city’s network management center. The issue likely lies within this scope based on the network management indications. Additionally, the city’s center is closer to our current location.
2. Diagnostic Process
We arrived at our destination half an hour later and immediately commenced our examination. Based on the information provided by the regional network management center, the router in this subnet reported a higher error data rate. We conducted tests directly on this subnet, which is structured as a multi-segment network connected through switches, consisting of eight 10BaseT and 18 100BaseT Ethernet segments. We connected a network tester for automatic monitoring, resulting in an average error flow rate of 3% for the router and an effective rate of 7% (E1 links are used for the WAN connections). The error flow rate in the switch predominantly pointed towards the segment connected to port 3# in the first slot, whereas other segments showed normal results. The 3# segment was a 100BaseT Ethernet segment with 97 workstations. DNS servers, IP servers, and other critical service servers were also connected to this segment. Testing the 3# port, we observed an error count of 25%. We then moved the F683 “Network Multimeter” (network tester) to monitor the 3# segment. The results indicated that error types included frame check sequence (FCS) errors and other unclassified errors. FCS errors accounted for 27%, and unclassified errors accounted for 11%, while normal data packet flow was at 3%. The error count of 27% corresponded with the switch’s indications, while the 11% of unclassified errors couldn’t be recognized by the switch or router and required localization. Disconnecting the router slightly reduced the error metrics. This indicated that the fault was indeed within this subnet and had little to do with WAN links. Since the entire subnet was interconnected via hubs (8x16Port), we further analyzed all the error localization data indicated by the network tester F683. The instrument showed that 97 workstations and 5 servers were generating FCS frame check errors, with varying quantities.
As all workstations were producing FCS frame check errors, it was unlikely that all network cards in these workstations were faulty (a highly improbable scenario). Instead, the cause of the fault was likely cable issues (incorrect wiring of all cables or the use of counterfeit cables) and introduced interference, such as signal interference, ground interference, power interference, radiation interference, etc. (including those within the unclassified error type). The network manager argued that since the cable system had undergone certification testing based on the ISO 11801 standard during the acceptance check, which was conducted by the network management center itself, there should be no issues.
To quickly localize the fault, we used a common “binary search” method to isolate segments: we cut power to half of the hubs, but the fault persisted. We then cut power to one-half of those hubs (one-quarter of the total), and the fault disappeared. Upon restoring power, we unplugged cable connectors of workstations on that one-quarter of hubs (two hubs). When we unplugged the workstation cable connector connected to port 7# on hub 6, all error indications on the network multimeter disappeared. The network staff concluded that the fault was unlikely to be the network cards of the workstation, as all network cards underwent two-by-two exchanges with adjacent workstations and swapping positions with three adjacent machines as part of preparations for the inspection (the center lacked other network testing tools and had to resort to this widely-used albeit effective “crude method”). We tested the network card of the workstation with the network tester, and the results showed normal physical parameters and operating protocols for its port. This suggested that the fault was likely located elsewhere within the workstation and primarily consisted of interference-type errors (belonging to the unclassified error type). We ruled out the possibility of cable-induced noise. Disconnecting the cable connector on the workstation’s side of the network card eliminated the fault, indicating that the fault was not caused by cable noise. Upon closely examining the machine, a faint but distinct burning odor could be detected when near the workstation (although far from the point of smoke). Close to the machine, a distinct sizzling sound could be heard from the power switch in the workstation. Testing the workstation’s connection to the server, we observed numerous retransmissions and invalid frames. After replacing the backup power supply, the fault was resolved.
3. Conclusion
The reason for the fault was rather straightforward, stemming from electromagnetic interference generated by a single workstation’s faulty power switch. This interference signal infiltrated the network after leaving the network card output port, resulting in significant bandwidth occupancy within the network. This interference disrupted other workstations’ data packets, which manifested as a high occurrence of FCS frame check errors (affecting the ratio based on actual normal traffic at each workstation). Simultaneously, this interference signal also disturbed the servers and routers, leading to frequent alarm status notifications on the network management machine at the regional center. As the overall network traffic was around 41% (which is below the 40% threshold where users would start to notice a network slowdown), and effective traffic was only 3%, users on the city’s subnet wouldn’t feel a significant reduction in network speed, even though their transmitted packets were being damaged and required retransmission.
4. Diagnostic Recommendations
Network management systems typically can only detect about 30% to 40% of network faults (depending on the managed device’s network management capabilities and its ability to analyze and record abnormal network traffic). After an alarm is triggered, further steps are usually required to quickly pinpoint the specific fault location and attributes. The reason why this fault couldn’t be accurately located and immediately resolved was multifaceted. Firstly, the county-level network had no network maintenance tools, relying solely on the experience of network maintenance personnel and some software downloaded from the internet to monitor their network. This is a direct reason why the fault couldn’t be resolved for a long time. Currently, allocating appropriate tools based on the scale and level of network maintenance to personnel responsible for network management and operation is a complex issue that perplexes network planners, planning units, and network maintenance personnel. Secondly, the fault was relatively simple, but problems in the maintenance system prevented close cooperation and coordination during the fault-finding process, leading to a prolonged issue. In fact, there are many mature approaches and practices to enable comprehensive, efficient, rapid, and low-cost network management and maintenance. It is suggested that network management personnel and operation maintenance personnel, while busy with rapid network deployment, continuous tracking of new network technologies, and interacting with new devices, should also allocate some of their energy to study theories, methods, and mature approaches regarding network maintenance, striving to achieve results with less effort. Complete network documentation, regular testing, network benchmarking, performance monitoring, physical layer testing, protocol monitoring, traffic analysis, and more have always been effective yet straightforward means to prevent serious incidents in large networks.
Did you know that in firefighting, the most critical work is not fighting fires but preventing them? Network maintenance work is similar and equally important! It can be fully compared.
5. Afterword
Subsequently, the regional network conducted a comprehensive certification test on its subnets, revealing numerous hidden fault vulnerabilities that were typically imperceptible. The current network should be in the best health. We were recently invited to conduct an overall assessment of their network, and we hope to make a breakthrough (rated on a scale of 10; the current highest score is 5).