1. Symptoms
Today’s “patient” is a billing center of a mobile phone company. According to the network manager of the center, to alleviate the problem of mobile phone users having difficulty paying their phone bills, the center invested heavily in adjusting and upgrading the original billing center’s network three months ago. The network speed between the center and the four banks entrusted to collect mobile phone fees was upgraded from the standard 64Kbps speed DDN line to an E1 line with a speed of 2.048Mbps. The billing center’s network was also upgraded from a 10Mbps Ethernet to a primarily switch-based 100Mbps Ethernet. Before the upgrade, the entrusted banks often reported mysterious network interruptions, which generally recovered quickly and did not cause significant disruption. After the upgrade, while the network speed improved significantly, the 120 subordinate service points for mobile users often displayed messages like “network remote fault, unable to provide data” or “unstable data transmission, please check the network” on their computer screens when processing payment procedures for mobile users. This would lead to temporary service suspensions, causing significant user dissatisfaction. Sometimes, though service was still provided, data processing speed noticeably decreased, with the worst cases requiring 5-6 minutes for a single transaction (normal times are generally under 10 seconds). This was slower than the network equipment’s performance before the upgrade. This issue occurred 1 to 2 times per week, each time lasting from 1 to 2 hours. Since the real cause of the network interruptions before the upgrade was never determined, network management personnel, during the planning of this network upgrade, were hoping that upgrading the equipment would completely eliminate these lingering network issues. Unfortunately, their luck was extremely poor, as they not only failed to resolve the old problems but also caused even more significant new issues. As a result, they turned to the “network hospital” for diagnosis.
2. Diagnosis Process
Since the bank network and the telecommunications billing network are not in the same location, it was challenging to determine which one to investigate first. However, from the initial analysis of the symptoms described above, there were potential issues with both the bank network, the mobile communication company’s billing center network, and the connecting links. Most of the billing center’s network equipment and routing equipment were replaced during the upgrade, and despite the upgrade, the problems persisted and became more severe. This suggested that the new network equipment wasn’t the root cause of the issues. The investigation had to involve both the bank network and the billing network. Inquiries made to users at the bank’s various service points indicated that when there were “troubles” with mobile billing, other business processes at the bank continued without issues, and the users of the telecommunications billing center network didn’t report any network anomalies either. This suggested that the bank network was less likely to be the source of the problem compared to the routing equipment in the mobile billing network and its connection to the bank network. Therefore, the decision was made to first examine the mobile billing network and its connecting links located in the mobile communication company’s data center. The first network test was performed when there were no issues with the network. The results showed that all test indicators indicated normal network operation. When the F683 network tester was connected to the billing network’s switch router, monitoring the network’s performance, it showed router utilization at 1% (equivalent to about 20Kbps of business traffic in the E1 link) with error statistics at 0%, completely consistent with the data observed in the network management system. Changing the F683 network tester to a parallel connection with the billing server also yielded the same test results, indicating normal network operation at that time. Discussions with the local area network (LAN) users and maintenance staff at the billing network’s location revealed that network personnel never sensed any issues with their LAN. Although they were aware of the mobile users’ frequent complaints, there was no apparent issue in the billing LAN, and the billing server appeared to operate normally. When observing the network management system during the occurrence of the problem, no issues were found with the routers, switches, or billing servers. Using the OneTouch Network Assistant, a network fault diagnostic tool, to simulate user traffic on the bank’s routers, the bank’s network business transfer server (tests performed at the bank), the connecting routers between the telecommunications billing network and the bank network, the switches on the network channels, and the billing server, a 2-minute test with 80% sustained traffic was performed (tests were conducted at the billing center). This was monitored with the F683 network tester, and the results were mostly consistent, with utilization at 80% and no errors, except for a 2% collision rate observed at the billing server. ICMP Ping tests were all under 3ms, and ICMP monitoring tests showed no congestion, data unreachable, redirection, data parameter errors, and similar issues. These results indicated that the network channels were functioning quite well. In this situation, two testing methods could be used to continue diagnosing the problem: passive monitoring, which involves starting monitoring equipment like network testers, traffic analyzers, and network management systems and waiting for the problem to reappear, and active testing, which involves starting all relevant network equipment and terminal devices and their businesses or performing simulated operations. This was followed by monitoring the network’s operational status for fault localization. To expedite the fault localization, it was agreed upon by the network management personnel of the billing network and the bank network to employ the second method, which required considerable manpower and resources, testing all network equipment and terminal devices involved and simulating business operations. The second test was conducted after the business day ended. Five minutes after all network equipment was started, the anticipated fault symptoms indeed appeared. When monitoring the network management system, the connection routers between the billing network and the bank network saw traffic rise to 3%, switches doubled in traffic, and billing server traffic decreased by 70%, but no network anomalies were detected. Using the F683 network tester to perform mobile monitoring on the entire billing channel’s related links and devices, the results indicated that the routers and switches data matched the observations from the network management system. However, the traffic on the billing server’s link was 68%, normal data was 7%, and error data was 61% (phantom interference such as ghosts, FCS errors, and frame fragmentation). It was evident that there was likely an issue with the link between the billing server and the switch. When the service was temporarily halted and the cable plug on the billing server’s network card was disconnected for cable testing, the results showed that only the 1-2 and 3-6 wire pairs were connected, while the 4-5 and 7-8 wire pairs were not connected. Network management personnel explained that apart from newly added cabling, most of the cable systems had not changed after the upgrade, with only a few links being adjusted. Further investigation revealed that the 4-5 and 7-8 wire pairs were connected to another backup server, which was used for bi-weekly manual inspection, backup, and reporting of critical data to local authorities. When the backup server was put into operation, the fault symptoms reappeared. By temporarily connecting the backup server with a new separate link, the fault disappeared completely. Testing the replaced cable showed that the near-end crosstalk (NEXT) was out of spec (underperforming by -2dB, PSNEXT comprehensive near-end crosstalk -8dB).
3. Conclusion
Network cables typically contain four pairs (eight wires) of thin cable lines, with standard 10Base-T and 100Base-Tx networks generally using only the 1-2 and 3-6 wire pairs, while the 4-5 and 7-8 wire pairs are not used. In 10Base-T networks, it was common to use the 4-5 or 7-8 wire pairs for telephone transmission or for connecting another computer. However, in 100Base-Tx Ethernet, given the high network frequency and data rates, this kind of usage is not allowed. Before the upgrade of the billing network, some sites used a single cable to connect two computers. After the upgrade, these cables remained unchanged, and since they were relatively close to the newly added switch, the backup server was connected using these cables. Although the backup server was hardly used under normal circumstances, its connection still introduced interference to the billing server, albeit in small quantities. This interference caused the 2% collision rate observed in the switch’s link. The PSNEXT out-of-specification near-end crosstalk (PSNEXT) of this cable led to significant interference when the data backup server operated, disrupting data transmission. This resulted in the same data packet being transmitted multiple times and requiring repeated processing. Real traffic suddenly increased to 68%, and reprocessing traffic went from 0% to 6.98%. Because the servers used inexpensive workgroup switches, the network management system was unable to detect the severe issue in the switch port. Before the upgrade, the occasional network interruptions were caused by crosstalk in shared cables. Since it was a 10Base-T network with low speeds, these effects were relatively minor and often occurred briefly and sporadically.
4. Diagnostic Recommendations
In 10Base-T Ethernet networks, there are numerous non-standardized cabling and many non-compliant cabling links. Because 10Base-T networks operate at low speeds, these quality issues were often concealed. These issues remained hidden until the upgrade to 100Base-Tx Ethernet, where they were exposed. Problems that were not apparent in the 10Base-T network cabling systems also created a false perception among integrators, contractors, and users, leading them to believe that cabling was free from problems as long as it was physically connected. This made them overlook the significant impact of cabling product quality and construction processes on network performance. It is advisable for network designers to initially use standardized design plans, and for contractors and users to choose standardized construction practices and field certification testing when signing network construction contracts to ensure the initial quality of comprehensive cabling systems. It is generally recommended, as described in “Network Testing and Maintenance Guidelines,” to conduct periodic tests of the cabling system annually (or every six months if necessary) to ensure the performance of the cabling system and to eliminate any damage caused by changes in layout, variations in the number of users, or manual adjustments. Additionally, keeping accurate and comprehensive records of network business operations and fault conditions can aid in diagnosing issues. If the “patient” is familiar with their network’s business processes, it can help avoid involving numerous personnel in overtime to resolve faults.
5. Afterword
A week later, when we called to check on the “patient,” we learned that they had completely replaced the shared cabling with separate, compliant cables. The billing network was working very well, and there were no more complaints from mobile users about difficulties in making payments.