Note: This content is accessible to all versions of every browser. However, this browser may not support basic Web standards, preventing the display of our site's design details. We support the mission of the Web Standards Project in the campaign encouraging users to upgrade their browsers.
Wednesday, February 19, 2003 22:31 // ETZ J97, ETH, Zurich, Switzerland // href
Today around 9am our main Solaris server started acting up. Its performance got patchy. We eventually found that it was suffering from excessive TCP retransmits of up to 1000%. This means that for each packet it sends out on the net it has to try 10 times until it is successful. This is an extremely hight value, or so Virtual Adrian tells us.
We started searching franticly for the reason of the problem, as performance on the server and even more on its clients was suffering badly. After about one hour of web hunting with and traffic dumping, we gave in to the pressure from the street and rebooted the beast, hoping that probably some internals of the kernel had been thrown out of whack and after a reboot all would be well. And indeed it was, at least for a few minutes. Then the server started misbehaving again, driving its TCP stack through the roof. As rebooting did not help, we went back to tcpdumping and etherealing. I did learn a lot about pcap filter syntax ...
'tcp[tcpflags] amp (tcp-rst) !0 ampamp tcp[tcpflags] amp (tcp-ack) =0'
but nothing about the reason for the retransmits. Fortunately, at this stage, the retransmit rate was not always at 1000% so work was possible for our users.
Then, in the early afternoon, Manuel found that the root disk of the server causing SCSI timeouts. As if we didn't have enough on our hands already. SCSI timeouts make the machine stop and wait for several seconds at a time. Together with the server, most people using its resources, were experience the same freezing problem on their workstation.
What a day. I have been writing emails about what was happening to our users all day long, but things were really stating to look bad. Our wonderful reputation for high quality service and superb uptime was going down the drain. It seemed though that most users were not blaming us at this stage, probably due to the fact that I kept them up to date with what was happening.
Around that time David found, that in the latest Solaris kernel patch there was a fix for some TCP stack issue which might be related to the retransmits we were still suffering from. He started to put in this patch so that we could activate it when we rebooted. This was going to be necessary anyway as I was preparing to replace the root disk with a fresh device.
Then, suddenly just minutes before the reboot, the server went back to normal, the retransmits were gone and performance was good again, no traces left.
So here I am, another day older and not much smarter about what was causing todays network problems. I can imagine things like that there is a bug in the Solaris TCP stack which can be triggered by a rouge packet and this would cause the symptoms we experienced today but I suspect, once the real reason is known, it will be way less spectacular.
Content © by Tobias Oetiker