I was using Lincoln Stein’s fine perl script torture.pl to stress test some J2EE architecture. We were having some initial stability problems and the code would break under the weight of torture. I ran the perl script while I monitored the J2EE server, a process that allowed us to identify the point of failure: an unclosed database connection. We patched the code and life was grand. With the fix in place, I stressed the server again with torture. It performed well under pressure and most importantly the code never broke, the server never failed.
But a funny thing happened on the way to user acceptance testing. The code broke again. How could that be? I had stressed the code with infinitely more simulated users then the number of acceptance testers. Why did the code break then? Did I miss something? In an attempt to solve the problem I was back in the lab with torture and stressing the java code.
But in the lab, the server wouldn’t break. I steadily increased the number of simulated users in a futile attempt to break the server. The only thing I broke was the Linux workstation which was running torture.pl, not the HP-UX server that was hosting the J2EE content. Perl’s overhead was exhausting the Linux workstation’s resources. I didn’t have access to the computing resources neccessary to run a high number of simulated users in perl. Therefore I needed something leaner.
I was working on project in C which made HTTP requests. Over the course of a weekend I was able to morph the code so that it would execute many simultaneous HTTP requests. I changed the output to match Lincoln Stein’s reporting so that it would be familiar to my team. Back in the lab on Monday morning, I was able to hit the server with three times as many simulated users. But nothing happened. The code still refused to break.
My new code proved one thing. I was persuing the problem from the wrong angle. The number of simulated users exceeded the number of acceptance testers by many orders of magnitude, yet the acceptance testers were able to break the server, but the simulated users were not. Torture was pounding the server one URL at a time, the acceptance testers were roaming through the content at will. That discrepancy, I concluded, explained how the code passed one test and failed in the other. I had to recreate the acceptance testing in a lab environment so that I could pinpoint the cause of failure.
I cobbled a perl script which havested server links to a file. I rewote the C code so that it would read the URLs into memory and requested them randomly. Success! I was able to break the server; I could recreate the failure.
Continued testing led me to the conclusion that our code was not breaking, but rather the failure was at the server itself. Through the evidence we collected we were able to prove a problem to a vendor which ultimately resulted in a vendor issued patch. With that in place, the site was stable.
I added a cool name and GNU autotools support and published siege GPL because I thought it would be useful to others.