I'm curious if anyone else has had a similar issue and/or if I'm just being an idiot (my money is on the latter)...

We've been using JMeter to stress test a new app on a new, dedicated server. After a lot of work in setting up the script and learning JMeter, we determined our test PC (the one generating load) could handle 600 threads per JMeter instance. It could also comfortably handle 3 instances (4 in a pinch, but it pushed it). We had a similar machine who could mirror this performance, so in total we could push up to 6 x 600 concurrent users in testing our app. All went well in testing (relatively) and we determined the app/server at its current state was melting down at around 3000 users. These tests were all done with the server and load generating PC on the same network.

Fast forward a few weeks. We move the server to our data center. Nothing has changed in the server or test environment with the exception of the server being outside of the network now. The first test was one 600 thread instance of JMeter running our script....it bombed. Spectacularly. I dialed it back to 300 and it worked just fine. I'll gloss over a lot of details here, but basically we tried two separate JMeter instances, each pushing 300 threads, at two separate locations. It worked great (expected throughput numbers, no errors, etc). Then on a whim, I tried the same thing (2 x 300 threads) on the original load generating PC...and it worked! So now my question:

Why would one instance running 600 threads fail, but two instances running 300 work? As far as the server, network, and network card of the load gen PC are concerned, the two tests are basically identical. I've seen the 600 thread test work many, many times (most of those times with 3-4 instances in parallel) running on the load gen PC. I have to assume there's some JMeter setting that's having an issue with excessive external incoming traffic....but I can't find it.

Any help is appreciated. One more oddity that I almost forgot...when the 600 thread test fails, every failure is almost exactly 21 seconds, and the failure buildup is not gradual. One second things are working OK, and then suddenly the max wait time on all my requests shoot up to ~21 seconds and the errors pile up.

- Jay