We have been deliberating and debating this and thought will involve more minds on this.
Consider this Scenario
I run a performance test which has an SLA of 120 Transactions in one hour. To achieve this throughput i need to have an arrival rate of 1 transaction every 30 seconds. I have set up my Load runner scenario to achieve this exact purpose.
Version 1 of the software that i benchmarked on gave me this throughput with a CPU Utilization of ~80% and a response time of 20 seconds. So effectively i have waited 10 Seconds in between my transactions to achieve this throughput.
Version 2 of the same software had some performance features built on to it so now when i ran this same scenario i noticed that my CPU utilization has come to ~40% and my response time has become 10 seconds. So i have now effectively waited 20 Seconds between transactions.
In both the above cases my arrival rate has been a constant of 1 transaction every 30 seconds. The only thing that has changed here is the service time due to performance improvements and since the service time has changed the CPU utilization has also changed.
Now the question.
1. In the above example is it right to compare and trend these two even when the Utilization is quite different. How confident can we be to say that the response time has improved by a factor of x even though the utilization is not the same has above.
2. Is it valid if i change the arrival rate so that it can achieve 240 Transactions and keep the Utilization at 80%. This may give me a different response time but my throughput is completely whacked and how effective is this comparison.
This totally depends on what your test requirements are.
I assume by CPU utilisation you are talking about the server hosting the app?
1) Is the business requirement for V2 of the app to perform at least as well, or better, than V1? In which case, this has been proven. In fact you have demonstrated that server utilisation is considerably less even though the same amount of work is being done (from a business process perspective)
2) Do you need to establish the maximum transactional throughput of the system (and quantify the additional capacity gained by moving from V1 to V2 of the application) - in which case a test such as 2) is a valid test
You should also understand what the maximum allowable infrastructure utilisation is (this could vary depending on the organisation in question, and on what method of failover/clustering is employed etc)
1) It is valid to compare these two tests. You have held the arrival rate constant which is correct. Due to the performance gains the response time has improved causing the hit-rate (number of requests/unit time) to decrease. This should lower your average cpu utilization (in addition to whatever performance tuning was done).
2) You wouldn't increase the arrival rate unless you were trying to test with a 'constant' hit-rate and/or to test the response time under higher utilization (or max throughput as Perez stated).
Is there a chance that the application will be performing more than 1 transaction simultaneously (such as a website)? If so, you may want to introduce some amount of variation into the arrival rate to test the concurrency scenario (at a minimum) - human interaction is not linear. Just a thought...
A problem is a difference between what is perceived and what is desired, that
we want to reduce (Dewey 1933)
Its a perfectly valid comparison. V2 takes 1/2 the elapsed time and 1/2 the CPU time to perform the same amount of work as V1. This implies that nearly all the response time is sensitive to CPU speed and that as think time goes to 0, throughput for V2 would be close to 2 X that of V1 (assuming CPU remains the primary bottleneck.)