Hey everyone, I'm a long time lurker, first time poster. The performance environment that my company provides me is far from optimal. There are certain scenarios that I can run which yield consistent results while others tend to have some fairly large outliers. I have been doing everything I can over the last 2 years to get the environment completely re-built, but this is all besides the point.
I can run the exact same script using the exact same data multiple times in my isolated environment, and there could be some wide variance between transaction response times, standard deviations & 90th percentile. During the last week I've been looking at different methods of reporting which could report system performance by weeding out some of the outlying data.
What I've come across during the last few days is the mathematical calculation to use the interquartile range (IQR) in order to calculate outlying data boundaries. Basically, the equation states that anything higher than 1.5 times the IQR plus Q3 is considered to be an outlier, and anything lower than 1.5 times the IQR minus Q1 is also an outlier. The site named these two values the "min valid" and "max valid" & created a box plot using the min valid, Q1, Median, Q3, max valid & the actual min & max values, which gave a great overall picture of the characteristics of the numbers. I ran through this exercise on one of my tests, and while it looks like the best representation of my test results thus far, I was curious to know how realistic it is.
How accurate would it be to identify outlying data by using the IQR calculations? I can't find too much information on this method, but what I've found simply states it as a fact with no supporting information or prerequisites. When I ran through this exercise on my test data, it reported ~8.7% of my 32,407 samples were outliers. Maybe it's just me, but that seemed high. Do you think this is a method that should not be used for reporting performance results, or do you think it's a valid method which is simply highlighting our environmental issues?
Another method to visualize outliers is a simple scatter graph, see Scott Barber's article here.
I guess I'm curious - using 90 or 95 percentile is not enough? Or are the outliers so bad you want to remove them from the average calculations, or what?
I'm actually interested to discuss this [img]/images/graemlins/laugh.gif[/img] but I'd like to understand more about why you need to analyze the outliers to this extent. I am assuming that analysis has been done to indicate that these outliers are a result of the test env. and not indicative of some underlying issue in the AUT.
A problem is a difference between what is perceived and what is desired, that
we want to reduce (Dewey 1933)