 
Member
confidence intervals My load group has been asked how they model confidence in their results based on how they interpret confidence intervals and far as I know we don't. Anyone got any ideas? Can you give you me a brief rundown. Thanx. 
Senior Member
Re: confidence intervals Actually I just not too long ago wrote a small paper on this very concept. In most performance testing you have to be careful of the moving average  which is something all the tools that I know of rely on to some degree. The whole point of a confidence interval is really to get around the moving average and to account for the inherent variability in data in distributed performance modeling.
Consider two Web sites where both of them have an average ten second download time (as computed by a moving average based either on scenario duration or transaction processing). But does this mean both sites are performing equally. Consider that for one web site, all download times are between eight and twelve seconds. For the other, half the download times might be two seconds and the other half seventeen seconds.
A good performance model will rely on the calculation of a confidence interval to estimate this type of variability. When comparing two data sets with approximately equal numbers of data points (or equal numbers of transactions, equal number of sessions, etc.; whatever floats your boat) a narrow confidence interval will generally indicate consistency in performance while a wider confidence interval reflects a somewhat greater variability in the data and indicates the potential for less consistency in performance. This does not, a priori, mean performance is bad necessarily; just that it is more variable. And, as we know in performance for the Web, variability can easily turn into a problem  particularly with load spikes. For instance, the second web site that I just mentioned would have a wider confidence interval. This means it might bear looking into why that is.
In the "You Did Not Ask But I Will Tell You Anyway" category, a general confidence interval also tells you how confident you can be that the true mean lies within the interval or duration of execution.
So, basically, a confidence interval is a way of stating how confident one is that a system is performing adequately. (Kind of makes you wonder why I did not just say that in the first place, huh?) 
Member
Re: confidence intervals Okay I guess I understand the basics of what it means. I'm not sure what you mean about the true mean and all that but you're right that tool use a moving average. But I'm still not sure how to "do" confidence intervals. You know what I mean? 
Senior Member
Re: confidence intervals The true mean existing within the interval means that your measurement is not skewed because of nonaccounted for outliers in the execution duration (think of DoubleClick servers being down for example when you use their ads on your pages). But do not get sidetracked by that. I really just brought that up for any statistics purists out there who would have no doubt taken me to task had I not mentioned it. Anyway, "doing" confidence intervals is just a matter of measuring the values you have. I am attaching an Excel spreadsheet (zipped up) I have thrown together on this for the paper that I mentioned. You can see how it works by looking at the cells that have formulas. I also put a few comments here and there.
I will answer your other question but I will give my blurb answer right now so you can decide to stop reading after this paragraph if you want to. The idea of "doing" confidence intervals is mainly to help mitigate the instrinsic error of measurement. The error of measurement will be a combination of intrinsic bias and data variance. When you minimize the error of measurement you get the most accurate measures of the ratio of true response times (overall) to average response times. (Stop reading now if that basically gave you what you wanted. The rest is expository based on what I just said.)
Please keep in mind that a confidence interval is meant to convey confidence but it can be overused and actually be detrimental  i.e., providing confidence when, in reality, that is the last thing you should be feeling. Intervals such as these are subject to weighted averages. This is not quite the same thing as a moving average (tending more towards a specific population bias), but for the use of performance tools it might as well be. (I can guarantee that it is for Mercury's LoadRunner, Segue's SilkPerformer, and Rational's LoadTest. I am also fairly certain that this applies to RSW's eLoad as well.) The reason I mention this is because of something I mention often in this forum, with regards to Webbased performance testing, and that is heavytailing. A heavytail distribution is a statistical term but, in this case, it means a relatively small number of very high download times skews a mean calculation and this is usually attributable to differing file sizes that are served up by the Web system. (This is very similar to statistical outliers skewing a timeseries analysis although, again, it is not, strictly speaking, identical.) Heavytails make it hard to give confidence intervals for Web systems. This is as opposed to traditional client/server systems that do not display such a distribution as a general rule.
Going back to the tools for a moment (and you did not mention which, if any, you are using) you have to consider the measurement agent in use. This will have a moreorless unique distribution of measurement data. This roughly corresponds with a builtin bias of the observer  in this case the agent is the observer. This is very much like quantum theory where the observer affects the observed to greater and lesser degrees depending on the nature of the observing (monitoring) being done. And keep in mind that a measurement agent just refers to any process that simulates, monitors, or in some way analyzes performance data that is derived over a given foliated set (time duration).
You can get around this, somewhat, by the use of a weighted mean of medians. (Remember that the tools you use will essentially rely on a weighted mean or moving average. So what you are doing here is really just upping the ante a little bit.) You may be thinking "what the heck does that mean?" Well, this calculated measure represents the average response time and then a resampling technique is used on that data to calculate the confidence interval. To calculate weighted mean of medians, a median is calculated for all of the data from each individual agent for a given measurement period, which is the sampling time. If there are, say, ten agents, the result is ten medians. What the statistics tells us is that each median eliminates the effect of a heavy tail since no individual measurement can unfairly skew the data. The weighted mean of these ten values is taken and presented as the overall average. The weighting for each measurement agent also helps to eliminate bias mainly because each one is handled separately and has its own unique distribution of data. Quite obviously this is not a perfect way to do it but since the idea is to make it so that all of the data cannot be pulled into one big data set (as if they all had the same distribution), what this does is force a calculation of a combined 95% confidence interval by combining the different measurement observation set of results. In quantum theory, this would be the equivalent of removing the role of the observer by making him/her part of the system being observed. Of course, for those up on their quantum theory, all this does is remove the problem one step up the ladder. The same thing applies in performance analysis as well but, at the very least, it does give more confidence than just accepting averages that might have little meaning.
If you read this far, please tell me all that made sense. 
Member
Re: confidence intervals I think you've got way to much time on your hands! (Kidding.) It sounds like the key idea is bias. This is what I've been worried about before but just didn't know how to word it to my management. Everytime I look at results and report on them I'm never one huddred percent certain. And I'm still not clear on the exact "hows" (Please don't tell me I've gotta study quantom theory! I have enough trouble with basic arithmetic, thank you very much.) 
Senior Member
Re: confidence intervals Okay, see if this helps  sans quantum theory.
Remember that when dealing with Web systems it is not always the easiest thing in the world to calculate a confidence interval. I said this before. Again, this is because Web traffic is very dynamic, very random, and many times bursty in nature (hence burst traffic patterns), all of which is just another way of saying that it is not distributed normally.
The real key to putting confidence intervals to use is to account for the fact that measurement agents will have a unique distribution of data. I said this before as well. So if we consider two groups of staged users in a performance testing tool doing the exact same actions, it is very possible for those two groups to, statistically, be very different in the distribution of their response times. Another thing to remember is that confidence intervals are only valid when calculated over time intervals in which the probability distribution is fairly constant. For this reason you can foliate an observation period any way you like but, as an example, it is possible to do an hour foliation and another hour foliation and get confidence intervals for both of those hours  say one at 9:00 AM and one at 11:00 AM. This will be very different than performing a continuous twohour test and calculating an interval for those two hours even though the end result is basically two hours of testing in both cases. Taken another way, forming a confidence interval over a twoday period is a lot different than calculating a specific confidence interval for two specific days. This may seem like an obvious point but it is often overlooked, most often during sustained stress testing when you are looking not only at rough scalability but also when the system just bombs out on you.
A common thing people want to do is compare sites. Either they want to compare their own sits with each other (like in a company that has more than one site) or compare theirs to their competition (say uBid comparing themselves to eBay or Barnes&Noble comparing themselves to Amazon). The key, in this case, is to create confidence intervals for each site separately. If the two intervals do not overlap at all then you have a statistical basis for comparing one site to another. If there is overlap then you really do not have a valid statistical basis for making a claim one way or the other except by doing actual reporting on the individual sites  and I doubt your competitors will appreciate you running your performance scenarios on their servers. They key to this, of course, is the sampling intervals and methods  they should be equivalent. Not only that, but the observation period must have occurred at the same times. (Not on the same day or during the same hour, necessarily, but during the same interval of time  the foliation. Ideally, however, it should be at the same time during the same period but burst traffic for Web systems shows this does not matter as much. In traditional client/server it would.)
That sampling is important so to elaborate a little more, it is pretty much a truism that you need at least six to eight samples as the minimum sample size to get anywhere near a 95% confidence interval. Samples of what? Well, in standard parlance you would refer to data points but I like to call them performance nodes. The performance node is a particular service or resource of the site that could, potentially, cause a bottleneck because it relies on distributed communication via some protocol. Essentially, however, to get a 95% confidence for the median, we have to be 95% confident that the median lies between the highest and lowest observations  i.e., in the interval which I had mentioned in my first post on this. If not, we cannot give a valid upper or lower confidence bound for the median.
Moving along to justification for the above, fewer than six samples (performance nodes) cannot give a 95% confidence interval and to see why consider that we have only three performance nodes. All performance nodes will be below the median 1/8th of the time and all will be above the median 1/8th of the time. Thus, the median is between the highest and lowest observations only 75% of the time. I am not going to go through rigorous proofs of this since I trust that the conclusion is fairly obvious but I do want to show a basic list that highlights the number of performance nodes required to provide a given confidence interval:
3 = 75% confidence interval
4 = 87.5% confidence interval
5 = 93.75% confidence interval
6 = 96.8% confidence interval
7 = 98.4% confidence interval
8 = 99.2% confidence interval
Something to keep in mind  a 95% confidence interval is mathematically possible with six performance nodes but that does not equate with saying that it is assured. So I would not go to your management waving it in their collective faces claiming the authority of an omniscient performance analyst. A lot of this depends on the number of measurement agents and the arrival patterns being monitored/simulated. Having just one simulated user is a lot different than having ten. And having ten staged (concurrent) simulated users is different than having ten simultaneous simulated users. Also keep in mind that transaction pathing (what is actually being done) coupled with arrival rates will determine burst traffic  something that is very easy to miss with confidence intervals.
Another thing to keep in mind is the error rate. Failed transactions and requests should not be counted into any averages and they should not be used to determine a confidence interval. However, the error rate does give one part of the picture of performance so it should definitely not be discarded and, if nothing else, it serves as a good risk assessment based on the confidence interval that was calculated. 
Junior Member
Re: confidence intervals Ah, no offense but I think you are laughably wrong on a lot of this. Any tasks that are a trans. are a sequence of visits by the trans. to servers. The avg. # of visits per server is the visit ratio for the server. That is how you derive confidence. You have an output rate and input rate. 
Senior Member
Re: confidence intervals Laughably wrong? Ahh, geez. Not again! Why can I not just be wrong instead of laughably wrong for once? Anyway, while I admit the possibility I am not sure where you think I am incorrect in what I stated. I reread my posts just to be sure and I did find a few things I would have probably said a little differently but I am not aware of any major errors of fact (yet). I guess my confusion more stems from understanding how your statement is meant to show I was incorrect.
What you are basically describing is a forcedflow law. However, you state "the avg. # of visits per server" when I think what you meant was the average number of visits per transaction to a given server. That is how you work out the visit ratio. So if you consider a server index s, then V_s would indicate the visit ratio for that particular server. You are hinting at arrivals and in one my posts I specifically mentioned "A lot of this depends on the number of measurement agents and the arrival patterns being monitored/simulated."
You also state you have "an output rate and input rate." True  if by input rate you are referring to arrivals, but the forced flow that you appear to be talking about requires two outputs  not an input, per se, even if part of that is the basis. You have the server output rate and the system's output rate (meaning the server, network, etc.). It is this combined with the visit ratios that determines conforming output rates over a given number of servers.
Actually, having typed that out I think I see where you feel I am incorrect. You feel, and correct me if I am wrong, that the patterns of arrivals based on the output trafficking is not being accounted for in what I talked about with confidence intervals. After all, how confident can I be in the results if I do not account for variable output relative to arrivals. Fair enough, but consider that I did state: "Also keep in mind that transaction pathing (what is actually being done) coupled with arrival rates will determine burst traffic  something that is very easy to miss with confidence intervals." The transaction pathing will be the outputs. This goes back to my correction to your statement. It is not average number of visits per server, but average number of visits per transaction to the server. So I did state that there was some risk with Web systems that did not exist with traditional client/server (or, at least, did not exist to the same degree). 
Junior Member
Re: confidence intervals In regards to your comments on comparing two sites performance to one another, and not having a 'competitive' baseline to compare your results against  there is one company (www.webhancer.com) that is a third party who reports on what customers experience at any site on the internet. They have reports on Amazon and B&N  and they are grouped as industry verticals, so if you want to find out the average TCP connect time for booksellers, they could easily tell you.
They have about 50 or so different reports they can generate which can be further broken down by connection speed, time of day, etc.
If you want more info, email me. casey@cybereffects.com 
Senior Member
Re: confidence intervals Keynote, Optimal, Holistix, etc. all do similar things and I agree they can be helpful in some situations. The problem is that many times they do not tell you how they derive their results so you cannot feed it into your own performance models  if you use them. And, of course, the whole point of confidence intervals (and Web traffic in general) is that the averages do not, strictly speaking, matter or, put less bluntly, can be misleading.
Burst traffic is the number one overall killer for Web sites (witness recent problems with Amazon or problems farther back with eBay, PlanetSourceCode, or CrankyCritic or even going farther back with SETI@Home).
Heavytailed distributions are another problem that mainly affect sites like CNN.com, CinemaNow, NetFlix, etc. This is something that many services do not always account for or, if they do, they do not tell you exactly how they account for it.
I agree that it is handy to see the competition (at a snapshot in time, of course) but it is not very helpful overall, at least to me, in showing what your site is doing or will do. You also have to know the configuration that is hosting the Web site which some services do not always tell you.
Posting Permissions
 You may not post new threads
 You may not post replies
 You may not post attachments
 You may not edit your posts

Forum Rules 