This is such a pain in the neck, if you ask me. Where I did some work with VoIP it basically involved making sure all the interface links in the system were able to handle everything. This meant doing a performance model based on the switches, gateway, all routers, etc. as well as doing a scaled performance model for the network as a whole. We had some tools that could do reasonably good VoIP decoding and, obviously, our focus was on latency and loss measurements.
We basically had test cases for traffic priority/precedence specific to the VoIP implementation, audio quality levels, RTP packets, variable bitrate codecs, signaling integrity, transmitted reference samples, buffering, and then particular elements that related to the software interface that we were using. The only thing we did not really do was any of the "estimated voice quality scoring" that a lot of vendors advertise for now. The biggest problems (or bottlenecks) we had were with our network segments. Obviously if you have a latency issue there then your data quality suffers proportionally.
I agree with this being a pain. Jeff did some more detailed work than I did - mostly what I did was a multi-user end user simulation using a "traditional" load generation tool. It really would have been easier and cheaper to just get a bunch of users on the application manually. For all that I generally frown on that technique, sometimes it really is a viable option for load generation.
With things like VoIP and streaming media, the major bottlenecks usually are very obvious if you are monitoring the right things and applying even a moderate, manual, load. At least they have been in my experience.