Spirent has always been on the cutting-edge of video and audio testing. When the hot technology was Microsoft’s MMS, we added that. When RTSP rose, we implemented it with a myriad of player emulation (Microsoft, Apple, Real, BitBand …). When RTMP (Flash) video was the next best thing, we added that too, and so on and so forth.
When presenting the solutions to the customers, we always make it clear that in video/voice testing, you want to look not only at how many streams the system you test can handle, but also the quality of the streams. If your customers are plagued with bad video quality, they will not use your Video on Demand service any more. If they can’t understand what the person on the other side of the IP phone is saying, they will switch to a different VoIP provider.
This is why Avalanche (as well as Spirent Test Center and some of our other products) always implemented Quality of Experience (QoE) metrics. There are network layer metrics: the Media Delivery Index (MDI) -related stats; and the “human-level” metrics: the Mean Opinion Score (MOS). These are pretty much industry standard metrics that are totally relevant when testing RTSP and SIP.
But now we support Adaptive Bitrate (ABR) and…. We don’t provide MOS or MDI. And people are surprised, with reason. I was surprised too, at the beginning, until a discussion on our internal mailing list got me to think more about it. Let’s explore the reasons why we didn’t implement MOS and MDI for ABR, but let’s first recap what MOS means in the context of load testing.
What is MOS, again?
MOS is a score on a scale of 5 that reflects the quality of a video as a human would evaluate it. A score of 5 is “perfect” (and not achievable by design). A score below 1 is “please make it stop, my eyes are bleeding.” A typical good score is somewhere between 4.1 and 4.5.
As soon as a video is encoded, you can use tools to calculate its MOS. A good (usually one you have to pay for) encoder will give you a high score. A bad (that you sometimes also pay for) encoder will give you a bad score. I will not get into the details of how the score is achieved, but in two words it depends on your test methodology. Some people will compare the source and retrieved files (PEVQ). If you use R-Factor you’ll use codec and bitrate information and so on. There are other ways to calculate video quality, even when sticking to MOS.
When a video is streamed, the MOS on the receiving end cannot be higher than the source video. At best, in perfect network conditions, the MOS scores will be equal between the source and retrieved media. This is what you look for when looking at MOS during load tests: you’re looking at the evolution of the MOS (it shouldn’t get lower), not only its absolute value. If the source video MOS score is 2, it’s pretty bad, but if it’s still 2 when it reaches the client, your network is not degrading the quality: your network is good.
What makes MOS go down then? Typically, it’s packet loss. RTSP and Multicast video streaming typically use RTP/UDP for the data stream (RTSP, which is TCP-based, is only used for control). If you’re reading this blog, you know that UDP is an unreliable transport protocol – there’s no re-transmit feature, among other things. (People have been trying to work around that using RTCP, but it adds overhead, which is why RTP wasn’t based on TCP in the first place I think, so it’s not an ideal solution).
Why is it irrelevant then?
As we have just seen, in a live network, a MOS score will decrease due to bad network conditions because the underlying transport protocol (UDP) is unreliable. But Adaptive Bitrate is based on HTTP, which itself is based on TCP! And we know that TCP is a reliable protocol. There will be no packet loss – TCP’s retransmit mechanisms will kick-in to make sure you get that lost packet.
This means your clients video quality score will always be the same as the source, because ABR relies on TCP to make sure there’s no lost data. Therefore, measuring this is irrelevant.
But retransmits bring other problems. First, there is the overhead. Not much can be done about that. ABR is a technology that will favor quality over verbosity.
Then it takes time to re-transmit packets. You’ve got an extra round-trip to make when you re-transmit. On a fairly lossy network, the re-transmission will multiply and slow down the network. How will this manifest (pun intended) for the users? They will not have enough data (fragments) to keep playing the video without interruption. This is known as Buffering Wait Time. You don’t want that.
When this threatens to happen, the ABR client will tend to downshift to a lower bitrate. This is what makes this technology brilliant. As the name implies, it will adapt to the network conditions. This is what you want to look at. As we’ve seen, the video quality is a given. What is not a given, and a very good metric to look at, is the total number of downshifts. Or the total number of buffer underruns. Or the average Buffering Wait Time. And guess what, Avalanche measures all that!
One Metric To Rule Them All
People like to have one metric to simplify the results analysis, and they are right. While this metric cannot be as precise all looking in details at all the stats, it’s important to have it.
In Avalanche we call it the Adaptivity Score. We look at the total bandwidth used by the users, and compare it to the potential maximum bandwidth (that’s the maximum available bitrate multiplied by the number of users). We then rationalize it over 100.
Let’s have an example. If we have 10 users, connecting to an ABR server serving streams at bitrates of 1 Mbps and 500 Kbps. That’s a maximum potential bandwidth of 10 Mbps. If all 10 users are on the 1 Mbps stream, the score will be 100:
(current bitrate / max bitrate) * 100
((10×1 Mbps) / 10 Mbps) * 100 = 100
Now let’s pretend that half of the users go to the 500 Kbps stream.
(((5×0.5 Mbps) + (5x1Mbps)) / 10 Mbps) * 100 = 75
And since we do this calculation at every result sampling interval, you can analyze this after the test has been executed and get a nice graph.
In the example below I used Avalanche to emulate both the client and server of the Apple HLS implementation of ABR. I have a bunch of bitrates in the Manifest, and enabled a bitrate shift algorithm. The users are configured to start at the minimum possible bitrate and work their way up. The video lasts 5 minutes (to allow enough time to shift all the way up).
The first graph shows the Adaptivity Score. The second graph shows which bitrate “buckets” the users are streaming from. We can see that as the users go to higher bitrate channels the score goes higher.
And just for fun, here’s a screenshot of the throughput in that test. That’s almost 30 Gbps on a single device emulating both the clients and servers 🙂
If there is one thing to take from this article, it’s that in HTTP Adaptive Bitrate we know that thanks to TCP, all of the video data will reach the clients. There will be no lost data. We know that the quality of the video as viewed by the clients will be equal to the source. But the cost of this is that you might have an increased buffering time as packets are potentially re-transmitted.
The second part is that if a Service Provider wants to make sure its customers have the best possible experience, they need to make sure these clients can smoothly stream from the highest available bitrate. That’s your “quality of experience” measurement in ABR : how close to the maximum available bandwidth your clients can reach.