Justin Hartman • January 01, 2019
It happens about once a week — someone starts their very first stream with a camera pointed at themselves; while watching it on their computer or tablet, they’re surprised to discover that they’re watching themselves from about 30 seconds ago, and they call us to ask why their stream is so delayed.
Many streaming newcomers are used to tools like Skype or FaceTime that allow them to collaborate with others in real-time and feel a lot like talking on the phone or in person. So why is streaming different?
In this post, we’ll answer this question and provide a deeper understanding of live video streaming and how it works.
First and foremost, it’s important to distinguish web “streaming” from “conferencing” tools.
Tools like BoxCast, Livestream, Ustream, and YouTube Live fall primarily into the former camp, while tools like FaceTime, Skype, and Zoom fall into the latter.
The primary difference in the design of these tools is whether the content is primarily meant to be a “broadcast” — a small number of presenters to a potentially large number of viewers in a one-way fashion — or a two-way collaboration among a limited number of participants.
Although this distinction may seem trivial, it becomes very important when the number of participants or viewers scales to a large number.
Keeping the delay between participants low enough for collaboration requires tightly-coupled computing services – and tightly-coupled services do not scale to large numbers of participants.
For the remainder of this post, we’ll focus on streaming services, which are meant to be a means of broadcasting content to an arbitrarily large, globally distributed audience.
Below are a few streaming terms that you might not be familiar with. We’ve defined them so you can reference them as you go on:
If you assume that your viewers are not sitting in the seats at your live event, then latency may actually not be that important. Whether two seconds or two minutes, if a viewer is not present in person, then they’ll be blissfully unaware that there is any latency at all.
However, sometimes latency is an issue.
For example, live attendees may be tweeting updates, or you may be providing live score and stat updates for a sporting event. If your latency is too long, viewers may read about something before they see and hear it happen, which is not ideal. So, we should try to keep the latency as low as possible.
If you are choosing a streaming technology and latency is a potential concern for you, the main decision you must make is a tradeoff between scalability and low latency.
If low latency is more important, you should choose a technology that provides it, at the cost of potentially inhibiting future viewership growth.
If broad viewership is more important, you should choose a technology that supports broad scalability at the expense of higher latency. We’ll offer a technical breakdown below.
Let’s look at how a typical live streaming system works and examine how latency is introduced at each step:
Whether you’re using a single camera or a sophisticated video mixing system, taking a live image and turning it into digital signals takes some time. At minimum, it will take at least the duration of a single captured video frame (1/30th of a second for a 30fps frame rate).
More advanced systems such as video mixers will introduce additional latency for decoding, processing, re-encoding, and re-transmitting. Your video capture and processing requirements will determine this value.
Minimum: about 33 milliseconds
Maximum: hundreds of milliseconds
When encoding in software (on a PC or Mac) or using a hardware encoder (BoxCaster, Teradek, etc.), it takes time to convert the "raw" image signal into a compressed format suitable for transmission across the Internet. This latency can range from extremely low (thousandths of a second) to values closer to the duration of a video frame. Changing encoding parameters can lower this value at the expense of encoded video quality.
Minimum: about 1 millisecond
Maximum: about 40-50 milliseconds
The encoded video takes time to transmit over the Internet to a VDS. This latency is affected by the encoded media bitrate (lower bitrate usually means lower latency), the latency and bandwidth of the internet connection, and the proximity (over the Internet) to the VDS.
Minimum: about 5-10 milliseconds
Maximum: hundreds of milliseconds
Since the internet is a massively connected series of digital communication routes, the encoded video data may take one of many different routes to the VDS, and this route may change over time. Because these routes take different amounts of time to traverse (and the data may be queued anywhere along the route), it may arrive at the VDS out of order. A special software component called a jitter buffer re-orders the arriving data so that it can be properly decoded.
When configuring the jitter buffer, one must choose a maximum time boundary inside of which data can be reordered. This time boundary provides the latency of the jitter buffer. As the latency is lowered, the risk of losing “late” data increases; while choosing a higher latency ensures that more "late" data is recovered.
Minimum: typically no less than 100 milliseconds
Maximum: several seconds
Your viewers will be watching from many kinds of devices (PCs, Macs, tablets, phones, TVs, and set-top boxes) over many types of networks (LAN/wifi, 4G LTE, 3G, etc.). In order to provide a quality viewing experience across a range of devices, a good streaming provider should provide ABR.
There are two general ways to accomplish this: either the encoder streams multiple quality levels to the VDS (which are directly relayed to viewers), or the encoder sends a single high-quality stream to the VDS, which then transcodes and transrates it to multiple levels. Typically, the transcoding and transrating takes about as long as a "segment" of encoded video (more about segments later), but it can be faster at smaller resolutions and lower bitrates.
Minimum: about 1 second
Maximum: about 10 seconds
There are two categories of protocols for viewing live video content: non-HTTP-based and HTTP-based. The two differ on their latency and their scalability. Understanding these differences is integral to choosing a streaming solution.
Non-HTTP-based protocols (such as RTSP and RTMP) use a combination of TCP and UDP communications to send media to viewers. They can potentially be very low-latency (as low as the network latency from the VDS to the viewer); however, their support for adaptive streaming is spotty at best. Furthermore, scaling these protocols to large numbers of viewers becomes very difficult and expensive.
HTTP-based protocols (such as HLS, HDS, MSS, and MPEG-DASH) are designed to take advantage of standard web servers and content distribution networks which scale to many (thousands to millions of) simultaneous users. They also have built-in support for adaptive playback, and have more broad native support on mobile devices.
The way these HTTP-based protocols work is by breaking up the continuous media stream into "segments" that are typically 2-10 seconds long. These segments can then be served to viewers by a standard web server or content distribution network.
HTTP-based protocols are generally better suited to most live streaming scenarios due to better feature support and scalability. However, the disadvantage of these protocols is that the latency is at least as long as the segment length, and can be as bad as 3-4 times the segment length (for example, iOS devices buffer 3-4 segments before even beginning to play the video).
Minimum (for non-HTTP-based protocols): about 5-10 milliseconds
Minimum (for HTTP-based protocols): about 2 seconds
Maximum (for HTTP-based protocols): about 30-40 seconds
Whether viewing on a phone, a computer, or a TV, it takes time to decompress the media data and render it on the screen. In the best case, this can be as low as a single frame duration (1/30th of a second at 30fps), but typical values are 2-5 times the duration of a video frame. This latency is determined by the capabilities of the viewing device.
Minimum: about 33 milliseconds
Maximum: hundreds of milliseconds
A streaming solution that uses non-HTTP-based protocols can achieve a lower latency; per our estimates above, latency will likely be in the range of about 1.2–17 seconds; realistically, it will typically be about 5–10 seconds. However, this solution will not scale well beyond about 50–100 simultaneous viewers.
A streaming solution that uses HTTP-based adaptive bitrate mechanisms will have a slightly higher latency range: about 3.2–56 seconds. Realistically, it will typically be in the 15–45 second range. Since this approach uses HTTP-based mechanisms that can leverage off-the-shelf CDNs, it can theoretically support a very large number of simultaneous viewers without difficulty.
Some attributes of your total latency may be within your control. Your encoder settings, the jitter buffer, the transcoding and transrating profiles, and segment duration may be configurable. Keep in mind, however, that while a lower latency may sound desirable, it’s important to test these settings with great caution, as each choice may bring about other negative consequences.
At BoxCast, we take great pains to automate as many of these choices as possible to maximize the stream quality and ensure a delightful viewer experience.
Like we just established, when it comes to broadcasting something live, there's never been a way to see what your broadcast will look like before you share it publically with your viewers.
We’re changing that with BoxCast Preview.
Preview is a BoxCast feature that lets you see exactly what will be broadcast to the world. But instead of the normal 30-60 second delay it takes to prepare your video to be streamed, Preview shows you what your broadcast will look like to your viewers with only a few seconds of latency.