BoxCaster with a CameraIt happens about once a week — a new customer starts their very first stream with a camera pointed at themselves; while watching it on their computer or tablet, they’re surprised to discover that they’re watching themselves from about 30 seconds ago, and they call us to ask why their stream is so delayed.

Many streaming newcomers are used to tools like Skype or FaceTime that allow them to collaborate with others in real-time and feel a lot like talking on the phone or in person. So why is streaming different?

In this post, we’ll answer this question and provide a deeper understanding of live video streaming and how it works.

Streaming vs. Conferencing

First and foremost, it’s important to distinguish web “streaming” from “conferencing” tools.

Tools like BoxCast, Livestream, Ustream, and YouTube Live fall primarily into the former camp, while tools like FaceTime, Skype, and Zoom fall into the latter.

The primary difference in the design of these tools is whether the content is primarily meant to be a “broadcast” — a small number of presenters to a potentially large number of viewers in a one-way fashion — or a two-way collaboration among a limited number of participants.

Although this distinction may seem trivial, it becomes very important when the number of participants or viewers scales to a large number.

Keeping the delay between participants low enough for collaboration requires tightly-coupled computing services – and tightly-coupled services do not scale to large numbers of participants.

For the remainder of this post, we’ll focus on streaming services, which are meant to be a means of broadcasting content to an arbitrarily large, globally distributed audience.

A Few Terms You Should Know

Below are a few streaming terms that you might not be familiar with. We’ve defined them so you can reference them as you go on:

  • Latency: the more accurate term for “delay”; the amount of time between something that happens in the “real world” and the display of that event on the viewer’s screen.
  • Video Distribution Service (VDS): though a VDS can take many forms, it is essentially responsible for taking one or more incoming streams of video and audio (from a broadcaster) and presenting it to viewers. This includes what is commonly referred to as a Content Delivery Network.
  • Content Delivery Network (CDN): a means of efficiently distributing content around the globe.
  • Transcoding: the process of decoding an incoming media stream, changing one or more of its parameters (e.g. codec, video size, sampling rate, or encoder capabilities), and re-encoding it with the new parameter settings.
  • Transrating: a similar process to transcoding, whereby the media stream’s compressed bitrate is changed, typically to a lower value.
  • Adaptive Bitrate Streaming (ABR): ensures that viewers on many kinds of devices with different capabilities and varying internet access can smoothly play a media stream.

Does Latency Always Matter?

If you assume that your viewers are not sitting in the seats at your live event, then latency may actually not be that important. Whether two seconds or two minutes, if a viewer is not present in person, then they’ll be blissfully unaware that there is any latency at all.

However, sometimes latency is an issue.

For example, live attendees may be tweeting updates, or you may be providing live score and stat updates for a sporting event. If your latency is too long, viewers may read about something before they see and hear it happen, which is not ideal. So, we should try to keep the latency as low as possible.

If you are choosing a streaming technology and latency is a potential concern for you, the main decision you must make is a tradeoff between scalability and low latency.

If low latency is more important, you should choose a technology that provides it, at the cost of potentially inhibiting future viewership growth.

If broad viewership is more important, you should choose a technology that supports broad scalability at the expense of higher latency. We’ll offer a technical breakdown below.

What Causes Latency?

Let’s look at how a typical live streaming system works and examine how latency is introduced at each step:


Image Capture

Whether you’re using a single camera or a sophisticated video mixing system, taking a live image and turning it into digital signals takes some time. At minimum, it will take at least the duration of a single captured video frame (1/30th of a second for a 30fps frame rate).

More advanced systems such as video mixers will introduce additional latency for decoding, processing, re-encoding, and re-transmitting. Your video capture and processing requirements will determine this value.

Minimum: about 33 milliseconds

Maximum: hundreds of milliseconds


When encoding in software (on a PC or Mac) or using a hardware encoder (BoxCaster, Teradek, etc.), it takes time to convert the "raw" image signal into a compressed format suitable for transmission across the Internet. This latency can range from extremely low (thousandths of a second) to values closer to the duration of a video frame. Changing encoding parameters can lower this value at the expense of encoded video quality.

Minimum: about 1 millisecond

Maximum: about 40-50 milliseconds

Download the Buyer's Guide to Live Video Streaming


The encoded video takes time to transmit over the Internet to a VDS. This latency is affected by the encoded media bitrate (lower bitrate usually means lower latency), the latency and bandwidth of the internet connection, and the proximity (over the Internet) to the VDS.

Minimum: about 5-10 milliseconds

Maximum: hundreds of milliseconds

Jitter Buffer

Since the internet is a massively connected series of digital communication routes, the encoded video data may take one of many different routes to the VDS, and this route may change over time. Because these routes take different amounts of time to traverse (and the data may be queued anywhere along the route), it may arrive at the VDS out of order. A special software component called a jitter buffer re-orders the arriving data so that it can be properly decoded.

When configuring the jitter buffer, one must choose a maximum time boundary inside of which data can be reordered. This time boundary provides the latency of the jitter buffer. As the latency is lowered, the risk of losing “late” data increases; while choosing a higher latency ensures that more "late" data is recovered.

Minimum: typically no less than 100 milliseconds

Maximum: several seconds

Transcoding and Transrating

Your viewers will be watching from many kinds of devices (PCs, Macs, tablets, phones, TVs, and set-top boxes) over many types of networks (LAN/wifi, 4G LTE, 3G, etc.). In order to provide a quality viewing experience across a range of devices, a good streaming provider should provide ABR.

There are two general ways to accomplish this: either the encoder streams multiple quality levels to the VDS (which are directly relayed to viewers), or the encoder sends a single high-quality stream to the VDS, which then transcodes and transrates it to multiple levels. Typically, the transcoding and transrating takes about as long as a "segment" of encoded video (more about segments later), but it can be faster at smaller resolutions and lower bitrates.

Minimum: about 1 second

Maximum: about 10 seconds

Transmission to Viewers

There are two categories of protocols for viewing live video content: non-HTTP-based and HTTP-based. The two differ on their latency and their scalability. Understanding these differences is integral to choosing a streaming solution.

Non-HTTP-based protocols (such as RTSP and RTMP) use a combination of TCP and UDP communications to send media to viewers. They can potentially be very low-latency (as low as the network latency from the VDS to the viewer); however, their support for adaptive streaming is spotty at best. Furthermore, scaling these protocols to large numbers of viewers becomes very difficult and expensive.

HTTP-based protocols (such as HLS, HDS, MSS, and MPEG-DASH) are designed to take advantage of standard web servers and content distribution networks which scale to many (thousands to millions of) simultaneous users. They also have built-in support for adaptive playback, and have more broad native support on mobile devices.

The Best Equipment for Live Video Streaming: Spring 2017

The way these HTTP-based protocols work is by breaking up the continuous media stream into "segments" that are typically 2-10 seconds long. These segments can then be served to viewers by a standard web server or content distribution network.

HTTP-based protocols are generally better suited to most live streaming scenarios due to better feature support and scalability. However, the disadvantage of these protocols is that the latency is at least as long as the segment length, and can be as bad as 3-4 times the segment length (for example, iOS devices buffer 3-4 segments before even beginning to play the video).

Minimum (for non-HTTP-based protocols): about 5-10 milliseconds

Minimum (for HTTP-based protocols): about 2 seconds

Maximum (for HTTP-based protocols): about 30-40 seconds

Decoding and Display

Whether viewing on a phone, a computer, or a TV, it takes time to decompress the media data and render it on the screen. In the best case, this can be as low as a single frame duration (1/30th of a second at 30fps), but typical values are 2-5 times the duration of a video frame. This latency is determined by the capabilities of the viewing device.

Minimum: about 33 milliseconds

Maximum: hundreds of milliseconds

Putting It Together

A streaming solution that uses non-HTTP-based protocols can achieve a lower latency; per our estimates above, latency will likely be in the range of about 1.2–17 seconds; realistically, it will typically be about 5–10 seconds. However, this solution will not scale well beyond about 50–100 simultaneous viewers.

A streaming solution that uses HTTP-based adaptive bitrate mechanisms will have a slightly higher latency range: about 3.2–56 seconds. Realistically, it will typically be in the 15–45 second range. Since this approach uses HTTP-based mechanisms that can leverage off-the-shelf CDNs, it can theoretically support a very large number of simultaneous viewers without difficulty.

What are my next steps?

Some attributes of your total latency may be within your control. Your encoder settings, the jitter buffer, the transcoding and transrating profiles, and segment duration may be configurable. Keep in mind, however, that while a lower latency may sound desirable, it’s important to test these settings with great caution, as each choice may bring about other negative consequences.

At BoxCast, we take great pains to automate as many of these choices as possible to maximize the stream quality and ensure a delightful viewer experience.

Recommended Reading:

Like we just established, when it comes to broadcasting something live, there's never been a way to see what your broadcast will look like before you share it publically with your viewers.

We’re changing that with BoxCast Preview.

Preview is a BoxCast feature that lets you see exactly what will be broadcast to the world. But instead of the normal 30-60 second delay it takes to prepare your video to be streamed, Preview shows you what your broadcast will look like to your viewers with only a few seconds of latency.

Click here to learn Why You'll Love BoxCast's Low-Latency Preview.

Published by Justin Hartman on December 28, 2015 in Streaming, Tech Updates

Become a Live Video Streaming Expert

BoxCast Guides

Download our free guides

Keep Up With The Latest From BoxCast

Subscribe to our newsletter for monthly updates on BoxCast

Follow us on social media to get the freshest inside scoop

facebook_logo Twitter-Logo LinkedIn_logo_initials insta_logo_