So You Want to Be a Video Communications App Developer

2021.05.30

No? Maybe you do and just don’t realize it yet. This post will introduce the concepts and explain the essential technologies and platforms like Zoom’s new SDK.

This space is on the cusp of a boom that can be compared to the web in the late 1990s and mobile apps in early 2010s. New platforms and APIs have become available that make it literally hundreds of times easier to build scalable video communications applications. Yet the opportunity at hand feels somewhat difficult to express, and one reason is the awkward nomenclature.

Compare with those two previous booms:

“Web?” Just three letters (and significantly less awkward than “information superhighway”—the first attempt at branding in this space!)

“Apps?” A snappy four letters (and also benefited from a billion-dollar ad campaign by Apple in 2010.)

But “video communications”…? So many letters that a lazier writer might not even bother counting them.

I’m using video communications as an umbrella term for many products and features with realtime social aspirations. These include:

Video chat, video conferencing
Group calls, audio rooms
Co-watching, watch parties, chat between stream viewers
Live customer engagement
Virtual events, webinars, town halls
Live streaming with guest participants
Remote presence, telework, telehealth

Many of the hottest products of 2020–21 belong here. You’ve heard of them: Zoom, Peloton, Discord, Clubhouse — all definitely video communications. (Clubhouse is a good example of the difficulty in naming this space. It’s audio-only… But from a platform perspective, that’s simply a special case of video communications.)

Zoom is an interesting case because it’s a generic video communications platform. This makes it both enormously popular and poorly optimized for any particular use case. Everybody is on Zoom: lawyers, kindergarten teachers, venture capitalists, grandparents, yoga clubs. Its user interface is the lowest common denominator by necessity. Many of its user segments would be better served by a vertical-specific app, were someone to develop those apps. (For an investor’s perspective on this opportunity, check out the highly interesting “Verticalization of Zoom” by JJ Oslund.)

Please indulge me with a little walk down memory lane back to a simpler time, circa 1994 — the brief period when people were already getting connected via computers but Netscape was still a glimmer in Marc Andreessen’s eye, AOL ruled Americans’ online behavior, and email was the leading Internet application.

Companies eager to promote their Internet presence would publish an email address, not a URL (because the web ecosystem didn’t exist yet). Email was the delivery platform for embryonic social applications like mailing lists, but also a paradigmatic model for corporate collaboration (e.g. Lotus Notes, Microsoft Outlook). Email thus was the lowest common denominator for text-based communications, just as Zoom is for video today.

The web enabled the “unbundling of email”. It progressed in steps that are clearly identifiable in retrospect:

Static content. Want to find out what’s on the lunch menu this week? It’s easier when the restaurant has a website. No more calling or emailing them! This was the web in 1995: great for distributing content that doesn’t update constantly and isn’t personalized to a particular user, but not very alive.
Forms. Planning on buying something? Want to ask a question? Instead of emailing them, it could save everyone time if they had a form on their website for structured requests.
Apps. From those contact forms, it’s a short conceptual leap to make the server answer to requests automatically using a server-side application that generates a new web page. This was “web apps 1.0”, and it replaced email-based automatic content services which didn’t have the benefit of a structured UI.
Text chat. In 2001–2005, web apps became dynamic thanks to new browser APIs. Instead of rendering a whole new page for each server request, you could fetch data in small pieces and update parts of the web page as needed— Web 2.0! This unbridled network access also enabled realtime text chat, which then became a core part of business web design thanks to easy-to-use platforms like Intercom. This was the last piece of email to be unbundled: Intercom and others largely took over direct customer contact, providing tools better adjusted to each business vertical than simply having customer representatives reading and sending email.

Now, video in 2021 obviously isn’t email in 1994. But the common thread is the rising power of web front-ends backed by increasingly powerful application platforms. Web-based text chat delivered the last step of unbundling email for businesses, allowing them to interact directly with customers in a format that improves both user-facing interfaces and internal workflows.

It’s the same with video. Previously you’d have to take a Zoom call regardless of whether you’re talking to grandparents, selling a product, or giving a presentation that might determine the future of your company. It’s now becoming possible to add business-specific application structure to those communications. Smoother workflows, higher-quality social interactions—it will be hard to go back to the lowest common denominator after you’ve experienced custom-built real-time services on the web.

Zoom Video Communications Inc. certainly isn’t oblivious to this development. The company recently (March 2021) announced its Video SDK that lets you build applications on top of the same platform that powers Zoom meetings. However, despite its impressive consumer reach and user volume, Zoom is entering a market already populated by some experienced competitors.

This post will present Zoom’s API along with three of those competitors and their SDKs. As part of the review, I’ll build small web applications on each platform to get a better idea of their strengths and weaknesses. But first a brief overview of the technology that enables these platforms, so we’ll be able to understand their differences…

Technology underpinnings of live Internet video

Transmitting realtime video on the Internet isn’t trivial. It’s extremely bandwidth-intensive compared to other applications and also extremely sensitive to delivery timing fluctuations—a challenging combination on Internet’s packet network.

Jon Dahl, CEO of video encoding platform Mux, has an excellent blog piece called “Why Video Is Awesome” where he writes:

“Honestly — it’s a minor miracle that video ever plays at all.”

Now take that miracle and multiply it by the number of simultaneous participants in a video call! It’s a major miracle that most video communications apps work as well as they do. (Despite all the complaining, the answer to those eternal questions “can you hear me?” and “can you see my screen?” is most often “yes!”)

An unfortunate feature of this technology space is that it’s extremely acronym-heavy, and most of the acronyms are random jumbles of the same handful of letters. We’ve got RTP, RTMP, TCP, RTCP, RTSP, RTMPS, SFU, STUN, WebRTC (no plain RTC though, to keep you on your toes!)… It’s a trip.

For this discussion you need to recognize just three: RTMP, RTP and WebRTC. Rest of the acronyms, we might briefly mention but mostly get to handwave away as “auxiliary stuff that you can learn about if you go deep enough down the video rabbit hole”.

RTMP

Honestly, it’s not even worth expanding these acronyms. They’re all named the same: “real-time-something-not-descriptive-nor-useful”. Just remember that this first one under discussion is The One With The “M”.

RTMP is the protocol you’d use to stream video from a computer to a service like Twitch, Facebook Live, or YouTube Live. This protocol is quite old: it was originally created by Adobe for Flash Player when it was the only game in town for web video. Yet it remains widely used because so many systems understand it. (Add an “S” at the end and you get RTMPS, the encrypted version of the protocol, just like HTTPS is the secure version of HTTP.)

RTMP today is mostly for one-way communications that are further broadcast on another network. Think of it like you’re a TV studio communicating with a broadcast tower. Any video mixing is done on your computer. You then use RTMP to send a single master signal to a media server (typically hosted at Twitch or Facebook or wherever), and their platform converts it into something that your viewers can consume. You don’t need to think about that part though, except to understand that it will introduce latency — a fancy word for delay. Depending on the video distribution platform and the device that your viewers are using to watch the stream, this latency can be anything between a few seconds and 30 seconds.

We now see that there are many definitions of “real-time” in this field! Just because the protocol’s name is real-time-something-or-other, it doesn’t necessarily mean you can easily have realtime interactions with your audience. Even many seconds of latency might not be a problem but you must plan your show’s interactions accordingly. Summa summarum: you use RTMP when your live video communication works like television. You send a signal, some big and expensive media server magic distributes it to potentially millions of viewers, and it’s necessarily delayed by several seconds or more.

RTP

What a difference one less “M” makes. We saw RTMP is used for single video streams at fairly high bandwidth and where latency isn’t a prime concern—like television.

RTP on the other hand was designed for latency-sensitive applications that send and receive multiple streams at fairly low bandwidth—more like telephone. And indeed, RTP is a foundational protocol in “Voice over IP” or VoIP, which is how most voice calls are transmitted nowadays. Bits being bits, RTP isn’t limited to voice, but can carry video just as well.

(RTP has a sister protocol called RTCP which is used for quality control and other metadata, and a related protocol called RTSP which you’re not likely to encounter today unless you’re dealing with embedded video equipment like surveillance/CCTV cameras. We don’t need to know anything more about that, fortunately.)

To support the latency requirements of telephony and video calls, RTP makes a different tradeoff than RTMP: it’s ok to lose some bits occasionally if that helps deliver most bits faster. Because of this property, RTP alone isn’t really sufficient. It’s a fine protocol for transmitting those video and audio bits that need to get to their destination ASAP… But something else is needed to cover an application’s remaining data needs; for example, the details of how participants connect to each other and possibly a central server.

WebRTC

Enter WebRTC. It’s not a single protocol, but more like a shared framework for video communications applications that work in web browsers. It offers APIs and mandates the underlying protocols (like RTP), as well as the codecs used to compress the actual audio/video data. Thanks to WebRTC, Firefox users can talk to Chrome users, and both can talk to servers that understand the relevant parts of WebRTC.

The first case — Firefox talking to Chrome—is an example of a peer-to-peer connection. It’s a hippie democracy. Why let The Man’s big server decide who you talk to? Connect to each other directly! (In practice you do end up needing a server to make an initial connection between browsers, as most people these days don’t have their own IP address… The server technology for this peer discovery is called STUN. There’s also TURN, basically a more powerful STUN server that can also relay data.)

Peer-to-peer runs into trouble when your hippie teleconferencing commune grows to more than a handful of people. Every participant needs to receive everybody else’s video while also sending a copy of their own camera stream to each of those other participants. A regular home Internet connection typically doesn’t have a lot of upload bandwidth, so it may struggle with sending video even to a fairly small number of peers.

As the peer-to-peer utopia crumbles and you crawl back to The Man, he tells you that he’s got a solution called SFU. (No, not STFU, he genuinely wants to help you.) SFU stands for Selective Forwarding Unit. It is a server that is connected to all the participants in a WebRTC call and forwards streams from other participants. This way, your precious upload bandwidth is only spent talking to the SFU. The SFU can also save your download bandwidth by recompressing the streams from the other participants.

Since the SFU has access to all the participant streams, it can also do other interesting things like recording and broadcasting the call. It doesn’t have to treat participants equally either: some might be passive viewers who just receive a composited video of the active participants, like watching a live stream but for a private audience.

To extend the audience, the SFU can also become a bridge into the “television-like” RTMP world we discussed previously. So, we’ve got WebRTC-using participants on their browsers and mobile devices transmitting RTP; the SFU receives those RTP streams and renders them into a single video stream; this it sends using RTMP to another service (such as Twitch or Facebook Live) that will broadcast it to viewers. This is necessary for use cases such as large-audience online classes and talkshow-style audience participation in a live stream.

This concludes our tech dive. Hopefully it gave you a reasonable idea of the available technical building blocks. Remember that your application or solution probably doesn’t need to use everything. The technical features you need from the platform will be quite different if your app is e.g. just facilitating calls between two people, vs. streaming an online class that has complex participant roles. Because platforms bill by the minute, it’s important to understand enough of the technical choices that you don’t accidentally run up unsustainable bills or spend money on options that you don’t use. Next let’s look at the companies that promise to make this acronym soup easier to work with.

The platforms

Zoom’s brand new SDK (introduced March 2021) was mentioned earlier. As the thousand-pound gorilla of consumer mindshare, we’ll definitely want to take a look at Zoom’s developer offering. It will be pitted against three already established video communications platforms, presented here in alphabetical order.

Agora

Founded in 2013, Agora is dual-headquartered in Santa Clara, California and Shanghai, China.

The company’s product is their “Real-Time Engagement Platform” whose feature set is pretty much exactly what we’ve discussed in this article. It also includes text chat and analytics. You can see Agora’s feature list on the right (screenshot from their website).

So, Agora is very much a “pure play” API company in this space — they don’t have a legacy enterprise business, for example.

Agora has been popular with startups in both China and Silicon Valley. Although Agora doesn’t advertise this relationship directly, the aforementioned Clubhouse is known to use Agora’s platform for their audio chat rooms.

Daily.co

Daily is the upstart among these companies, but it actually has five years of history already.

After participating in the Y Combinator start-up accelerator in 2016, Daily initially experimented with a hardware video conferencing device for enterprises, then pivoted into a pure API provider.

Y Combinator has a tradition of demanding relentless customer focus from their startups, and this pedigree shows in Daily’s emphasis on bringing developer simplicity to WebRTC.

Where the APIs in this space typically tend to be “enterprisey” with Java-style class hierarchies and object compositions and lots of technical choices along the way, Daily flips the script: you can start with a dead simple prebuilt UI that is a drop-in into any web app and offers full access to Daily’s platform, even on the free plan.

Twilio

Twilio brands themselves a “cloud communications platform as a service”, or CPaaS for (not-very-)short.

Founded in 2008, Twilio’s original offering was an API for automating traditional phone calls and SMS text messages. Today Twilio is a large publicly listed corporation and has expanded into email services, chatbots, Internet of Things, customer data management, and many other niches of enterprise communications.

Thus, unlike the other companies in this comparison, what Twilio calls “programmable video” is really a minor part of their business. This may seem like a handicap, but this kind of situation can sometimes be advantageous to the customer if the upstart product benefits from long-term investment from its wealthy corporate parent. In the case of Twilio, the company certainly has a history of API innovation and well-executed acquisitions. We’ll explore their video API in more detail and see where its strengths lie.

Zoom

Ah, the company whose name practically became a common noun for “video meeting” during the Covid-19 pandemic.

Zoom’s secret sauce is in their proprietary client applications and protocols. Even the web-based Zoom client behaves more like a native application, as it loads a large bundle of compiled WebAssembly code. This underlying philosophy is different from the competitors who rely more on standard WebRTC and lightweight SDKs. Because of complexities in opening up this “secret sauce” to outsiders, Zoom took their time to offer a real video conferencing SDK. As of March 2021 it’s finally out.

It introduces the concept of “Zoom sessions”. Unlike familiar Zoom meetings, you can’t join sessions from the Zoom app or a Zoom link; instead sessions can only be accessed through the third party application which created them. (This is of course how the other platforms presented here work as well, but this kind of programmability is brand new to Zoom.)

The excluded

One important competitor left out here is Amazon’s Chime SDK. I‘ve wanted to primarily focus on the needs of startups and individual developers, and the Chime SDK is geared towards serving enterprises that already have a substantial investment in the Amazon Web Services (AWS) ecosystem. I don’t feel I would do it justice in this context.

Another omission is the Vonage Video API, formerly known as OpenTok by TokBox. Although the product has a long history, it has seemed stagnant for many years. I had to make a choice on which APIs to explore in detail, and decided to focus on newer entrants.

Finally, there’s always the option of not using a platform and instead rolling your own. How hard can it be, you may think? Just install some open source packages on AWS and write some front-end code… Remember what Jon Dahl of Mux said? “It’s a minor miracle that video ever plays at all.”

You probably don’t want to be in the miracle business if you can avoid it. Maybe — maybe — if you just need one-to-one video calls, building it yourself using WebRTC is a workable option. But you shouldn’t consider doing your own mobile client development and SFUs and media server deployments and all that, unless you’re absolutely convinced it is a core part of your business.

Pricing comparison

Reason prevails, and we find ourselves determined to leverage a well-maintained commercial platform rather than cobbling something together from various semi-maintained open source widgets and risk being in the miracle business. That means we must pay, and big product dreams could translate into big bills (although at that point, it’s usually a positive problem to have—especially in the present startup funding environment.)

Some of the vendors in this space have an annoying propensity to never disclose any prices upfront if they can possibly avoid it. Instead, they really want you to schedule a call with a salesperson. If you’re managing a corporate budget, I’m sure it can be illuminating to have a little chat on How To Spend It — but I assume most of my readers aren’t in that position, and would prefer to do their own math on what they can afford.

I’ve chosen Agora, Daily, Twilio and Zoom because their pricing is reasonably easy to discover and understand. Let’s invert the order and start with Zoom.

Zoom’s products and pricing

To even test the Zoom Video SDK, you need to create a brand new Zoom user account and enter a valid credit card. The sign-up page (shown below) tries to get you to sign onto the $1,000 / year plan. When you sign up for pay-as-you-go, Zoom still sends you an initial $0 invoice. Everything about this flow is designed for corporations, not individuals.

There’s no billing cap on the pay-as-you-go plan, so your credit card could theoretically be hit with unlimited charges if you really manage to mess up your scaling plans. Fortunately 10,000 minutes / month are included for free, which should get you pretty far when doing development.

Beyond that, the price is $0.0035 per “meeting session minute” which means the total amount of time spent on a call. If you have five participants in a 30-minute video call session, it counts as 5x10 = 50 minutes and will cost you half a dollar (more precisely 52.5 cents).

If you’re serious with your Zoom SDK plans, you can save money with the other plan that costs $1K/year upfront but gives you 30K free minutes / month and a slightly cheaper $0.003/min price after that.

All in all, Zoom’s pricing is commendably simple to understand, although the setup isn’t very friendly to non-corporates.

Twilio’s products and pricing

Twilio’s enterprise roots are showing in their Video API pricing. The website makes you choose upfront between three products, and at every corner lurks a button asking if you’d like to talk to a Twilio salesperson instead. Let’s try to figure it out without the hard-sell pitch.

The three products are in fact packaging various WebRTC setups. These will hopefully sound familiar from the “tech dive” section earlier:

Twilio Video WebRTC Go. This is a free sampler of their SDK, but it really won’t get you very far because it only supports 1:1 WebRTC calls (i.e. peer-to-peer between two persons). If you just want to see what the SDK is like, this is the way to go, but it won’t give you much insight into the full product.
Twilio Video P2P. This product’s tagline is “Build peer-to-peer video applications with unlimited TURN relay” — in other words, it’s the next step up in WebRTC capabilities compared to the free tier. We get multiple participants, and Twilio’s server will help relaying the streams as needed (that’s what TURN does)… But fundamentally this is the hippie commune model of everyone connecting to everyone, which can be taxing for upload bandwidth. The price is $0.0015 per participant minute. So, a 30-minute call with five participants would come to 22.5 cents.
Twilio Video Groups. Remember SFU from the WebRTC discussion earlier? Well, this product is basically an SFU server. Up to 50 participants can join a session. The price is $0.004 / participant minute. A 30-minute call with the maximum of fifty participants would thus cost $6. If you want a recording of the call in a single video stream, it costs a bit extra (60 cents per hour). Or, if you want to record all the participant streams as separate video files, it doubles the price.

We can see that Twilio is applying an enterprise-style pricing logic to WebRTC: the tiers are very distinct, and it’s up to you to carefully decide what capabilities you need and how much you’re willing to pay.

Signing up for the free tier is fairly easy, but you need to provide a credit card to be able to test the P2P and Groups products. This process goes through an “Upgrade Account” form that assumes you’re a corporation. There’s no mention of free minutes, so you’ll be billed for any time you spend with the P2P / Groups products.

Daily’s products and pricing

Daily.co is essentially the “anti-Twilio” in their pricing.

Where Twilio delineates their plans and pricing options rigidly by technical features and requires a credit card on file before they’ll let you test drive the P2P or Groups products, Daily has a free “no-questions-asked” plan that lets you try everything they offer. This makes life easy, especially if you’re early in your product planning and don’t know yet whether you might be using P2P or SFU or whatever acronym jazz.

You can sign up for Daily’s free plan without pretending to be a corporation or giving them your credit card — no possibility of surprise bills, then. The free plan offers a maximum of 2,000 participant minutes / month, computed with the same formula we saw with Zoom and Twilio. You can create rooms up to 200 participants (Which, if you actually did it on the free plan, would drain your 2,000 free minutes in just one ten-minute session! But the ability to test these large-scale rooms without risk of fees is certainly valuable.)

Perhaps taking a page from the Y Combinator playbook, Daily’s price plans are named in startup growth terms rather than by WebRTC features or other technical arcana. There’s a “Launch” tier that lets you move beyond the 2K minutes / month limit of the free plan. It costs $9 / month and $0.004 per additional participant minutes — exact same price we saw with Twilio Groups.

Beyond Launch, there’s a “Scale” plan at $199 / month which gives you 10K minutes / month before the $0.004 price kicks in. This plan also includes options for cloud recording and HIPAA compliance. (That’s something you need for telehealth applications in the US. It seems that Twilio makes you buy an enterprise plan to get HIPAA and Daily doesn’t, so if this is your niche, it may be a point in Daily’s favor.)

Daily also offers a separate audio-only rate that’s a 75% discount from video. (The actual price is $0.00099 / participant minute, which is just a silly number to look at. Easier to think of it as a discount.) So, if you built your audio-only Clubhouse competitor on Daily, a room with 20 participants chatting away in a 60-minute session would cost you $1.18. This matches Agora, as we’ll see next.

Agora’s products and pricing

We already briefly saw Agora’s product list. Let’s take another look…

Clearly Agora has more options than the other platforms we’ve looked at. Voice call and video call are familiar WebRTC territory, and essentially similar to what we’ve seen with the others.

“Interactive Live Streaming” in Premium and Standard flavors is an interesting pair of products because it reaches into the RTMP territory that was discussed in the tech dive. Agora’s docs for these products do a good job of explaining the difference between Premium and Standard (latency, mostly) as well as presenting potential use cases. (I love that they specifically mention “online games such as Murder Mystery and Werewolf Killing”! There is a clear shortage of werewolf-slaughter video communications products.)

Cloud recording is supported by everyone, but Agora also offers on-premise recording, which seems like more of an enterprise option.

Overall, Agora has the functionality to build almost anything you might want.

Understanding their limits and pricing is generally trickier than with others we’ve seen. For example, how many participants can you include in a video chat room? The answer is not a straightforward number like “50” or “200”, but instead this:

Up to one million users in a channel. Agora recommends limiting the number of users sending streams concurrently to 17 at most.

Followed by these footnotes:

If the number of users sending streams concurrently exceeds the recommended value, each user in the channel can only see or hear a random group of users who are sending streams. For example, if 18 hosts are sending streams concurrently in a live streaming channel, each user cannot see or hear a random one of the 18 hosts.

Agora does not provide APIs for limiting the number of users sending streams concurrently, and you can implement the limitation in your application layer.

One million participants sounds incredible! But the small print explains that if you go beyond 17 senders, Agora’s platform will simply drop streams at random. Control for this limit needs to be in your application layer — in other words, you’re on the hook for enforcing this somehow in your code.

This seems to be an example of Agora’s philosophy. You get access to the raw power of their platform and product range, but you may need to do some more work to harness it properly.

This philosophy also manifests in Agora’s pricing. The baseline is simple enough: you get 10,000 free “service minutes” / month, and the definition of service minute is the same as we’ve seen before (time spent connected to the service by each active participant).

For audio only, Agora offers the exact same pricing as Daily: $0.00099 / service minute. If Clubhouse were paying list price on Agora, a room with 20 participants chatting away in a 60-minute session would result in $1.18 of charges. (Seems very likely that a customer of Clubhouse’s size and stature is getting a substantial volume discount, but of course we can only look at list prices here.)

For video, the picture is more complicated. The pricing page explains: “Agora adds up the resolution of all the video streams a user subscribes to at the same time to determine the user’s aggregate video resolution. […] For example, if a user subscribes to two 960 × 720 video streams at the same time, the aggregate resolution is 960 × 720 + 960 × 720 = 1,382,400. The user is charged for Full HD video service.”

There are four service tiers: HD, Full HD, 2K and 2K+. The lowest “HD” tier costs essentially the same as Daily: $0.0039 / participant minute. The price then rises steeply, and a 2K+ stream costs nearly ten times as much! What this means is that your app must be very conscious of video resolutions transmitted to participants. For example, 13 streams at 640x480 each would be enough data to put your app in the 2K+ bucket. If you sent those same 13 streams downscaled to 400x300 instead, it would fall under “Full HD” and would cost 75% less.

Complicated? Yes, a bit — but on the other hand Agora at least lets you design your application at this level. The other vendors aren’t so forthcoming about the limitations and actual bandwidth of their solutions.

[Update 12 Oct 2021: My original intent was to write Part 2 of this post immediately. Unfortunately the hands-on API exploration described below remains unwritten. Still hoping to get around to it…]

One of the reasons for writing this post is to get first-hand experience of those implicit limitations that vendors might not be eager to discuss. I’m going to write a simple basic video chat application four times, using each vendor’s web API, and then spend a bit of time trying to bump into the limitations. What happens if you join a video room with many participants in Agora vs. Daily vs. Twilio vs. Zoom? Can we use these demo apps to get some measurements that could help us understand the quality tradeoffs on each platform?

So, tune in… some day… for another thrilling beginning as I download the SDKs and build those “Hello world” browser chat applications.

To get notified about the next part, why not follow me on Twitter @pauliooj? You can also sign up for RSS notifications with the link in the bottom-right corner (copy it to your favorite RSS reader). Thanks for reading!

APP.RODEO