News from Industry

Third time’s a charm: WebRTC Insights, 3 years in

bloggeek - Mon, 11/20/2023 - 12:30

Let’s look at what we’ve achieved with WebRTC Insights in the past three years and where we are headed with it.

Along with Philipp Hancke, I’ve been running multiple projects. WebRTC Insights is one of the main ones.

Three years ago, we decided to start a service – WebRTC Insights – where we send out an email every two weeks about everything and anything that WebRTC developers need to be aware of. This includes bug reports, upcoming features, Chrome experiments, security issues and market trends.

All of this with the intent of empowering you and letting you focus on what is really important – your application. We take care of giving you the information you need quicker and in a form that is already processed.

Three years into this initiative, this is still going strong. We’ve onboarded a new client recently, and this is what he had to share with us on the first week already:

“[The Insights] Newsletter has been great and very helpful. Wish we had subscribed 2 years ago.”

Sean MacIsaac, Founder and EVP, Engineering @ Roam

Why is the WebRTC Insights so useful for our clients?

It boils down to two main things:

  1. Time
  2. Focus

We reduce the time it takes for engineers and product people to figure out issues they face and trends on the market. Instead of them searching the internet to sift through hints or trying to catch threads of information on things they care about, we give it straight to them – usually a few days before their clients (or management) complains about it.

On top of it, we increase their focus on what’s important to them. Going back to past issues to find problems, search issues, look at security problems, know of experiments Google is doing or just be aware of the areas where Google is investing their efforts – all of these become really simple to do.

In the past few weeks we’ve been getting complaints from clients about audio issues on Mac (usually acoustic echo problems in Chrome). These were already hinted to in one of our previous issues and the full details appeared in the more recent issues. In parallel, we’ve been able to sniff around for root causes for them almost in real-time – enabling them to zero in on the problem and find a suitable workaround.

If I weren’t so modest, I would say that for those who are serious about WebRTC, we are a force multiplier in their WebRTC expertise.

WebRTC Insights by the numbers

Since this is the third year, you can also check out our past “year in review” posts:

This is what we’ve done in these 3 years:

26 Insights issued this year with 329 issues & bugs, 136 PSAs, 15 security vulnerabilities, 230 market insights all totaling 231 pages. That’s quite a few useful insights to digest and act upon.

We have covered over a thousand issues and written more than 650 pages.

WebRTC is still ever changing – both in the codebase and how it gets used by the market.

Activity on libWebRTC has cooled down yet again in the last year, dropping below 200 commits a month consistently:

This is more visible by looking at the last four years:

On one hand WebRTC is very mature now, on the other hand it seems to us that there is still a lot of work to be done and bugs to be fixed. External contributions were up. What is concerning is that the “big drop” in May happened three months after Google announced a round of layoffs but we have not seen many departures of long-time contributors.

Let’s dive into the categories, along with a few new initiatives we’ve taken this year as part of our WebRTC Insights service.


The number of reported external bugs has dropped considerably as did the number of issues tracking new work and initiatives. This correlates with the decreased commit activity.

The areas for bugs also shifted, we have seen a lot more issues related to hardware acceleration (since Google is eying that now to further reduce the CPU usage in Google Meet). Operating systems are starting to become a bigger issue, for example MacOS Sonoma caused quite a few audio issues and enabled overlaid emoji reactions (a bad choice with consequences described here) by default as part of a bigger push to move features like background blur to the OS layer. And of course, every autumn brings a new Safari on iOS release which means a ton of regressions…

A good example of how Philipp himself uses Insights as a way to identify what change caused a regression was the lack of H.264 fallback on Android which rolled out in Chrome 115 in August. We had been commenting on the original change end of May:

That said, we did not think of Android which remains complicated when it comes to H.264 support. Thankfully this rollout was guarded by a feature flag so the regression could be mitigated by the WebRTC team in less than two days.

PSAs & resources worth reading

In addition to the public service announcements done by Googlers (and Philipp) as part of making changes to the C++ API or network behavior we continue to be tracking Chromium-related “Intents” (which are a useful indicator for what is going to ship) and relevant W3C/IETF discussions in this section. We also moved more in-depth technical comments on relevant blog posts from the “Market” section which made the overall decline in activity less visible here.

Experiments in WebRTC

Chrome’s field trials for WebRTC are a good indicator of what large changes are rolling out which either carry some risk of subtle breaks or need A/B experimentation. Sometimes, those trials may explain behavior that only reproduces on some machines but not on others. We track the information from the chrome://version page over time which gives us a pretty good picture on what is going on:

We have gotten a bit better and now track rollout percentages. We have not seen regressions from these rollouts in the last year which is good news.

WebRTC security alerts

This year we continued keeping track of WebRTC related CVEs in Chrome (15 new ones in the past year). For each one, we determine whether they only affect Chromium or when they affect native WebRTC and need to be cherry-picked to your own fork of libwebrtc when you use it that way.

In recent months we’ve seen a trend of looking more closely at the codec implementations to find security threats there. Our expectation is that this will continue in the coming year as well – expect more CVEs around this area.

A personal highlight was Google’s Natalie Silvanovich following up on a silly SDP munging thing Philipp did with CVE-2023-4076 which affected WebRTC munging in Chrome (but not native applications:

If only anyone had told us that using SDP in the API, let alone having Javascript manipulate it in the input, is a bad idea…

WebRTC market guidance

What are the leaders in video conferencing doing? What is Google doing with Meet, which directly affects WebRTC’s implementation? Are they all headed in the same direction? Do they invest in different technologies and domains?

How about CPaaS vendors? How are they trying to differentiate from each other?

Other vendors who use WebRTC or delve into the communication space – where do they innovate?

Here’s a quick example we’ve noticed when Twilio worked on migrating their media servers to different IP and ports:

This ability to look at best practices of vendors, how they handled such challenges, or introduced new features is an eye opener. These are the things we cover in our market guidance. The intent here is to get you out of your echochamber that is your own company, and see the bigger world. We do that in small doses, so that it won’t defocus you. But we do it so you can take into account these trends and changes that are shaping our industry.

The interesting thing is that as WebRTC goes more and more into a kind of a “maintenance mode” with its browser releases, the variance and interesting newsworthy items we see on the market as a whole is growing. This is likely why our market insights section has seen rapid growth this year.

Insights automation

We’ve grown nicely in our client base, and up until recently, we sent the emails… manually.

It became a time consuming activity to say the least, and one that was also prone to errors. So we finally automated it.

The WebRTC Issue emails are now automated. They include the specific issue along with the latest collection security issues. It has made life considerably simpler on our end.

Join the WebRTC experts

We are now headed into our fourth year of WebRTC Insights.

Our number of subscribers is growing. If you’ve got to this point, then the only question to ask is why aren’t you already subscribed to the WebRTC Insights if WebRTC interests you so much?

You can read more about the available plans for WebRTC Insights and if you have any questions – just contact Tsahi.

Oh – and you shouldn’t take only our word for how great WebRTC Insights – just see what Google’s own Serge Lachapelle has to say about it:

Still not sure? Want to sample an issue? Just reach out to me.

The post Third time’s a charm: WebRTC Insights, 3 years in appeared first on

Qotom Q20321G9 fanless PC

TXLAB - Tue, 11/07/2023 - 00:04

As PCengines announced the end of sales of their famous APU platform, it’s time to look for alternative devices that can be utilized as firewalls or network probes or VPN appliances.

I bought recently a Qotom Q20321G9 mini-PC from AliExpress. The model is similar to their Q20331G9 model described on Qotom website. The difference is a slower CPU and less SFP+ interfaces:

ModelQ20321G9Q20331G9CPUIntel Atom C3558RIntel Atom C3758RTDP17W26WNICs2x SFP+, 2x SFP, 5x 2.5Gbit LAN4x SFP+, 5x 2.5Gbit LAN

Comparing to the APU platform, this Qotom box is huge: 62mm high, compared to 30mm of APU enclosure, 217mm bright, and much heavier because of the massive heatsink. But it has much more to offer.

Two M.2 NVME sockets allow a redundant storage setup out of the box. Also, it supports ECC RAM (although the model I received had a non-ECC DIMM), so it can serve as a reliable hardware platform if you need a long-term service. Also, it has an M.2 socket for an LTE modem, two antenna mounting holes, and a nano-SIM card slot.

A minor downside is that even at idling, with all CPU cores running at 800MHz, the device is getting quite warm. The onboard sensors show the CPU core temperatures at around +42C to +44C, and the enclosure is rather hot at the touch.

I also have run a CPU stress test with the enclosure covered by a towel for about a half an hour, and the CPU temperature exceeded 60C, still functioning well.

A minor inconvenience is that the power button is too easy to press if you’re moving around it while testing. But the button is easy to remove, so that the power switch can be pressed by a pen when needed.

The SFP and SFP+ interfaces were recognized by Debian 12 out of the box.

The device arrived with a preinstalled Windows 10. The BIOS allows redirecting the console to the COM port, which is provided as an RJ-45 socket, with the same pinout as Cisco routers.

The NIC numbering is a bit non-intuitive, and the marking on the enclosure does not help much. Here are the interfaces as they’re seen by Debian, if you look at the device’s interface panel:

eno1 (SFP+)eno3 (SFP)enp7s0 (LAN)enp6s0 (LAN)enp8s0 (LAN)eno2 (SFP+)eno4 (SFP)enp5s0 (LAN)enp4s0 (LAN)

Some diagnostics output below:

root@qotom01:~# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation Model name: Intel(R) Atom(TM) CPU C3558R @ 2.40GHz BIOS Model name: Intel(R) Atom(TM) CPU C3558R @ 2.40GHz CPU @ 2.4GHz BIOS CPU family: 178 CPU family: 6 Model: 95 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 1 CPU(s) scaling MHz: 52% CPU max MHz: 2400.0000 CPU min MHz: 800.0000 BogoMIPS: 4800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology no nstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 x tpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_ fault epb cat_l2 ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts m d_clear arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 96 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 8 MiB (4 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-3 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected root@qotom01:~# lsusb Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 003: ID 05e3:0608 Genesys Logic, Inc. Hub Bus 001 Device 002: ID 046d:c31c Logitech, Inc. Keyboard K120 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub root@qotom01:~# lspci 00:00.0 Host bridge: Intel Corporation Atom Processor C3000 Series System Agent (rev 11) 00:04.0 Host bridge: Intel Corporation Atom Processor C3000 Series Error Registers (rev 11) 00:05.0 Generic system peripheral [0807]: Intel Corporation Atom Processor C3000 Series Root Complex Event Collector (rev 11) 00:06.0 PCI bridge: Intel Corporation Atom Processor C3000 Series Integrated QAT Root Port (rev 11) 00:09.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #0 (rev 11) 00:0a.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #1 (rev 11) 00:0b.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #2 (rev 11) 00:0c.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #3 (rev 11) 00:0e.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #4 (rev 11) 00:0f.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #5 (rev 11) 00:10.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #6 (rev 11) 00:11.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #7 (rev 11) 00:12.0 System peripheral: Intel Corporation Atom Processor C3000 Series SMBus Contoller - Host (rev 11) 00:13.0 SATA controller: Intel Corporation Atom Processor C3000 Series SATA Controller 0 (rev 11) 00:14.0 SATA controller: Intel Corporation Atom Processor C3000 Series SATA Controller 1 (rev 11) 00:15.0 USB controller: Intel Corporation Atom Processor C3000 Series USB 3.0 xHCI Controller (rev 11) 00:16.0 PCI bridge: Intel Corporation Atom Processor C3000 Series Integrated LAN Root Port #0 (rev 11) 00:17.0 PCI bridge: Intel Corporation Atom Processor C3000 Series Integrated LAN Root Port #1 (rev 11) 00:18.0 Communication controller: Intel Corporation Atom Processor C3000 Series ME HECI 1 (rev 11) 00:1a.0 Serial controller: Intel Corporation Atom Processor C3000 Series HSUART Controller (rev 11) 00:1f.0 ISA bridge: Intel Corporation Atom Processor C3000 Series LPC or eSPI (rev 11) 00:1f.2 Memory controller: Intel Corporation Atom Processor C3000 Series Power Management Controller (rev 11) 00:1f.4 SMBus: Intel Corporation Atom Processor C3000 Series SMBus controller (rev 11) 00:1f.5 Serial bus controller: Intel Corporation Atom Processor C3000 Series SPI Controller (rev 11) 01:00.0 Co-processor: Intel Corporation Atom Processor C3000 Series QuickAssist Technology (rev 11) 02:00.0 Non-Volatile memory controller: Phison Electronics Corporation PS5013 E13 NVMe Controller (rev 01) 04:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03) 05:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03) 06:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03) 07:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03) 08:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03) 09:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03) 0a:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30) 0b:00.0 Ethernet controller: Intel Corporation Ethernet Connection X553 10 GbE SFP+ (rev 11) 0b:00.1 Ethernet controller: Intel Corporation Ethernet Connection X553 10 GbE SFP+ (rev 11) 0c:00.0 Ethernet controller: Intel Corporation Ethernet Connection X553 Backplane (rev 11) 0c:00.1 Ethernet controller: Intel Corporation Ethernet Connection X553 Backplane (rev 11) root@qotom01:~# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation Model name: Intel(R) Atom(TM) CPU C3558R @ 2.40GHz BIOS Model name: Intel(R) Atom(TM) CPU C3558R @ 2.40GHz CPU @ 2.4GHz BIOS CPU family: 178 CPU family: 6 Model: 95 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 1 CPU(s) scaling MHz: 52% CPU max MHz: 2400.0000 CPU min MHz: 800.0000 BogoMIPS: 4800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology no nstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 x tpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_ fault epb cat_l2 ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts m d_clear arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 96 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 8 MiB (4 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-3 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected root@qotom01:~# lsusb Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 003: ID 05e3:0608 Genesys Logic, Inc. Hub Bus 001 Device 002: ID 046d:c31c Logitech, Inc. Keyboard K120 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub root@qotom01:~# lspci 00:00.0 Host bridge: Intel Corporation Atom Processor C3000 Series System Agent (rev 11) 00:04.0 Host bridge: Intel Corporation Atom Processor C3000 Series Error Registers (rev 11) 00:05.0 Generic system peripheral [0807]: Intel Corporation Atom Processor C3000 Series Root Complex Event Collector (rev 11) 00:06.0 PCI bridge: Intel Corporation Atom Processor C3000 Series Integrated QAT Root Port (rev 11) 00:09.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #0 (rev 11) 00:0a.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #1 (rev 11) 00:0b.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #2 (rev 11) 00:0c.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #3 (rev 11) 00:0e.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #4 (rev 11) 00:0f.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #5 (rev 11) 00:10.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #6 (rev 11) 00:11.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI Express Root Port #7 (rev 11) 00:12.0 System peripheral: Intel Corporation Atom Processor C3000 Series SMBus Contoller - Host (rev 11) 00:13.0 SATA controller: Intel Corporation Atom Processor C3000 Series SATA Controller 0 (rev 11) 00:14.0 SATA controller: Intel Corporation Atom Processor C3000 Series SATA Controller 1 (rev 11) 00:15.0 USB controller: Intel Corporation Atom Processor C3000 Series USB 3.0 xHCI Controller (rev 11) 00:16.0 PCI bridge: Intel Corporation Atom Processor C3000 Series Integrated LAN Root Port #0 (rev 11) 00:17.0 PCI bridge: Intel Corporation Atom Processor C3000 Series Integrated LAN Root Port #1 (rev 11) 00:18.0 Communication controller: Intel Corporation Atom Processor C3000 Series ME HECI 1 (rev 11) 00:1a.0 Serial controller: Intel Corporation Atom Processor C3000 Series HSUART Controller (rev 11) 00:1f.0 ISA bridge: Intel Corporation Atom Processor C3000 Series LPC or eSPI (rev 11) 00:1f.2 Memory controller: Intel Corporation Atom Processor C3000 Series Power Management Controller (rev 11) 00:1f.4 SMBus: Intel Corporation Atom Processor C3000 Series SMBus controller (rev 11) 00:1f.5 Serial bus controller: Intel Corporation Atom Processor C3000 Series SPI Controller (rev 11) 01:00.0 Co-processor: Intel Corporation Atom Processor C3000 Series QuickAssist Technology (rev 11) 02:00.0 Non-Volatile memory controller: Phison Electronics Corporation PS5013 E13 NVMe Controller (rev 01) 04:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03) 05:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03) 06:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03) 07:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03) 08:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03) 09:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03) 0a:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30) 0b:00.0 Ethernet controller: Intel Corporation Ethernet Connection X553 10 GbE SFP+ (rev 11) 0b:00.1 Ethernet controller: Intel Corporation Ethernet Connection X553 10 GbE SFP+ (rev 11) 0c:00.0 Ethernet controller: Intel Corporation Ethernet Connection X553 Backplane (rev 11) 0c:00.1 Ethernet controller: Intel Corporation Ethernet Connection X553 Backplane (rev 11) root@qotom01:~# ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: enp4s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 20:7c:14:f2:9c:76 brd ff:ff:ff:ff:ff:ff 3: enp5s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 20:7c:14:f2:9c:77 brd ff:ff:ff:ff:ff:ff 4: enp6s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 20:7c:14:f2:9c:78 brd ff:ff:ff:ff:ff:ff 5: enp7s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 20:7c:14:f2:9c:79 brd ff:ff:ff:ff:ff:ff 6: enp8s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 20:7c:14:f2:9c:7a brd ff:ff:ff:ff:ff:ff 7: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether 20:7c:14:f2:9c:7b brd ff:ff:ff:ff:ff:ff altname enp11s0f0 8: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 20:7c:14:f2:9c:7c brd ff:ff:ff:ff:ff:ff altname enp11s0f1 9: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 20:7c:14:f2:9c:7d brd ff:ff:ff:ff:ff:ff altname enp12s0f0 10: eno4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether 20:7c:14:f2:9c:7e brd ff:ff:ff:ff:ff:ff altname enp12s0f1

Zooming in on remote education and WebRTC

bloggeek - Mon, 11/06/2023 - 12:30

An overview of remote education and WebRTC. The market niches, challenges and solutions.

Whenever a video meetings company starts looking at verticals for the purpose of targeted marketing, one of the verticals that is always there is education. We’ve seen this during the pandemic – as the world went into quarantine mode, schools started figuring out how to teach kids remotely.

The remote education market is not just schools doing remote video calls. It is a lot more varied. I’d like to explore that market in this article.

Table of contents How big can remote education really get?

There are around 2 billion children in the world. Over 80% of them attend schools.

Some 235 million higher education students are out there as well around the globe.

During the pandemic, a lot of them were online, taking classes remotely. For multiple hours each day.

The slide above is from Kranky Geek 2020. In this session, Google talked about their work on WebRTC in Chrome.

Here they shared the increase in video minutes during the initial quarantines. The huge spike there starts at around the August/September timeframe, when schools start.

Remote education is here to stay. Not with its increased usage of 10-100x, but definitely bigger than in the past. There are many places where remote education can fit – and not only for emergencies such as the pandemic.

Me? Remote education?

Like everyone else, my kids went through the process of remote education during the pandemic. Here, the Ministry of Education went all-in with Zoom for schools (along with Google Classroom and Microsoft Office – go figure). Since then, our kids have on and off private tutors doing classes remotely sometimes. And now, when we have a war raging between Gaza and Israel, depending on where you live, you might be studying from home or physically in school.

I had my share of consulting with education organizations across the globe. Some focusing on schools, others with universities and some with private tutoring. It was always fascinating to see how such markets are distinctly different from each other, and how remote education also takes different shapes and sizes based on the country.

And then there are my own online courses, with their associated office hours and AMAs.

The role of WebRTC in remote education

WebRTC plays an important role in the education market. Besides offering video communications, it also enables the ability to mesh the communication experience directly into the LMS (Learning Management System) or the SIS (School Information System), offering a seamless and tailored experience for both the teacher and the learners – one that enables the educators to implement various pedagogies.

Remember here that WebRTC is a synchronous technology – live, real-time voice and video communications. A large chunk of the education market is leaning heavily on asynchronous learning (recorded videos, texts to read, etc). These are not covered in this article.

Here are some market niches and use cases where you will find WebRTC in remote education.

Group lessons

The simplest one to explain is probably group lessons. The classic one would be the pandemic use case, where during quarantine, schools went all virtual – classes were conducted online.

Remote group lessons aren’t limited to schools either – they are done in universities, private group tutoring, etc.

Main challenges here include:

Moderation tools for the teachers. Ones that are simple to use while conducting the lesson itself

Collaboration tools to make the lessons more engaging. Maintaining engagement in online group lessons is the biggest challenge at the moment, especially for younger learners

Authentication and authorization of users. Lots of anecdotal stories around this one throughout the pandemic

One thing that is raised time and again with group lessons, especially in schools, is the need (and inability) to get the students to keep their cameras on. This is a huge obstacle to effective learning, and something that needs to be taken into account.

Another important thing that needs to be fleshed out early on here, is who is the client – is it the teacher or the students. Whoever the system is geared towards will set the tone to how the solution gets designed and implemented.

One-to-one tutoring

These are mainly one on one lessons conducted remotely.

Outside of the domain of classic education, a lot of classes are actually conducted in such a way. Here are a few anecdotal stories from recent years that I’ve learned about:

A dear friend who is learning to play the piano. Remotely. She travels a lot between the US and Israel, and takes her lessons from everywhere through her iPad

Another friend, taking 1:1 drawing lessons

Online chess lessons for kids in our community

My son’s friend, learning C++ on Unreal engine, taking 1:1 lessons

My son, a few years ago, when he was 10 or so, learning to build online games using nocode game engines from an 18 year-old who lived two cities away

My wife took online dance lessons to specialize in Salsa from a renowned instructor abroad

Besides the collaborative, engagement level and nature of such lessons, it is important to note that they aren’t suitable for everyone. Some teachers are more natural in these, and some students can learn effectively in such a manner while others struggle (I have both examples at home).

An interesting use case here that I’ve seen is math and English (!) tutors from India and China teaching remote kids in the UK and the US. Why? Simply because they are cheaper than using local teachers. Then there was the opposite – rich Chinese families getting one-to-one English tutoring for their kids from US teachers. Go figure.

One-to-one tutoring comes in a lot of different shapes and sizes.

MOOCs (Massive Open Online Courses)

MOOCs were all the rage 10 years ago. Their market is still consistently growing.

MOOCs are simply large online courses that are open for people around the globe. Some of them are collaborative, while others are mainly lecturer driven. Some allow for asynchronous learning while others are more synchronous in their nature. Both the asynchronous and synchronous learning modes in MOOCs offer self-paced learning (at least to some degree).

WebRTC finds its way into MOOCs for their synchronous part, when that requires live video sessions – either between lecturers and students or between student groups in the more collaborative courses.


Proctoring isn’t about learning, but about taking exams. Remote proctoring enables taking exams at the comfort of one’s home or office without going to the classroom.

With proctoring, the user is required to open up his camera and microphone as well as share his screen while taking the exam. The proctoring application takes care of checking that other tabs aren’t being opened and that nothing fishy is taking place (as much as possible). WebRTC is used to gather all that realtime audio and video data and record it. If needed, these recordings can be accessed by human proctors later on.

It should be noted that for proctoring, there are a lot of requirements around circumventing the ability to cheat on the exam. This includes things like monitoring applications used during the exam, maintaining focus on the exam page, etc. To achieve this, most proctoring solutions end up as PC applications (usually using Electron) which the student needs to install on his machine in order to take the exam. The innards of the proctoring application will end up using WebRTC in a web application – simply for its speed of development and the use of the WebRTC ecosystem.


While similar to classic education, coaching is slightly different. In its essence, these can be 1:1 sessions or small group sessions where issues and challenges in certain areas get fleshed out. In group lessons and 1:1 tutoring, a lot of the focus is on collaboration features. Here, in many cases, it will be more on the video of the participants and the need to bring them together.

Another interesting aspect of coaching is the platform it gets attached to – either directly or indirectly. Coaching often comes bundled as a larger course/training offering, mixed with in-person meetings, reading/presented materials and the coaching sessions themselves.

The LMS and SIS systems are usually also lacking in the coaching platforms. Usually, these will be geared towards flexible use and at times an integrated payment system.


Webinars are a form of lessons that is conducted over the internet, mostly for businesses to assist in marketing and sales efforts. Depending on the level of interactiveness of the webinar, the need and use of WebRTC will be needed.

In the past, webinars were usually conducted via specialized downloadable applications, where the content was mostly slide decks and the voice of the speakers. The interaction with the audience was done via text messages and organized Q&A. Over time, these solutions became richer and more sophisticated, adding video communications as well as the ability of the audience to “join the podium” if and when needed.

Using WebRTC here enabled getting rid of the application download requirement and increased the level of interactivity quite considerably.

The intersection of education and healthcare

Education and healthcare are bound together. I’ve shown that a bit in my WebRTC in telehealth article, looking at it from the remote training of healthcare topics perspectives. I want to take a different angle on the same topic here. I’ll do that by showcasing two interesting use cases I’ve been privy to a few years back.

#1 – Dance lessons in cancer

I heard this one from a dancer who had cancer and healed. Women with cancer have it hard. Chemo is brutal – it seeps out the energy and causes hair loss. This means women don’t want to go outside that much. Here, being able to bring them remotely to a dance lesson can be a real benefit to them, especially if they love(d) dancing. They won’t go physically – not wanting to meet people outside and the stairs that come with it – along with the energy it takes. But they will be willing to dance – maybe.

Remote dance lessons for this niche is beneficial. Not from an educational standpoint but more from a mental health one.

#2 – Video in class for students in hospitals

Another vendor I worked with briefly was assisting school kids who had to be treated in hospital or just stay home for prolonged periods of time (think weeks or months at a time). Their solution was to bring a video conferencing system and rig it in the physical classroom of the kid as well as where he is located, be it home or a hospital bed.

This way, the kid could join the classes as well as stay connected to other classmates during recesses. The main purpose here isn’t really the teaching part, but rather to make sure the student stays in contact with peers in his age group and not be secluded during that period of time.

Is this a use case in education? In healthcare? I can’t really say…

ERT (Emergency Remote Teaching)

The pandemic showed us that remote education is challenging but might be necessary. We were all quarantined for long periods of time, with school across the globe going remote.

Here in Israel, when clashes with Gaza or Hezbollah in Lebanon flare, schools shift to remote learning. It isn’t frictionless or smooth, but it is the solution we have to try and continue educating kids here.

The most crucial aspect of ERT is that teachers are forced to change their teaching setting with no preparation. In Israel, at least, the pandemic didn’t prepare teachers for the current war – it feels like the education system in Israel learned nothing from the pandemic wrt to remote teaching

Top down decisions; sometimes

Education is interesting. Especially the institutional ones of schools.

In some countries, decisions are made top down while in others, there’s more autonomy kept at the school level or the district level.

Here are a few things I learned asking the question on LinkedIn, about what tool was used during the pandemic for virtual classes across the globe:

  • Israel. Where I live. Was mostly Zoom during the pandemic
    • There was also a bit of Google Meet and some BigBlueButton, due to its integration with MASHOV (an SIS in Israel)
    • The government struck a deal in education for Google Classroom country-wide
    • There’s also Office available for free for all students
    • And Zoom was the decision for virtual classes
    • This year, it all changed to Google Meet, presumably due to security concerns, but more likely this was due to pricing (Zoom renewal cost money while you get Google Meet and Microsoft Teams for free with Google Classroom and Office respectively)
    • Zoom hurried up with a statement that it is secure and now available for free for the education system in Israel
    • As the saying goes – it’s all about the money
  • Bulgaria used Jitsi Meet (through the Shkolo platform); later replaced by Microsoft Teams. Both with government provided accounts
  • Colombia. Most public schools and the university system relied heavily on Microsoft Teams. Private schools and universities were about an even split between Zoom and Microsoft Teams
  • Austria was mainly Microsoft Teams
  • Russia – Zoom
  • The United States was mostly Zoom. It wasn’t mandated, but just how things ended up in most places
  • UK. A private school in London opted for Microsoft Teams. Public schools were left to figure out their own solution
  • Argentina. Zoom, though I am not sure if everywhere and if the decision was top down or bottom up
  • India. Primarily Microsoft Teams and occasionally Zoom. Mainly because Microsoft Teams had better and stronger channel partners in India, being able to offer better deals
  • France. Started with Zoom and Jitsi Meet in schools. Now, the government has built a large scale BigBlueButton infrastructure for virtual classrooms

This is by no means complete or accurate, but it shows a few important aspects of education:

In some countries, decisions on the tools to use is taken top down, while in others, each district or school is left to autonomously make a decision

Like in many industries, but probably more so, appearances matter. Losing Israel for Zoom was bad publicity. They had to fix that quickly by renewing the service for free. BTW – the damage is already done, my kids are now using Google Meet at school and there likely isn’t a way back

Live, online and in-person

Education is mixed. It isn’t all virtual and isn’t all in person.

My own WebRTC Courses are online, but not live. The lessons are pre-recorded. I offer monthly AMA meetings as part of them which are online and live.

I took a CPO course last year. It included in person meetings (3 full days), weekly live sessions as well as pre-recorded information.

My kids are now learning some days remote and some days in-person in the school.

Some countries had recorded/broadcasted lessons alongside virtual live classes during the pandemic, creating from them a full set of learning materials that students can use moving forward.

The LMS (Learning Management System) used needs to take all these into account, enabling different learning strategies and different content types. Your own service needs to be able to figure out what works best.


The term Hybrid Learning refers to any form that incorporates online and offline learning. This is slightly different from how we define hybrid meetings.

  • As an example in Israel at the moment, in the current “war setup”, students go physically to school a few days a week and the rest they learn asynchronously or synchronously from home.
  • Another example of hybrid learning is when students work with laptops in the traditional classroom.

Allowing a student to join remotely to a class taking place in-person is a real challenge, but one that needs to be dealt with as well. This isn’t any different from hybrid meetings in enterprises in terms of the basic need. The difference is likely in size and complexity.

Most classes aren’t geared to this. From the placement of the cameras in the class, to the way the lessons are conducted and to the way teachers need to split their attention between in person to remote students.

In most places, going hybrid in education is an intentional decision that can be made only for select use cases and in a limited number and types of institutions.


Who is allowed to join a virtual lesson? Should the teacher approve each student joining? How do you know who is online? Who is actively listening? Should anyone be automatically allowed to speak up? Share their screen? Is there a way to check if the student goes “off the reservation”, doing other things in other browser tabs or on his phone in parallel?

All these are hard questions with no good answers.

Moderation in education must take place – especially for group lessons. This has two purposes:

  1. Maintain a semblance of order
  2. Let the teacher focus on teaching

Oftentimes, moderation tools deal with a semblance of order but less with the focus of the teacher or teaching.

The decision in Israel for example to go for Google Meet makes total sense simply because authentication and identity is managed by Google Classroom already. Classroom is acting as the LMS as well, or at least the hub for students and teachers. Having a tighter integration means some of the moderation requirements can more easily be met.

It isn’t only about what can be moderated, but how and with what level of friction


How are assessments taking place in online learning?

In the traditional classroom, teachers physically saw the students and could easily gauge their level of attentiveness. To that, home assignments and tests were added.

Once going online, technology can come to assist the teachers and students, adding a layer of information to the assessment process. Dashboards can be built to make this data accessible.

Where does WebRTC fit in here? The same way it does in online meetings, where we see today a growing focus on incorporating transcriptions, meeting summaries and action items automatically. Similar LLM/generative AI technologies can be used to glean insights out of online lessons.

In many ways, this isn’t done yet. Probably because we’re still struggling with engagement (see below).

Collaboration and whiteboarding

How is collaboration done in education? Do we need the classing blackboard/whiteboard for teaching? How does that get translated to the digital, remote scenario?

Are we looking here for something as powerful and flexible as a Miro board or something simpler and less feature rich?

Is teaching math or physics similar to teaching languages or literature when it comes to collaboration and whiteboard?

How about Kahoot or similar polling/quiz capabilities? Do we make them engaging or boring as hell?

A lot of thought and energy needs to be diverted towards these types of questions, in trying to figure out what works best to increase engagement and improve the learning experience (and by extension, the learning itself).

The challenge of engagement

How do you define engagement in online synchronous lessons?

Is students opening cameras considered engagement?

Maybe students be engaged with their cameras turned off

Getting students to open up their cameras, having them choose to do so and keep the cameras on is a big issue in schools and in higher education.

In my son’s school, they are now shifting towards enforcing students to open their cameras… but allowing them to point that camera at the ceiling

Once you have cameras on, how does a teacher gauge the level of engagement of a student? How does he spare the time looking at 20+ students (36 in Israel classes) to understand if they are engaged or not while trying to present his screen to teach something out of his slidedeck?

“Feeling the crowd” to understand if a topic needs further explanation or can the teacher move on to new topics is harder to achieve online than it is in person.

The challenge of engagement (part 2)

How do you get students engaged?

What type of collaboration solution do you need?

Which experiences should be baked into the solution?

My son decided to take up Russian. His friend speaks Russian with his parents, so he decided he wants to understand when they talk to each other (go figure). He decided independently to install Duolingo on his phone and has been taking their lessons for almost a year now

He can now read Russian and know quite a few words.

A good friend of mine is learning German using Duolingo. We did a roadtrip in the US in February. I had to hear him learn in our long hours on the road. It was an interesting experience to see it from the side, trying to figure out how this magic happens.

Engagement and “gamification” are a main part of how Duolingo works and how it gets students back into their app over and over again.

We haven’t quite cracked the formula of how to do this well in live virtual classes. There must be a way to get there, and when we find it, we will see great dividends from it.

Asymmetry in remote education

There are teachers and there are students. Who is the system designed to cater?

A simple question. Answering with “both” is likely going to be wrong most of the time.

I had a meeting at a large and prominent university in Europe a few years back. They wanted to build a video conferencing system for lectures. Have the professor in front of a large digital board showing tens of students joining remotely. Call it extremely expensive and unique. That was before the pandemic, so unrelated to it.

The question I had was who this system is for. Is it to sell students on a great remote experience or is it for the professor to feel important. I have my own answer here

You need to decide who the service you are developing is really there to cater – the teacher and his needs, assuming that students will simply join because they have little choice. Or the students, focusing on enticing them to join, collaborate and interact.

Doing both at the same time is a real challenge, and one that most vendors aren’t prepared to take yet.

Figure out who your main user is. The teacher or the students. Or maybe the parents?

Training the educators

Someone needs to teach the teachers how to use the service. This is a real problem, especially when going mainstream.

When the pandemic started and Zoom was selected here in Israel, a lot of videos surfaced explaining how to use Zoom in the context of teaching with it. Last month, when Google Meet was the official solution, you started seeing the same occur for Google Meet here in Israel.

The differences between these two services may seem minor, but they are big for teachers who aren’t technically savvy.

Some private tutors for example shy away from remote lessons. Their reason is the inability to focus on the student during the lesson. Increase that by 20-40 students in a single lesson, many of them acting like prisoners trying to break out and figuring out ways to game the system called a virtual lesson, and you get to the need for teachers who know their way using the service inside and out.

Onboarding and familiarizing teachers to the platform is just as important as the actual service, sometimes even more

A matter of costs

This one might just be an opinion of mine.

Remote education is a huge market. During the pandemic, it encompassed almost all the world’s students. And yet, the amount of money available to spend per minute is quite low.

In many cases, the deals are large (in front of a state or a country). Sometimes, they are smallish, in front of a single school. There’s money in these institutions, but in many cases, that money is spent elsewhere.

When going after the education market, it is vital to understand the buying habits and budget of the would-be purchaser beforehand.

Solutions in the education market need to be cost effective and efficient from a WebRTC infrastructure point of view

Where can I help, if at all?

Online WebRTC courses, to skill up engineers on this technology

Consulting, mostly around architecture decisions and technology stack selection

Testing and monitoring WebRTC systems, via my role as Senior Director at Cyara (and the co-founder of testRTC)

The post Zooming in on remote education and WebRTC appeared first on

WebRTC in telehealth: More than just HIPAA compliance

bloggeek - Mon, 10/23/2023 - 13:00

When it comes to WebRTC in telehealth, there are quite a few use cases and a lot of things to consider besides HIPAA compliance.

A thing that comes up in each and every discussion related to telehealth & WebRTC is the value of the call in telehealth. We’ve seen video meetings and calls go down to zero in their cost/value for the user. Especially during the pandemic. So whenever we find a nice market where there is high value for a call, it is heartening. Healthcare is such a place where we can easily explain why calls are important.

But what exactly does WebRTC in telehealth mean? It isn’t just a patient calling a doctor. There is a lot more to it than that. Let’s dive in together to see what we can find.

Table of contents My own experience with Telehealth As a user Me and my son, waiting in a hospital while he had some blood samples taken during COVID

Like many others, my first real bump with telehealth took place during the COVID quarantines.

My son was sick with high fever for over a week, and the doctors didn’t help any.

My wife was worried, needing more comfort by knowing someone was looking at him. Really looking at him.

So we used a kind of a private service that a hospital near our vicinity was giving:

  • You subscribe and pay a hefty price
  • They send over a kit
  • You install an app and take measurements multiple times a day (useless ones, but stay with me)
  • They send over a radiologist to do an X-ray scan (need something to show they can)
  • Then you get to talk to a doctor once a day. Over a video call. From the same app

What can I say? It worked as advertised.

As a consultant and a product manager

We have quite a few healthcare clients using our various WebRTC services at testRTC.

Other than that:

  • Took part of an RFP of the ministry of health in Israel by assisting the vendor who approached me win the contract
  • I assisted vendors during the pandemic to troubleshoot their architecture and scale their service rapidly

That and just from conversations with vendors, along with a review of this article by a few who work on telehealth products and integrating their comments as well.

Does that make me an expert in telehealth? No.

But I can fill in the WebRTC angle of telehealth, which is a rather big one.

Finding WebRTC in Telehealth

Telehealth for me is about the digital transformation of healthcare services.

It can start small, with things such as scheduling and viewing lab test results. And then it can grow towards virtualizing the actual patient-doctor interaction. Or any other interaction within the healthcare space between one or more people (emphasis on one here – not two).

I’ve listed here the main use cases that came to mind thinking of it in recent days.

Patients and doctors

The most obvious use case is the patient and doctor scenario.

In this, the doctor visitation itself is remote and virtual.

This can be useful in many situations:

  • When the patient can’t get to the doctor’s office
  • During the pandemic:
    • When healthcare providers didn’t want patients physically in the office
    • If doctors are sick, but their numbers are dwindling due to them being quarantined, while they can still be useful as doctors remotely
  • If you don’t want to waste a patient’s time in coming over and waiting
  • When it is truly urgent (an emergency)

For many of these situations, this is the setup that takes place:

  1. Doctor – sitting in front of a PC or laptop. In a designated office or hospital (=managed network), or at home (=unmanaged network)
  2. Patient – connecting from a smartphone or tablet, via a direct link or an installed application

More on that – later.

In general – here’s where you’ll see such solution types deployed:

Hospitals and large healthcare organizations

Clinics hosting multiple doctors

Private clinic of a single doctor

Insurance companies

Also remember that the word doctor is a broad definition of the caretakers involved. These can be nurses, doctors, dietitians and other practitioners offering the treatment/session to the patient remotely.

The other thing to remember is that this is also asymmetric in scarcity: there are a lot more patients than they are caregivers.

Group therapy and counseling

Then there’s group therapy.

One where one or more psychologists lead a larger group of patients. The same also applies to dietitians, speech therapists, smokers, cancer patients and other groups of practitioners.

Here again, the idea and intent is that the patients and the therapists can join remotely to a virtual meeting and conduct that meeting.

The main benefit? Not needing to drive and travel for the meeting and being able to conduct it from anywhere.

Notable here is the fact that this can be enhanced or taken to a slightly different perspective – this can encompass the allied health domain, where AA (Alcoholic Anonymous) groups for example fit in.

Nurse stations

The nurse station is slightly different from the doctor-patient in my mind.

Here, the patient is situated physically next to the nurse, so the call/meeting isn’t virtual or remote but rather in person. The “twist” is that there is another caregiver or external authority that can be joined remotely to the session if and when needed. Say a doctor with a specialization that might not be available where the patient is located – this can be viewed in a way to democratize the access to specialty care.

Envision a nurse moving inside a hospital ward. She has a mobile station moving around with her that can be used to conduct video meetings with doctors. It can also be used for other purposes such as adding a live translator into that interaction with the patient or the patient’s custodian.

The lack of specialized provider access in remote areas can be extremely critical, and here again, virtual meetings can assist. Taking this further, a nurse station of sorts can be placed inside an ambulance providing immediate care – even for cases of strokes or cardiac arrests.


Outpatients are clinics that belong to hospitals. These are designed for people who do not require a hospital bed or an overnight stay. Sometimes, this can be for minor surgeries. Mostly for diagnostics, treatments or as follow ups to hospital admissions.

These clinics are part of the overall treatment that patients get from the hospital or for things that are hard to obtain elsewhere due to scarcity of machinery and/or experience.

Some of the diagnostics done in an outpatient clinic can be done remotely. This reduces wait times and travel times for patients and also allows using doctors joining remotely and not physically inside the clinic.

While similar to the patients and doctors use case, there are differences. The main one being the organization behind it, the logistics and the network. Hospital networks are usually a lot more complex and limited to connectivity of WebRTC traffic, bringing with it a different set of headaches.

Taking care of the elderly

As the human population is aging in general and people live longer, we’re also getting to a point where elderly care is different from other areas of healthcare. Another aspect of it is the breakdown of the family unit into smaller pieces where elderly people move to assisted living, nursing homes and hospices.

Here, the telehealth solutions seen include also things like:

  • The ability to easily communicate with family members and friends remotely to keep connected
  • Remotely monitor and take care of the old via solutions that remind us of a surveillance use case
  • Providing access to doctors remotely, especially for the less common health issues

Remote patient monitoring is another new field. Due to the scarcity of nurses, many hospitals are moving towards virtual patient monitoring for patients who are in hospitals or medical facilities that require 24×7 monitoring for critical patients.

Operating rooms

The operating room is at the heart of hospital care. It is where surgeons, anesthetics, nurses and other practitioners work together on a patient in an aseptic environment.

An obvious requirement here might be to have an expert join remotely to observe, instruct or consult during surgery. That expert can be someone who isn’t at the vicinity of the hospital, enabling to bridge the gap of knowledge and expertise existing between central hospitals in large cities to rural ones.

It can also be used to have an expert who is situated in the hospital join in – entering an operating room requires the caregiver to scrub before entering. This process takes several minutes. By having the expert join remotely from another room at the hospital, we can have him jump from one surgery to another faster. Think of the supervisor of multiple surgery rooms at a hospital or a specialist. Saving scrubbing times can increase efficiency.

Then there is the option of getting external observers into the surgery rooms without having them in the surgery room itself. They can be silent or vocal participants. Joining in as trainees for example, as part of their learning process to become surgeons.

As we advance in this area, we see AR and VR technologies enter the space, either to assist the doctor locally in the surgery or have the external experts join remotely.


Learning in operating rooms is just part of training in the healthcare domain.

Training can take different shapes and sizes here, and in a way, it is also part of the education market.

Here are some of the examples I’ve seen:

  • Remote training/education for various healthcare roles
  • First aid training for civilians
  • Medical equipment training
Machinery remoting

Healthcare is a domain that has lots and lots and lots of devices and machinery. From simple thermometers to CT scanners and surgical robots.

What we are seeing in many areas is the remoting of these devices and machines. Having the patient being diagnosed or treated use a device (or have a device used on him), while having the technician, specialist, nurse or doctor operate or access the data of the device remotely.

This has many different reasons – from letting patients stay at home, to getting specialists from remote areas, to increasing the efficiency of the caregivers (reducing their travel time between visitations).

Here are a few examples:

Stethoscopes, Thermometers, Ophthalmoscopes, Otoscopes, etc. These devices can be made smart – having the patient use them on his own and have their measurements sent to remote nurses or doctors

X-ray, CT, MRI – different type of scans that can be done in one place and have the operator or the person deciphering the results located elsewhere

Surgical robots, that can be observed or operated remotely

Robots roaming hospitals, taking care of menial tasks such as sanitizing equipment and rooms

There is an ongoing increase in adding smarts into devices and the healthcare space is part of that trend. When caregivers need to interact with these devices or access their measurements in real time, this can be done using WebRTC technology.

Simultaneous translation and/or scribes

Doctors are a scarce resource. As such, a critical part is having their time better utilized.

There are two telehealth solutions that are aiming to get that done in a similar fashion but totally different focus:

Translation – patients speaking a different language than that of a caregiver need a better way to communicate. Hospitals and clinics cannot always have a translator in hand available. In such cases, having a translator join remotely can be a good solution.

The purpose? Increase accessibility of doctors to patients who don’t speak the doctor’s language.

Scribes – doctors need to keep everything documented. The patient digital record (PDR) is an important part of treatment over time. The writing part takes time and is done in parallel to diagnosing the patient. It is quite common today to have a doctor sit in front of you, typing away on his PC without even looking at the patient (being on the receiving end of that treatment more than once, it does sometimes feel somewhat surreal). Remote scribes can alleviate that by taking part in the doctor visitation, taking care of filling in the PDR. A different approach making headway here is AI-based transcription and the automatic creation of the medical record entries – this alleviates the need for a human scribe.

The purpose? Increase efficiencies and enable doctors to treat more patients.

At the boundary between education and healthcare

Then there is the education part adjacent to healthcare. Think of children who are treated for long periods of time where they either need to stay in the hospital or at home for treatment and rest. How do you make sure they don’t lose too much of the curriculum during that time? That they stay connected with their friends in class?

There are solutions for that, in the form of providing a PC at school and a tablet or laptop to the kid to remotely join such sessions.

This is probably more suitable for the education market, but I just wanted to add it here for completeness.

A game of numbers

Telehealth is a relatively small WebRTC market.

If you take all physicians in the world, and try to figure out how many there are per the size of the population, you will get averages of 1:500 at most (see Wikipedia as a source for example).

Not all physicians practice telehealth. Of those who do, many do it seldomly. The size of the number here isn’t big when it comes to minutes or visitations conducted.

Compared to the number of minutes conducted every day on Facebook Messenger, the total telehealth minutes worldwide will be miniscule.

The difference here though, is the importance and willingness to pay for each such minute.

When trying to do market sizing or value – be sure to remember this –

Total number of doctors, minutes and visits isn’t that large worldwide

Telehealth minutes are more valuable than social media minutes

WebRTC telehealth and HIPAA compliance

Whenever telehealth is discussed, HIPAA compliance is thrown out in the air. At its heart, HIPAA compliance is about security and privacy of patients and their information, all wrapped up in a nice certification package:

  • Vendors wanting to sell telehealth services to hospitals need to be HIPAA compliant – at least in the US
  • In the EU, there’s GDPR, with different interpretation per EU country
  • Then there are other countries outside of the US and the EU with their own regulations
  • All in all, the requirements here are quite similar

Most countries have separate regulations for patient privacy which are generally more stricter than personal privacy. While there’s more to it than what I’ll share here, it usually boils down to encryption and all the management that goes around it.

WebRTC is encrypted, so all that is left is for the application to not ruin it… which isn’t always simple.

Sometimes, you will find vendors touting E2EE (End-to-End encryption), which in most WebRTC jargon means the use of media servers who can’t access the media. Oftentimes, these vendors actually mean the use of P2P (Peer-to-Peer), where no media server is used at all.

Oh, and if you are using a third party video conferencing solution (say… a CPaaS vendor), then you will need to obtain a BAA (Business Associate Agreement) from that vendor, indicating that he complies with HIPAA. You will then need to certify your own application on top of it.

Network and firewall restrictions

Hospitals and clinics usually end up with very restrictive internet networks. This stems from the need to maintain patient confidentiality and privacy. The increase in ransomware attacks on businesses and healthcare organizations is a source of worry as well.

To such a climate, adding WebRTC telehealth solutions requires opening more IP addresses and ports on the organizations’ firewalls.

A big challenge for vendors is to get their WebRTC applications to work in certain healthcare organizations. Usually because their services get blocked or throttled by deep packet inspection.

Vendors who can make this process smoother and simpler for customers will win the day.

Quality of media

Not being able to see video well in a social interaction is acceptable.

Having a doctor not being able to see the mole on your skin is a totally different thing.

Quality of media can be critical in certain use cases of telehealth. Here, it might be a matter of resolution and sharpness of the image, but it can also be related to the latency of the session. Remote procedures conducted via WebRTC for telehealth might be a bit more sensitive to latency than your common meeting scenario.

Depending upon the use case, you have to prioritize resolution vs frame rate. A still patient needs higher resolution and surgery or any motion specific activity requires a higher framerate. The ability to switch between these two priorities is also a consideration.

At times, 4K requirements or specific color spaces and audio restrictions may be needed. Especially when dealing with analysis of sensor data from medical devices. These may require a bit more work to integrate properly with WebRTC.

Asymmetric nature of users and devices

One tidbit about telehealth is that sessions are almost always asymmetric in nature and for the majority, they are going to end up as a 2-way conversation.

By asymmetric I mean that the users have different devices:

  • Doctors and caregivers will almost always be on devices that are known in advance – their location, their makeup, etc.
    • More likely, they will be accessing them from a laptop or a PC
    • They use the same application again and again. This means that they will learn to workaround issues they bump into
    • Often on a restricted device with older browser versions and/or low CPU power. Though not always and not everywhere
    • Sometimes, though less and less these days, old equipment used by doctors in their office means the introduction of interop requirements
  • Patients will almost always join from a mobile device – a tablet or a smartphone
    • Many will do so via a URL they receive over SMS, joining from a mobile browsers
    • Browser use on mobile isn’t as stable, especially on iOS Safari. Device handling is trickier with the need to handle phone calls and assistants (Siri) interacting with the same microphone
    • Others will end up on a native application built for this specific purpose
    • Being unassuming consumers, they try to join from everywhere. Including elevators or moving cars
    • They are also not going to use the application much and won’t want to waste time mucking around figuring out things or troubleshooting them. This means telehealth apps need to relentlessly focus on UX and usability for the patient side

This asymmetric nature affects how telehealth applications need to be designed and built, taking special care around permissions, privacy and the unique user experience of the various users.

Medical devices, sensors and telemetry

Modern healthcare has the most variety of devices and sensors out there from all industries (leaving out the defense industry). These devices are now being digitized and modernized. Part of this modernization is adding communication channels for them, and even more recently – being able to virtualize and remote their use – either partially or fully.

Medical devices sometimes generate images. Other times an audio stream. Or a video feed. Or other sensory data and information. WebRTC enables sending such data in real time, or the telehealth application can send this data out of band, via Websockets or HTTP messages.

It can be as simple as taking a measurement of a patient remotely, while he is holding the medical device and the nurse or doctor observes him and the results sent over inside the application.

That can progress passively overseeing a procedure and commenting on it in a video session. Think of a doctor or a nurse consulting remotely with a specialist while giving a treatment or operating a surgical procedure.

And it can go to the extreme of remotely giving the procedure. A radiologist operating the CT machine remotely for example.

How these get connected and where WebRTC fits exactly is a tricky challenge. There’s latency to deal with, connectivity to physical devices, oftentimes without the ability to replace them, regulatory issues – this space has quite a few obstacles, which are also great barriers of entry and motes against competitors if one invests the effort here.

SaaS, CPaaS & open source: Build vs Buy

Telehealth comes in different shapes and sizes.

Many of the CPaaS vendors have gone ahead and made themselves easy to use for telehealth, mainly by supporting HIPAA compliance requirements.

I’ve seen various telehealth solutions built on CPaaS while others build their own service from scratch using open source components. There is no single approach here that I can suggest, as each has its own advantages and challenges.

One of the biggest challenges in adopting CPaaS for telehealth is upholding the patient’s privacy. Functions of the CPaaS platform require it to know certain elements of PHI (Personal Health Information), especially if call recordings are implemented. At times, a telehealth platform may expose a patient name or other information to the CPaaS implementation. These invite additional security risks and may violate patient privacy laws. A BAA here helps, but may not be enough, since most patient privacy laws require to expose only the bare minimum that is needed to an external entity (in this case, the CPaaS vendor) when it comes to PHI.

Here. vendors should look at their core competencies and the actual requirements they have from their WebRTC infrastructure. And as always, my suggestion is to go with CPaaS unless there is a real reason not to.

Where can I help, if at all?

Online WebRTC courses, to skill up engineers on this technology

Consulting, mostly around architecture decisions and technology stack selection

Testing and monitoring WebRTC systems, via my role as Senior Director at Cyara (and the co-founder of testRTC)

The post WebRTC in telehealth: More than just HIPAA compliance appeared first on

No. I am not ok

bloggeek - Mon, 10/09/2023 - 07:15

I’ve been meaning to write about a different topic about WebRTC, but somehow, this was more important.

There’s a war going on here where I live between Israel and Hamas. Or Israel and Gaza. Or Israel and the Palestiniens. Or Israel and Iran’s proxies. Or Israel and muslim extremists.

Or all of the above if we’re frank with ourselves.

We haven’t invited this war or wanted it, but it is what we need to face and deal with.

Others are explaining the situation better than I can on social media sites and in english. Here is one such example:

To those of you who reached out to me asking if I am ok, if me and my family are safe, I answered that we’re ok’ish mostly.

Well… I am not ok.

  • At least 700 were brutally murdered
    • Many of them civilians
    • Many of them babies, children, women and the elderly
    • Some of them are muslims (usually through rockets)
    • Some of them foreigners here in Israel – working, living or just visiting
  • Over 260 were butchered in an outdoor party. Many of them teenagers and young adults
  • The number of murdered is likely to raise above 1,000
    • At the size of Israel, this is bigger than 9/11 or pearl harbor event
    • It is a huge milestone and likely a turning point
  • Over 5,000 rockets have been fired on Israeli cities (might be more – might be less – who’s counting anymore?)
  • There are more than 100 kidnapped Israelis in Gaza now. Taken from their homes in Israel. Again – babies, children, women and elderly among them
  • Israeli parents and families still don’t know where their loved one are
    • Are they wounded somewhere?
    • Are they dead?
    • Were they kidnapped and taken into Gaza?
    • Are they being abused? Raped? Decapitated?
    • Some find out from social media
      • A story about people seeing their family members on live videos
      • An elderly woman whose family found out she was murdered because the murderer decided to take a photo of her and publish it on her Facebook account
    • My Facebook is filled with photos of missing family members. Mostly kids and young adults
  • This is all for the world to see right there on social media if one cares to look at war crimes and atrocities committed by Hamas while the Gazans, Palestinitens and other extremist muslims across the globe cheer and gloat (again – directly on social media – just go and watch)
  • These aren’t human beings. These are monsters

I. Am. Not. Ok.

  • Yes. Physically, I am fine. We live at the center of Israel in relative safety at the moment
  • Everything is relative in life
  • We came back from a two week vacation in the US a day before the war started
  • Yesterday, I went to the supermarket to buy supplies – we’re short on everything
  • In the elevator I met a neighbor coming out. We greeted each other with “hi”. He noted that we don’t say “good morning” anymore. I agreed. We left without the so common “have a great day” greeting
  • The supermarket was big and full of people for a Sunday morning
  • It was also totally quiet. If you know Israelis, you know we’re a loud bunch. None of it took place there
  • Everyone looked shell shocked and subdued on the outside. Looking more closely, you could see purpose. A parent telling his 20-year old child he wants to be called to the war – saying that while he is old, he wants to participate and help in any way he can
  • A person near the cash registers, asking people to donate food and stuff to take to the soldiers
  • And me? I consider myself sharp minded and grounded. I couldn’t find my shopping cart each time I went hunting for things to buy. Over and over again. I even came back and almost took a different cart to the astonishment of the pregnant lady and her husband standing next to it. Where was my mind wondering? Each and every time

No. I am not ok.

  • We have two kids. Teenagers
  • My son was on overdrive on the first day of the war. Hyperactive
    • Probably in an attempt to process things
    • That curbed down by the end of the day, and now he is silent and subdued
    • Buries himself in his video games and his drawings
  • My daughter, ever the silent type, stayed silent
    • She went to sleep on that first day, telling me that one of her best friend’s brothers was likely injured and his parents are rushing to the hospital
    • She woke up the next morning reading his name on a website as one of the first people announced dead. Murdered. Only 19 years old
    • Before we could tell her the news as we heard it through the parents
    • She spent the rest of yesterday going back and forth with the rest of her friends to that friend’s home. She will likely do that the rest of the week
    • At the age of 16, she is now experiencing grief. Seeing it in the face. Seeing parents bury their murdered child
    • What can I do with such reality?
  • And me and my wife? We trudge along, each with his own way of dealing with it
  • Thinking if and when to do what
    • Is it the right time to shower or should we wait? Sirens and all
    • Should we take our kids to this activity or that, or just cancel it for now
    • And if our kids need to go somewhere, should we go along with them, for the good that will do, or not
    • Is it enough to just close the door to the Mamad, or do we need to add an element that won’t let murdering palestinians open it from the outside while we’re inside?
    • Mundane daily thoughts and decisions we need to make here
  • It is hard to sleep at night
    • Not sure it has anything to do with a jet lag coming back from the US and the 10 hour difference
    • Or is it just the weird situation we’re in
    • Probably that second option

I am not ok.

  • We had sirens here. 5 of them so far I think. Not really counting
  • Each time, this means running to our Mamad. Every house and apartment in Israel has such a thing if it were built in the last 20+ years
    • This is a room that is built differently than the rest of the house
    • It has concrete walls and ceiling
    • A bomb shelter door and window made of heavy iron
    • Complete with the ability to ceil it up for chemical weapons if needed
  • This room is also my home office. If you’ve seen any of my videos or met me virtually, then you’ve seen this room
  • The window there is now closed. There’s no point in opening it up until this is all over
  • Once, I had to run in from a neighbor’s apartment, where we discussed matters related to the building. A decision I had to make – should I go stay in their Mamad or run home to be with my family so they worry a wee bit less
  • We had a rocket fall a few 100’s of meters from our place. On the road. No one was wounded. We heard it really well

I am not ok.

  • There’s an iron dome battery somewhere close. A few kilometers away I assume
  • When it fires rockets we feel it and then we hear it
  • It might be followed with a siren or not, depending on where the likely missiles are about to hit
  • Then you hear the intercepts or the falling missiles. They sound different

I am not ok.

  • We live next to a hospital. It is located some 2 kilometers from our place
  • In the last two days, I’ve seen my share of military helicopters coming in and out, moving severely wounded people around as they spread them across hospitals in Israel

I am not ok.

  • Hamas and the Palestinians are busy killing as many jews as they can indiscriminately
  • Our government and legal system are bickering over the legality of stopping supplying electricity to Gaza. We give them life while they give us death
  • What stupid world are we living in?

Physically? I am fine.

The rest? Not so much

If you know me or have been to this site before, then you know a bit about Israelis already.

We are here to create and innovate. To bring good to the world and to improve things.

In the 10+ years I’ve been running this blog, I shared my thoughts and helped my industry as much as I could. Many times, not asking for anything in return. It is what I do.

Two years ago, me and my other Israeli co-founders sold testRTC. Ever since I’ve been asking myself what I should do next.

One of my dreams recently has been to start teaching. Kids. Older ones. Show them the world of technology and entrepreneurship and what is possible. Be a mentor. Raise the next generation of creativity and innovation of Israelis.

I believe Israelis are a net positive to the world.

I act like this every day. I teach my kids in that way. I see that the floundering and ill equipped education system we have here in Israel does the same. There is no hatred in our teachings or in the way we raise our kids.

Palestiniens. Hamas. Extremist muslims.

How can they slaughter kids in cold blood? Murder whole families? Kill without discrimination whole communities? Then go and show it to the world on social media. And then praise it and celebrate on the streets.

This is inhumane.

In many ways, I see them as a net negative to the world.

I just can’t see it otherwise at the moment.

People who ask me what they can do to help – nothing. And everything.

Our dysfunctional government will find a way to help, and until then, the civilians here and the soldiers will figure it out. We always do. We don’t have a choice.

I don’t really need anything from you. We’re Israelis. We’ll survive. We have done so ever since the holocaust and we know we can only depend on ourselves. So thanks for asking, but I don’t need a thing at the moment.

  • The solidarity flags and colors lighting places across the globe? That’s useless. Sorry
  • You’ll switch gears over there saying we shouldn’t kill Palestiniens soon enough
  • All the while having your governments (at least some of them) continue to fund the Palestiniens in one way or another, just ending up fueling their war against us

Here’s a few picks from the news:

What can you do?

Understand that there aren’t really two sides to this story.

This conflict isn’t symmetrical in any way. It is between people who want to live and people who want to kill and ruin.

If you don’t believe me, then just go on social media and see what the Palestinians are doing. How they parade dead Israeli soldiers, small kids and elderly on the streets of Gaza for all their people to see and enjoy. This is the 21st century.

So no. I am not ok.

We will prevail. And in the meantime, I will be working. Different than usual, but still working. Still making my small and modest contribution to the world. Trying to touch and better those I interact with.

The post No. I am not ok appeared first on

Fitting WebRTC in the brave new world of webcams, security, surveillance and visual intelligence

bloggeek - Tue, 09/26/2023 - 12:30

WebRTC has its place in surveillance and security applications. It isn’t core to these industries, but it is critical in many deployments.

Surveillance has become near and dear to my heart. I had a few vendors consult with me in the past. There are a few using testRTC. And then there’s the personal level. The system we have in our apartment building.

This got me to think quite a lot about WebRTC in surveillance tech lately.

Table of contents Why my interest in surveillance cameras (and WebRTC)?

I live in an apartment building here in Israel:

23 floors

91 apartments

2 main entrances (and another side one)

3 elevators

3 levels of underground parking

And yes. We have a surveillance camera system. Like all of the other apartment buildings in my neighborhood:

The view from my apartment on a nice day

A year ago, I was in charge of the vendor selection and upgrade process of our cameras. We switched from an analog system into a hybrid analog/IP one.

This month, we’re looking into upgrading an elevator camera to an IP one, as well as adding WiFi to our underground parking. Having a chat with one of the vendors we’re reaching out to, he was fascinated with my work on WebRTC and the potential of using it for application-less viewing of cameras.

I’ve had my share of meetings and dealings with vendors building different types of surveillance and security solutions. From private security solutions to large scale, enterprise visual intelligence ones. Obviously, the matter of these interactions were around WebRTC.

I am not an expert in surveillance, so take the market overview with a grain of salt

That said, I do know my way with WebRTC and where it fits nicely

Here are some of the things I learned over the years

Security and surveillance use cases in WebRTC

I’ll start with the obvious – cameras, security and surveillance have multiple use cases. Some of them can be seen as classic to this domain while others slightly newer or a specialized niche. Each of these use cases is a world onto its own with its requirements from WebRTC and the types of solutions emerging in it.

Small scale / cheap multiple surveillance cameras

This is where I’d frame my own experience of our apartment building. A system that requires 32 or less video cameras, spread across the location, connected to a DVR (Digital Video Recorder) or an NVR (Network Video Recorder).

In essence, you go install the cameras in sensitive locations, wire them up (with an analog cable, IP or even wireless) to the media server that is located onsite as well. That media server is a DVR if it is a closed loop system or an NVR if you’re living in modern times. I’ll just refer to these two as xVR from here on.

Once there, you hook’em up to a local monitor that nobody goes and look at, as well as let the owner connect remotely from his PC or mobile phone.

Is WebRTC needed here? Not really.

Surveillance cameras today use RTP (and sometimes also RTSP). These are the new ones. Old ones are pure analog. They connect to that xVR media server, which handles them quite well today. It did so also before WebRTC came to our lives. The user then accesses the system to play the videos remotely using a dedicated application, which again, existed before WebRTC.

Since there’s no specific requirement to access this through a web browser, the use of WebRTC here is questionable.

You might say WebRTC would make things easier, but hey – if it ain’t broken, don’t fix it

These solutions are purchased from local vendors that install such systems. The buyer will usually reach out to an installer that will pick and choose the cameras and the surveillance system for the buyer. The buyer cares less about the technology and more about the local vendor’s ability to install and maintain the system when needed.

Enterprise / large scale surveillance

Large scale surveillance systems for enterprises is more of the same as the small scale ones, but with a few main differences:

  1. There are more cameras
  2. There are also more sensors which we want to control and manage, likely using the same system. Think doors and managing employee entrance using keycards for example. While this is about surveillance and security, it is also about building automation
  3. This can go from a small scale building to as large as smart cities with lots of cameras – anywhere in-between that I bunch here are most likely multiple different markets with slightly different requirements
  4. We are likely to have a NOC, where security guards look at screens. Just like in the movies…

The two things that are making headways in this industry?

  • Using AI to reduce the amount of people needed to look at surveillance monitors. This is done by adding vision smarts into cameras and the media servers (local or in the cloud), so that events and alerts can be filtered better
  • To some extent, there’s also a requirement to use WebRTC in the NOC to be able to view in real time camera feeds without installing anything

Like the small scale solutions, here too the buyer will look for local installers. These will be the local integrators who bring the systems and install them. At times, the decision of brand will come from the buyer, though this is less likely. It is important to remember that a considerable part of the cost goes towards the setup and installation and not necessarily to the cost of the equipment itself.

Personal/home surveillance

This one is the residential one. It is a B2C space where the buyer is a person buying a camera for his own home security. The decision is made on price or brand mostly.

Here you’ll find also solutions that make use of old smartphones and tablets as cameras, or something like the one we purchased a few years back when our kids were younger:

A digital peephole camera

Having the ability for them to see who is outside our door when they were shorter.

Here too, the market is going into multiple directions:

  • Home automation, connecting more sensors and devices in the home, some of them have cameras in them
  • Surveillance and security, where today it seems at least here in Israel, that fingerprint door locks are all the rage

Where does WebRTC play here? It might make things smoother to develop for the companies, but this doesn’t seem to be the case.

One thing that goes through all use cases above, is the existence of another solution – the video doorbell. Taken into buildings, this becomes an intercom system, which again – can make use of WebRTC. And why? Because it needs bidirectional support for audio at the very least, making WebRTC a suitable alternative.

Personal security

A totally different niche is the one of personal security.

This manifests itself as apps (and services) people can use to increase their security while going about in their daily tasks. Some of these apps connect you to friends and family while others to personal security agents. The WebRTC requirement here is the same for all cases – be able to conduct voice and video calls in real time.

Taken more broadly from the personal level, the same can be implemented in campuses, cities, events, etc.

Unique (?) challenges for WebRTC with camera hardware

There are some unique challenges for WebRTC when it comes to the surveillance space, and that’s mostly a matter of hardware.

  • Costs
    • Hardware costs money. Not just the devices themselves, but their installation. This also means that hardware costs needs to be kept low in most of these systems, which means less processing power available on the cameras themselves or the xDR devices
    • To drive costs down, CPUs won’t be as performant as the ones found in smartphones or PCs for example, and they would almost always rely heavily on hardware video encoding
  • Maintenance
    • Many of these hardware systems come without subscription services. This means any firmware upgrades might or might not be available. It also means that such upgrades are sometimes clunky to get done on the devices, especially when they need to be handled remotely
    • There’s a lot of physical maintenance as well involved. Cleaning lenses of cameras for example
  • Technology leaps
    • You purchase a system. It has cameras and a xDR. Time passes. A couple of years. You decide you need more cameras, replace an existing one, whatever
    • There’s improvements that took place. The system you have might not even be able to deal with the new cameras available today, and purchasing old ones might not be possible or economical anymore
    • We had this when the system in our residential building broke. The DVR had a hard drive malfunction – it didn’t record anything anymore
      • It was impossible to replace, and buying a new old system wouldn’t be the right approach
      • Some of the cameras lost quality due to their analog coax cables (I was told this is an issue), and the predicament was we’d lose more of these cables in the coming 2-5 years anyways
      • So we had to shift the whole system to an IP based one. A technology leap
      • While I don’t foresee a move away from IP, I am sure many of these systems will change in the coming years in ways that would leave some of the old hardware unusable
  • Hybrid
    • There are hybrid alternatives in this space. We ended up getting one for our building
    • Due to the technology leaps, you end up with multiple types of sensors and cameras, from different generations and technologies
    • The systems that cobbles it all together (the xDR in our case), can be one that manages them all
    • Most installers won’t recommend it. It is mostly a necessary evil. Likely because it reduces the revenue of the installer and adds to the complexity of the installation and the system

Most of these issues won’t plague a software solution. But here, we end up in the real world simply because someone needs to go and install the physical cameras.

When figuring out the hardware platform to use, it is important to think of future trends and technology improvements that affect your implementation

In the case of surveillance, there’s WebRTC, future video codecs (AV1) and machine learning in the vision domain to think about. Probably also programmable photography that is bringing innovations to smartphones for a few years now

Ingress, egress and the concept of real time

Where to place WebRTC in the solution?

Since I write a lot about WebRTC, and this article is mostly about WebRTC in surveillance markets, it is THE biggest question to answer here.

There are two different places, and both are suitable, but not necessarily together in the same system.

Surveillance needs real time. Sometimes.


In our own residential building, I seldom care about the live feed from the cameras. It is to check if the front door to the building is open or not, or if there’s some area that got dirty (usually dog pee). Then most of the time is spent rewinding to figure out who caused the problem. Nothing here is considered real time in nature or requires sub second latency.

Elsewhere, real time might be critical on the viewer side (egress), which brings with it the question of whether WebRTC fits here well.


Web cameras that directly stream out WebRTC to the world (or the xDR). Is that a benefit? What’s the value of it versus the existing camera technologies used?

I am not quite for or against this, as I am not really sure here. I’d say that a benefit here can be in the fact that it makes the whole technology stack simpler if you end up using WebRTC end-to-end instead of needing to switch protocols from the camera to the viewer. Just remember here that rewind and playback will likely require something other than WebRTC.

The main advantage of WebRTC here might be the removal of the need to transcode and translate across protocols and codecs. It makes xDR software simpler to write and reduces a lot of their CPU requirements, making the systems lighter and cheaper (the xDR – not the camera itself).

One more thing to think of is cameras that also require bidirectional audio. Because a security guard wants to announce or warn perpetrators, or because this is a video doorbell. There, WebRTC fits nicely, though again – not mandatory (I’d still try using it there more than elsewhere).

  Going to introduce WebRTC to a surveillance system? Great. Check first where exactly within the whole architecture WebRTC fits and ask yourself why

Mobile or desktop?

Another important aspect of a surveillance system is where people go to watch the videos.

When we installed our own system, we were told that the mobile app is better than the PC app. In both, these were applications. But somehow for the consumers, it meant using the smartphone. It sucks. But yes – it sucks more on the desktop. Which is crazy, considering that what you’re trying to do is watch output coming from 4K cameras in order to identify people.

Then again, who is your customer?

If this is a large enterprise, where there’s going to be a fancy video wall of video feeds with a bored security guard looking at it, then should this be an application or would it be preferable to use a web application for it, with the help of WebRTC? It seems that much of the industry on the client side is looking for lightweight solutions that require less software installations, favoring browsers and… WebRTC.

And if you’re already doing WebRTC for one egress destination, you can use it for all others – browser and app based.

One more thing to consider – it is easier today to develop a web application than it is a native PC application. Cheaper and faster. Which means that supporting WebRTC if the desktop is your primary viewing device might be the right decision to make.

See if there’s a strong need for a zero-install or desktop viewing. This might well lead you towards WebRTC on the egress side

The age of Artificial Intelligence in surveillance tech

The biggest driver in this industry is machine learning and artificial intelligence. And not necessarily the Generative AI kind, but rather the kind that deals with object classification.

The challenge with surveillance is watching the damn cameras. You need eyeballs on screens. The good old motion detection removes a lot of noise (or more accurately, static), but it leaves much to be desired.

One of the elevators in my building, along with the video you get most hours of the day – empty. The bar at the bottom with the blue stripes marks when there’s actual movement.

Using machine learning, it will be easier to search for dogs, people, colors, items and other tidbits to figure out times of interest in the thousands of hours of boring videos, as well as act as “Google search” on recorded video feeds.

Doing all that in the cloud is possible, but expensive and tedious – how do you ship all the video, decode it, process it again, etc.

Doing it on the edge, on the device itself (the camera or the xDR) is preferable, but requires new hardware, so requires another technology leap and refresh.

WebRTC isn’t core for surveillance but it is critical

This is something to remember.

WebRTC isn’t core to surveillance. You don’t really need it to get surveillance cameras working, installed or connected to their xDR media servers. You don’t even need it to view videos – either “live” or as playback.

But, and that’s a big one – in some cases, having WebRTC is critical. Because your customer may want to be able to use web browsers and install nothing. He may want to be able to get bidirectional media. There might be a need to get video feeds that are at sub second latencies.

For these, WebRTC might not be a core competency, but they are critical to the successful delivery and deployment of your product. This translates into having a need to have that skill set in your team or be able to outsource it to someone with that skill set.

Where can I help, if at all?

Online WebRTC courses, to skill up engineers on this technology

Consulting, mostly around architecture decisions and technology stack selection Testing and monitoring WebRTC systems, via my role as Senior Director at Cyara (and the co-founder of testRTC)

The post Fitting WebRTC in the brave new world of webcams, security, surveillance and visual intelligence appeared first on

Solving CPaaS vendor lock-in (as a customer and as a CPaaS vendor)

bloggeek - Tue, 09/12/2023 - 12:30

How to think and plan for CPaaS vendor lock-in when it comes to your WebRTC application implementation.

How can/should CPaaS vendors compete on winning customers? More than that, how can/should CPaaS vendors poach customers from other CPaaS vendors?

What prompted this article is the various techniques CPaaS vendors use and what they mean to customers – how should customers react to these techniques. I’ll focus on the Video API part of CPaaS – or to be more specific, the part that deals with WebRTC implementation.

Table of contents What is CPaaS vendor lock-in?

For me CPaaS (or Communication Platform as a Service) is a service that lets companies build their own communication experiences in a flexible manner. Usually done via APIs and requires developers, but recently, also via lowcode/nocode interactions (such as embedding an iframe).

A CPaaS vendor ends up defining its own interface of APIs which his customers are using to create these communication experiences.

That API interface is proprietary. There is no standard specification for how CPaaS APIs need to look or behave. This means that if you used such an API, and you want to switch to another CPaaS vendor – you’re going to need to do all that integration work all over again.

Think of it like switching from an Android phone to an iPhone or vice versa:

  • There’s a new interface you need to learn
    • It might be similar since it practically used for doing the same things
    • But it is also a bit “off”. The things you expect to be in one place are in another place
    • The settings is done differently
    • And the way you deal with the phone’s assistant (or Siri?) is different as well
  • You need to install all of your apps from scratch
    • Find them in the app store, download them, install them
    • Set them up by logging in
    • Some of them you need to purchase separately all over again
    • Others you won’t find… and you’ll need to look for alternative apps instead – or decide not to use that functionality any longer
  • The behavior will be different
    • The background color of the apps
    • They way you switch between screens is different
    • The swipe “language” is also slightly different

In a way, you want the same experience (only better), but there’s going to be a learning curve and an adaptation curve where you familiarize yourself with the new CPaaS vendor and “make yourself at home”.

The vendor lock-in part is how much effort and risk will you need to invest and overcome in order to switch from one vendor to another – to call that other vendor your new home.

Vendor lock-in has 3 aspects to it in CPaaS:

  1. Difference in the API interface. That’s a purely technical one. Low risk usually, with varying degree of effort
  2. Behavioral differences. This has higher risk with unknown effort involved. While both CPaaS vendors do the “same” thing, they are doing it differently. And that difference is hidden behind how they behave. Your own application may rely on behavior that isn’t part of the standardized official interface and you will find out about it only once you test the migrated application on the new CPaaS vendor’s interface or later when things break in production
  3. Integration differences. There are things outside the official interface you might have integrated with such as logs collection, understanding and handling error codes and edge cases, ETL processes, security mechanisms, etc. These things are the ones developers usually won’t account for when estimating the effort in the beginning and will likely be caught late in the migration process itself

Vendor lock-in is scary. Not because of the technical effort involved but because of the risks from the unknowns. The more years and the more interfaces, scenarios and code you have running on a CPaaS vendor, the higher the lock-in and risk of migration you are at.

The innovation in WebRTC that CPaaS is “killing”

Before WebRTC, we had other standards. RTP and RTCP came a lot before WebRTC.

We had RTMP, RTSP, SIP and H.323.

The main theme of all these standard specifications was that their focus has always been about standardizing what goes on over the network. They didn’t care or fret about the interface for the developer. The idea behind this was to enable using this standard on whatever hardware, operating system and programming language. Just read the spec and implement it anyway you like.

WebRTC changed all that (ignoring Flash here). We now have a specification where the API interface for the developer of a web application is also predefined.

WebRTC specifies what goes on the network, but also the JavaScript API in web browsers.

Here’s how I like explaining it in my slides:

One of the main advantages of WebRTC is that a developer who uses WebRTC in one project for one company can relatively easily switch to implement a different WebRTC project for another company. (that’s not really correct, but bear with me a little here)

We now could think of WebRTC just like other technologies – someone proficient in WebRTC is “comparable” to someone who worked with Node.js or SQL or other technologies. Whereas working with SIP or H.323 begs the question – which framework or implementation was used – learning a new one has its own learning curve.

Enter CPaaS…

And now the WebRTC API interface is no longer relevant. The CPaaS vendor’s SDK has its own interface indicating how things get done. And these may or may not bear any resemblance to the WebRTC API. Moreover – it might even try very hard to hide the WebRTC stack implementation from the developer.

This piece of innovation, where a developer using WebRTC can jump into new code of another project quickly is gone now. Because the interfaces of different CPaaS vendors aren’t standardized and don’t adhere to the standard WebRTC API interface (and they shouldn’t be – it isn’t because they are mean – it is because they offer a higher level of abstraction with more complex and complete functionality).

Not having the same interface across CPaaS vendors is one of the reasons we’ve started down this rabbit hole of exploring what CPaaS vendor lock-in is exactly.

CPaaS vendor poaching techniques and how to react to them

Every so often, you see one or more CPaaS vendors trying to grab a bit more market share in this space. Sometimes, it is about enticing customers who want to start using a CPaaS vendor. Other times it is focused on trying to poach customers from other CPaaS vendors.

When looking at the latter, here are the CPaaS vendor poaching techniques I’ve seen, how effective they are, and what you as a target company should think about them.

#1 – Feature list comparisons

The easiest technique to implement (and to review) is the feature list comparison.

In it, a CPaaS vendor would simply generate and share a comparison table of how its feature set is preferable over the popular alternatives.

For a company looking to switch, this would be a great place to start. You can skim through the feature list and see exactly what’s there in the platform you are currently using and the one you are thinking of switching to.

When looking at such a list, remember and ask yourself the following questions:

  • Is this list up to date? Oftentimes, these pages are created with big fanfare when a “poaching” or comparison project is initiated by the marketing department of a CPaaS vendor. But once done, it is seldom updated to reflect the latest versions (especially the latest version of the competitor). So take the comparison with a grain of salt. It is likely to be somewhat incorrect
  • Check what your experience is with the vendor you are using versus how it is reflected in the comparison table. Does the table describe things as you see them?
  • The features that look better “on paper” in this table for the vendor you plan on switching to. Do you need these features? Are they critical for you today or in the near future? Or are they just nice to have
  • The “greens” on the vendor making the comparison – are they on par with the other vendor or just a less comprehensive implementation of it? (for example, support for group calls – both vendors may support it, but one can get you to X users with open mics in a group call while the other can do 10X users)

I’ve had my fare share of reading, writing and responding to comparison tables. A long time ago (pre-WebRTC), we received inputs that our competitor can do almost 10 times the number of concurrent calls we are able to do with much higher throughput. Obviously, we created a task force to deal with it. The conclusion was simple – the competitor didn’t measure the network time at all – just CPU time in the machine. We weren’t measuring the same thing and his choice of metric meant he always looked better

Your role in this? To read between the lines and understand what wasn’t written. Always remember that this isn’t an objective comparison – it is highly skewed towards the author of it (otherwise, he wouldn’t be publishing it)

#2 – Performance comparisons

Here the intent of the CPaaS vendor is to show that his platform is superior in its performance. It can offer better quality, at lower bitrates and CPU use for larger groups.

If a vendor does it on his own, then potential customers will immediately view the results as suspect. This is why most of them use third party objective vendors to do these performance comparisons for them (at a cost).

We’ve done this at testRTC a couple of times – some publicly shared (for this one, I’ve placed my own reputation and testRTC’s reputation on the frontline, insisting not to name the other vendors) and others privately done. It is a fun project since it requires working towards a goal of figuring out how different CPaaS vendors behave in different scenarios.

Zoom did this as well, comparing itself to other CPaaS vendors. Agora answered in kind with a series of posts comparing themselves back to Zoom (where Zoom didn’t look as shiny).

Just remember a few things when reading such comparisons:

  • They were commissioned. They wouldn’t be published and shared if they weren’t showing what the CPaaS vendor wanted them to show
  • For me, it is more interesting to see how the setup of the performance tests was done and what was left out or missed in the comparison to begin with
    • The types of machines and browsers selected
    • Scenarios picked
    • Reference applications used for each vendor
    • How measurements are done
    • Which metrics are selected for the comparison
  • Who the vendor was looking to compare himself to
  • The CPaaS vendor usually helps and tweaks his own platform to fit the scenarios selected, while the competing vendors have no say in which of their applications or samples are used and if or how they are optimized for the scenario (hint: they aren’t)

In the end, the fact that a CPaaS vendor performs better than another in a scenario you don’t need says nothing for you. Make sure to give more weight to the results of actual scenarios relevant to you, and be sure you understand what is really being compared

#3 – Guides, how-to’s and success stories

How do you make the migration of a customer from a different CPaaS vendor to your own? You write a migration document about it. A guide. Or a how-to. Or you get a testimonial or a success story from a customer willing to share publicly that he migrated and how life is so much better for him now.

These are mainly targeted at raising the confidence level for those who are contemplating switching, signaling them that the process isn’t risky and that others have taken this path successfully already.

As someone thinking of moving from one vendor to another, I’d seriously consider reaching out to the CPaaS vendor and ask the hard questions:

  • How easy is the migration really is
  • What challenges should one expect
  • Are there any common issues that migrating customers have bumped into
  • How many such customers do they have
  • Can they reach out and ask one of those who migrated to have a quick direct conversation with

Anecdotes and recipes are nice. What you are after is having more data points.

Read these guides and success stories. Try reading between the lines in them. Check if you have any open questions and then ask these questions directly. Gather as much information as you can to get a clearer picture

#4 – Reference applications

I wasn’t sure if this fits for migrating customers because it is a bit broader in nature. But here we are

In many cases, CPaaS vendors have reference applications available. Usually hosted on github. Just pull the code, compile, host and run it. You get an app that is “almost” ready for deployment.

You see how easy that was? Think how easy it is going to be to migrate to us with this great reference.

Remember a few things here:

  • Your workflow is likely different enough from the reference app that there’s work to be done here
  • In most cases, if you’ve built your application already on another vendor, using a reference app of another CPaaS vendor is close to impossible
  • Reference apps are just references. They usually don’t cover many of the edge cases that needs handling

From my point of view, reference apps are nice to get a taste of what’s possible and how the API of a CPaaS vendor gets used. But that’s about it. They are unlikely to be useful during the migration process itself

#5 – Shims and adaptors

They say imitation is the highest form of flattery. If that is true, then shims and adapters would fit well here.

In CPaaS, the most common one was supporting TwiML (that’s Twilio’s XML “language” for actions on telephony events). There’s also the idea/intent of having the whole API interface of another CPaaS vendor (or parts of it) supported directly by the poacher. The purpose of which is to make it easy to switch over.

Clearing things up a bit:

  • CPaaS vendor A has an API interface
  • CPaaS vendor B has a different API interface
  • To make it easier to switch from vendor A to vendor B, vendor B decides to create a piece of software that translates calls of A’s API interface to B’s API interface. This is usually called a shim or an adaptor

The result? If you’re using vendor A, theoretically, you can take the shim created by vendor B and magically without any investment, you migrate to vendor B. Problem solved

While this looks great on paper, I am afraid it has little chance of holding up in the real world . Here’s why:

  1. The shim created is usually partial. Especially if vendor A offers a very rich interface (most vendors will, especially in the domain of video APIs and WebRTC)
  2. Like reference applications, these shims don’t take good care of edge cases. Why? Because they aren’t used by many customers less customers = less investment
  3. WebRTC is rather new, and CPaaS vendors have much to add, so every time vendor A updates his CPaaS and adds APIs to the interface – vendor B needs to invest in updating the shim. But is that even done once a shim is created? Or is it again, placed in the afterburner due the previous rule less customers = less investment
  4. Behavior. Same API interface doesn’t necessarily mean the vendor’s platforms behave the same on the network. These changes are hard to catch… and might be even harder to resolve
  5. Using a shim is nice, but if you want to use specific features available in vendor B’s interface – can you even do that if you’re doing everything via the shim? And is that the correct way to do things moving forward for you?

The thing is, that using a shim still means a ton of testing and headaches, but such that are hard to overcome.

If I had to switch between vendors, I’d ignore such shims altogether. For me they’re more of a trap than anything else.

Someone suggesting you use their shim for switching over to their CPaaS? Ignore them and just analyze what needs to be done as if there’s no shim available. You’ll thank me later

Build vs Buy – my first preference is ALWAYS buy (=CPaaS)

We’ve seen 5 different techniques CPaaS vendors use to try and poach customers from one another. For the most part, they are of the type of “buyers beware”. And yet, we do need to migrate from time to time from one CPaaS vendor to another. Market dynamics might force us to do so or just the need to switch to a better platform or offering.

Does that mean it would be best to go it alone and build your own platform instead of using a third party CPaaS vendor?


Vendor lock-in isn’t necessarily a bad thing. My first preference is always to adopt a CPaaS vendor. And if not to adopt one, then to articulate very clearly why the decision to build is made.

What should you do when you start using a CPaaS vendor to make the transition to another vendor (or to your own platform) smoother in the distant future? Here are a few things to consider.

  1. Limit the calls to the vendor’s API interface
    • If you can make all of them from a single source file then great
    • Even if not it is fine, but try not to call the vendor’s APIs and use their objects directly all over the place
    • Having it all nicely compartmentalized will reduce the amount of changes needed during a migration
  2. Consider building an abstraction layer
    • While I hate this one, it appeals to some
    • Create your own abstraction of the communications capabilities you need
    • Have that abstraction a “standardized” internal interface you follow
    • Implement the integration with the vendor as a class/object of that interface
    • This enables you to implement the next vendor or your own platform as yet another class/object for the same interface in the future at some point.
    • Risky. As this probably will require architectural and design changes once that time comes, but it might still be the decision that can get your company to move forward
  3. Don’t use undocumented APIs and behaviors
    • These will be harder to figure out in the future
    • Making them harder to modify during a migration
  4. Assume there’s no simple solution
    • No silver bullet or magic solution here
    • Which means that time invested in catering for future multiple vendors or seamless migration paths is time wasted
    • Try to make the decisions here ones that don’t take more resources or time today due to some unknown future need – you are more likely to make a mistake in these decisions than you are to succeed in it

The post Solving CPaaS vendor lock-in (as a customer and as a CPaaS vendor) appeared first on

WebRTC cracks the WHIP on OBS

webrtchacks - Tue, 08/22/2023 - 14:28

Open Broadcast Studio or OBS is an extremely popular open-source program used for streaming to broadcast platforms and for local recording. WebRTC is the open-source real time video communications stack built into every modern browser and used by billions for their regular video communications needs. Somehow these two have not formally intersected – that is […]

The post WebRTC cracks the WHIP on OBS appeared first on webrtcHacks.

WebRTC conferences – to mix or to route audio

bloggeek - Mon, 08/21/2023 - 12:30

How do you choose the right architecture for a WebRTC audio conferencing service?

Last month, Lorenzo Miniero published an update post on work he is doing on Janus to improve its AudioBridge plugin. It touched a point that I failed to write about for a long time (if at all), so I wanted to share my thoughts and views on it as well.

I’ll start with a quick explanation – Lorenzo is adding to Janus a lot of layers and flexibility that is needed by developers who are taking the route of mixing audio in WebRTC conferences. What I want to discuss here is when to use audio mixing and when not to use it. And as everything else, there usually isn’t a clear cut decision here.

Table of contents What’s mixing and what’s routing in WebRTC?

Group calls in WebRTC can take different shapes and sizes. For the most part, there are 3 dominant architectures for WebRTC multiparty calling: mesh, mixing and routing.

I’ll be focusing on mixing and routing here since they scale well to 100’s or more users.

Let’s start with the basics.

Assume there’s a conversation between 5 people. Each of these people can speak his mind and the others can hear him speaking. If all of these people are remote with each other and we now need to model it in WebRTC, we might think of it as something like this illustration:

This is known as a mesh network. Its biggest disadvantage for us (though there are others) is the messiness of it all – the number of connections between participants that grows polynomially with the number of users. The fact that we need to send out the same audio stream to all participants individually is another huge disadvantage. Usually, we assume (and for good reasons) that the network available to us is limited.

The immediate obvious solution is to get a central media server to mix all audio inputs, reducing all network traffic and processing from the users:

This media server is usually called an MCU (or a conferencing bridge). Users here “feel” as if they are in a session with only a single entity/user and the MCU is in charge of all the headaches on behalf of the users.

This mixer approach can be a wee bit expensive for the service provider and at times, not the most flexible of approaches. Which is why the SFU routed model was introduced, though mostly for video meetings. Here, we try to enjoy both worlds – we have the SFU route the media around, to try and keep bitrates and network use at reasonable levels while trying to reduce our hosting and media processing costs as service providers:

The SFU has become commonplace and the winning architecture model for video meetings almost everywhere. Voice only meetings though, have been somewhere in-between. Probably due to the existence and use of audio bridges a lot before WebRTC came to our lives.

This begs the question then, which architecture should we be using for our audio in group calls? Should we mix it in our media servers or just route it around like we do with video?

Before I go ahead to try and answer this question, there’s one more thing I’d like to go through, and that’s the set of media processing tools available to us today for audio in WebRTC.

Audio processing tools available for us in WebRTC

Encoding and decoding audio is the baseline thing. But other than that, there are quite a few media processing and network related algorithms that can assist applications in getting to the desired scale and quality of audio they need.

Before I list them, here are a few thoughts that came to mind when I collected them all:

  • This list is dynamic. It changes a bit every year or so, as new techniques are introduced
  • You can’t really use them all, all the time, for all use cases. You need to pick and choose the ones that are relevant to your use case, your users and the specific context you’re in
  • We now have a machine learning based tool as well. We will have more of these in a year or two for sure
  • It was a lot easier to compile this list now that we’ve finished recording and publishing all the lessons for the Higher-level WebRTC protocols course – we’ve covered most of these tools there in great detail
Audio level

There is an RTP header extension for audio level. This allows a WebRTC client to indicate what is the volume that can be found inside the encoded audio packet being sent.

The receiver can then use that information without decoding the packet at all.

What can one do with it?

Decide if you need to decode the packet at all or just discard it if there’s no or little voice activity or if the audio level is too low (no one’s going to hear what’s in there anyway).

You can replace it with DTX (see below) or not forward the packet in a Last-N architecture (see below).

Not mix its content with other audio channels (it doesn’t hold enough information to be useful to anyone).


Discontinuous transmission

If there’s nothing really to send – the person isn’t speaking but the microphone is open – then send “silence” but with less packets over the network.

That’s what DTX is about, and it is great.

In larger meetings, most people will listen and not speak over one another. So most audio streams will just be “silence” or muted. If they aren’t muted, then sending DTX instead of actual audio reduces the traffic generated. This can be a boon to SFUs who end up processing less packets.

An SFU media server can also decide to “replace” actual audio it receives from users (because they have a low audio level in them or because of Last-N decisions he is making) with DTX data when routing media around.


Packet Loss Concealment

Packets are going to be lost, but there would be content that still needs to be played back to the user.

You can decide to play silence, a repeat of the last heard packet, lower its volume a bit, etc.

This can be done both on the server side (especially in the case of an MCU mixer) or on the client side – where such algorithms are implemented in the browser already. SFUs can ignore this one, mostly since they don’t decode and process the actual media anyway.

At times, these can be done using machine learning, like Google’s proprietary WaveNetEq, which tries to estimate and predict what was in the missing packet based on past packets received.

Packet loss concealment isn’t great at all times, but it is a necessary evil.


Theoretically, you could use retransmissions for lost packets.

WebRTC does that mostly for video packets, but this can also find a home for audio.

It is/was a rather neglected area because PLC and Opus inband FEC techniques worked nicely.

For the time being, you’re likely to skip this tool, but it is one I’d keep an eye on if I were highly interested in audio quality advancements.


Forward Error Correction is about sending redundant data that can be used to reconstruct lost packets. Redundancy coding is what we usually do for audio, which is duplicating encoded frames.

Audio bandwidth requirements are low, so duplicating frames doesn’t end up taxing much of our network, especially in a video call.

This approach enables us at a “low cost” to gain higher resiliency to packet losses.

This can be employed by the client sender, or even from the server side, beefing up what it received – both as an SFU or an MCU.

Check Philipp Hancke’s tal at Kranky Geek about Advanced in Audio Codecs

Then there’s the nuances and headaches of when to duplicate and how much, but that’s for another article.


A known technicality in WebRTC’s implementation is that it only mixes the 3 loudest incoming audio channels before playing back the audio.

Why 3? Because 2 wasn’t enough and 4 seemed unnecessary is my guess. Also, the more sources you mix, the higher the noise levels are going to be, specially without good noise suppression (more on that below)

Well… Google just decided to remove that restriction. Based on the announcement, that’s because the audio decoding takes place in any case, so there’s not much of a performance optimization not to mix them all.

So now, you can decide if you want to mix everything (which you just couldn’t before) or if you want to mix or route only a few loudest volume (or most important) audio streams if that’s what you’re after. This reduces CPU and network load (depending on which architecture you are using).

Google Meet for example, is employing Last-3 technique, only sending up to 3 loudest audio streams to users in a meeting.

Oh, and if you want to dig deeper into the reasoning, there’s a nice Jitsi paper from 2016 explaining Last N.

Noise suppression: RNNoise and other machine learning algorithms

Noise suppression is all the rage these days.

RNNoise is a veteran among the ML-based noise suppression algorithms that is quite popular these days.

Janus for example, have added it to their AudioBridge and implemented optional RNNoise logic to handle channel-based noise suppression in their MCU mixer for each incoming stream.

Google added this in their Google Meet cloud – their SFU implementation passes the audio to dedicated servers that process this noise suppression – likely by decoding, noise suppression and encoding back the audio.

Many vendors today are introducing proprietary noise suppression to their solutions on the client side. These include Krisp, Dolby, Daily, Jitsi, Twilio and Agora – some via partnerships and others via self development.

Mixing keeps the headaches away from the browser

Why use an MCU for mixing your audio call? Because it takes all the implementation headaches and details away from the browser.

To understand some of what it entails on the server though, I’d refer you again to read Lorenzo’s post.

The great thing about this is that for the most part, adding more users means throwing more cloud hardware on the problem to solve it. At least up to a degree this can work well without thinking of scaling out, decentralization and other big words.

It is also how this was conducted for many years now.

Here are the tools I’d aim for in using for an audio MCU:

ToolUse?ReasoningAudio levelDecoding less streams will get higher performance density for the server. Use this with Last-N logicDTXBoth when decoding and while encodingPLCOn each incoming audio stream separatelyRTX & NACKTo early to do this todayFEC and REDToday, for an MCU, this would be rare to see as a supported featureConsider on outgoing audio streams; as well as enable for incoming streams from devicesLast-NLast-3 is a good default unless you have a specific user experience in mind (see below examples)Noise suppressionOn incoming channels, those that passed Last-N filtering, to clean them up before mixing the incoming streams together

Things to note with an audio MCU, is that the MCU needs to generate quite a few different outgoing streams. For 10 participants with 4 speakers (at Last-4 configuration), it would be something like this:

We have 5 separate mixers at play here:

  • 1 mixing all 4 active speakers
  • 4 mixing only 3 out of the 4 each time – we don’t want to send the person speaking his own audio mixed in the stream
Routing gets you better flexibility

Why do we use an SFU for audio conferences? Because we use it for video already… or because we believe this is the modern way of doing things these days.

When it comes to routing audio, the thing to remember is that we have a delicate balance between the SFU and the participants, each playing a part here to get a better experience at the end of the day.

Here are the tools I’d use for an audio SFU:

ToolUse?ReasoningAudio levelWe must have this thing implemented and enabled, especially since we really really really want to be able to conduct Last-N logic and not send each user all audio channels from all other participantsDTXWe can use this to detect silence as well here (and remove from Last-N logic). On the sending logic, the SFU can decide to DTX the channels in Last-N that are silent or at a low volume to save a bit of extra bandwidth (a minor optimization)PLCNot needed. We route the audio packets and let the participants fix any losses that take placeRTX & NACKTo early to do this todayFEC and REDThis can be added on the receiver and sender side in the SFU to improve audio quality. Adding logic to dynamically device when and how much redundancy based on network conditions is also an advantage hereLast-NLast-3 is a good default. Probably best to keep this at most at Last-5 since the decision here means more CPU use on the participants’ sideNoise suppressionNot needed. This can be done on the participants’ side

In many ways, an audio SFU is simpler to implement than an audio MCU, but tweaking it just right to gain all the benefits and optimizations from the client implementation is the tricky part.

Where the rubber hits the road – let’s talk use cases

As with everything else I deal with, which approach to use depends on the circumstances. One of the main deciding criteria in this case is going to be the use case you are dealing with and the scenario you are solving this for.

Here are a few that came to mind.

Gateway to the old world

The first one is borderline “obvious”.

Before WebRTC, no one really did an audio conference using an SFU architecture. And if they did, it was unique, proprietary and special. The world revolved and still revolves around MCU and mixing audio bridges.

If your service needs to connect to legacy telephony services, existing deployments of VoIP services running over SIP (or god forbid H.323), connect to a large XMPP network – whatever it may be – that “other” world is going to be running as an MCU. Each device is likely capable of handling only one incoming audio stream.

So trying to connect a few users from your service (no matter if you are using an SFU or an MCU) would need to mix these users when connecting them to the legacy service.

Video meetings with mixed audio

There are services that decide to use an SFU to route video streams and an MCU for the audio streams.

Sometimes, it is because the main service started as an audio service (so an audio bridge was/is at the heart of the service already) and video was bolted on the platform. Sometimes it is because gatewaying to the old world is central to the service and its mindset.

Other times, it is due to an effort to reduce the number of audio streams being sent around, or to reduce the technical requirements of audio only participants.

Whatever the reason, this is something you might bump into.

The big downside of such an approach is the loss of lip synchronization. There is no practical way you can synchronize a single audio stream that represents mixed content of multiple video streams. In fact, no lip synchronization with any of the video streams takes place…

Usually, the excuse I’ll be hearing is that the latency difference isn’t noticeable and no one complained. Which begs the question – why do we bother with lip synchronization mechanisms at all then? (we do because it does matter and is noticeable – especially when the network is slightly bumpier than usual)

Experience the crowd

Think of a soccer game. 50,000 people in a stadium. Rawring when there’s a goal or a miss.

With Last-3 audio streams mixed, you wouldn’t be hearing anything interesting when this takes place “remotely” for the viewers.

The same applies to a virtual online concert.

Part of the experience you are trying to convey is the crowds and the noises and voices they generate.

If we’re all busy reducing noise levels, suppressing it, picking and choosing the 2-3 voices in the crowd to mix, then we just degrade the experience.

Crowds matter in some scenarios. And keeping their experience properly cannot be done by routing audio streams around. Especially not when we’re starting to talk about hundreds of more active participants.

This case necessitates the use of MCU audio bridging. And likely a distributed approach the moment the numbers of users climb higher.

Metaverse and spatial audio

The metaverse is coming. Or will be. Maybe. Now that Apple Vision Pro is upon us. But even before that, we’ve seen some metaverse use cases.

One thing that comes to mind here is the immersion part of it, which leads to spatial audio. The intent of hearing multiple sounds coming from different directions – based on where the speaker is.

This means several things:

  1. For each user, the angle and distance (=volume level) of each other person speaking is going to be different
  2. That Last-3 strategy doesn’t work anymore. If you can distinguish directionness and volume levels individually, then more sources might need to be “mixed” here

Do you do that on the client side by way of an SFU implementation, or would it be preferable to do this in an MCU implementation?

And what about trying to run concerts in the metaverse? How do you give the notion of the crowds on the audio side?

These are questions that definitely don’t have a single answer.

In all likelihood, in some metaverse cases, the SFU model will be the best architectural approach while in others an MCU would work better.

Recording it all

Not exactly a use case in its own right, but rather a feature that is needed a lot.

When we need to record a session, how do we go about doing that?

Today, in at least 99% of the time that would be by mixing all audio and video sources and creating a single stream that can be played as a “regular” mp4 file (or similar).

Recording as a single stream means using an MCU-like solution. Sometimes by implementing it in a headless browser (as if this is a silent participant in the session) and other times by way of dedicated media servers. The result is similar – mixing the multiple incoming streams into a single outgoing one that goes directly to storage.

The downside of this, besides needing to spend energy on mixing something that people might never see (which is a decision point to which architecture to pick for example), is that you get to view and hear only a single viewpoint of a single user – since the mixed recording is already “opinionated” based on what viewpoint it took.

We can theoretically “record” the streams separately and then play them back separately, but that’s not that simple to achieve, and for the most part, it isn’t commonplace.

A kind of a compromise we see today with professional recording and podcast services is to record by mixed and separated audio streams. This allows post production to take either based on the mixing needs, but done manually.

Which will it be? MCU or SFU for your next audio meeting?

We start with this, and we will end with this.

It depends.

You need to understand your requirements and from there see if the solution you need will be based on an MCU, and SFU or both. And if you need help with figuring that out, that’s what my WebRTC courses are for – check them out.

The post WebRTC conferences – to mix or to route audio appeared first on

10 Years of webrtcHacks – merch and stats

webrtchacks - Mon, 07/24/2023 - 22:11

webrtcHacks celebrates our 10th birthday today 🎂. To commemorate this day, I’ll cover 2 topics here: Our new merch store Some stats and trends looking back on 10 years of posts We have the Merch In the early days of webrtcHacks, co-founder Reid Stidolph ordered a bunch of stickers which proved to be extremely popular. […]

The post 10 Years of webrtcHacks – merch and stats appeared first on webrtcHacks.

WebCodecs, WebTransport, and the Future of WebRTC

webrtchacks - Tue, 07/18/2023 - 14:30

Explore the future of Real-Time Communications with WebrtcHacks as we delve into the use of WebCodecs and WebTransport as alternatives to WebRTC's RTCPeerConnection. This comprehensive blog post features interviews with industry experts, a review of potential WebCodecs+WebTransport architecture, and a discussion on real-time media processing challenges. We also examine performance measurements, hardware encoder issues, and the practicality of these new technologies.

The post WebCodecs, WebTransport, and the Future of WebRTC appeared first on webrtcHacks.

New: Higher-Level WebRTC Protocols course

bloggeek - Mon, 07/17/2023 - 12:30

A new Higher-level WebRTC protocols course and discounts, available for a limited period of time.

Over a year ago, Philipp Hancke came to me with the idea of creating a new set of courses. Ones that will dig deeper into the heart of the protocols used in WebRTC. This being a huge undertaking, we decided to split it into several courses, and focus on the first one – Low-level WebRTC protocols.

We received positive feedback about it, so we ended up working on our second course in this series – Higher-level WebRTC protocols.

Why the need for additional WebRTC courses?

There is always something more to learn.

The initial courses at WebRTC Course were focused on giving an understanding of the different components of WebRTC itself and on getting developers to be able to design and then implement their application.

What was missing in all that was a closer look at the protocols themselves. Of looking at what goes on in the network, and being able to understand what goes over the wire. Which is why we started off with the protocols courses.

Where the Low-level WebRTC protocols looks at directly what goes to the network with WebRTC, our newer Higher-level WebRTC protocols is taking it up one level:

This time, we’re looking at the protocols that make use of RTP and RTCP to make the job of real time communications manageable.

If you don’t know exactly what header extensions are, and how they work (and why), or the types of bandwidth estimation algorithms that WebRTC uses – and again – how and why – then this course is for you.

If you know RTP and RTCP really well, because you’ve worked in the video conferencing industry, or have done SIP for years – then this course is definitely for you.

Just understanding the types of RTP header extensions that WebRTC ends up using, many of them proprietary, is going to be quite a surprise for you.

Our WebRTC Protocols courses

Got a use case where you need to render remote machines using WebRTC? These require sitting at the cutting edge of WebRTC, or more accurately and a slightly skewed angle versus what the general population does with WebRTC (including Google).

Taking upon yourself such a use case means you’ll need to rely more heavily on your own expertise and understanding of WebRTC.

There are now 2 available protocols courses for you:

  1. Low-level WebRTC protocols
  2. Higher-level WebRTC protocols (half-complete. Call it a work in progress)

And there are 2 different ways to purchase them:

  1. Each one separately – low and high
  2. As part of the bigger ALL INCLUDED WebRTC Developer bundle (the Higher-level course was just added to it)

You should probably hurry though…

  • There’s a 40% discount on the Higher-level WebRTC protocols course. This early-bird discount will be available until the end of this month ($180 instead of $300)
  • There’s also a 20% discount on all courses and ebooks. Call it a summer sale – this one is available using discount code SUMMER

Check out my WebRTC courses

The post New: Higher-Level WebRTC Protocols course appeared first on

Cloud gaming, virtual desktops and WebRTC

bloggeek - Mon, 07/03/2023 - 13:30

WebRTC is an important technology for cloud gaming and virtual desktop type use cases. Here are the reasons and the challenges associated with it.

Google launched and shut down Stadia. A cloud gaming platform. It used WebRTC (yay), but it didn’t quite fit into Google’s future it seems.

That said, it does shed a light on a use case that I’ve been “neglecting” in my writing here, though it was and is definitely top of mind in discussions with vendors and developers.

What I want to put in writing this time is cloud gaming as a concept, and then alongside it, all virtual desktops and cloud rendering use cases.

Let’s dig in

Table of contents The rise and (predictable?) fall of Google Stadia

Google Stadia started life as Project Stream inside Google.

Technically, it made perfect sense. But at least in hindsight, the business plan wasn’t really there. Google is far remote from gaming, game developers and gamers.

On the technical side, the intent was to run high end games on cloud machines that would render the game and then have someone play the game “remotely”. The user gets a live video rendering of the game and sends back console signals. This meant games could be as complex as they need be and get their compute power from cloud servers, while keeping the user’s device at the same spec no matter the game.

Source: Google

I’ve added the WebRTC text on the diagram from Google – WebRTC was called upon so that the player could use a modern browser to play the game. No installation needed. This can work nicely even on iOS devices, where Apple is adamant about their part of the revenue sharing on anything that goes through the app store.

Stadia wanted to solve quite a few technological challenges:

  • Running high end console games on cloud machines
  • Remotely serving these games in real time
  • Playing the game inside a browser (or an equivalent)

And likely quite a few other challenges as well (scaling this whole thing and figuring out how to obtain and keep so many GPUs for example).

Technically, Stadia was a success. Businesswise… well… it shut down a little over 3 years after its launch – so not so much.

What Stadia did though, was show that this is most definitely possible.

WebRTC, Cloud gaming and the challenges of real time

To get cloud gaming right, Google had to do a few things with WebRTC. Things they haven’t really needed too much when the main thing for WebRTC at Google was Google Meet. These were lowering the latency, dealing with a larger color space and aiming for 4K resolution at 60 fps. What they got virtually for “free” with WebRTC was its data channel – the means to send game controller signals quickly from the player to the gaming machine in the cloud.

Lets see what it meant to add the other three things:

4K resolution at 60 fps

Google aimed for high end games, which meant higher resolutions and frame rates.

WebRTC is/was great for video conferencing resolutions. VGA, 720p and even 1080p. 4K was another jump up that scale. It requires more CPU and more bandwidth.

Luckily, for cloud gaming, the browser only needs to decode the video and not encode it. Which meant the real issue, besides making sure the browser can actually decode 4K resolutions efficiently, was to conduct efficient bandwidth estimation.

As an algorithm, bandwidth estimation is finely tuned and optimized for given scenarios. 4K and cloud gaming being a new scenario, meant that bitrates that were needed weren’t 2mbps or even 4mbps but rather more in the range of 10-35mbps.

The built-in bandwidth estimator in WebRTC can’t handle this… but the one Google built for the Stadia servers can. On the technical side, this was made possible by Google relying on sender-side bandwidth estimation techniques using transport-cc.

Lower latency: playout delay

Remember this diagram?

It can be found in my article titled With media delivery, you can optimize for quality or latency. Not both.

WebRTC is designed and built for lower latency, but in the sub-second latency, how would you sort the latency requirements of these 3 activities?

  1. Nailing a SpaceX rocket landing
  2. Playing a first shooter game (as old as I am, that means Doom or Quake for me)
  3. Having an online meeting with a potential customer

WebRTC’s main focus over the years has been online meetings. This means having 100 milliseconds or 200 milliseconds delay would be just fine.

With an online game? 100 milliseconds is the difference between winning and losing.

So Google tried to reduce latency even further with WebRTC by adding a concept of Playout Delay. The intent here is to let WebRTC know that the application and use case prefers playing out the media earlier and sacrificing even further in quality, versus waiting a bit for the benefit of maybe getting better quality.

Larger color space

Video conferencing and talking heads doesn’t need much. If you recall, with video compression what we’re after is to lose as much as we can out of the original video signal and then compress. The idea here is that whatever the eye won’t notice – we can make do without.

Apparently, for talking heads we can lose more of the “color” and still be happy versus doing something similar for an online game.

To make a point, if you’ve watched Game of Thrones at home, then you may remember the botch they had with the last season with some of the episodes that ended up being too dark for television. That was due to compression done by service providers…

So far this is my favorite screenshot from #BattleForWinterfell #GameofThrones

— Lady Emily (@GreatCheshire) April 29, 2019

While different from the color space issue here, it goes to show that how you treat color in video encoding matters. And it differs from one scenario to another.

When it comes to games, a different treatment of color space was needed. Specifically, moving from SDR to HDR, adding an RTP header extension in the process to express that additional information.

Oh, and if you want to learn more about these changes (especially resolution and color space), then make sure to watch this Kranky Geek session by YouTube about the changes they had to make to support Stadia:

What’s in cloud gaming anyway?

Here’s the thing. Google Stadia is one end of the spectrum in gaming and in cloud gaming.

Throughout the years, I’ve seen quite a few other reasons and market targets for cloud gaming.

Types of cloud games

Here are the ones that come out of the top of my head:

  • High end gaming. That’s the Google Stadia use case. Play a high end game anywhere you want on any kind of device. This reduces the reliance and need to upgrade your gaming hardware all the time
    • You’ll find NVIDIA, Amazon Luna and Microsoft xCloud focused in this domain
    • How popular/profitable this is is still questionable
  • Console gaming. PlayStation, Xbox, Switch. Whatever. Picking a game and playing without waiting to download and install is great. It also allows reducing/removing the hard drive from these devices (or shrinking them in size)
  • Mobile games. You can now sample mobile apps and games before downloading them, running them in the cloud. Other things here? You could play games of other users using their account and the levels they reached instead of slaving your way there
  • Retro/emulated games. There’s a large and growing body of games that can’t be played on today’s machines because the hardware for them is too old. These can be emulated, and now also played remotely as cloud games. How about playing a PlayStation 2 game? Or an old and classing SEGA arcade game? Me thinking Golden Axe
Improved gameplay

Why not even play these games with others remotely?

My son recently had a sit down with 4 other friends, all playing on Xbox together a TMNT game. It was great having them all over, but you could do it remotely as well. If the game doesn’t offer remote players, by pushing it to the cloud you can get that feature simply because all users immediately become remote players.

At this stage, you can even add a voice conference or a video call to the game between the players. Just to give them the level of collaboration they can get out of playing the likes of Fortnite. Granted, this requires more than just game rendering in the cloud, but it is possible and I do see it happen with some of the vendors in this space.

Beyond cloud gaming – virtual desktop, remote desktop and cloud rendering

Lower latencies. Bigger color space. Higher resolutions. Rendering in the cloud and consuming remotely.

All these aren’t specific to cloud gaming. They can easily be extended to virtual desktop and remote desktop scenarios.

You have a machine in the cloud – big or small or even a cluster. That “machine” handles computations and ends up rendering the result to a virtual display. You then grab that display and send it to a remote user.

One use case can just be a remote desktop a-la VNC. Here we’re actually trying to get connected from one machine to another, usually in a private and secure peer-to-peer fashion, which is different from what I am aiming for here.

Another, less talked about is doing things like Photoshop operations in the cloud. For the poor sad people like me who don’t have the latest Mac Pro with the shiny M2 Ultra chip, I might just want to “rent” the compute power online for my image or video editing jobs.

I might want to open a rendered 3D view of a sports car I’d like to buy, directly from the browser, having the ability to move my view around the car.

Or it might just be a simple VDI scenario, where the company (usually a large financial institute, but not only) would like the employees to work on Chromebook machines but have nothing installed or stored in them – all consumed by accessing the actual machine and data in their own corporate data center or secure cloud environment.

A good friend of mine asked me what PC to buy for himself. He needed it for work. He is a lawyer. My answer was the lowest end machine you can find would do the job. That saved him quite a lot of money I am guessing, and he wouldn’t even notice the difference for what he needs it for.

But what if he needs a bit more juice and power every once in a while? Can renting that in the cloud make a difference?

What about the need to use specialized software that is hard to install and configure? Or that requires a lot of collaboration on large amounts of data that need to be shared across the collaborators?

Taking the notion and capabilities of cloud gaming and applying them to non-gaming use cases can help us with multiple other requirements:

  1. CPU and memory requirements that can’t be met with a local machine easily
  2. The need to maintain privacy and corporate data in work from home environments
  3. Zero install environment, lowering maintenance costs

Do these have to happen with WebRTC? No

Can they happen with WebRTC? Yes

Would changing from proprietary VDI environments to open standard WebRTC in browsers improve things? Probably

Why use WebRTC in cloud gaming

Why even use WebRTC for cloud gaming or more general cloud rendering then?

With cloud gaming, we’re fine doing it from inside a dedicated app. So WebRTC isn’t really necessary. Or is it?

In one of our recent WebRTC Insights issues we’ve highlighted that Amazon Luna is dropping the dedicated apps in favor of the web (=WebRTC). From that article:

“We saw customers were spending significantly more time playing games on Luna using their web browsers than on native PC and Mac apps. When we see customers love something, we double down. We optimized the web browser experience with the full features and capabilities offered in Luna’s native desktop apps so customers now have the same exact Luna experience when using Luna on their web browsers.”

Browsers are still a popular enough alternative for many users. Are these your users too?

If you need or want web browser access for a cloud gaming / cloud rendering application, then WebRTC is the way to go. It is a slightly different opinion than the one I had with the future of live streaming, where I stated the opposite:

“The reason WebRTC is used at the moment is because it was the only game in town. Soon that will change with the adoption of solutions based on WebTransport+WebCodecs+WebAssembly where an alternative to WebRTC for live streaming in browsers will introduce itself.”

Why the difference? It is all about the latency we are willing to accommodate:

Your mileage may vary when it comes to the specific latency you’re aiming for, but in general – live streaming can live with slightly higher latency than our online meetings. So something other than WebRTC can cater for that better – we can fine tune and tweak it more.

Cloud gaming needs even lower latency than WebRTC. And WebRTC can accommodate for that. Using something else that is unproven yet (and suffers from performance and latency issues a bit at the moment) is the wrong approach. At least today.

Enter our WebRTC Protocols courses

Got a use case where you need to render remote machines using WebRTC? These require sitting at the cutting edge of WebRTC, or more accurately and a slightly skewed angle versus what the general population does with WebRTC (including Google).

Taking upon yourself such a use case means you’ll need to rely more heavily on your own expertise and understanding of WebRTC.

Over a year ago I launched with Philipp Hancke the Low-level WebRTC Protocols course. We’re now recording our next course – Higher-level WebRTC Protocols. 

If you are interested in learning more about this, be sure to join our waiting list for once we launch the course

Join the course waiting list

Oh, and I’d like to thank Midjourney for releasing version 5.2 – awesome images

The post Cloud gaming, virtual desktops and WebRTC appeared first on

Apple Vision, VR/AR, the metaverse and what it means to the web and WebRTC

bloggeek - Mon, 06/19/2023 - 13:30

The Apple Vision pro is a new VR/AR headset. Here are my thoughts on if and how it will affect the metaverse and WebRTC.

There were quite a few interesting announcements and advances made in recent months that got me thinking about this whole area of the metaverse, augmented reality and virtual reality. All of which culminated with Apple’s unveiling last week of the Apple Vision Pro. For me, the prism from which I analyze things is the one of communication technologies, and predominantly WebRTC.

A quick disclaimer: I have no clue about what the future holds here or how it affects WebRTC. The whole purpose of this article is for me to try and sort my own thoughts by putting them “down on paper”.

Let’s get started then

Table of contents The Apple Vision Pro

Apple just announced its Vision Pro VR/AR headset. If you’re reading this blog, then you know about this already, so there isn’t much to say about it.

For me? This is the first time that I had this nagging feeling for a few seconds that I just might want to go and purchase an Apple product.

Most articles I’ve read were raving about this – especially the ones who got a few minutes to play with it at Apple’s headquarters.

AR/VR headsets thus far have been taking one of the two approaches:

  1. AR headsets were more akin to “glasses” that had an overhead display on them which is where the augmentation took place with additional information being displayed on top of reality. Think Google Glass
  2. VR headsets which you wear a whole new world on top of your head, looking at a video screen that replaces the real world altogether

Apple took the middle ground – their headset is a VR headset since it replaces what you see with two high resolution displays – one for each eye. But it acts as an AR headset – because it uses external cameras on the headset to project the world on these displays.t

The end result? Expensive, but probably with better utility than any other alternative, especially once you couple it with Apple’s software.

Video calling, FaceTime, televisions and AR

Almost at the sidelines of all the talks and discussions around Apple Vision Pro and the new Mac machines, there have been a few announcements around things that interest me the most – video calling.

FaceTime and Apple TV

One of the challenges of video calling has been to put it on the television. This used to be called a lean back experience for video calling, in a world predominantly focused on lean forward when it comes to video calling. I remember working on such proof of concepts and product demos with customers ~15 years ago or more.

These never caught on.

The main reason was somewhere between the cost of the hardware, maintaining privacy with a livingroom camera and microphone positioning/noise.

By tethering the iPhone to the television, the cost of hardware along with maintaining privacy gets solved. The microphones are now a lot better than they used to – mostly due to better software.

Apple, being Apple, can offer a unique experience because they own and control the hardware – both of the phone and the set-top box. Something that is hard for other vendors to pull off.

There’s a nice concept video on the Apple press release for this, which reminded me of this Facebook (now Meta) Portal presentation from Kranky Geek:

Can Android devices pull the same thing, connected to Chromecast enabled devices maybe? Or is that too much to ask?

Do television and/or set-top box vendors put an effort into a similar solution? Should they be worried in any way?

Where could/should WebRTC play a role in such solutions, if at all?

FaceTime and Apple Vision Pro

How do you manage video calls with a clunky AR/VR headset plastered on your face?

First off, there’s no external camera “watching you”, unless you add one. And then there’s the nagging thing of… well… the headset:

Apple has this “figured out” by way of generating a realistic avatar of you in a meeting. What is interesting to note here, is that in the Apple Vision Pro announcement video itself, Apple made a three important omissions:

  1. They don’t show how the other people in the meeting see the person with the Vision headset on
  2. There’s only a single person with a Vision headset on, and we have his worldview, so again, we can’t see how others with a Vision headset look like in such a call
  3. How do you maintain eye contact, or even know where the user is looking at? (a problem today’s video calling solutions have as well)

What do the people at the meeting see of her? Do they see her looking at them, or the side of her head? Do they see the context of her real-life surroundings or a virtual background?

I couldn’t find any person who played with the Apple Vision Pro headset and reported using FaceTime, so I am assuming this one is still a work in progress. It will be really interesting to see what they come up with once this is released to market, and how real life use looks and feels like.

Lifelike video meetings: Just like being there

Then there’s telepresence. This amorphous thing which for me translates into: “expensive video conferencing meeting rooms no one can purchase unless they are too rich”.

Or if I am a wee bit less sarcastic – it is where we strive to with video conferencing – what would be the ultimate experience of “just like being there” done remotely if we had the best technology money can buy today.

Google Project Starline is the current poster child of this telepresence technology.

The current iteration of telepresence strives to provide 3D lifelike experience (with eye contact obviously). To do so while maintaining hardware costs down and fitting more environments and hardware devices, it will rely on AI – like everything else these days.

The result as I understand it?

  • Background removal/replacement
  • Understanding depth, to be able to generate a 3D representation of a speaker on demand and fit it to what the viewer needs, as opposed to what the cameras directly capture

Now look at what FaceTime on an Apple Vision Pro really means:

Generate a hyper realistic avatar representation of the person – this sounds really similar to removing the background and using cameras to generate a 3D representation of the speaker (just with a bit more work and a bit less accuracy).

Both Vision Pro and Starline strive for lifelike experiences between remote people. Starline goes for a meeting room experience, capturing the essence of the real world. Vision Pro goes after a mix between augmented and virtual reality here – can’t really say this is augmented, but can’t say this is virtual either.

A telepresence system may end up selling a million units a year (a gross exaggeration on my part as to the size of the market, if you take the most optimistic outcome), whereas a headset will end up selling in the tens of millions or more once it is successful (and this is probably a realistic estimate).

What both of these ends of the same continuum of a video meeting experience do is they add the notion of 3D, which in video is referred to as volumetric video (we need to use big fancy words to show off our smarts).

And yes, that does lead me to the next topic I’d like to cover – volumetric video encoding.

Volumetric video coding

We have the metaverse now. Virtual reality. Augmented reality. The works.

How do we communicate on top of it? What does a video look like now?

The obvious answer today would be “it’s a 3D video”. And now we need to be able to compress it and send it over the network – just like any other 2D video.

The Alliance for Open Media, who has been behind the publication and promotion of the AV1 video codec, just published a call for proposals related to volumetric video compression. From the proposal, I want to focus on the following tidbits:

  • The Alliance’s Volumetric Visual Media (VVM) Working Group formed in February 2022 this is rather new
  • It is led by Co-Chairs Khaled Mammou, Principal Engineer at Apple, and Shan Liu, Distinguished Scientist and General Manager at Tencent Apple… me thinking Vision Pro
  • The purpose is the “development of new tools for the compression of volumetric visual media” better compression tools for 3D video

This being promoted now, on the same week Apple Vision Pro comes out might be a coincidence. Or it might not.

The founding members include all the relevant vendors interested in AR/VR that you’d assume:

  • Apple – obviously
  • Cisco – WebEx and telepresence
  • Google – think Project Starline
  • Intel & NVIDIA – selling CPUs and GPUs to all of us
  • Meta – and their metaverse
  • Microsoft – with Teams, Hololens and metaverse aspirations

The rest also have vested interest in the metaverse, so this all boils down to this:

AR/VR requires new video coding techniques to enable better and more efficient communications in 3D (among other things)

Apple Vision Pro isn’t alone in this, but likely the one taking the first bold steps

The big question for me is this – will Apple go off with its own volumetric video codecs here, touting how open they are (think FaceTime open) or will they embrace the Alliance of Open Media work that they themselves are co-chairing?

And if they do go for the open standard here, will they also make it available for other developers to use? Me thinking… WebRTC

Is the metaverse web based?

Before tackling the notion of WebRTC into the metaverse, there’s one more prerequisite – that’s the web itself.

Would we be accessing the metaverse via a web browser, or a similar construct?

For an open metaverse, this would be something we’d like to have – the ability to have our own identity(ies) in the metaverse go with us wherever we go – between Facebook, to Roblox, through Fortnite or whatever other “domain” we go to.

Last week also got us this sideline announcement from Matrix: Introducing Third Room TP2: The Creator Update

Matrix, an open source and open standard for decentralized communications, have been working on Third Room, which for me is a kind of a metaverse infrastructure for the web. Like everything related to the metaverse, this is mostly a work in progress.

I’d love the metaverse itself to be web based and open, but it seems most vendors would rather have it limited to their own closed gardens (Apple and Meta certainly would love it that way. So would many others). I definitely see how open standards might end up being used in the metaverse (like the work the Alliance of Open Media is doing), but the vendors who will adopt these open standards will end up deciding how open to make their implementations – and will the web be the place to do it all or not.

Where would one fit WebRTC in the metaverse, AR and VR?

Maybe. Maybe not.

The unbundling of WebRTC makes it both an option while taking us farther away from having WebRTC as part of the future metaverse.

Not having the web means no real reliance on WebRTC.

Having the tooling in WebRTC to assist developers in the metaverse means there’s incentive to use and adopt it even without the web browser angle of it.

WebRTC will need at some point to deal with some new technical requirements to properly support metaverse use cases:

  • Volumetric video coding
  • Improve its spatial audio capability
  • The number of audio streams that can be mixed (3 are the max today)

We’re still far away from that target, and there will be a lot of other technologies that will need to be crammed in alongside WebRTC itself to make this whole thing happen.

Apple’s new Vision Pro might accelerate that trajectory of WebRTC – or it might just do the opposite – solidify the world of the metaverse inside native apps.

I want to finish this off with this short piece by Jason Fried: The visions of the future

It looks at AR/VR and generative AI, and how they are two exact opposites in many ways.

Recently I also covered ChatGPT and WebRTC – you might want to take a look at that while at it.

The post Apple Vision, VR/AR, the metaverse and what it means to the web and WebRTC appeared first on

Livestream this Friday: WebCodecs, WebTransport, and the Future of WebRTC

webrtchacks - Tue, 06/06/2023 - 14:10

Here at webrtcHacks we are always exploring what’s next in the world of Real Time Communications. One area we have touched on a few times is the use of WebCodecs with WebTransport as an alternative to WebRTC’s RTCPeerConnection. There have been several recent experiments by Bernard Aboba – WebRTC & WebTransport Co-Chair and webrtcHacks regular, […]

The post Livestream this Friday: WebCodecs, WebTransport, and the Future of WebRTC appeared first on webrtcHacks.

Is WebRTC really free? The costs of running a WebRTC application

bloggeek - Mon, 06/05/2023 - 12:30

Is WebRTC really free? It is open source and widely used due to it. But it isn’t free when it comes to running and hosting your own WebRTC applications.

If you are new to WebRTC, then start here – What is WebRTC?

Time to answer this nagging question:

Is WebRTC really free?

One of the reasons that WebRTC is the most widely used developer technology for real time communications in the world is that it is open source. It helps a lot that it comes embedded and available in all modern browsers. That means that anyone can use WebRTC for any purpose they see fit, without paying any upfront licensing fee or later on royalties. This has enabled thousands of companies to develop and launch their own applications.

But does that mean every web application built with WebRTC is free? No. WebRTC may well be free, but whatever is bolted on top of it might not be. And then there are still costs involved with getting a web application online and dealing with traffic costs.

For that reason, in this article, I’ll be touching on why WebRTC really is free, and what you have to factor in for it if you want to get your own WebRTC application.

Table of contents Yes. WebRTC itself is completely free

Since I am sure you didn’t really go read that other article – I’ll suggest it here again: What is WebRTC?

The TL;DR version of it?

The WebRTC software library is open sourced under a permissible open source license. That means its source code is available to everyone AND that individuals and companies can modify and use it anywhere they wish without needing to contribute back their changes. It makes it easier for commercial software to be developed with it (even when no changes or improvements are made to the base WebRTC library – just because of how corporate lawyers are).

You see? WebRTC really is free.

Google “owns” and maintains the main WebRTC library implementation. Everyone benefits from this. That siad, they aren’t doing this only from the goodness of their heart – they have their own uses for WebRTC they focus on.

However, there are costs involved with running a WebRTC application

While you don’t have to pay anything for WebRTC itself, there’s the application you develop, publish and then maintain. There are costs that come into play here – and considerable ones. These costs can vary depending on your requirements. 

I’d like to split the costs here into 3 components:

  1. The cost of developing a WebRTC application
  2. How expensive it is to optimize a WebRTC implementation
  3. Hosting and maintenance costs of a WebRTC application
1. The cost of developing a WebRTC application

The first thing you can put as a cost is to build the WebRTC application itself.

Here, as in all other areas, there’s more demand than supply when it comes to skilled WebRTC engineers. So much so that I had to write an article about hiring WebRTC developers – and I still send this link multiple times a month when asked about this.

Here too, you should split the cost into two parts:

  1. How much does it cost to develop your application?
  2. The WebRTC part of the application – how much investment do you need to put on it?

Since everything done in WebRTC requires skilled engineers (that are scarce when it comes to WebRTC expertise), you can safely assume it is going to be a wee bit more expensive than you estimate it to be.

2. How expensive it is to optimize a WebRTC implementation

I know what you’re going to say. Your WebRTC application is going to be awesome. Glorious. Superb. It is going to be so good that it will wipe the floor with the existing solutions such as Zoom, Google Meet and Microsoft Teams.

That kind of a mentality is healthy in an entrepreneur, but a dose of reality is necessary here:

  • You can’t out-do Google in quality with WebRTC
    • At least not if you’re going to butt head to head
    • Remember that they’re the ones who maintain WebRTC and implement it inside Chrome
    • And if you have the skillset to actually deliver on this one, it means you don’t need to read this article…
  • These vendors have large teams
    • Larger than what you are going to put out there
    • Almost definitely larger than what you’re going to budget for in the next 3 years
    • In man-years they are going to out-class you on pure media quality
    • Especially when the focus is on improving it in our industry at the moment
    • These vendors also need to deal with how Google runs WebRTC in the browser

This brings me to the need to optimize what you’re doing on an ongoing basis.

Ever since the pandemic, we’ve seen a growing effort in the leading vendors in this space to improve and optimize quality. This manifests itself in the research they publish as well as features they bring to the market. Here are a few examples:

  • Larger meeting sizes
  • Lower CPU use
  • Newer audio and video codecs
  • Introduction of AI algorithms to the media pipeline

You should plan for ongoing optimization of your own as well. Your customers are going to expect you to keep up with the industry. The notion of “good enough” works well here, but the bar of what is “good enough” is rising all the time.

Such optimizations are also needed not only to improve quality, but also to reduce costs.

Factor these costs in…

3. Hosting and maintenance costs of a WebRTC application

I had a meeting the other day. A founder of a startup who had to use WebRTC because customers needed something live and interactive. That component wasn’t at the core of his application, but not having it meant lost deals and revenue. It was a mandatory capability needed for a specific feature.

He complained about WebRTC being expensive to operate. Mainly because of bandwidth costs.

We can split WebRTC maintenance costs here into two categories: cloud costs, keeping the lights on costs.

Cloud costs

That startup founder was focused on cloud costs.

When we look at the infrastructure costs of web applications, there’s the usual CPU, memory, storage and network. We might be paying these directly, or indirectly via other managed and serverless services.

With WebRTC, the network component is the biggest hurt. Especially for video applications. You can reduce these costs by going to 2nd tier IaaS vendors or by hosting in “no-name” local data centers, but if you are like most vendors, you’re likely to end up on Amazon, Microsoft or Google cloud. And there, bandwidth costs for outgoing traffic are high.

WebRTC is peer to peer, but:

  • Not all sessions can go peer to peer. Some must be relayed via TURN servers
  • Large group calls in most cases will mean going through the cloud with your bandwidth to WebRTC media servers
  • All commercial WebRTC services I know have server components that gobble up bandwidth

And the more successful you become – the more bandwidth you’ll consume – and the higher your cloud costs are going to be.

You will need to factor this in when developing your application, especially deciding when to start optimizing for costs and bandwidth use.

Keeping the lights costs

Then there’s the “keeping the lights” costs.

WebRTC changes all the time. Things get deprecated and removed. Features change behavior over time. New features are added. You continually need to test that your application does not break in the upcoming Chrome release. Who is going to take care of all that in your WebRTC application?

You will also need to understand the way your WebRTC application is used. Are users happy? Are there areas you need to invest in with further optimization? Observability (=monitoring) is key here.

Keeping the lights on has its own set of costs associated with it.

Build vs buy a WebRTC infrastructure

Buying your WebRTC infrastructure by using managed services like CPaaS vendors is expensive. But then again, building your own (along with optimizing and maintaining it) is also expensive.

Roughly speaking, this is the kind of a decision table you’ll see in front of you:

BuildBuyPros Customized to your specific need
Ownership of the solution and ability to modify with changing requirements
Better control over costs Short time to market
Low initial cost
Less of a need for a highly skilled team of WebRTC expertsCons Time consuming. Longer time to market
High initial development costs
Ongoing maintenance costs
Finding/sourcing skilled WebRTC developers Cost at scale can be an issue
Harder to differentiate on the media layer

There’s also a middleground, where you can source/buy certain pieces and build others. Here are a few examples/suggestions:

  • Consider paying for a managed TURN service while building your own WebRTC application
  • Signaling can be outsourced using the likes of PubNub, Pusher and Ably
  • You can get your testing and monitoring needs from testRTC (a company I co-founded)

You can also start with a CPaaS vendor and once you scale and grow, invest the time and money needed to build your own infrastructure – once you’ve proven your application and got to product-market-fit.

So, how free is WebRTC, really?

Part of WebRTC’s claim to fame is its nature as an open source and thus free software for building interactive web applications. While the technology itself is indeed free of charge and offers numerous freedoms, there are still costs associated with running a WebRTC application.

When we had to launch our own video conferencing service some 25 years ago, we had to put an investment of several millions of dollars along with an engineering team for a period of a couple of years. Only to get to the implementation of a media engine.

WebRTC gives that to you for “free”. And it is also kind enough to be pre-integrated in all modern browsers.

What Google did with WebRTC was to reduce the barrier of entry to real time communication drastically.

Creating a WebRTC application isn’t free – not really. But it does come with a lot of alternatives that bring with them freedom and flexibility.

The post Is WebRTC really free? The costs of running a WebRTC application appeared first on

WebRTC media resilience: the role FEC, RED, PLC, RTX and other acronyms play

bloggeek - Mon, 05/22/2023 - 13:00

How WebRTC media resilience works – what FEC, RED, PLC, RTX are and why they are needed to improve media quality in real-time communications.

Networks are finicky in nature, and media codecs even more so.

With networks, not everything sent is received on the other end, which means we have one more thing to deal with and care about when it comes to handling WebRTC media. Luckily for us, there are quite a few built-in tools that are available to us. But which one should we use at each point and what benefits do they bring?

This is what I’ll be focusing on in this article.

Table of contents Networks are lossy

Communication networks are lossy in nature. This means that if you send a packet through a network – there’s no guarantee of that packet reaching the other side. There’s also no guarantee that packets are reached in the order you’ve sent them or in a timely fashion, but that’s for another article.

This is why almost everything you do over the internet has this nice retransmission mechanism tucked away somewhere deep inside as an assumption. That retransmission mechanism is part of how TCP works – and for that matter, almost every other transport protocol implemented inside browsers.

The assumption here is that if something is lost, you simply send it again and you’re done. It may take a wee bit longer for the receiver to receive it, but it will get there. And if it doesn’t, we can simply announce that connection as severed and closed.

We call and measure that “something is lost” aspect of networks as packet loss.

Stripping away that automatic assumption that networks are reliable and everything you send over them is received on the other side is the first important step in understanding WebRTC but also in understanding real-time transport protocols and their underlying concepts.

Media codecs are lossy (and sensitive)

Media codecs are also lossy but in a different way. When an audio codec or a video codec needs to encode (=compress) the raw input from a microphone or a camera, what they do is strip the data out of things they deem unnecessary. These things are levels of perceived quality of the original media.

I remember many years ago, sitting at the dorms in the university and talking about albums and CDs. One of the roommates there was an audiophile. He always explained how vinyl albums have better audio quality than CDs and how MP3 just ruins audio quality. Me? I never heard the difference.

Perceived quality might be different between different people. The better the codec implementation, the more people will not notice degraded quality.

Back to codecs.

Most media codecs are lossy in nature. There are a few lossless ones, but these are rarely used for real time communications and not used in WebRTC at all. The reason we use lossy codecs is to have better compression rates:

Taking 1080p (Full HD) video at 30 frames per second will result in roughly 1.5Gbps of data. Without compressing it – it just won’t work. We’re trying to squeeze a lot of raw data over networks, and as always, we need to balance our needs with the resources available to us.

To compress more, we need:

  • To reduce what we care about to the its bare minimum (the lossy aspect of the codecs we use)
  • More CPU and memory to perform the compression
  • Make every bit we end up with matter

That last one is where media codecs become really sensitive.

If every bit matters, then losing a bit matters. And if losing a bit matters, then losing a whole packet matters even more.

Since networks are bound to lose packets, we’re going to need to deal with media packets missing and our system (in the decoder or elsewhere) needing to fill that gap somehow

More on lossy codecs

More on the future of audio codecs (lossy and lossless ones)

Types of WebRTC media correction

Media packets are lost. Our media decoders – or WebRTC system as a whole – needs to deal with this fact. This is done using different media correction mechanisms. Here’s a quick illustration of the available choices in front of us:

Each such media correction technique has its advantages and challenges. Let’s review them so we can understand them better.

PLC: Packet Loss Concealment

Every WebRTC implementation needs a packet loss concealment strategy. Why? Because at some point, in some cases, you won’t have the packets you need to play NOW. And since WebRTC is all about real-time, there’s no waiting with NOW for too long.

What does packet loss concealment mean? It means that if we lost one or more packets, we need to somehow overcome that problem and continue to run to the best of our ability.

Before we dive a bit deeper, it is important to state: not losing packets is always better than needing to conceal lost packets. More on that – later.

This is done differently between audio and video:

Audio PLC

For the most part, audio packets are decoded frame-by-frame and usually also packet-by-packet. If one is lost, we can try various ways to solve that. There are the most common approaches:

Illustration taken from Philipp Hancke’s presentation at Kranky Geek. Video PLC

Packet loss on video streams has its own headaches and challenges.

In video, most of the frames are dependant on previous ones, creating chains of dependencies:

I-frames or keyframes (whatever they are called depending on the video codec used) break these dependency chains, and then one can use techniques like temporal scalability to reduce the dependencies for some of the frames that follow.

When you lose a packet, the question isn’t only what to do with the current video frame and how to display it, but rather what is going to happen to future frames depending on the frame with the lost packet.

In the past, the focus was on displaying every bit that got decoded, which ended up with video played back with smears as well as greens and pinks.

Check it for yourself, with our most recent WebRTC fiddle around frame loss.

Today, we mostly not display frames until we have a clean enough bitstream, opting to freeze the video a bit or skip video frames than show something that isn’t accurate enough. With the advances in machine learning, they may change in the future.

PLC is great, but there’s a lot to be done to get back the lost packets as opposed to trying to make do with what we have. Next, we will see the additional techniques available to us.

RTX: Retransmissions

Here’s a simple mechanism (used everywhere) to deal with packet loss – retransmission.

In whatever protocol you use, make sure to either acknowledge receiving what is sent to you or NACKing (sending a negative acknowledgement) when not receiving what you should have received. This way, the sender can retransmit whatever was lost and you will have it readily available.

This works well if there’s enough time for another round trip of data until you must play it back. Or when the data can help you out in future decoding (think the dependency across frames in video codecs). It is why retransmissions don’t always work that well in WebRTC media correction – we’re dealing with real time and low latency.

Another variation of this in video streams is asking for a new I-frame. This way, the receiver can signal the sender to “reset” the video stream and start encoding it from scratch, which essentially means a request to break the dependency between the old frames and the new ones that should be sent after the packet loss.

RED: REDundancy Encoding

Retransmission means we overcome packet losses after the fact. But what if we could solve things without retransmissions? We can do that by sending the same packet more than once and be done with it.

Double or triple the bitstream by flooding it with the same information to add more robustness to the whole thing.

RED is exactly that. It concatenates older audio frames into fresh packets that are being sent, effectively doubling or tripling the packet size.

If a packet gets lost, the new frame it was meant to deliver will be found in one of the following packets that should be received.

Yes. it eats up our bandwidth budget, but in a video call where we send 1Mbps of video data or more, tripling the audio size from 40kbps to 90kbps might be a sacrifice worth making for cleaner audio.

FEC: Forward Error Correction

Redundancy encoding requires an additional 100% or more of bitrate. We can do better using other means, usually referred to as Forward Error Correction.

Mind you, redundancy encoding is just another type of forward error correction mechanism

With FEC, we are going to add more packets that can be used to restore other packets that are lost. The most common approach for FEC is by taking multiple packets, XORing them and sending the XORed result as an additional packet of data.

If one of the packets is lost, we can use the XORed packet to recreate the lost one.

There are other means of correction algorithms that are a wee bit more complex mathematically (google about Reed-Solomon if you’re interested), but the one used in WebRTC for this purpose is XOR.

FEC is still an expensive thing since it increases the bitrate considerably. Which is why it is used only sparingly:

  • When you know there’s going to be packet losses on the network
  • To protect only important video frames that many other frames are going to be dependent on
Making sense of WebRTC media correction


How is each one signaled over the network? When would it make sense to use it? How does WebRTC implement it in the browser and what exactly can you expect out of it?

All that is mostly arcane knowledge. Something that is passed from one generation of WebRTC developers to another it seems.

Lucky for you, Philipp Hancke and myself are working on a new course – Higher Level WebRTC Protocols. In it, we are covering these specific topics as well as quite a few others in a level of detail that isn’t found anywhere else out there.

Most of the material is already written down. We just need to prettify it a bit and record it.

If you are interested in learning more about this, be sure to join our waiting list for once we launch the course

Join the course waiting list

The post WebRTC media resilience: the role FEC, RED, PLC, RTX and other acronyms play appeared first on

ChatGPT meets WebRTC: What Generative AI means to Real Time Communications

bloggeek - Mon, 05/08/2023 - 13:00

ChatGPT is changing computing and as an extension how we interact with machines. Here’s how it is going to affect WebRTC.

ChatGPT became the service with the highest growth rate of any internet application, reaching 100 million active users within the first two months of its existence. A few are using it daily. Others are experimenting with it. Many have heard about it. All of us will be affected by it in one way or another.

I’ve been trying to figure out what exactly does a “ChatGPT WebRTC” duo means – or in other words – what does ChatGPT means for those of us working with and on WebRTC.

Here are my thoughts so far.

Table of contents Crash course on ChatGPT

Let’s start with a quick look at what ChatGPT really is (in layman terms, with a lot of hand waving, and probably more than a few mistakes along the way).

BI, AI and Generative AI

I’ll start with a few slides I cobbled up for a presentation I did for a group of friends who wanted to understand this.

ChatGPT is a product/service that makes use of machine learning. Machine learning is something that has been marketed a lot as AI – Artificial Intelligence. If you look at how this field has evolved, it would be something like the below:

We started with simple statistics – take a few numbers, sum them up, divide by their count and you get an average. You complicate that a bit with weighted average. Add a bit more statistics on top of it, collect more data points and cobble up a nice BI (Business Intelligence) system.

At some point, we started looking at deep learning:

Here, we train a model by using a lot of data points, to a point that the model can infer things about new data given to it. Things like “do you see a dog in this picture?” or “what is the text being said in this audio recording?”.

Here, a lot of 3 letter acronyms are used like HMM, ANN, CNN, RNN, GNN…

What deep learning did in the past decade or two was enable machines to describe things – be able to identify objects in images and videos, convert speech to text, etc.

It made it the ultimate classifier, improving the way we search and catalog things.

And then came a new field of solutions in the form of Generative AI. Here, machine learning is used to generate new data, as opposed to classifying existing data:

Here what we’re doing is creating a random input vector, pushing it into a generator model. The generator model creates a sample for us – something that *should* result in the type of thing we want created (say a picture of a dog). That sample that was generated is then passed to the “traditional” inference model that checks if this is indeed what we wanted to generate. If it isn’t, we iteratively try to fine tune it until we get to a result that is “real”.

This is time consuming and resource intensive – but it works rather well for many use cases (like some of the images on this site’s articles that are now generated with the help of Midjourney).


  • We started with averages and statistics
  • Moved to “deep learning”, which is just hard for us to explain how the algorithms got to the results they did (it isn’t based on simple rules any longer)
  • And we then got to a point where AI generates new data
The stellar rise of ChatGPT

The thing is that all this thing I just explained wouldn’t be interesting without ChatGPT – a service that came to our lives only recently, becoming the hottest thing out there:

The Most Important Chart In 100 Years #AI #GPT #ChatGPT #technology @JohnNosta

— Kyle Hailey (@kylelf_) February 16, 2023

ChatGPT is based on LLMs – Large Language Models – and it is fast becoming the hottest thing around. No other service grew as fast as ChatGPT, which is why every business in the world now is trying to figure out if and how ChatGPT will fit into their world and services.

Why ChatGPT and WebRTC are like oil and water

So it begged the question: what can you do with ChatGPT and WebRTC?

Problem is, ChatGPT and WebRTC are like oil and water – they don’t mix that well.

ChatGPT generates data whereas WebRTC enables people to communicate with each other. The “generation” part in WebRTC is taken care of by the humans that interact mostly with each other on it.

On one hand, this makes ChatGPT kinda useless for WebRTC – or at least not that obvious to use for it.

But on the other hand, if someone succeeds to crack this one up properly – he will have an innovative and unique thing.

What have people done with ChatGPT and WebRTC so far?

It is interesting to see what people and companies have done with ChatGPT and WebRTC in the last couple of months. Here are a few things that I’ve noticed:

In LiveKit’s and Twilio’s examples, the concept is to use the audio source from humans as part of prompts for ChatGPT after converting them using Speech to Text and then converting the ChatGPT response using Text to Speech and pass it back to the humans in the conversation.

Broadening the scope: Generative AI

ChatGPT is one of many generative AI services. Its focus is on text. Other generative AI solutions deal with images or sound or video or practically any other data that needs to be generated.

I have been using MidJourney for the past several months to help me with the creation of many images in this blog.

Today it seems that in any field where new data or information needs to be created, a generative AI algorithm can be a good place to investigate. And in marketing-speak – AI is overused and a new overhyped term was needed to explain what innovation and cutting edge is – so the word “generative” was added to AI for that purpose.

Fitting Generative AI to the world of RTC

How does one go about connecting generative AI technologies with communications then? The answer to this question isn’t an obvious or simple one. From what I’ve seen, there are 3 main areas where you can make use of generative AI with WebRTC (or just RTC):

  1. Conversations and bots
  2. Media compression
  3. Media processing

Here’s what it means

Conversations and bots

In this area, we either have a conversation with a bot or have a bot “eavesdrop” on a conversation.

The LiveKit and Twilio examples earlier are about striking a conversation with a bot – much like how you’d use ChatGPT’s prompts.

A bot eavesdropping to a conversation can offer assistance throughout a meeting or after the meeting –

  • It can try to capture to essence of a session, turning it into a summary
  • Help with note taking and writing down action items
  • Figure out additional resources to share during the conversation – such as knowledge base items that reflect what a customer is complaining about to a call center agent

As I stated above, this has little to do with WebRTC itself – it takes place elsewhere in the pipeline; and to me, this is mostly an application capability.

Media compression

An interesting domain where AI is starting to be investigated and used is media compression. I’ve written about Lyra, Google’s AI enabled speech codec in the past. Lyra makes assumptions on how human speech sounds and behaves in order to send less data over the network (effectively compressing it) and letting the receiving end figure out and fill out the gaps using machine learning. Can this approach be seen as a case of generative AI? Maybe

Would investigating such approaches where the speakers are known to better compress their audio and even video makes sense?

How about the whole super resolution angle? Where you send video at resolutions of WVGA or 720p and then having the decoder scale them up to 1080p or 4K, losing little in the process. We’re generating data out of thin air, though probably not in the “classic” sense of generative AI.

I’d also argue that if you know the initial raw content was generated using generative AI, there might be a better way in which the data can be compressed and sent at lower bitrates. Is that something worth pursuing or investigating? I don’t know.

Media processing

Similar to how we can have AI based codecs such as Lyra, we can also use AI algorithms to improve quality – better packet loss concealment that learns the speech patterns in real time and then mimics them when there’s packet loss. This is what Google is doing with their WaveNetEQ, something I mentioned in my WebRTC unbundling article from 2020.

Here again, the main question is how much of this is generative AI versus simply AI – and does that even matter?

Is the future of WebRTC generative (AI)?

ChatGTP and other generative AI services are growing and evolving rapidly. While WebRTC isn’t directly linked to this trend, it certainly is affected by it:

  • Applications will need to figure out how (and why) to incorporate generative AI with WebRTC as part of what they offer
  • Algorithms and codecs in WebRTC are evolving with the assistance of AI (generative or otherwise)

Like any other person and business out there, you too should see if and how does generative AI affects your own plans.

The post ChatGPT meets WebRTC: What Generative AI means to Real Time Communications appeared first on

RTC@Scale 2023 – an event summary

bloggeek - Mon, 05/01/2023 - 13:00

RTC@Scale is Facebook’s virtual WebRTC event, covering current and future topics. Here’s the summary for RTC@Scale 2023 so you can pick and choose the relevant ones for you.

WebRTC Insights is a subscription service I have been running with Philipp Hancke for the past two years. The purpose of it is to make it easier for developers to get a grip of WebRTC and all of the changes happening in the code and browsers – to keep you up to date so you can focus on what you need to do best – build awesome applications.

We got into a kind of a flow:

  • Once every two weeks we finalize and publish a newsletter issue
  • Once a month we record a video summarizing libwebrtc release notes (older ones can be found on this YouTube playlist)

Oh – and we’re covering important events somewhat separately. Last month, a week after Meta’s RTC@Scale event took place, Philipp sat down and wrote a lengthy summary of the key takeaways from all the sessions, which we distributed to our WebRTC Insights subscribers.

As a community service (and a kind of a promotion for WebRTC Insights), we are now opening it up to everyone in this article

Table of contents Why this issue?

Meta ran their rtc@scale event again. Last year was a blast and we were looking forward to this one. The technical content was pretty good again. As last year, our focus for this summary is what we learned or what it means for folks developing with WebRTC. Once again, the majority of speakers were from Meta. At times they crossed the line of “is this generally useful” to the realm of “Meta specific” but most of the talks still provide value.

Compared to last year there were almost no “work with me” pitches (with one exception).

It is surprising how often Meta says “WebRTC” or “Google” (oh and Amazon as well).

Writing up these notes took a considerable amount of time (again) but we learned a ton and will keep referencing these talks in the future so it was totally worth it (again). You can find the list of speakers and topics on the conference website, the seven hours of raw video here (which includes the speaker introductions) or you just scroll down below for our summary.

SESSION 1 Rish Tandon / Meta – Meta RTC State of the Union

Duration: 13:50

Watch if you

  • watch nothing else and don’t want to dive into specific areas right away. It contains a ton of insights, product features and motivation for their technical decisions

Key insights:

  • Every conference needs a keynote!
  • 300 million daily calls on Messenger alone is huge
    • The Instagram numbers on top of that remain unclear. Huge but not big enough to brag about?
    • Meta seems to have fared well and has kept their usage numbers up after the end of the pandemic, despite the general downward/flat trend we see for WebRTC in the browser
    • 2022 being their largest-ever year in call volume this suggests eating someone else’s market share (Google Duo possibly?)
  • Traditionally RTC at Meta was mobile-first with 1-1 being the dominant use-case. This is changing with Whatsapp supporting 32 users (because FaceTime does? Larger calls are in the making), an improved desktop application experience with a paginated 5×5 grid. Avatars are not dead yet btw
  • Meta is building their unified “MetaRTC” stack on top of WebRTC and openly talks about it. But it is a very small piece in the architecture diagram. Whatsapp remains a separate stack. RSYS is their cross-platform library for all the things on top of the core functionality provided by libWebRTC
  • The paginated 4×4 grid demo is impressive
    • Pagination is a hard problem to solve since you need to change a lot of video stream subscriptions at the same time which, with simulcast, means a lot of large keyframes (thankfully only small resolution ones for this grid size)
    • You can see this as the video becomes visible from left to right at 7:19
    • Getting this right is tough, imagine how annoying it would be if the videos showed up in a random order…
  • End-to-end encryption is a key principle for Meta
    • This rules out an MCU as part of the architecture
    • Meta is clearly betting on the simulcast (with temporal layers), selective forwarding and dominant speaker identification for audio (with “Last-N” as described by Jitsi in 2015)
  • Big reliability improvements by defining a metric “%BAD” and then improving that
    • The components of that metric shown at 9:00 are interesting
    • In particular “last min quality reg” which probably measures if there was a quality issue that caused the user to hang up:
  • For mobile apps a grid layout that scales nicely with the number of participants is key to the experience. One of the interesting points made is that the Web version actually uses WASM instead of the browser’s native video elements
  • The “Metaverse” is only mentioned as part of the outlook. It drives screen sharing experiences which need to work with a tight latency budget of 80ms similar to game streaming
Sriram Srinivasan / Meta – Real-time audio at Meta Scale

Duration: 19:30

Watch if you are

  • An engineer working on audio. Audio reliability remains one of the most challenging problems with very direct impact to the user experience

Key insights:

  • Audio in RTC has evolved over the years:
    • We moved from wired-network audio-only calls to large multi party calls on mobile devices
    • Our quality expectations (when dogfooding) have become much higher in the last two decades
    • The Metaverse introduces new requirements which will keep us busy for the next one
  • Great examples of the key problems in audio reliability starting at 2:30
    • Participants can’t hear audio
    • Participants hear echo
    • Background noise
    • Voice breakup (due to packet loss)
    • Excessive latency (leading to talking over each other)
  • On the overview slide at 4:20 we have been working on the essentials in WebRTC for a decade, with Opus thankfully enabling the high-end quality
    • This is hard because of the diversity in devices and acoustic conditions (as well as lighting for video). This is why we still have vendors shipping their own devices (Meta discontinued their Portal device though)
    • Humans have very little tolerance for audio distortions
  • The basic audio processing pipeline diagram at 5:50 and gets walked through until 11:00
    • Acknowledges that the pipeline is built on libWebRTC and then says it was a good starting point back in the day. The opinion at Google seems to be that the libWebRTC device management is very rudimentary and one should adopt the Chrome implementations. This is something where Google was doing better than messenger with duo. They are not going to give that away for free to their nemesis
    • While there have been advances in AEC recently due to deep neural networks, this is a challenge on mobile devices. The solution is a “progressive enhancement” which enables more powerful features on high-end devices. On the web platform it is hard to decide this upfront as we can’t measure a lot due to fingerprinting concerns. You heard the term “progressive RTC application” or PRA here on WebRTC Insights first (but it is terrible, isn’t it?)
    • For noise suppression it is important to let the users decide. If you want to show your cute baby to a friend then filtering out the cries is not appropriate. Baseline should be filtering stationary noise (fan, air condition)
    • Auto gain control is important since the audio level gets taken into account by SFUs to identify the dominant speaker
    • Low-bitrate encoding is important in the market with the largest growth and terrible networks and low-end devices: India. We have seen this before from Google Duo
  • Audio device management (capture and rendering) starts at 11:00 and is platform-dependent
    • This is hard since it cannot be tested at scale but is device specific. So we need at least the right telemetry to identify which devices have issues and how often
    • End-call feedback which gets more specific for poor calls with a number of buckets. This is likely correlated with telemetry and the “last minute quality regression” metric
    • While all of this is great it is something Meta is keeping to themselves. After all, if Google made them spend the money why would they not make Google spend the money to compete?
    • This goes to show how players other than Google are also to blame for the current state of WebRTC (see Google isn’t your outsourcing vendor)
  • Break-down of “no-audio” into more specific cases at 13:00
    • The approach is to define, measure, fix which drives the error rate down
    • This is where WebRTC in the browser has disadvantages since we rarely get the details of the errors exposed to Javascript hence we need to rely on Google to identify and fix those problems
    • Speaking when muted and showing a notification is a common and effective UX fix
    • Good war stories, including the obligatory Bluetooth issues and interaction between phone calls and microphone access
  • Outlook at 17:40 about the Metaverse
    • Our tolerance for audio issues in a video call is higher because we have gotten used to the problems
    • Techniques like speaker detection don’t work in this setting anymore
Niklas Enbom / Meta – AV1 for Meta RTC

Duration: 18:00

Watch if you are

  • An engineer working on video, the “system integrators” perspective makes this highly valuable and applicable with lots of data and measurements
  • A product owner interested in how much money AV1 could save you

Key insights:

  • Human perception is often the best tool to measure video quality during development
  • AV1 is adopted by the streaming industry (including Meta who wrote a great blog post). Now is the time to work on RTC adaptation which lags behind:
  • AV1 is the next step after H.264 (a 20 year old codec) for most deployments (except Google who went after VP9 with quite some success)
  • Measurements starting at 4:20
    • The “BD-Rate” described the bitrate difference between OpenH264 and libaom implementations, showing a 30-40% lower bitrate for the same quality. Or a considerably higher quality for the same bitrate (but that is harder to express in the diagram as the Y-axis is in decibels)
    • 20% of Meta’s video calls end up with less than 200kbps (globally which includes India). AV1 can deliver a lot more quality in that area
    • The second diagram at 5:20 is about screen sharing which is becoming a more important use-case for Meta. Quality gains are even more important in this area which deals with high-resolution content and the bitrate difference for the same quality is up to 80%. AV1 screen content coding tools help address the special-ness of this use-case too
    • A high resolution screen sharing (4k-ish) diagram is at 6:00 and shows an ever more massive difference followed by a great visual demo. Sadly the libaom source code shown is blurry in both examples as a result of H.264 encoding) but you can see a difference
  • Starting at 6:30 we are getting into the advantages of AV1 for integrators or SFU developers:
    • Reference Picture Resampling removes the need for keyframes when switching resolution. This is important when switching the resolution down due to bandwidth estimates dropping – receiving a large key frame is not desirable at all in that situation. Measuring the amount of key frames due to resolution changes is a good metric to track (in the SFU) – quoted as 1.5 per minute
    • AV1 offers temporal, spatial and quality SVC
    • Meta currently uses Simulcast (with H.264) and requires (another good metric to track) 4 keyframes per minute (presumably that means when switching up)
  • Starting at 8:00 Niklas Enbom talks about the AV1 challenges they encountered:
    • AV1 can also provide significant cost savings (the exact split between cost savings and quality improvements is what you will end up fighting about internally)
    • Meta approached AV1 by doing an “offline evaluation” first, looking at what they could gain theoretically and then proceeding with a limited roll-out on desktop platforms which validated the evaluation results
    • Rolling this out to the diverse user base is a big challenge of its own, even if the results are fantastic
    • libAOM increases the binary size by 1MB which is a problem because users hate large apps (and yet, AV1 would save a lot more even on the first call) which becomes a political fight (we never heard about that from Google including it in libWebRTC and Chrome). It gets dynamically downloaded for that reason which also allows deciding whether it is really needed on this device (on low-end devices you don’t need to bother with AV1)
    • At 11:40 “Talk time” is the key metric for Meta/Messenger and AV1 means at least 3x CPU usage (5x if you go for the best settings). This creates a goal conflict between battery (which lowers the metric) and increased quality (which increases it). More CPU does not mean more power usage however, the slide at 13:00 talks about measuring that and shows results with a single-digit percentage increase in power usage. This can be reduced further with some tweaks and using AV1 for low-bitrate scenarios and using it only when the battery level is high enough. WebRTC is getting support (in the API) for doing this without needing to resort to SDP manipulation, this is a good example of the use-case (which is being debated in the spec pull request)
    • At 15:30 we get into a discussion about bitrate control, i.e. how quickly and well will the encoder produce what you asked for as shown in the slide:
  • Blue is the target bitrate, purple the actual bitrate and it is higher than the the bitrate for quite a while! Getting rate control on par with their custom H.264 one was a lot of work (due to Meta’s H.264 rate controller being quite tuned) and will hopefully result in an upstream contribution to libaom! The “laddering” of resolution and framerate depending on the target bitrate is an area that needs improvements as well, we have seen Google just ship some improvements in Chrome 113. The “field quality metrics” (i.e. results of getStats such as qpSum/framesEncoded) are codec-specific so cannot be used to compare between codecs which is an unsolved problem
  • At 17:00 we get into the description of the current state and the outlook:
    • This is being rolled out currently. Mobile support will come later and probably take the whole next year
    • VR and game streaming are obvious use-cases with more control over devices and encoders
    • VVC (the next version of HEVC) and AV2 are on the horizon, but only for the streaming industry and RTC lags behind by several years usually.
    • H.264 (called “the G.711 of video” during Q&A) is not going to go away anytime soon so one needs to invest in dealing with multiple codecs
Jonathan Christensen / SMPL – Keeping it Simple

Duration: 18:40

Watch if you are

  • A product manager who wants to understand the history of the industry, what products need to be successful and where we might be going
  • Interested in how you can spend two decades in the RTC space without getting bored

Key insights:

  • Great overview of the history of use-cases and how certain innovations were successfully implemented by products that shaped the industry by hitting the sweet spot of “uncommon utility” and “global usability”
  • At 3:00: When ICQ shipped in 1996 it popularized the concept of “presence”, i.e. showing a roster of people who are online in a centralized service.
  • At 4:40: next came MSN Messenger which did what ICQ did but got bundled with Windows which meant massive distribution. It also introduced free voice calling between users on the network in 2002. Without solving the NAT traversal issue which meant 85% of calls failed. Yep, that means not using a STUN server in WebRTC terms (but nowadays you would go 99.9% failure rate)
  • At 7:00: While MSN was arguing who was going to pay the cost (of STUN? Not even TURN yet!) Skype showed up in 2003 and provided the same utility of “free voice calls between users” but they solved NAT traversal using P2P so had a 95% call success rate. They monetized it by charging 0.02$ per minute for phone calls and became a verb by being “Internet Telephony that just works”
  • The advent of the first iPhone in 2007 led to the first mobile VoIP application, Skype for the iPhone which became the cash cow for Skype. The peer-to-peer model did however not work great there as it killed the battery quickly
  • At 9:30: WhatsApp entered the scene in 2009. It provided less utility than Skype (no voice or video calls, just text messaging) and yet introduced the important concept of leveraging the address book on the phone and using the phone number as an identifier which was truly novel back then!
  • When Whatsapp later added voice (not using WebRTC) they took over being “Internet Telephony that just works”
  • At 11:40: Zoom… which became a verb during the 2020 pandemic. The utility it provided was a friction free model
    • We disagree here, downloading the Zoom client has always been something WebRTC avoided, just as “going to a website” had the same frictionless-ness we saw with the earlyWebRTC applications like and the other we have forgotten about
    • What it really brought was a freemium business model to video calling that was easy to use freely and not just for a trial period
  • At 12:40: These slides ask you to think about what uncommon utility is provided by your product or project (hint: WebRTC commoditized RTC) and whether normal people will understand it (as the pandemic has shown, normal people did not understand video conferencing). What follows is a bit of a sales pitch for SMPL which is still great to listen to, small teams of RTC industry veterans would not work on boring stuff
  • At 15:00: Outlook into what is next followed by predictions. Spatial audio is believed to be one of the things but we heard that a lot over the last decade (or two if you have been around for long enough; Google is shipping  this feature to some Pixel phones, getting the name wrong), as is lossless codecs for screen sharing and Virtual Reality
  • We can easily agree that the prediction that “users will continue to win” (in WebRTC we do this every time a Googler improves libWebRTC) but whether there will be “new stars” in RTC remains to be seen
First Q&A

Duration: 25:00

Watch if:

  • You found the talks this relates to interesting and want more details

Key points:

  • We probably needs something like Simulcast but for audio
  • H.264 is becoming the G.711 of video. Some advice on what metrics to measure for video (freezes, resolution, framerate and qp are available through getStats) 
  • Multi-codec simulcast is an interesting use-case
  • The notion that “RTC is good enough” is indeed not great. WebRTC suffers in particular from it
SESSION 2 Sandhya Rao / Microsoft – Top considerations for integrating RTC with Android appliances

Duration: 21:30

Watch if you are

  • A product manager in the RTC space, even if not interested in Android

Key points:

  • This is mostly a shopping list of some of the sections you’ll have in a requirements document. Make sure to check if there’s anything here you’d like to add to your own set of requirements
  • Devices will be running Android OS more often than not. If we had to plot how we got there: Proprietary RTOS → Vxworks → Embedded Linux → Android
  • Some form factors discussed:
    • Hardware deskphone
    • Companion voice assistant
    • Canvas/large tablet personal device
    • always-on ambient screen
    • shared device for cross collaboration (=room system with touchscreen)
  • Things to think through: user experience, hardware/OS, maintenance+support
  • User experience
    • How does the device give a better experience than a desktop or a mobile device?
    • What are the top workloads for this device? focus only on them (make it top 3-5 at most)
  • Hardware & OS
    • Chipset selection is important. You’ll fall into the quality vs cost problem
    • Decide where you want to cut corners and where not to compromise
    • Understand which features take up which resources (memory, CPU, GPU)
    • What’s the lifecycle/lifetime of this device? (5+ years)
  • Maintenance & support
    • Environment of where the device is placed
    • Can you remotely access the device to troubleshoot?
    • Security & authentication aspects
    • Ongoing monitoring
Yun Zhang & Bin Liu / Meta – Scaling for large group calls

Duration: 19:00

Watch if you are

  • A developer dealing with group calling and SFUs, covers both audio and video. Some of it describes the very specific problems Meta ran into scaling the group size but interesting nonetheless

Key points:

  • The audio part of the talk starts at 2:00 with a retrospective slide on how audio was done at Meta for “small group calls”. For these it is sufficient to rely on audio being relatively little traffic compared to video, DTX reducing the amount of packets greatly as well as lots of people being muted. As conference sizes grow larger this does not scale, even forwarding the DTX silence indicator packets every 400ms could lead to a significant amount of packets. To solve this two ideas are used: “top-N” and “audio capping”
  • The first describes forwarding the “top-N” active audio streams. This is described in detail in the Jitsi paper from 2015. The slideshow uses the same mechanism with audio levels as RTP header extension (the use of that extension was confirmed in the Q&A; the algorithm itself can be tweaked on the server). The dominant speaker decision also affects Bandwidth allocation for video:
  • The second idea is “audio capping” which does not forward audio from anyone but the last couple of dominant speakers. Google Meet does this by rewriting audio to three synchronization sources which avoids some of the PLC/CNG issues described on one of the slides. An interesting point here is at 7:50 where it says “Rewrite the Sequence number in the RTP header, inject custom header to inform dropping start/end”. Google Meet uses the contributing source here and one might use the RTP marker bit to signal the beginning or end of a “talk spurt” as described in RFC 3551
  • The results from applying these techniques are shown at 8:40 – 38% reduction of traffic in a 20 person call, 63% for a 50 person call and less congestion from server to the client
  • The video part of the talk starts at 9:30 establishing some of the terminology used. While “MWS” or “multiway server” is specific to Meta we think the term “BWA” or “bandwidth allocation” to describe how the estimated bandwidth gets distributed among streams sent from server to client is something we should talk more about:
    • Capping the uplink is not part of BWA (IMO) but if nobody wants to receive 720p video from you, then you should not bother encoding or sending it and we need ways for the server to signal this to the client
  • The slide at 10:20 shows where this is coming from, Meta’s transition from “small group calls” to large ones. This is a bit more involved than saying “we support 50 users now”. Given this mention “lowest common denominator” makes us wonder if simulcast was used by the small calls even because it solves this problem
  • Video oscillation, i.e. how and when to switch between layers which needs to be done “intelligently”
  • Similarly, bandwidth allocation needs to do something smarter than splitting the bandwidth budget equally. Also there are bandwidth situations where you can not send video and need to degrade to sending only one and eventually none at all. Servers should avoid congesting the downstream link just as clients do BWE to avoid congesting the upstream
  • The slide at 13:00 shows the solution to this problem. Simulcast with temporal layers and “video pause”:
    • Simulcast with temporal layers provides (number of spatial layers) * (number of temporal layers) video layers with different bitrates that the server can pick from according to the bandwidth allocation
    • “Video pause” is a component of what Jitsi called “Video pausing” in the  “Last-N” paper
  • It is a bit unclear what module the “PE-BWA” replaces but taking into account use-cases like grid-view, pinned-user or thumbnail makes a lot of sense
  • Likewise, “Stream Subscription Manager” and “Video Forward Manager” are only meaningful inside Meta since we cannot use it. Maximizing for a “stable” experience rather than spending the whole budget makes sense. So do the techniques to control the downstream bandwidth used, picking the right spatial layer, dropping temporal layers and finally dropping “uninteresting” streams
  • At 18:10 we get into the results for the video improvements:
    • 51% less video quality oscillation (which suggests the previous strategy was pretty bad) and 20% less freezes
    • 34% overall video quality improvement, 62% improvement for the dominant speaker (in use-cases where it is being used; this may include allocating more bandwidth to the most recent dominant speakers)
  • At 18:30 comes the outlook:
    • Dynamic video layers structure sounds like informing the server about the displayed resolution on the client and letting it make smart decisions based on that
Saish Gersappa & Nitin Khandelwa / Whatsapp – Relay Infrastructure

Duration: 15:50

Watch if you are

  • A developer dealing with group calling and SFUs. Being Whatsapp this is a bit more distant from WebRTC (as well as the rest of the “unified” Meta stack?) but still has a lot of great points

Key points:

  • After an introduction of Whatsapp principles (and a number… billion of hours per week) for the first three minutes the basic “relay server” is described which is a media server that is involved for the whole duration of the call (i.e. there is no peer-to-peer offload)
  • The conversation needs to feel natural and network latency and packet loss create problems in this area. This gets addressed by using a global overlay network and routing via those servers. The relay servers are not run in the “core” data centers but at the “points of presence” (thousands) that are closer to the user. This is a very common strategy we have always recommended but the number and geographic distribution of the Meta PoPs makes this impressive. To reach the PoPs the traffic must cross the “public internet” where packet loss happens
  • At 5:30 this gets discussed. The preventive way to avoid packet loss is to do bandwidth estimation and avoid congesting the network. Caching media packets on the server and resending them from there is a very common method as well, typically called a NACK cache. It does not sound like FEC/RED is being used or at least not mentioned.
  • At 6:30 we go into device resource usage. An SFU with dominant speaker identification is used to reduce the amount of audio and video streams as well as limit the number of packets that need to be encrypted and decrypted. All of this costs CPU which means battery life and you don’t want to drain the battery
  • For determining the dominant speaker the server is using the “audio volume” on the client. Which means the ssrc-audio-level based variant of the original dominant speaker identification paper done by the Jitsi team.
  • Next at 8:40 comes a description of how simulcast (with two streams) is used to avoid reducing the call quality to the lowest common denominator. We wonder if this also uses temporal scalability, Messenger does but Whatsapp still seems to use their own stack
  • Reliability is the topic of the section starting at 10:40 with a particular focus on reliability in cases of maintenance. The Whatsapp SFU seems to be highly clustered with many independent nodes (which limits the blast radius); from the Q&A later it does not sound like it is a cascading SFU. Moving calls between nodes in a seamless way is pretty tricky, for WebRTC one would need to both get and set the SRTP state including rollover count (which is not possible in libSRTP as far as we know). There are two types of state that need to be taken into account:
    • Critical information like “who is in the call”
    • Temporary information like the current bandwidth estimate which constantly changes and is easy to recover
  • At 12:40 we have a description of handling extreme load spikes… like calling all your family and friends and wishing them a happy new year (thankfully this is spread over 24 hours!). Servers can throttle things like the bandwidth estimate in such cases in order to limit the load (this can be done e.g. when reaching certain CPU thresholds). Prioritizing ongoing calls and not accepting new calls is common practice, prioritizing 1:1 calls over multi party calls is acceptable for Whatsapp as a product but would not be acceptable for an enterprise product where meetings are the default mode of operation
  • Describing dominant speaker identification and simulcast as “novel approaches” is… not quite novel
Second Q&A

Duration: 28:00

Watch if:

  • You found the talks this relates to interesting and want more details

Key points:

  • There were a lot more questions and it felt more dynamic than the first Q&A
  • Maximizing video experience for a stable and smooth experience (e.g. less layer switches) often works better than chasing the highest bitrate!
  • Good questions and answers on audio levels, speaker detection and how BWE works and is used by the server
  • It does sound like WhatsApp still refuses to do DTLS-SRTP…
SESSION 3 Vinay Mathew / Dolby – Building a flexible, scalable infrastructure to test at scale

Duration: 22:00

Watch if you are

  • A software developer or QA engineer working on RTC products

Key points:

  • today has the following requirements/limits:
    • Today: 50 participants in a group call; 100k viewers
    • Target: 1M viewers; up to 25 concurrent live streams; live performance streaming
  • Scale requires better testing strategies
  • For scale testing, Dolby split the functionality into 6 different areas:
    • Authentication and signaling establishment – how many can be handled per second (rate), geo and across geo
    • Call signaling performance – maximum number of join conference requests that can be handled per second
    • Media distribution performance – how does the backend handle the different media loads, looking at media metrics on client and server side
    • Load distribution validation – how does the backend scale up and down under different load sizes and changes
    • Scenario based mixing performance – focus on recording and streaming to huge audiences (a specific component of their platform)
    • Metrics collection from both server and client side – holistic collection of metrics and use a baseline for performance metrics out of it
  • Each component has its own set of metrics and rates that are measured and optimized separately
  • Use a mix of testRTC and in-house tools/scripts on AWS EC2 (Python based, using aiortc; locust for jobs distribution)
  • Homegrown tools means they usually overprovision EC2 instances for their tests. Something they want to address moving forward
  • Dolby decided not to use testRTC for scale testing. Partly due to cost issues and the need to support native clients
  • The new scale testing architecture for Dolby:
  • Mix of static and on demand EC2 instances, based on size of the test
  • Decided on a YAML based syntax to define the scenario
  • Scenarios are kept simple, and the scripting language used is proprietary and as a domain specific language
  • This looks like the minimal applicable architecture for stress testing WebRTC applications. If you keep your requirements of testing limited, then this approach can work really well
Jay Kuo / USC – blind quality assessment of real-time visual communications

Duration: 18:15

Watch if you

  • Want to learn how to develop ways to measure video quality in an RTC scenario

Key points:

  • We struggled a bit with this one as it is a bit “academic” (with an academic sales pitch even!) and not directly applicable. However, this is a very hard problem that needs this kind of research first
  • Video quality assessment typically requires both the sender side representation of the video and the receiver side. Not requiring the sender side video is called “blind quality assessment” and is what we need for applying it to RTC or conversational video
  • Ideally we want a number from such a method (called BVQA around 4:00) that we can include in the getStats output. The challenge here is doing this with low latency and efficiently in particular for Meta’s requirements to run on mobile phones
  • We do wonder how background blur affects this kind of measurement. Is the video codec simply bad for those areas or is this intentional…
Wurzel Parsons-Keir / Meta – Beware the bot armies! Protecting RTC user experiences using simulations

Duration: 25:00

Watch if you are

  • Interested in a better way to test than asking all your coworkers to join you for a test (we all have done that many times)
  • A developer that wants to test and validate changes that might affect media quality (such as bandwidth estimation)
  • Want to learn how to simulate a ten minute call in just one minute
  • Finally want to hear a good recruiting pitch for the team (the only one this year)
  • Yes, Philipp really liked this one. Wurzel’s trick of making his name more appealing to Germans works so please bear with him.

Key takeaways:

  • This is a long talk but totally worth the time
  • At 3:00 some great arguments for investing in developer experience and simulation, mainly by shifting the cost left from “test in production” and providing faster feedback cycles. It also enables building and evaluating complex features like “Grid View and Pagination” (which we saw during the keynote) much faster
  • After laying out the goals we jump to the problem at 6:00. Experiments in the field take time and pose a great risk. Having a way to test a change in a variety of scenarios, conditions and configuration (but how representative are the ones you have?) shortens the feedback cycle and reduces the risk
  • At 7:30 we get a good overview of what gets tested in the system and how. libWebRTC is just a small block here (but a complex one) followed by the introduction of “Newton” which is the framework Meta developed for deterministic and faster than real-time testing. A lot of events in WebRTC are driven by periodic events, such as a camera delivering frames at 30 frames per second, RTCP being sent at certain intervals, networks having a certain bits-per-second constraints and so on
  • At 9:20 we start with a “normal RTC test”, two clients and a simulated network. You want to introduce random variations for realism but make those reproducible. The common approach for that is to seed the random generator, log the seed and allow feeding it in as a parameter to reproduce
  • The solution to the problem of clocks is sadly not to send a probe into the event horizon of a black hole and have physics deal with making it look faster on the outside. Instead, a simulated clock and task queue is used. Those are again very libWebRTC specific terms, it provides a “fake clock” which is mainly used for unit tests. Newton extends this to end-to-end tests, the secret sauce here is how to tell the simulated network (assuming it is an external component and not one simulated by libWebRTC too) about those clocks as well
  • After that (around 10:30) it is a matter of providing a great developer experience for this by providing scripts to run thousands of calls, aggregate the results and group the logging for these. This allows judging both how this affects averages as well identifying cases (or rather seed values!) where this degraded the experience
  • At 12:00 we get into the second big testing system built which is called “Callagen” (such pun!) which is basically a large scale bot infrastructure that operates in real-time on real networks. The system sounds similar to what Tsahi built with testRTC in many ways as well as what Dolby talked about. Being Meta they need to deal with physical phones in hardware labs. One of the advantages of this is that it captures both sender input video as well as receiver output video, enabling traditional non-blind quality comparisons
  • Developer experience is key here, you want to build a system that developers actually use. A screenshot is shown at 14:40. We wonder what the “event types” are. As suggested by the Dolby talk there is a limited set of “words” in a “domain specific language” (DSL) to describe the actions and events. Agreeing on those would even make cross-service comparisons more realistic (as we have seen in the case of Zoom-vs-Agora this sometimes evolves into a mud fight) and might lead to agreeing on a set of commonly accepted baseline requirements for how a media engine should react to network conditions
  • The section starting at 16:00 is about how this applies to… doing RTC testing @scale at Meta. It extends the approach we have seen in the slides before and again reminds us of the Dolby mention of a DSL. As shown around 17:15 the “interfaces” for that are appium scripts for native apps or python-puppeteer ones for web clients (we are glad web clients are tested by Meta despite being a niche for them!)
  • At 17:40 the challenge of ensuring test configurations are representative. This is a tough problem and requires putting numbers on all your features so you can track changes. And some changes only affect the ratios in ways that… don’t show up until your product gets used by hundreds of millions of users in a production scenario. Newton reduces the risk here by validating with a statistically relevant number of randomized tests at least which increases the organizational confidence. Over time it also creates a feedback loop of how realistic the scenarios you test are. Compared to Google, Meta is in a pretty good position here as they only need to deal with a single organization doing product changes which might affect metrics rather than “everyone” using WebRTC in Chrome
  • Some example use-cases are given at 19:15 that this kind of work enables. Migrating strategies between “small calls” and “large calls” is tricky as some metrics will change. Getting insight into which ones and whether those changes are acceptable (while retaining the metrics for “small ones”) is crucial for migrations
    • Even solving the seemingly “can someone join me on a call” problem provides a ton of value to developers
    • The value of enabling changes to complex issues such as anything related to codecs cannot be underestimated
  • Callagen running a lot of simulations on appium also has the unexpected side-effect of exposing deadlocks earlier which is a clear win in terms of shifting the cost of such a bug “left” and providing a reproduction and validation of fixes
  • Source-code bisect, presented at 21:00, is the native libWebRTC equivalent of Chromium’s together with Instead of writing a jsfiddle, one writes a “sim plan”. And it works “at scale” and allows observing effects like a 2% decrease in some metric. libWebRTC has similar capabilities of performance monitoring to identify perf regressions that run in Google’s infrastructure but that is not being talked about much by Google sadly
  • A summary is provided at 23:00 and there is indeed a ton to be learned from this talk. Testing is important and crucial for driving changes in complex systems such as WebRTC. Having proof that this kind of testing provides value makes it easier to argue for it and it can even identify corner cases
  • At 24:00 there is a “how to do it yourself” slide which we very much appreciate from a “what can WE learn from this”. While some of it seems like generally applicable to testing any system, thinking about the RTC angle is useful and the talk gave some great examples. That small, take baby steps. They will pay off in the long run (and for “just” a year of effort the progress seems remarkable)
  • There is a special guest joining at the end!
Sid Rao / Amazon – Using Machine learning to enrich and analyze real-time communications

Duration: 17:45

Watch if you are

  • A developer Interested in audio quality
  • A product manager that wants to see a competitor’s demo

Key points:

  • This talk is a bit sales-y for the Chime SDK but totally worth it. As a trigger warning, “SIP” gets mentioned. This covers three (and a half) use cases:
    • Packet loss concealment which improves the opus codec considerably
    • Deriving insights (and value) from sessions, with a focus on 1:1 use-cases such as contact centers or sales calls
    • Identifying multiple speakers from the same microphone (which is not a full-blown use-case but still very interesting)
    • Speech super resolution
  • Packet loss concealments start at 3:40. It is describing how Opus as a codec is tackling the improvements that deep neural networks can offer. Much of is also described in the Amazon blog post and we describe our take on it in WebRTC Insights #63. This is close to home for Philipp obviously:
    • RFC 2198 provides audio redundancy for WebRTC. It was a hell of a fight to get that capability back in WebRTC but it was clear this had some drawbacks. While it can improve quality significantly It cannot address bigger problems such as burst loss effectively
    • Sending redundant data only when there is voice activity is a great idea. However, libWebRTC has a weird connection between VAD and the RTP marker bit and fixing this caused a very nasty regression for Amazon Chime (in contact centers?) which was only noticed once this hit Chrome Stable. This remains unsolved, as well as easy access to the VAD information in APIs such as Insertable Streams that can be used for encoding RED using Javascript
    • It is not clear how sending redundant audio which are part of the same UDP packet is making the WiFi congestion problem worse at 4:50 (audio NACK would resend packets in contrast)
    • The actual presentation of DRED starts at 5:20 and has a great demo. What the demo does not show is that the magic is how little bitrate is used compared to just sending x10 the amount of data. Which is the true magic of DRED. Whether it is worth it remains to be seen. Applying it to the browser may be hard due to the lack of APIs (we still lack an API to control FEC bandwidth or percentage) but if the browser can decode DRED sent by a server (from Amazon) thanks to the magic tricks in the wire format that would be a great win already (for Amazon but maybe for others as well so we are approving this)
  • Deriving insights starts at 9:15 and is great at motivating why 1:1 calls, while considered boring by developers, are still very relevant to users:
    • Call centers are a bit special though since they deal with “frequently asked questions” and provide guidance on those. Leveraging AI to automate some of this is the next step in customer support after “playbooks” with predefined responses
    • Transcribing the incoming audio to identify the topic and the actual question does make the call center agent more productive (or reduces the value of a highly skilled customer support agent) with clear metrics such as average call handle time while improving customer satisfaction which is a win-win situation for both sides (and Amazon Chime enabling this value)
  • Identifying multiple speakers from the same microphone (also known as diarisation) starts at 10:55:
    • The problem that is being solved here is using a single microphone (but why limit to that?) to identify different persons in the same room speaking when transcribing. Mapping that to a particular person’s “profile” (identified from the meeting roster) is a bit creepy though. And yet this is going to be important to solve the problem of transcription after the push to return to the office (in particular for Amazon who doubled down on this). The demo itself is impressive but the looks folks give each other…
    • The diversity of non-native speakers is another subtle but powerful demo. Overlapping speakers are certainly a problem but people are less likely to do this unintentionally while being in the same room
    • We are however unconvinced that using a voice fingerprint is useful in a contact center context (would you like your voice fingerprint being taken here), in particular since the caller’s phone number and a lookup based on that has provided enough context for the last two decades
  • Voice uplift (we prefer “speech super resolution”) starts at 14:35. It takes the principle of “super resolution” commonly applied to video (see this KrankyGeek talk) and applies it to… G.711 calls:
    • With the advent of WebRTC and the high-quality provided by Opus we got used to the level of quality it provides which means that we perceive the worse quality of a G.711 narrowband phone call much more which causes fatigue when listening to those. While this may not be relevant to WebRTC developers this is quite relevant to call center agents (whose ears are on the other hand not accustomed to the level of quality Opus provides)
    • G.711 reduces the audio bandwidth by narrowing the signal frequency range to [300Hz, 3.4 kHz]. This is a physical process and as such not reversible. However, deep neural networks have listened to enough calls to reconstruct the original signal with sufficient fidelity
    • This feature is a differentiator in the contact center space, where most calls still originate from PSTN offering G.711 narrowband call quality. Expanding this to wideband for contact center agents may bring big benefits to the agent’s comfort and by extension to the customer experience
  • The summary starts at 16:00. If you prefer just the summary so far, listen to it anyway:
    • DRED is available for integration “into the WebRTC” platform. We will see whether that is going to happen faster than the re-integration of RED which took more than a year
Third Q&A

Duration: 19:45

Watch if

  • You found the talks this relates to interesting and want more details

Key points:

  • A lot of questions about open sourcing the stuff that gets talked about
  • Great questions about Opus/DRED, video quality assessment, getting representative network data for Newton (and how it relates to the WebRTC FakeClock)
  • The problem with DRED is that you don’t have just a single model but different models depending on the platform. And you can’t ship all of them in the browser binary…
SESSION 4 Ishan Khot & Hani Atassi / Meta – RSYS cross-platform real-time system library

Duration: 18:15

Watch if you are

  • A software architect that has worked with libWebRTC as part of a larger system

Key points:

  • This talk is a bit of an internal talk since we can’t download and use rsys which makes it hard to relate to it which is only possible if you have done your own integration of libWebRTC into a larger system
  • rsys is Meta’s RTC extension of their msys messaging library. It came out of Messenger and the need to abstract the existing codebase and make it more usable for other products. This creates an internal conflict between “we care only about our main use-case” and “we want to support more products” (and we know how Google’s priorities are in WebRTC/Chrome for this…). For example, Messenger made some assumptions about video streams and did not consider screen sharing to be something that is a core feature (as we saw in the keynote that has changed)
  • You can see the overall architecture at 8:00
  • With (lib)WebRTC being just one of many blocks in the diagram (the other two interesting ones are “camera” and “audio” which relate to the device management modules from the  second talk. Loading libWebRTC is done at runtime to reduce the binary size of the app store download
  • The slides that follow are a good description of what you need besides “raw WebRTC” like signaling and call state machines
  • The slides starting at 12:20 focus on how testing is done as well as debuggability and monitoring
  • The four-minute outlook which starts at 14:00 makes an odd point about 50 participants in a call being a challenge
Raman Valia & Shreyas Basarge / Meta – Bringing RTC to the Metaverse

Duration: 22:00

Watch if you are

  • Interested in the Metaverse and what challenges it brings for RTC
  • A product manager that wants to understand how it is different from communication products
  • An engineer that is interested in how RTC concepts like Simulcast are applicable to a more generic “world state” (or game servers as we think of them)

Key points:

  • The Metaverse is not dead yet but we still think it is called Fortnite
  • The distinction between communicating (in a video call) and “being present” is useful as the Metaverse tries to solve the latter and is “always on”
  • Around 5:00 delivering media over process boundaries is actually something where WebRTC can provide a better solution than IPC (but one needs to disable encryption for that use-case)
  • Embodiment is the topic that starts at 7:00. One of the tricky things about the Metaverse is that due to headsets you cannot capture a person’s face or landmarks on it since they are obscured by AR/VR devices
  • The distinction between different “levels” of Avatars, stylized, photorealistic and volumetric at 8:30 is interesting but even getting to the second stage is going to be tough
  • Sharing the world state that is being discussed at 15:00 is an adjacent problem. It does require systems similar to RTC in the sense that we have mediaserver-like servers (you might call them game servers) and then need techniques similar to simulcast. Also we have “data channels” with different priorities. And (later on) even “floor control
  • For the outlook around 20:30 a large concert is mentioned as a use-case. Which has happened in Fortnite since 2019
Fourth Q&A

Duration: 14:10

Watch if

  • You found the talks this relates to interesting and want more details

Key points:

  • rsys design assumptions which led to the current architecture and how its performance gets evaluated. And how they managed to keep the organization aligned on the goals for the migration
  • Never-ending calls in the Metaverse and privacy expectations which are different in a 1:1 call and a virtual concert
Closing remarks

We tried capturing as much as possible, which made this a wee bit long. The purpose though is to make it easier for you to decide in which sessions to focus, and even in which parts of each session.

Oh – and did we mention you should check out (and subscribe) to our WebRTC Insights service?

The post RTC@Scale 2023 – an event summary appeared first on

What exactly is a WebRTC media server?

bloggeek - Mon, 04/24/2023 - 13:00

WebRTC media server is an optional component in a WebRTC application. That said, in most common use cases, you will need one.

There are different types of WebRTC servers. One of them is the WebRTC media server. When will you be needing one and what exactly it does? Read on.

Oh – and if you’re looking to dig deeper into WebRTC media servers, make sure to check the end of this article for an announcement of our latest WebRTC course

Table of contents Servers in WebRTC

There are quite a few moving parts in a WebRTC application. There’s the client device side, where you’ll have the web browsers with WebRTC support and maybe other types of clients like mobile applications that have WebRTC implementations in them.

And then there are the server side components and there are quite a few of them. The illustration above shows the 4 types of WebRTC servers you are likely to need:

  • Application servers where the application logic resides. Unrelated directly to WebRTC, but there nonetheless
  • Signaling servers used to orchestrate and control how users get connected to one another, passing WebRTC signaling across the devices (WebRTC has no signaling protocol of its own)
  • TURN (and STUN) servers that are needed to get media routed through firewalls and NATs. Not all the time, but frequently enough to make them important
  • WebRTC media servers processing and routing WebRTC media packets in your infrastructure when needed

The illustration below shows how all of these WebRTC servers connect to the client devices and what types of data flows through them:

What is interesting, is that the only real piece of WebRTC infrastructure component that can be seen as optional is the WebRTC media server. That said, in most real-world use-cases you will need media servers.

The role of a WebRTC media server

At its conception, WebRTC was meant to be “between” browsers. Only recently, did the good people at the W3C see it fit to change it to something that can work also in browsers. We’ve know that to be the case all along

What does a WebRTC media server do exactly? It processes and routes media packets through the backend infrastructure – either in the cloud or on premise.

Let’s say you are building a group calling service and you want 10 people to be able to join in and talk to each other. For simplicity’s sake, assume we want to get 1Mbps of encoded video from each participant and show the other 9 participants on the screen of each of the users:

How would we go about building such an application without a WebRTC media server?

To do that, we will need to develop a mesh architecture:

We’d have the clients send out 1Mbps of their own media to all the other participants who wish to display them on their screen. This amounts to 9*1Mbps = 9Mbps of upstream data that each participant will be sending out. Each client receives streams from all 9 other participants, getting us to 9Mbps of downstream data.

This might not seem like much, but it is. Especially when sent over UDP in real time, and when we need to encode and encrypt each stream separately for each user, and to determine bandwidth estimation across the network. Even if we reduce the requirement from 1Mbps to a lower bitrate, this is still a hard problem to deal with and solve.

It becomes devilishly hard (impossible?) when we crank up the number to say 50 or a 100 participants. Not to mention the numbers we see today of 1,000 or more participants in sessions (either active participants or passive viewers).

Enter the WebRTC media server

This is where a WebRTC media server comes in. We will add it here to be able to do the following tasks for us:

  • Reduce the stress on the upstream connection of clients
    • Now clients will send out fewer media streams to the server
    • The server will be distributing the media it receives to other clients
  • Handle bandwidth estimation
    • Each client takes care of bandwidth estimation in front of the server
    • The server takes care of the whole “operation”, understanding the available bandwidth and constraints of all clients

Here’s what’s really going on and what we use these media servers for:

WebRTC media servers bridge the gaps in the architecture that we can’t solve with clients alone

How is a WebRTC media server different from TURN servers

Before we continue and dive in to the different types of media servers, there’s something that must be said and discussed:

WebRTC media server != TURN server

I’ve seen people try to use the TURN server to do what media servers do. Usually that would be things like recording the data stream.

This doesn’t work.

TURN servers route media through firewalls and NAT devices. They aren’t privy to the data being sent through them. WebRTC privacy is maintained by having data encrypted end to end when passing via TURN servers – the TURN servers don’t know the encryption key so can’t do anything with the media.

WebRTC media servers are implementations of WebRTC clients in a server component. From an architectural point of view, the “session” terminates in the WebRTC media server:

A WebRTC media server is privy to all data passing through it, and acts as a WebRTC client in front of each of the WebRTC devices it works with. It is also why it isn’t so well defined in WebRTC but at the same time so versatile.

Types of WebRTC media servers

This versatility of WebRTC media servers means that there are different types of such servers. Each one works under different architectural assumptions and concepts. Lets review them quickly here.

Routing media using an SFU

The most common and popular WebRTC media server is the SFU.

An SFU routes media between the devices, doing as little as possible when it comes to the media processing part itself.

The concept of an SFU is that it offloads much of the decision making of layout and display to the clients themselves, giving them more flexibility than any other alternative. At the same time, it takes care of bandwidth management and routing logic to best fit the capabilities of the devices it works with.

To do all that, it uses technologies such as bandwidth estimation, simulcast, SVC and many others (things like DTX, cascading and RED).

At the beginning, SFUs were introduced and used for group calls. Later on, they started to appear as live streaming and broadcast components.

Mixing media with an MCU

Probably the oldest media server solution is the MCU.

The MCU was introduced years before WebRTC, when networks were limited. Telephony systems had/have voice conferencing bridges built around the concept of MCUs. Video conferencing systems required the use of media servers simply because video compression required specialized hardware and later too much CPU from client devices.

In telephony and audio, you’ll see this referred to as mixers or audio bridges and not MCUs. That said, they still are one and the same technically.

What MCUs do is to receive and mix the media streams it receives from the various participants, sending a single stream of media towards the clients. For clients, an MCU looks like a call between 2 participants – it is the only entity the client really interacts with directly. This means there’s a single audio and a single video stream coming into and going out of the client – regardless of the number of participants and how/when they join and leave the session.

MCUs were less used in WebRTC from the get go. Part of it was the simple economies of scale – MCUs are expensive to operate, requiring a lot of CPU power (encoding and decoding media is expensive). It is cheaper to offer the same or similar services using SFUs. There are vendors who still rely on MCUs in WebRTC for group calling, though in most cases, you will find MCUs providing the recording mechanism only – where what they end up doing is taking all inputs and mixing them into a single stream to place in storage.

Bridging across standards using a gateway

Another type of media server that is used in WebRTC is a gateway.

In some cases, content – rendered, live or otherwise – needs to be shared in a WebRTC session – or a WebRTC session needs to be shared on another type of a protocol/medium. To do so, a gateway can be used to bridge between the protocols.

The two main cases where these happen are probably:

  1. Connecting surveillance cameras that don’t inherently support WebRTC to a WebRTC application
  2. Streaming a WebRTC session into a social network (think Twitch, YouTube Live, …)
The hybrid media server

One more example is a kind of a hybrid media server. One that might do routing and processing together. A group calling service that also records the call into a single stream for example. Such solutions are becoming more and more popular and are usually deployed as multiple media servers of different types (unlike the illustration above), each catering for a different part of the service. Splitting them up makes it easier to develop, maintain and scale them based on the workload needed by each media server type.

Cloud rendering

This might not be a WebRTC media server per se, but for me this falls within the same category.

Sometimes, what we want is to render content in the cloud and share it live with a user on a browser. This is true for things like cloud gaming or cloud application delivery (Photoshop in the cloud for hourly consumption). In such a case, this is more like a peer-to-peer WebRTC session taking place between a user on a browser and a cloud server that renders the content.

I see it as a media server because many of the aspects of development and scaling of the cloud rendering components are more akin to how you’d think about WebRTC media servers than they are about browser or native clients.

A quick exercise: What WebRTC media servers are used by Google Meet?

Let’s look at an example service – Google Meet. Why Google Meet? Well, because it is so versatile today and because if you want to trace capabilities in WebRTC, the best approach is to keep close tabs with what Google Meet is doing.

What WebRTC media servers does Google Meet use? Based on the functionality it offers, we can glean out the types that make up this service:

  • Supports large group meetings – this is where SFU servers are used by Google Meet to host and orchestrate the meeting. Each user has different layouts during the same session and can flexibly control what it views
  • Recording meetings – Google Meet recordings shows a single participant/screen share and mixes all audio streams. For the audio this means using an MCU server and for the video this is more akin to a switching SFU server (always picking out a single video stream out of those available and not aiming for a “what you see is what you get” kind of recording)
  • Connect to YouTube live – here, they connect between Google Meet and YouTube Live using an RTMP gateway in real-time instead of storing it in a file like it is done while recording
  • Dialing in from regular telephones – this one requires a hybrid gateway bridging server as well as an MCU to mix the audio into the meeting
  • Cloud based noise suppression – Google decided to implement noise suppression in Google Meet using servers. This requires an SFU/bridging gateway to connect to servers that process the media in such a way
  • Cloud based background removal – For low performing devices, Google Meet also runs background removal in the server, and like noise suppression, this requires an SFU/bridging gateway for this functionality

A classing meeting service in WebRTC may well require more than a single type of a WebRTC media server, likely deployed in hybrid mode across different hardware configurations.

When will you need a WebRTC media server?

As we’ve seen earlier, the answer to this is simple – when doing things with WebRTC clients only isn’t possible and we need something to bridge this gap.

We may lack:

  • Bandwidth on the client side, so we will alleviate that by adding WebRTC servers
  • CPU, memory or processing power, delegating that to the cloud
  • Conduct certain machine learning algorithms, where having them run in cloud services may make more sense (due to CPU, memory, availability of training data, speed, certain AI chips, …)
  • Bridging between WebRTC and other components that don’t use WebRTC, such as connecting to telephony systems, surveillance cameras, social media streaming services, etc
  • When we need the data on servers – so we record the sessions (we can also do this without a WebRTC server, but there will be a media server in the cloud there nonetheless)

What I usually do when analyzing the needs of a WebRTC application is to find these gaps and determine if a WebRTC media server is needed (it usually is). I do so by thinking of the solution as a P2P one, without media servers. And then based on the requirements and the gaps found, I’ll be adding certain WebRTC media server elements into the infrastructure needed for my WebRTC application.

E2EE and WebRTC media servers

We’ve seen a growing interest in recent years in privacy. The internet has shifted to encryption first connections and WebRTC offers encrypted only media. This shift towards privacy started as privacy from other malicious actors on the public internet but has since shifted also towards privacy from the service provider itself.

Running a group meetings service through a service provider that cannot access the meeting’s content himself is becoming more commonplace.

This capability is known as E2EE – End to End Encryption.

When introducing WebRTC media servers into the mix, it means that while they are still a part of the session and are terminating WebRTC peer connections (=terminating encrypted SRTP streams) on their own, they shouldn’t have access to the media itself.

This can be achieved only in the SFU type of WebRTC media servers by the use of insertable streams. With it, the application logic can exchange private encryption keys between the users and have a second encryption layer that passes transparently through the SFU – enabling it to do its job of packet routing without the ability to understand the media content itself.

WebRTC media servers and open source

Another important aspect to understand about WebRTC media servers is that most of those using media servers in WebRTC do so using open source frameworks for media servers.

I’ve written at length about WebRTC open source projects – there are details there about the market state and open source WebRTC media servers there.

What is important to note is that more often than not, projects who don’t use managed services for their WebRTC media servers usually pick open source WebRTC media servers to work with and not develop their own from scratch. This isn’t always the case, but it is quite common.

Video APIs, CPaaS and WebRTC media servers

WebRTC Video API and CPaaS is another area I cover quite extensively.

Vendors who decide to use a CPaaS vendor for their WebRTC application will mainly do it in one of two situations:

  1. They need to bridge audio calls to PSTN to connect them to regular telephony
  2. There’s a need for a WebRTC media server (usually an SFU) in their solution

Both cases require media servers…

This leads to the following important conclusion: there’s no such thing as a CPaaS vendor doing WebRTC that isn’t offering a managed WebRTC media server as part of its solution – and if there is, then I’ll question its usefulness for most potential customers.

Taking a deep dive into WebRTC protocols

Last year, I released the Low-level WebRTC protocols course along with Philipp Hancke.

The Low-level WebRTC protocols course has been a huge success, which is why we’re starting to work on our next course in this series: Higher level WebRTC protocols

Before we go about understanding WebRTC media servers, it is important to understand the inner-workings of the network protocols that WebRTC employs. Our low-level protocols course covers the first part of the underlying protocols. This second course, looks at the higher level protocols – the parts that look and deal a bit more with network realities – challenges brought to us by packet losses as well as other network characteristics.

Things we cover here include retransmissions, forward error correction, codecs packetization and a myriad of media processing algorithms.

Want to be the first to know when we open our early bird enrollment?

Join the waiting list

The post What exactly is a WebRTC media server? appeared first on


Subscribe to OpenTelecom.IT aggregator

Using the greatness of Parallax

Phosfluorescently utilize future-proof scenarios whereas timely leadership skills. Seamlessly administrate maintainable quality vectors whereas proactive mindshare.

Dramatically plagiarize visionary internal or "organic" sources via process-centric. Compellingly exploit worldwide communities for high standards in growth strategies.

Get free trial

Wow, this most certainly is a great a theme.

John Smith
Company name

Yet more available pages

Responsive grid

Donec sed odio dui. Nulla vitae elit libero, a pharetra augue. Nullam id dolor id nibh ultricies vehicula ut id elit. Integer posuere erat a ante venenatis dapibus posuere velit aliquet.

More »


Donec sed odio dui. Nulla vitae elit libero, a pharetra augue. Nullam id dolor id nibh ultricies vehicula ut id elit. Integer posuere erat a ante venenatis dapibus posuere velit aliquet.

More »

Startup Growth Lite is a free theme, contributed to the Drupal Community by More than Themes.