WebRTC has its place in surveillance and security applications. It isn’t core to these industries, but it is critical in many deployments.
Surveillance has become near and dear to my heart. I had a few vendors consult with me in the past. There are a few using testRTC. And then there’s the personal level. The system we have in our apartment building.
This got me to think quite a lot about WebRTC in surveillance tech lately.Table of contents
I live in an apartment building here in Israel:
2 main entrances (and another side one)
3 levels of underground parking
And yes. We have a surveillance camera system. Like all of the other apartment buildings in my neighborhood:The view from my apartment on a nice day
A year ago, I was in charge of the vendor selection and upgrade process of our cameras. We switched from an analog system into a hybrid analog/IP one.
This month, we’re looking into upgrading an elevator camera to an IP one, as well as adding WiFi to our underground parking. Having a chat with one of the vendors we’re reaching out to, he was fascinated with my work on WebRTC and the potential of using it for application-less viewing of cameras.
I’ve had my share of meetings and dealings with vendors building different types of surveillance and security solutions. From private security solutions to large scale, enterprise visual intelligence ones. Obviously, the matter of these interactions were around WebRTC.
I am not an expert in surveillance, so take the market overview with a grain of salt
That said, I do know my way with WebRTC and where it fits nicely
Here are some of the things I learned over the yearsSecurity and surveillance use cases in WebRTC
I’ll start with the obvious – cameras, security and surveillance have multiple use cases. Some of them can be seen as classic to this domain while others slightly newer or a specialized niche. Each of these use cases is a world onto its own with its requirements from WebRTC and the types of solutions emerging in it.Small scale / cheap multiple surveillance cameras
This is where I’d frame my own experience of our apartment building. A system that requires 32 or less video cameras, spread across the location, connected to a DVR (Digital Video Recorder) or an NVR (Network Video Recorder).
In essence, you go install the cameras in sensitive locations, wire them up (with an analog cable, IP or even wireless) to the media server that is located onsite as well. That media server is a DVR if it is a closed loop system or an NVR if you’re living in modern times. I’ll just refer to these two as xVR from here on.
Once there, you hook’em up to a local monitor that nobody goes and look at, as well as let the owner connect remotely from his PC or mobile phone.
Is WebRTC needed here? Not really.
Surveillance cameras today use RTP (and sometimes also RTSP). These are the new ones. Old ones are pure analog. They connect to that xVR media server, which handles them quite well today. It did so also before WebRTC came to our lives. The user then accesses the system to play the videos remotely using a dedicated application, which again, existed before WebRTC.
Since there’s no specific requirement to access this through a web browser, the use of WebRTC here is questionable.
You might say WebRTC would make things easier, but hey – if it ain’t broken, don’t fix it
These solutions are purchased from local vendors that install such systems. The buyer will usually reach out to an installer that will pick and choose the cameras and the surveillance system for the buyer. The buyer cares less about the technology and more about the local vendor’s ability to install and maintain the system when needed.Enterprise / large scale surveillance
Large scale surveillance systems for enterprises is more of the same as the small scale ones, but with a few main differences:
The two things that are making headways in this industry?
Like the small scale solutions, here too the buyer will look for local installers. These will be the local integrators who bring the systems and install them. At times, the decision of brand will come from the buyer, though this is less likely. It is important to remember that a considerable part of the cost goes towards the setup and installation and not necessarily to the cost of the equipment itself.Personal/home surveillance
This one is the residential one. It is a B2C space where the buyer is a person buying a camera for his own home security. The decision is made on price or brand mostly.
Here you’ll find also solutions that make use of old smartphones and tablets as cameras, or something like the one we purchased a few years back when our kids were younger:A digital peephole camera
Having the ability for them to see who is outside our door when they were shorter.
Here too, the market is going into multiple directions:
Where does WebRTC play here? It might make things smoother to develop for the companies, but this doesn’t seem to be the case.
One thing that goes through all use cases above, is the existence of another solution – the video doorbell. Taken into buildings, this becomes an intercom system, which again – can make use of WebRTC. And why? Because it needs bidirectional support for audio at the very least, making WebRTC a suitable alternative.Personal security
A totally different niche is the one of personal security.
This manifests itself as apps (and services) people can use to increase their security while going about in their daily tasks. Some of these apps connect you to friends and family while others to personal security agents. The WebRTC requirement here is the same for all cases – be able to conduct voice and video calls in real time.
Taken more broadly from the personal level, the same can be implemented in campuses, cities, events, etc.Unique (?) challenges for WebRTC with camera hardware
There are some unique challenges for WebRTC when it comes to the surveillance space, and that’s mostly a matter of hardware.
Most of these issues won’t plague a software solution. But here, we end up in the real world simply because someone needs to go and install the physical cameras.
When figuring out the hardware platform to use, it is important to think of future trends and technology improvements that affect your implementation
In the case of surveillance, there’s WebRTC, future video codecs (AV1) and machine learning in the vision domain to think about. Probably also programmable photography that is bringing innovations to smartphones for a few years nowIngress, egress and the concept of real time
Where to place WebRTC in the solution?
Since I write a lot about WebRTC, and this article is mostly about WebRTC in surveillance markets, it is THE biggest question to answer here.
There are two different places, and both are suitable, but not necessarily together in the same system.
Surveillance needs real time. Sometimes.
In our own residential building, I seldom care about the live feed from the cameras. It is to check if the front door to the building is open or not, or if there’s some area that got dirty (usually dog pee). Then most of the time is spent rewinding to figure out who caused the problem. Nothing here is considered real time in nature or requires sub second latency.
Elsewhere, real time might be critical on the viewer side (egress), which brings with it the question of whether WebRTC fits here well.
Web cameras that directly stream out WebRTC to the world (or the xDR). Is that a benefit? What’s the value of it versus the existing camera technologies used?
I am not quite for or against this, as I am not really sure here. I’d say that a benefit here can be in the fact that it makes the whole technology stack simpler if you end up using WebRTC end-to-end instead of needing to switch protocols from the camera to the viewer. Just remember here that rewind and playback will likely require something other than WebRTC.
The main advantage of WebRTC here might be the removal of the need to transcode and translate across protocols and codecs. It makes xDR software simpler to write and reduces a lot of their CPU requirements, making the systems lighter and cheaper (the xDR – not the camera itself).
One more thing to think of is cameras that also require bidirectional audio. Because a security guard wants to announce or warn perpetrators, or because this is a video doorbell. There, WebRTC fits nicely, though again – not mandatory (I’d still try using it there more than elsewhere).
Going to introduce WebRTC to a surveillance system? Great. Check first where exactly within the whole architecture WebRTC fits and ask yourself whyMobile or desktop?
Another important aspect of a surveillance system is where people go to watch the videos.
When we installed our own system, we were told that the mobile app is better than the PC app. In both, these were applications. But somehow for the consumers, it meant using the smartphone. It sucks. But yes – it sucks more on the desktop. Which is crazy, considering that what you’re trying to do is watch output coming from 4K cameras in order to identify people.
Then again, who is your customer?
If this is a large enterprise, where there’s going to be a fancy video wall of video feeds with a bored security guard looking at it, then should this be an application or would it be preferable to use a web application for it, with the help of WebRTC? It seems that much of the industry on the client side is looking for lightweight solutions that require less software installations, favoring browsers and… WebRTC.
And if you’re already doing WebRTC for one egress destination, you can use it for all others – browser and app based.
One more thing to consider – it is easier today to develop a web application than it is a native PC application. Cheaper and faster. Which means that supporting WebRTC if the desktop is your primary viewing device might be the right decision to make.
See if there’s a strong need for a zero-install or desktop viewing. This might well lead you towards WebRTC on the egress sideThe age of Artificial Intelligence in surveillance tech
The biggest driver in this industry is machine learning and artificial intelligence. And not necessarily the Generative AI kind, but rather the kind that deals with object classification.
The challenge with surveillance is watching the damn cameras. You need eyeballs on screens. The good old motion detection removes a lot of noise (or more accurately, static), but it leaves much to be desired.
One of the elevators in my building, along with the video you get most hours of the day – empty. The bar at the bottom with the blue stripes marks when there’s actual movement.
Using machine learning, it will be easier to search for dogs, people, colors, items and other tidbits to figure out times of interest in the thousands of hours of boring videos, as well as act as “Google search” on recorded video feeds.
Doing all that in the cloud is possible, but expensive and tedious – how do you ship all the video, decode it, process it again, etc.
Doing it on the edge, on the device itself (the camera or the xDR) is preferable, but requires new hardware, so requires another technology leap and refresh.WebRTC isn’t core for surveillance but it is critical
This is something to remember.
WebRTC isn’t core to surveillance. You don’t really need it to get surveillance cameras working, installed or connected to their xDR media servers. You don’t even need it to view videos – either “live” or as playback.
But, and that’s a big one – in some cases, having WebRTC is critical. Because your customer may want to be able to use web browsers and install nothing. He may want to be able to get bidirectional media. There might be a need to get video feeds that are at sub second latencies.
For these, WebRTC might not be a core competency, but they are critical to the successful delivery and deployment of your product. This translates into having a need to have that skill set in your team or be able to outsource it to someone with that skill set.
Where can I help, if at all?
Online WebRTC courses, to skill up engineers on this technology
The post Fitting WebRTC in the brave new world of webcams, security, surveillance and visual intelligence appeared first on BlogGeek.me.
How to think and plan for CPaaS vendor lock-in when it comes to your WebRTC application implementation.
How can/should CPaaS vendors compete on winning customers? More than that, how can/should CPaaS vendors poach customers from other CPaaS vendors?
What prompted this article is the various techniques CPaaS vendors use and what they mean to customers – how should customers react to these techniques. I’ll focus on the Video API part of CPaaS – or to be more specific, the part that deals with WebRTC implementation.Table of contents
For me CPaaS (or Communication Platform as a Service) is a service that lets companies build their own communication experiences in a flexible manner. Usually done via APIs and requires developers, but recently, also via lowcode/nocode interactions (such as embedding an iframe).
A CPaaS vendor ends up defining its own interface of APIs which his customers are using to create these communication experiences.
That API interface is proprietary. There is no standard specification for how CPaaS APIs need to look or behave. This means that if you used such an API, and you want to switch to another CPaaS vendor – you’re going to need to do all that integration work all over again.
Think of it like switching from an Android phone to an iPhone or vice versa:
In a way, you want the same experience (only better), but there’s going to be a learning curve and an adaptation curve where you familiarize yourself with the new CPaaS vendor and “make yourself at home”.
The vendor lock-in part is how much effort and risk will you need to invest and overcome in order to switch from one vendor to another – to call that other vendor your new home.
Vendor lock-in has 3 aspects to it in CPaaS:
Vendor lock-in is scary. Not because of the technical effort involved but because of the risks from the unknowns. The more years and the more interfaces, scenarios and code you have running on a CPaaS vendor, the higher the lock-in and risk of migration you are at.The innovation in WebRTC that CPaaS is “killing”
Before WebRTC, we had other standards. RTP and RTCP came a lot before WebRTC.
We had RTMP, RTSP, SIP and H.323.
The main theme of all these standard specifications was that their focus has always been about standardizing what goes on over the network. They didn’t care or fret about the interface for the developer. The idea behind this was to enable using this standard on whatever hardware, operating system and programming language. Just read the spec and implement it anyway you like.
WebRTC changed all that (ignoring Flash here). We now have a specification where the API interface for the developer of a web application is also predefined.
Here’s how I like explaining it in my slides:
One of the main advantages of WebRTC is that a developer who uses WebRTC in one project for one company can relatively easily switch to implement a different WebRTC project for another company. (that’s not really correct, but bear with me a little here)
We now could think of WebRTC just like other technologies – someone proficient in WebRTC is “comparable” to someone who worked with Node.js or SQL or other technologies. Whereas working with SIP or H.323 begs the question – which framework or implementation was used – learning a new one has its own learning curve.
And now the WebRTC API interface is no longer relevant. The CPaaS vendor’s SDK has its own interface indicating how things get done. And these may or may not bear any resemblance to the WebRTC API. Moreover – it might even try very hard to hide the WebRTC stack implementation from the developer.
This piece of innovation, where a developer using WebRTC can jump into new code of another project quickly is gone now. Because the interfaces of different CPaaS vendors aren’t standardized and don’t adhere to the standard WebRTC API interface (and they shouldn’t be – it isn’t because they are mean – it is because they offer a higher level of abstraction with more complex and complete functionality).
Not having the same interface across CPaaS vendors is one of the reasons we’ve started down this rabbit hole of exploring what CPaaS vendor lock-in is exactly.CPaaS vendor poaching techniques and how to react to them
Every so often, you see one or more CPaaS vendors trying to grab a bit more market share in this space. Sometimes, it is about enticing customers who want to start using a CPaaS vendor. Other times it is focused on trying to poach customers from other CPaaS vendors.
When looking at the latter, here are the CPaaS vendor poaching techniques I’ve seen, how effective they are, and what you as a target company should think about them.#1 – Feature list comparisons
The easiest technique to implement (and to review) is the feature list comparison.
In it, a CPaaS vendor would simply generate and share a comparison table of how its feature set is preferable over the popular alternatives.
For a company looking to switch, this would be a great place to start. You can skim through the feature list and see exactly what’s there in the platform you are currently using and the one you are thinking of switching to.
When looking at such a list, remember and ask yourself the following questions:
I’ve had my fare share of reading, writing and responding to comparison tables. A long time ago (pre-WebRTC), we received inputs that our competitor can do almost 10 times the number of concurrent calls we are able to do with much higher throughput. Obviously, we created a task force to deal with it. The conclusion was simple – the competitor didn’t measure the network time at all – just CPU time in the machine. We weren’t measuring the same thing and his choice of metric meant he always looked better
Your role in this? To read between the lines and understand what wasn’t written. Always remember that this isn’t an objective comparison – it is highly skewed towards the author of it (otherwise, he wouldn’t be publishing it)#2 – Performance comparisons
Here the intent of the CPaaS vendor is to show that his platform is superior in its performance. It can offer better quality, at lower bitrates and CPU use for larger groups.
If a vendor does it on his own, then potential customers will immediately view the results as suspect. This is why most of them use third party objective vendors to do these performance comparisons for them (at a cost).
We’ve done this at testRTC a couple of times – some publicly shared (for this one, I’ve placed my own reputation and testRTC’s reputation on the frontline, insisting not to name the other vendors) and others privately done. It is a fun project since it requires working towards a goal of figuring out how different CPaaS vendors behave in different scenarios.
Zoom did this as well, comparing itself to other CPaaS vendors. Agora answered in kind with a series of posts comparing themselves back to Zoom (where Zoom didn’t look as shiny).
Just remember a few things when reading such comparisons:
In the end, the fact that a CPaaS vendor performs better than another in a scenario you don’t need says nothing for you. Make sure to give more weight to the results of actual scenarios relevant to you, and be sure you understand what is really being compared#3 – Guides, how-to’s and success stories
How do you make the migration of a customer from a different CPaaS vendor to your own? You write a migration document about it. A guide. Or a how-to. Or you get a testimonial or a success story from a customer willing to share publicly that he migrated and how life is so much better for him now.
These are mainly targeted at raising the confidence level for those who are contemplating switching, signaling them that the process isn’t risky and that others have taken this path successfully already.
As someone thinking of moving from one vendor to another, I’d seriously consider reaching out to the CPaaS vendor and ask the hard questions:
Anecdotes and recipes are nice. What you are after is having more data points.
Read these guides and success stories. Try reading between the lines in them. Check if you have any open questions and then ask these questions directly. Gather as much information as you can to get a clearer picture#4 – Reference applications
I wasn’t sure if this fits for migrating customers because it is a bit broader in nature. But here we are
In many cases, CPaaS vendors have reference applications available. Usually hosted on github. Just pull the code, compile, host and run it. You get an app that is “almost” ready for deployment.
You see how easy that was? Think how easy it is going to be to migrate to us with this great reference.
Remember a few things here:
From my point of view, reference apps are nice to get a taste of what’s possible and how the API of a CPaaS vendor gets used. But that’s about it. They are unlikely to be useful during the migration process itself#5 – Shims and adaptors
They say imitation is the highest form of flattery. If that is true, then shims and adapters would fit well here.
In CPaaS, the most common one was supporting TwiML (that’s Twilio’s XML “language” for actions on telephony events). There’s also the idea/intent of having the whole API interface of another CPaaS vendor (or parts of it) supported directly by the poacher. The purpose of which is to make it easy to switch over.
Clearing things up a bit:
The result? If you’re using vendor A, theoretically, you can take the shim created by vendor B and magically without any investment, you migrate to vendor B. Problem solved
While this looks great on paper, I am afraid it has little chance of holding up in the real world . Here’s why:
The thing is, that using a shim still means a ton of testing and headaches, but such that are hard to overcome.
If I had to switch between vendors, I’d ignore such shims altogether. For me they’re more of a trap than anything else.
Someone suggesting you use their shim for switching over to their CPaaS? Ignore them and just analyze what needs to be done as if there’s no shim available. You’ll thank me laterBuild vs Buy – my first preference is ALWAYS buy (=CPaaS)
We’ve seen 5 different techniques CPaaS vendors use to try and poach customers from one another. For the most part, they are of the type of “buyers beware”. And yet, we do need to migrate from time to time from one CPaaS vendor to another. Market dynamics might force us to do so or just the need to switch to a better platform or offering.
Does that mean it would be best to go it alone and build your own platform instead of using a third party CPaaS vendor?
Vendor lock-in isn’t necessarily a bad thing. My first preference is always to adopt a CPaaS vendor. And if not to adopt one, then to articulate very clearly why the decision to build is made.
What should you do when you start using a CPaaS vendor to make the transition to another vendor (or to your own platform) smoother in the distant future? Here are a few things to consider.
The post Solving CPaaS vendor lock-in (as a customer and as a CPaaS vendor) appeared first on BlogGeek.me.
Open Broadcast Studio or OBS is an extremely popular open-source program used for streaming to broadcast platforms and for local recording. WebRTC is the open-source real time video communications stack built into every modern browser and used by billions for their regular video communications needs. Somehow these two have not formally intersected – that is […]
How do you choose the right architecture for a WebRTC audio conferencing service?
Last month, Lorenzo Miniero published an update post on work he is doing on Janus to improve its AudioBridge plugin. It touched a point that I failed to write about for a long time (if at all), so I wanted to share my thoughts and views on it as well.
I’ll start with a quick explanation – Lorenzo is adding to Janus a lot of layers and flexibility that is needed by developers who are taking the route of mixing audio in WebRTC conferences. What I want to discuss here is when to use audio mixing and when not to use it. And as everything else, there usually isn’t a clear cut decision here.Table of contents
Group calls in WebRTC can take different shapes and sizes. For the most part, there are 3 dominant architectures for WebRTC multiparty calling: mesh, mixing and routing.
I’ll be focusing on mixing and routing here since they scale well to 100’s or more users.
Let’s start with the basics.
Assume there’s a conversation between 5 people. Each of these people can speak his mind and the others can hear him speaking. If all of these people are remote with each other and we now need to model it in WebRTC, we might think of it as something like this illustration:
This is known as a mesh network. Its biggest disadvantage for us (though there are others) is the messiness of it all – the number of connections between participants that grows polynomially with the number of users. The fact that we need to send out the same audio stream to all participants individually is another huge disadvantage. Usually, we assume (and for good reasons) that the network available to us is limited.
The immediate obvious solution is to get a central media server to mix all audio inputs, reducing all network traffic and processing from the users:
This media server is usually called an MCU (or a conferencing bridge). Users here “feel” as if they are in a session with only a single entity/user and the MCU is in charge of all the headaches on behalf of the users.
This mixer approach can be a wee bit expensive for the service provider and at times, not the most flexible of approaches. Which is why the SFU routed model was introduced, though mostly for video meetings. Here, we try to enjoy both worlds – we have the SFU route the media around, to try and keep bitrates and network use at reasonable levels while trying to reduce our hosting and media processing costs as service providers:
The SFU has become commonplace and the winning architecture model for video meetings almost everywhere. Voice only meetings though, have been somewhere in-between. Probably due to the existence and use of audio bridges a lot before WebRTC came to our lives.
This begs the question then, which architecture should we be using for our audio in group calls? Should we mix it in our media servers or just route it around like we do with video?
Before I go ahead to try and answer this question, there’s one more thing I’d like to go through, and that’s the set of media processing tools available to us today for audio in WebRTC.Audio processing tools available for us in WebRTC
Encoding and decoding audio is the baseline thing. But other than that, there are quite a few media processing and network related algorithms that can assist applications in getting to the desired scale and quality of audio they need.
Before I list them, here are a few thoughts that came to mind when I collected them all:
There is an RTP header extension for audio level. This allows a WebRTC client to indicate what is the volume that can be found inside the encoded audio packet being sent.
The receiver can then use that information without decoding the packet at all.
What can one do with it?
Decide if you need to decode the packet at all or just discard it if there’s no or little voice activity or if the audio level is too low (no one’s going to hear what’s in there anyway).
You can replace it with DTX (see below) or not forward the packet in a Last-N architecture (see below).
Not mix its content with other audio channels (it doesn’t hold enough information to be useful to anyone).DTX
If there’s nothing really to send – the person isn’t speaking but the microphone is open – then send “silence” but with less packets over the network.
That’s what DTX is about, and it is great.
In larger meetings, most people will listen and not speak over one another. So most audio streams will just be “silence” or muted. If they aren’t muted, then sending DTX instead of actual audio reduces the traffic generated. This can be a boon to SFUs who end up processing less packets.
An SFU media server can also decide to “replace” actual audio it receives from users (because they have a low audio level in them or because of Last-N decisions he is making) with DTX data when routing media around.PLC
Packets are going to be lost, but there would be content that still needs to be played back to the user.
You can decide to play silence, a repeat of the last heard packet, lower its volume a bit, etc.
This can be done both on the server side (especially in the case of an MCU mixer) or on the client side – where such algorithms are implemented in the browser already. SFUs can ignore this one, mostly since they don’t decode and process the actual media anyway.
At times, these can be done using machine learning, like Google’s proprietary WaveNetEq, which tries to estimate and predict what was in the missing packet based on past packets received.
Packet loss concealment isn’t great at all times, but it is a necessary evil.RTX & NACK
Theoretically, you could use retransmissions for lost packets.
WebRTC does that mostly for video packets, but this can also find a home for audio.
It is/was a rather neglected area because PLC and Opus inband FEC techniques worked nicely.
For the time being, you’re likely to skip this tool, but it is one I’d keep an eye on if I were highly interested in audio quality advancements.FEC and RED
Audio bandwidth requirements are low, so duplicating frames doesn’t end up taxing much of our network, especially in a video call.
This approach enables us at a “low cost” to gain higher resiliency to packet losses.
This can be employed by the client sender, or even from the server side, beefing up what it received – both as an SFU or an MCU.
Check Philipp Hancke’s tal at Kranky Geek about Advanced in Audio Codecs
Then there’s the nuances and headaches of when to duplicate and how much, but that’s for another article.Last-N
A known technicality in WebRTC’s implementation is that it only mixes the 3 loudest incoming audio channels before playing back the audio.
Why 3? Because 2 wasn’t enough and 4 seemed unnecessary is my guess. Also, the more sources you mix, the higher the noise levels are going to be, specially without good noise suppression (more on that below)
Well… Google just decided to remove that restriction. Based on the announcement, that’s because the audio decoding takes place in any case, so there’s not much of a performance optimization not to mix them all.
So now, you can decide if you want to mix everything (which you just couldn’t before) or if you want to mix or route only a few loudest volume (or most important) audio streams if that’s what you’re after. This reduces CPU and network load (depending on which architecture you are using).
Google Meet for example, is employing Last-3 technique, only sending up to 3 loudest audio streams to users in a meeting.
Oh, and if you want to dig deeper into the reasoning, there’s a nice Jitsi paper from 2016 explaining Last N.Noise suppression: RNNoise and other machine learning algorithms
RNNoise is a veteran among the ML-based noise suppression algorithms that is quite popular these days.
Janus for example, have added it to their AudioBridge and implemented optional RNNoise logic to handle channel-based noise suppression in their MCU mixer for each incoming stream.
Google added this in their Google Meet cloud – their SFU implementation passes the audio to dedicated servers that process this noise suppression – likely by decoding, noise suppression and encoding back the audio.
Many vendors today are introducing proprietary noise suppression to their solutions on the client side. These include Krisp, Dolby, Daily, Jitsi, Twilio and Agora – some via partnerships and others via self development.Mixing keeps the headaches away from the browser
Why use an MCU for mixing your audio call? Because it takes all the implementation headaches and details away from the browser.
To understand some of what it entails on the server though, I’d refer you again to read Lorenzo’s post.
The great thing about this is that for the most part, adding more users means throwing more cloud hardware on the problem to solve it. At least up to a degree this can work well without thinking of scaling out, decentralization and other big words.
It is also how this was conducted for many years now.
Here are the tools I’d aim for in using for an audio MCU:ToolUse?ReasoningAudio levelDecoding less streams will get higher performance density for the server. Use this with Last-N logicDTXBoth when decoding and while encodingPLCOn each incoming audio stream separatelyRTX & NACKTo early to do this todayFEC and REDToday, for an MCU, this would be rare to see as a supported featureConsider on outgoing audio streams; as well as enable for incoming streams from devicesLast-NLast-3 is a good default unless you have a specific user experience in mind (see below examples)Noise suppressionOn incoming channels, those that passed Last-N filtering, to clean them up before mixing the incoming streams together
Things to note with an audio MCU, is that the MCU needs to generate quite a few different outgoing streams. For 10 participants with 4 speakers (at Last-4 configuration), it would be something like this:
We have 5 separate mixers at play here:
Why do we use an SFU for audio conferences? Because we use it for video already… or because we believe this is the modern way of doing things these days.
When it comes to routing audio, the thing to remember is that we have a delicate balance between the SFU and the participants, each playing a part here to get a better experience at the end of the day.
Here are the tools I’d use for an audio SFU:ToolUse?ReasoningAudio levelWe must have this thing implemented and enabled, especially since we really really really want to be able to conduct Last-N logic and not send each user all audio channels from all other participantsDTXWe can use this to detect silence as well here (and remove from Last-N logic). On the sending logic, the SFU can decide to DTX the channels in Last-N that are silent or at a low volume to save a bit of extra bandwidth (a minor optimization)PLCNot needed. We route the audio packets and let the participants fix any losses that take placeRTX & NACKTo early to do this todayFEC and REDThis can be added on the receiver and sender side in the SFU to improve audio quality. Adding logic to dynamically device when and how much redundancy based on network conditions is also an advantage hereLast-NLast-3 is a good default. Probably best to keep this at most at Last-5 since the decision here means more CPU use on the participants’ sideNoise suppressionNot needed. This can be done on the participants’ side
In many ways, an audio SFU is simpler to implement than an audio MCU, but tweaking it just right to gain all the benefits and optimizations from the client implementation is the tricky part.Where the rubber hits the road – let’s talk use cases
As with everything else I deal with, which approach to use depends on the circumstances. One of the main deciding criteria in this case is going to be the use case you are dealing with and the scenario you are solving this for.
Here are a few that came to mind.Gateway to the old world
The first one is borderline “obvious”.
Before WebRTC, no one really did an audio conference using an SFU architecture. And if they did, it was unique, proprietary and special. The world revolved and still revolves around MCU and mixing audio bridges.
If your service needs to connect to legacy telephony services, existing deployments of VoIP services running over SIP (or god forbid H.323), connect to a large XMPP network – whatever it may be – that “other” world is going to be running as an MCU. Each device is likely capable of handling only one incoming audio stream.
So trying to connect a few users from your service (no matter if you are using an SFU or an MCU) would need to mix these users when connecting them to the legacy service.Video meetings with mixed audio
There are services that decide to use an SFU to route video streams and an MCU for the audio streams.
Sometimes, it is because the main service started as an audio service (so an audio bridge was/is at the heart of the service already) and video was bolted on the platform. Sometimes it is because gatewaying to the old world is central to the service and its mindset.
Other times, it is due to an effort to reduce the number of audio streams being sent around, or to reduce the technical requirements of audio only participants.
Whatever the reason, this is something you might bump into.
The big downside of such an approach is the loss of lip synchronization. There is no practical way you can synchronize a single audio stream that represents mixed content of multiple video streams. In fact, no lip synchronization with any of the video streams takes place…
Usually, the excuse I’ll be hearing is that the latency difference isn’t noticeable and no one complained. Which begs the question – why do we bother with lip synchronization mechanisms at all then? (we do because it does matter and is noticeable – especially when the network is slightly bumpier than usual)Experience the crowd
Think of a soccer game. 50,000 people in a stadium. Rawring when there’s a goal or a miss.
With Last-3 audio streams mixed, you wouldn’t be hearing anything interesting when this takes place “remotely” for the viewers.
The same applies to a virtual online concert.
Part of the experience you are trying to convey is the crowds and the noises and voices they generate.
If we’re all busy reducing noise levels, suppressing it, picking and choosing the 2-3 voices in the crowd to mix, then we just degrade the experience.
Crowds matter in some scenarios. And keeping their experience properly cannot be done by routing audio streams around. Especially not when we’re starting to talk about hundreds of more active participants.
This case necessitates the use of MCU audio bridging. And likely a distributed approach the moment the numbers of users climb higher.Metaverse and spatial audio
The metaverse is coming. Or will be. Maybe. Now that Apple Vision Pro is upon us. But even before that, we’ve seen some metaverse use cases.
One thing that comes to mind here is the immersion part of it, which leads to spatial audio. The intent of hearing multiple sounds coming from different directions – based on where the speaker is.
This means several things:
Do you do that on the client side by way of an SFU implementation, or would it be preferable to do this in an MCU implementation?
And what about trying to run concerts in the metaverse? How do you give the notion of the crowds on the audio side?
These are questions that definitely don’t have a single answer.
In all likelihood, in some metaverse cases, the SFU model will be the best architectural approach while in others an MCU would work better.Recording it all
Not exactly a use case in its own right, but rather a feature that is needed a lot.
When we need to record a session, how do we go about doing that?
Today, in at least 99% of the time that would be by mixing all audio and video sources and creating a single stream that can be played as a “regular” mp4 file (or similar).
Recording as a single stream means using an MCU-like solution. Sometimes by implementing it in a headless browser (as if this is a silent participant in the session) and other times by way of dedicated media servers. The result is similar – mixing the multiple incoming streams into a single outgoing one that goes directly to storage.
The downside of this, besides needing to spend energy on mixing something that people might never see (which is a decision point to which architecture to pick for example), is that you get to view and hear only a single viewpoint of a single user – since the mixed recording is already “opinionated” based on what viewpoint it took.
We can theoretically “record” the streams separately and then play them back separately, but that’s not that simple to achieve, and for the most part, it isn’t commonplace.
A kind of a compromise we see today with professional recording and podcast services is to record by mixed and separated audio streams. This allows post production to take either based on the mixing needs, but done manually.Which will it be? MCU or SFU for your next audio meeting?
We start with this, and we will end with this.
You need to understand your requirements and from there see if the solution you need will be based on an MCU, and SFU or both. And if you need help with figuring that out, that’s what my WebRTC courses are for – check them out.
webrtcHacks celebrates our 10th birthday today 🎂. To commemorate this day, I’ll cover 2 topics here: Our new merch store Some stats and trends looking back on 10 years of posts We have the Merch In the early days of webrtcHacks, co-founder Reid Stidolph ordered a bunch of stickers which proved to be extremely popular. […]
Explore the future of Real-Time Communications with WebrtcHacks as we delve into the use of WebCodecs and WebTransport as alternatives to WebRTC's RTCPeerConnection. This comprehensive blog post features interviews with industry experts, a review of potential WebCodecs+WebTransport architecture, and a discussion on real-time media processing challenges. We also examine performance measurements, hardware encoder issues, and the practicality of these new technologies.
A new Higher-level WebRTC protocols course and discounts, available for a limited period of time.
Over a year ago, Philipp Hancke came to me with the idea of creating a new set of courses. Ones that will dig deeper into the heart of the protocols used in WebRTC. This being a huge undertaking, we decided to split it into several courses, and focus on the first one – Low-level WebRTC protocols.
We received positive feedback about it, so we ended up working on our second course in this series – Higher-level WebRTC protocols.Why the need for additional WebRTC courses?
There is always something more to learn.
The initial courses at WebRTC Course were focused on giving an understanding of the different components of WebRTC itself and on getting developers to be able to design and then implement their application.
What was missing in all that was a closer look at the protocols themselves. Of looking at what goes on in the network, and being able to understand what goes over the wire. Which is why we started off with the protocols courses.
Where the Low-level WebRTC protocols looks at directly what goes to the network with WebRTC, our newer Higher-level WebRTC protocols is taking it up one level:
This time, we’re looking at the protocols that make use of RTP and RTCP to make the job of real time communications manageable.
If you don’t know exactly what header extensions are, and how they work (and why), or the types of bandwidth estimation algorithms that WebRTC uses – and again – how and why – then this course is for you.
If you know RTP and RTCP really well, because you’ve worked in the video conferencing industry, or have done SIP for years – then this course is definitely for you.
Just understanding the types of RTP header extensions that WebRTC ends up using, many of them proprietary, is going to be quite a surprise for you.Our WebRTC Protocols courses
Got a use case where you need to render remote machines using WebRTC? These require sitting at the cutting edge of WebRTC, or more accurately and a slightly skewed angle versus what the general population does with WebRTC (including Google).
Taking upon yourself such a use case means you’ll need to rely more heavily on your own expertise and understanding of WebRTC.
There are now 2 available protocols courses for you:
And there are 2 different ways to purchase them:
You should probably hurry though…
Check out my WebRTC courses
WebRTC is an important technology for cloud gaming and virtual desktop type use cases. Here are the reasons and the challenges associated with it.
Google launched and shut down Stadia. A cloud gaming platform. It used WebRTC (yay), but it didn’t quite fit into Google’s future it seems.
That said, it does shed a light on a use case that I’ve been “neglecting” in my writing here, though it was and is definitely top of mind in discussions with vendors and developers.
What I want to put in writing this time is cloud gaming as a concept, and then alongside it, all virtual desktops and cloud rendering use cases.
Let’s dig inTable of contents
Google Stadia started life as Project Stream inside Google.
Technically, it made perfect sense. But at least in hindsight, the business plan wasn’t really there. Google is far remote from gaming, game developers and gamers.
On the technical side, the intent was to run high end games on cloud machines that would render the game and then have someone play the game “remotely”. The user gets a live video rendering of the game and sends back console signals. This meant games could be as complex as they need be and get their compute power from cloud servers, while keeping the user’s device at the same spec no matter the game.Source: Google
I’ve added the WebRTC text on the diagram from Google – WebRTC was called upon so that the player could use a modern browser to play the game. No installation needed. This can work nicely even on iOS devices, where Apple is adamant about their part of the revenue sharing on anything that goes through the app store.
Stadia wanted to solve quite a few technological challenges:
And likely quite a few other challenges as well (scaling this whole thing and figuring out how to obtain and keep so many GPUs for example).
Technically, Stadia was a success. Businesswise… well… it shut down a little over 3 years after its launch – so not so much.
What Stadia did though, was show that this is most definitely possible.WebRTC, Cloud gaming and the challenges of real time
To get cloud gaming right, Google had to do a few things with WebRTC. Things they haven’t really needed too much when the main thing for WebRTC at Google was Google Meet. These were lowering the latency, dealing with a larger color space and aiming for 4K resolution at 60 fps. What they got virtually for “free” with WebRTC was its data channel – the means to send game controller signals quickly from the player to the gaming machine in the cloud.
Lets see what it meant to add the other three things:4K resolution at 60 fps
Google aimed for high end games, which meant higher resolutions and frame rates.
WebRTC is/was great for video conferencing resolutions. VGA, 720p and even 1080p. 4K was another jump up that scale. It requires more CPU and more bandwidth.
Luckily, for cloud gaming, the browser only needs to decode the video and not encode it. Which meant the real issue, besides making sure the browser can actually decode 4K resolutions efficiently, was to conduct efficient bandwidth estimation.
As an algorithm, bandwidth estimation is finely tuned and optimized for given scenarios. 4K and cloud gaming being a new scenario, meant that bitrates that were needed weren’t 2mbps or even 4mbps but rather more in the range of 10-35mbps.
The built-in bandwidth estimator in WebRTC can’t handle this… but the one Google built for the Stadia servers can. On the technical side, this was made possible by Google relying on sender-side bandwidth estimation techniques using transport-cc.Lower latency: playout delay
Remember this diagram?
It can be found in my article titled With media delivery, you can optimize for quality or latency. Not both.
WebRTC is designed and built for lower latency, but in the sub-second latency, how would you sort the latency requirements of these 3 activities?
WebRTC’s main focus over the years has been online meetings. This means having 100 milliseconds or 200 milliseconds delay would be just fine.
With an online game? 100 milliseconds is the difference between winning and losing.
So Google tried to reduce latency even further with WebRTC by adding a concept of Playout Delay. The intent here is to let WebRTC know that the application and use case prefers playing out the media earlier and sacrificing even further in quality, versus waiting a bit for the benefit of maybe getting better quality.Larger color space
Video conferencing and talking heads doesn’t need much. If you recall, with video compression what we’re after is to lose as much as we can out of the original video signal and then compress. The idea here is that whatever the eye won’t notice – we can make do without.
Apparently, for talking heads we can lose more of the “color” and still be happy versus doing something similar for an online game.
To make a point, if you’ve watched Game of Thrones at home, then you may remember the botch they had with the last season with some of the episodes that ended up being too dark for television. That was due to compression done by service providers…April 29, 2019
While different from the color space issue here, it goes to show that how you treat color in video encoding matters. And it differs from one scenario to another.
When it comes to games, a different treatment of color space was needed. Specifically, moving from SDR to HDR, adding an RTP header extension in the process to express that additional information.
Oh, and if you want to learn more about these changes (especially resolution and color space), then make sure to watch this Kranky Geek session by YouTube about the changes they had to make to support Stadia:What’s in cloud gaming anyway?
Here’s the thing. Google Stadia is one end of the spectrum in gaming and in cloud gaming.
Throughout the years, I’ve seen quite a few other reasons and market targets for cloud gaming.Types of cloud games
Here are the ones that come out of the top of my head:
Why not even play these games with others remotely?
My son recently had a sit down with 4 other friends, all playing on Xbox together a TMNT game. It was great having them all over, but you could do it remotely as well. If the game doesn’t offer remote players, by pushing it to the cloud you can get that feature simply because all users immediately become remote players.
At this stage, you can even add a voice conference or a video call to the game between the players. Just to give them the level of collaboration they can get out of playing the likes of Fortnite. Granted, this requires more than just game rendering in the cloud, but it is possible and I do see it happen with some of the vendors in this space.Beyond cloud gaming – virtual desktop, remote desktop and cloud rendering
Lower latencies. Bigger color space. Higher resolutions. Rendering in the cloud and consuming remotely.
All these aren’t specific to cloud gaming. They can easily be extended to virtual desktop and remote desktop scenarios.
You have a machine in the cloud – big or small or even a cluster. That “machine” handles computations and ends up rendering the result to a virtual display. You then grab that display and send it to a remote user.
One use case can just be a remote desktop a-la VNC. Here we’re actually trying to get connected from one machine to another, usually in a private and secure peer-to-peer fashion, which is different from what I am aiming for here.
Another, less talked about is doing things like Photoshop operations in the cloud. For the poor sad people like me who don’t have the latest Mac Pro with the shiny M2 Ultra chip, I might just want to “rent” the compute power online for my image or video editing jobs.
I might want to open a rendered 3D view of a sports car I’d like to buy, directly from the browser, having the ability to move my view around the car.
Or it might just be a simple VDI scenario, where the company (usually a large financial institute, but not only) would like the employees to work on Chromebook machines but have nothing installed or stored in them – all consumed by accessing the actual machine and data in their own corporate data center or secure cloud environment.
A good friend of mine asked me what PC to buy for himself. He needed it for work. He is a lawyer. My answer was the lowest end machine you can find would do the job. That saved him quite a lot of money I am guessing, and he wouldn’t even notice the difference for what he needs it for.
But what if he needs a bit more juice and power every once in a while? Can renting that in the cloud make a difference?
What about the need to use specialized software that is hard to install and configure? Or that requires a lot of collaboration on large amounts of data that need to be shared across the collaborators?
Taking the notion and capabilities of cloud gaming and applying them to non-gaming use cases can help us with multiple other requirements:
Do these have to happen with WebRTC? No
Can they happen with WebRTC? Yes
Would changing from proprietary VDI environments to open standard WebRTC in browsers improve things? ProbablyWhy use WebRTC in cloud gaming
Why even use WebRTC for cloud gaming or more general cloud rendering then?
With cloud gaming, we’re fine doing it from inside a dedicated app. So WebRTC isn’t really necessary. Or is it?
In one of our recent WebRTC Insights issues we’ve highlighted that Amazon Luna is dropping the dedicated apps in favor of the web (=WebRTC). From that article:
“We saw customers were spending significantly more time playing games on Luna using their web browsers than on native PC and Mac apps. When we see customers love something, we double down. We optimized the web browser experience with the full features and capabilities offered in Luna’s native desktop apps so customers now have the same exact Luna experience when using Luna on their web browsers.”
Browsers are still a popular enough alternative for many users. Are these your users too?
If you need or want web browser access for a cloud gaming / cloud rendering application, then WebRTC is the way to go. It is a slightly different opinion than the one I had with the future of live streaming, where I stated the opposite:
“The reason WebRTC is used at the moment is because it was the only game in town. Soon that will change with the adoption of solutions based on WebTransport+WebCodecs+WebAssembly where an alternative to WebRTC for live streaming in browsers will introduce itself.”
Why the difference? It is all about the latency we are willing to accommodate:
Your mileage may vary when it comes to the specific latency you’re aiming for, but in general – live streaming can live with slightly higher latency than our online meetings. So something other than WebRTC can cater for that better – we can fine tune and tweak it more.
Cloud gaming needs even lower latency than WebRTC. And WebRTC can accommodate for that. Using something else that is unproven yet (and suffers from performance and latency issues a bit at the moment) is the wrong approach. At least today.Enter our WebRTC Protocols courses
Got a use case where you need to render remote machines using WebRTC? These require sitting at the cutting edge of WebRTC, or more accurately and a slightly skewed angle versus what the general population does with WebRTC (including Google).
Taking upon yourself such a use case means you’ll need to rely more heavily on your own expertise and understanding of WebRTC.
Over a year ago I launched with Philipp Hancke the Low-level WebRTC Protocols course. We’re now recording our next course – Higher-level WebRTC Protocols.
If you are interested in learning more about this, be sure to join our waiting list for once we launch the courseJoin the course waiting list
Oh, and I’d like to thank Midjourney for releasing version 5.2 – awesome images
The Apple Vision pro is a new VR/AR headset. Here are my thoughts on if and how it will affect the metaverse and WebRTC.
There were quite a few interesting announcements and advances made in recent months that got me thinking about this whole area of the metaverse, augmented reality and virtual reality. All of which culminated with Apple’s unveiling last week of the Apple Vision Pro. For me, the prism from which I analyze things is the one of communication technologies, and predominantly WebRTC.
A quick disclaimer: I have no clue about what the future holds here or how it affects WebRTC. The whole purpose of this article is for me to try and sort my own thoughts by putting them “down on paper”.
Let’s get started thenTable of contents
Apple just announced its Vision Pro VR/AR headset. If you’re reading this blog, then you know about this already, so there isn’t much to say about it.
For me? This is the first time that I had this nagging feeling for a few seconds that I just might want to go and purchase an Apple product.
Most articles I’ve read were raving about this – especially the ones who got a few minutes to play with it at Apple’s headquarters.
AR/VR headsets thus far have been taking one of the two approaches:
Apple took the middle ground – their headset is a VR headset since it replaces what you see with two high resolution displays – one for each eye. But it acts as an AR headset – because it uses external cameras on the headset to project the world on these displays.t
The end result? Expensive, but probably with better utility than any other alternative, especially once you couple it with Apple’s software.Video calling, FaceTime, televisions and AR
Almost at the sidelines of all the talks and discussions around Apple Vision Pro and the new Mac machines, there have been a few announcements around things that interest me the most – video calling.FaceTime and Apple TV
One of the challenges of video calling has been to put it on the television. This used to be called a lean back experience for video calling, in a world predominantly focused on lean forward when it comes to video calling. I remember working on such proof of concepts and product demos with customers ~15 years ago or more.
These never caught on.
The main reason was somewhere between the cost of the hardware, maintaining privacy with a livingroom camera and microphone positioning/noise.
By tethering the iPhone to the television, the cost of hardware along with maintaining privacy gets solved. The microphones are now a lot better than they used to – mostly due to better software.
Apple, being Apple, can offer a unique experience because they own and control the hardware – both of the phone and the set-top box. Something that is hard for other vendors to pull off.
There’s a nice concept video on the Apple press release for this, which reminded me of this Facebook (now Meta) Portal presentation from Kranky Geek:
Can Android devices pull the same thing, connected to Chromecast enabled devices maybe? Or is that too much to ask?
Do television and/or set-top box vendors put an effort into a similar solution? Should they be worried in any way?
Where could/should WebRTC play a role in such solutions, if at all?FaceTime and Apple Vision Pro
How do you manage video calls with a clunky AR/VR headset plastered on your face?
First off, there’s no external camera “watching you”, unless you add one. And then there’s the nagging thing of… well… the headset:
Apple has this “figured out” by way of generating a realistic avatar of you in a meeting. What is interesting to note here, is that in the Apple Vision Pro announcement video itself, Apple made a three important omissions:
What do the people at the meeting see of her? Do they see her looking at them, or the side of her head? Do they see the context of her real-life surroundings or a virtual background?
I couldn’t find any person who played with the Apple Vision Pro headset and reported using FaceTime, so I am assuming this one is still a work in progress. It will be really interesting to see what they come up with once this is released to market, and how real life use looks and feels like.Lifelike video meetings: Just like being there
Then there’s telepresence. This amorphous thing which for me translates into: “expensive video conferencing meeting rooms no one can purchase unless they are too rich”.
Or if I am a wee bit less sarcastic – it is where we strive to with video conferencing – what would be the ultimate experience of “just like being there” done remotely if we had the best technology money can buy today.
Google Project Starline is the current poster child of this telepresence technology.
The current iteration of telepresence strives to provide 3D lifelike experience (with eye contact obviously). To do so while maintaining hardware costs down and fitting more environments and hardware devices, it will rely on AI – like everything else these days.
The result as I understand it?
Now look at what FaceTime on an Apple Vision Pro really means:
Generate a hyper realistic avatar representation of the person – this sounds really similar to removing the background and using cameras to generate a 3D representation of the speaker (just with a bit more work and a bit less accuracy).
Both Vision Pro and Starline strive for lifelike experiences between remote people. Starline goes for a meeting room experience, capturing the essence of the real world. Vision Pro goes after a mix between augmented and virtual reality here – can’t really say this is augmented, but can’t say this is virtual either.
A telepresence system may end up selling a million units a year (a gross exaggeration on my part as to the size of the market, if you take the most optimistic outcome), whereas a headset will end up selling in the tens of millions or more once it is successful (and this is probably a realistic estimate).
What both of these ends of the same continuum of a video meeting experience do is they add the notion of 3D, which in video is referred to as volumetric video (we need to use big fancy words to show off our smarts).
And yes, that does lead me to the next topic I’d like to cover – volumetric video encoding.Volumetric video coding
We have the metaverse now. Virtual reality. Augmented reality. The works.
How do we communicate on top of it? What does a video look like now?
The obvious answer today would be “it’s a 3D video”. And now we need to be able to compress it and send it over the network – just like any other 2D video.
The Alliance for Open Media, who has been behind the publication and promotion of the AV1 video codec, just published a call for proposals related to volumetric video compression. From the proposal, I want to focus on the following tidbits:
This being promoted now, on the same week Apple Vision Pro comes out might be a coincidence. Or it might not.
The founding members include all the relevant vendors interested in AR/VR that you’d assume:
The rest also have vested interest in the metaverse, so this all boils down to this:
AR/VR requires new video coding techniques to enable better and more efficient communications in 3D (among other things)
Apple Vision Pro isn’t alone in this, but likely the one taking the first bold steps
The big question for me is this – will Apple go off with its own volumetric video codecs here, touting how open they are (think FaceTime open) or will they embrace the Alliance of Open Media work that they themselves are co-chairing?
And if they do go for the open standard here, will they also make it available for other developers to use? Me thinking… WebRTCIs the metaverse web based?
Before tackling the notion of WebRTC into the metaverse, there’s one more prerequisite – that’s the web itself.
Would we be accessing the metaverse via a web browser, or a similar construct?
For an open metaverse, this would be something we’d like to have – the ability to have our own identity(ies) in the metaverse go with us wherever we go – between Facebook, to Roblox, through Fortnite or whatever other “domain” we go to.
Last week also got us this sideline announcement from Matrix: Introducing Third Room TP2: The Creator Update
Matrix, an open source and open standard for decentralized communications, have been working on Third Room, which for me is a kind of a metaverse infrastructure for the web. Like everything related to the metaverse, this is mostly a work in progress.
I’d love the metaverse itself to be web based and open, but it seems most vendors would rather have it limited to their own closed gardens (Apple and Meta certainly would love it that way. So would many others). I definitely see how open standards might end up being used in the metaverse (like the work the Alliance of Open Media is doing), but the vendors who will adopt these open standards will end up deciding how open to make their implementations – and will the web be the place to do it all or not.Where would one fit WebRTC in the metaverse, AR and VR?
Maybe. Maybe not.
The unbundling of WebRTC makes it both an option while taking us farther away from having WebRTC as part of the future metaverse.
Not having the web means no real reliance on WebRTC.
Having the tooling in WebRTC to assist developers in the metaverse means there’s incentive to use and adopt it even without the web browser angle of it.
WebRTC will need at some point to deal with some new technical requirements to properly support metaverse use cases:
We’re still far away from that target, and there will be a lot of other technologies that will need to be crammed in alongside WebRTC itself to make this whole thing happen.
Apple’s new Vision Pro might accelerate that trajectory of WebRTC – or it might just do the opposite – solidify the world of the metaverse inside native apps.
I want to finish this off with this short piece by Jason Fried: The visions of the future
It looks at AR/VR and generative AI, and how they are two exact opposites in many ways.
Recently I also covered ChatGPT and WebRTC – you might want to take a look at that while at it.
The post Apple Vision, VR/AR, the metaverse and what it means to the web and WebRTC appeared first on BlogGeek.me.
Here at webrtcHacks we are always exploring what’s next in the world of Real Time Communications. One area we have touched on a few times is the use of WebCodecs with WebTransport as an alternative to WebRTC’s RTCPeerConnection. There have been several recent experiments by Bernard Aboba – WebRTC & WebTransport Co-Chair and webrtcHacks regular, […]
The post Livestream this Friday: WebCodecs, WebTransport, and the Future of WebRTC appeared first on webrtcHacks.
Is WebRTC really free? It is open source and widely used due to it. But it isn’t free when it comes to running and hosting your own WebRTC applications.
If you are new to WebRTC, then start here – What is WebRTC?
Time to answer this nagging question:
Is WebRTC really free?
One of the reasons that WebRTC is the most widely used developer technology for real time communications in the world is that it is open source. It helps a lot that it comes embedded and available in all modern browsers. That means that anyone can use WebRTC for any purpose they see fit, without paying any upfront licensing fee or later on royalties. This has enabled thousands of companies to develop and launch their own applications.
But does that mean every web application built with WebRTC is free? No. WebRTC may well be free, but whatever is bolted on top of it might not be. And then there are still costs involved with getting a web application online and dealing with traffic costs.
For that reason, in this article, I’ll be touching on why WebRTC really is free, and what you have to factor in for it if you want to get your own WebRTC application.Table of contents
Since I am sure you didn’t really go read that other article – I’ll suggest it here again: What is WebRTC?
The TL;DR version of it?
The WebRTC software library is open sourced under a permissible open source license. That means its source code is available to everyone AND that individuals and companies can modify and use it anywhere they wish without needing to contribute back their changes. It makes it easier for commercial software to be developed with it (even when no changes or improvements are made to the base WebRTC library – just because of how corporate lawyers are).
You see? WebRTC really is free.
Google “owns” and maintains the main WebRTC library implementation. Everyone benefits from this. That siad, they aren’t doing this only from the goodness of their heart – they have their own uses for WebRTC they focus on.However, there are costs involved with running a WebRTC application
While you don’t have to pay anything for WebRTC itself, there’s the application you develop, publish and then maintain. There are costs that come into play here – and considerable ones. These costs can vary depending on your requirements.
I’d like to split the costs here into 3 components:
The first thing you can put as a cost is to build the WebRTC application itself.
Here, as in all other areas, there’s more demand than supply when it comes to skilled WebRTC engineers. So much so that I had to write an article about hiring WebRTC developers – and I still send this link multiple times a month when asked about this.
Here too, you should split the cost into two parts:
Since everything done in WebRTC requires skilled engineers (that are scarce when it comes to WebRTC expertise), you can safely assume it is going to be a wee bit more expensive than you estimate it to be.2. How expensive it is to optimize a WebRTC implementation
I know what you’re going to say. Your WebRTC application is going to be awesome. Glorious. Superb. It is going to be so good that it will wipe the floor with the existing solutions such as Zoom, Google Meet and Microsoft Teams.
That kind of a mentality is healthy in an entrepreneur, but a dose of reality is necessary here:
This brings me to the need to optimize what you’re doing on an ongoing basis.
Ever since the pandemic, we’ve seen a growing effort in the leading vendors in this space to improve and optimize quality. This manifests itself in the research they publish as well as features they bring to the market. Here are a few examples:
You should plan for ongoing optimization of your own as well. Your customers are going to expect you to keep up with the industry. The notion of “good enough” works well here, but the bar of what is “good enough” is rising all the time.
Such optimizations are also needed not only to improve quality, but also to reduce costs.
Factor these costs in…3. Hosting and maintenance costs of a WebRTC application
I had a meeting the other day. A founder of a startup who had to use WebRTC because customers needed something live and interactive. That component wasn’t at the core of his application, but not having it meant lost deals and revenue. It was a mandatory capability needed for a specific feature.
He complained about WebRTC being expensive to operate. Mainly because of bandwidth costs.
We can split WebRTC maintenance costs here into two categories: cloud costs, keeping the lights on costs.Cloud costs
That startup founder was focused on cloud costs.
When we look at the infrastructure costs of web applications, there’s the usual CPU, memory, storage and network. We might be paying these directly, or indirectly via other managed and serverless services.
With WebRTC, the network component is the biggest hurt. Especially for video applications. You can reduce these costs by going to 2nd tier IaaS vendors or by hosting in “no-name” local data centers, but if you are like most vendors, you’re likely to end up on Amazon, Microsoft or Google cloud. And there, bandwidth costs for outgoing traffic are high.
WebRTC is peer to peer, but:
And the more successful you become – the more bandwidth you’ll consume – and the higher your cloud costs are going to be.
You will need to factor this in when developing your application, especially deciding when to start optimizing for costs and bandwidth use.Keeping the lights costs
Then there’s the “keeping the lights” costs.
WebRTC changes all the time. Things get deprecated and removed. Features change behavior over time. New features are added. You continually need to test that your application does not break in the upcoming Chrome release. Who is going to take care of all that in your WebRTC application?
You will also need to understand the way your WebRTC application is used. Are users happy? Are there areas you need to invest in with further optimization? Observability (=monitoring) is key here.
Keeping the lights on has its own set of costs associated with it.Build vs buy a WebRTC infrastructure
Buying your WebRTC infrastructure by using managed services like CPaaS vendors is expensive. But then again, building your own (along with optimizing and maintaining it) is also expensive.
Roughly speaking, this is the kind of a decision table you’ll see in front of you:BuildBuyPros Customized to your specific need
There’s also a middleground, where you can source/buy certain pieces and build others. Here are a few examples/suggestions:
You can also start with a CPaaS vendor and once you scale and grow, invest the time and money needed to build your own infrastructure – once you’ve proven your application and got to product-market-fit.So, how free is WebRTC, really?
Part of WebRTC’s claim to fame is its nature as an open source and thus free software for building interactive web applications. While the technology itself is indeed free of charge and offers numerous freedoms, there are still costs associated with running a WebRTC application.
When we had to launch our own video conferencing service some 25 years ago, we had to put an investment of several millions of dollars along with an engineering team for a period of a couple of years. Only to get to the implementation of a media engine.
WebRTC gives that to you for “free”. And it is also kind enough to be pre-integrated in all modern browsers.
What Google did with WebRTC was to reduce the barrier of entry to real time communication drastically.
Creating a WebRTC application isn’t free – not really. But it does come with a lot of alternatives that bring with them freedom and flexibility.
The post Is WebRTC really free? The costs of running a WebRTC application appeared first on BlogGeek.me.
How WebRTC media resilience works – what FEC, RED, PLC, RTX are and why they are needed to improve media quality in real-time communications.
Networks are finicky in nature, and media codecs even more so.
With networks, not everything sent is received on the other end, which means we have one more thing to deal with and care about when it comes to handling WebRTC media. Luckily for us, there are quite a few built-in tools that are available to us. But which one should we use at each point and what benefits do they bring?
This is what I’ll be focusing on in this article.Table of contents
Communication networks are lossy in nature. This means that if you send a packet through a network – there’s no guarantee of that packet reaching the other side. There’s also no guarantee that packets are reached in the order you’ve sent them or in a timely fashion, but that’s for another article.
This is why almost everything you do over the internet has this nice retransmission mechanism tucked away somewhere deep inside as an assumption. That retransmission mechanism is part of how TCP works – and for that matter, almost every other transport protocol implemented inside browsers.
The assumption here is that if something is lost, you simply send it again and you’re done. It may take a wee bit longer for the receiver to receive it, but it will get there. And if it doesn’t, we can simply announce that connection as severed and closed.
We call and measure that “something is lost” aspect of networks as packet loss.
Stripping away that automatic assumption that networks are reliable and everything you send over them is received on the other side is the first important step in understanding WebRTC but also in understanding real-time transport protocols and their underlying concepts.Media codecs are lossy (and sensitive)
Media codecs are also lossy but in a different way. When an audio codec or a video codec needs to encode (=compress) the raw input from a microphone or a camera, what they do is strip the data out of things they deem unnecessary. These things are levels of perceived quality of the original media.
I remember many years ago, sitting at the dorms in the university and talking about albums and CDs. One of the roommates there was an audiophile. He always explained how vinyl albums have better audio quality than CDs and how MP3 just ruins audio quality. Me? I never heard the difference.
Perceived quality might be different between different people. The better the codec implementation, the more people will not notice degraded quality.
Back to codecs.
Most media codecs are lossy in nature. There are a few lossless ones, but these are rarely used for real time communications and not used in WebRTC at all. The reason we use lossy codecs is to have better compression rates:
Taking 1080p (Full HD) video at 30 frames per second will result in roughly 1.5Gbps of data. Without compressing it – it just won’t work. We’re trying to squeeze a lot of raw data over networks, and as always, we need to balance our needs with the resources available to us.
To compress more, we need:
That last one is where media codecs become really sensitive.
If every bit matters, then losing a bit matters. And if losing a bit matters, then losing a whole packet matters even more.
Since networks are bound to lose packets, we’re going to need to deal with media packets missing and our system (in the decoder or elsewhere) needing to fill that gap somehow
More on lossy codecs
Media packets are lost. Our media decoders – or WebRTC system as a whole – needs to deal with this fact. This is done using different media correction mechanisms. Here’s a quick illustration of the available choices in front of us:
Each such media correction technique has its advantages and challenges. Let’s review them so we can understand them better.PLC: Packet Loss Concealment
Every WebRTC implementation needs a packet loss concealment strategy. Why? Because at some point, in some cases, you won’t have the packets you need to play NOW. And since WebRTC is all about real-time, there’s no waiting with NOW for too long.
What does packet loss concealment mean? It means that if we lost one or more packets, we need to somehow overcome that problem and continue to run to the best of our ability.
Before we dive a bit deeper, it is important to state: not losing packets is always better than needing to conceal lost packets. More on that – later.
This is done differently between audio and video:Audio PLC
For the most part, audio packets are decoded frame-by-frame and usually also packet-by-packet. If one is lost, we can try various ways to solve that. There are the most common approaches:
Packet loss on video streams has its own headaches and challenges.
In video, most of the frames are dependant on previous ones, creating chains of dependencies:
I-frames or keyframes (whatever they are called depending on the video codec used) break these dependency chains, and then one can use techniques like temporal scalability to reduce the dependencies for some of the frames that follow.
When you lose a packet, the question isn’t only what to do with the current video frame and how to display it, but rather what is going to happen to future frames depending on the frame with the lost packet.
In the past, the focus was on displaying every bit that got decoded, which ended up with video played back with smears as well as greens and pinks.Check it for yourself, with our most recent WebRTC fiddle around frame loss.
Today, we mostly not display frames until we have a clean enough bitstream, opting to freeze the video a bit or skip video frames than show something that isn’t accurate enough. With the advances in machine learning, they may change in the future.
PLC is great, but there’s a lot to be done to get back the lost packets as opposed to trying to make do with what we have. Next, we will see the additional techniques available to us.RTX: Retransmissions
Here’s a simple mechanism (used everywhere) to deal with packet loss – retransmission.
In whatever protocol you use, make sure to either acknowledge receiving what is sent to you or NACKing (sending a negative acknowledgement) when not receiving what you should have received. This way, the sender can retransmit whatever was lost and you will have it readily available.
This works well if there’s enough time for another round trip of data until you must play it back. Or when the data can help you out in future decoding (think the dependency across frames in video codecs). It is why retransmissions don’t always work that well in WebRTC media correction – we’re dealing with real time and low latency.
Another variation of this in video streams is asking for a new I-frame. This way, the receiver can signal the sender to “reset” the video stream and start encoding it from scratch, which essentially means a request to break the dependency between the old frames and the new ones that should be sent after the packet loss.RED: REDundancy Encoding
Retransmission means we overcome packet losses after the fact. But what if we could solve things without retransmissions? We can do that by sending the same packet more than once and be done with it.
Double or triple the bitstream by flooding it with the same information to add more robustness to the whole thing.
RED is exactly that. It concatenates older audio frames into fresh packets that are being sent, effectively doubling or tripling the packet size.
If a packet gets lost, the new frame it was meant to deliver will be found in one of the following packets that should be received.
Yes. it eats up our bandwidth budget, but in a video call where we send 1Mbps of video data or more, tripling the audio size from 40kbps to 90kbps might be a sacrifice worth making for cleaner audio.FEC: Forward Error Correction
Redundancy encoding requires an additional 100% or more of bitrate. We can do better using other means, usually referred to as Forward Error Correction.
Mind you, redundancy encoding is just another type of forward error correction mechanism
With FEC, we are going to add more packets that can be used to restore other packets that are lost. The most common approach for FEC is by taking multiple packets, XORing them and sending the XORed result as an additional packet of data.
If one of the packets is lost, we can use the XORed packet to recreate the lost one.
There are other means of correction algorithms that are a wee bit more complex mathematically (google about Reed-Solomon if you’re interested), but the one used in WebRTC for this purpose is XOR.
FEC is still an expensive thing since it increases the bitrate considerably. Which is why it is used only sparingly:
PLC, RTX, FEC, RED, …
How is each one signaled over the network? When would it make sense to use it? How does WebRTC implement it in the browser and what exactly can you expect out of it?
All that is mostly arcane knowledge. Something that is passed from one generation of WebRTC developers to another it seems.
Lucky for you, Philipp Hancke and myself are working on a new course – Higher Level WebRTC Protocols. In it, we are covering these specific topics as well as quite a few others in a level of detail that isn’t found anywhere else out there.
Most of the material is already written down. We just need to prettify it a bit and record it.
If you are interested in learning more about this, be sure to join our waiting list for once we launch the courseJoin the course waiting list
The post WebRTC media resilience: the role FEC, RED, PLC, RTX and other acronyms play appeared first on BlogGeek.me.
ChatGPT is changing computing and as an extension how we interact with machines. Here’s how it is going to affect WebRTC.
ChatGPT became the service with the highest growth rate of any internet application, reaching 100 million active users within the first two months of its existence. A few are using it daily. Others are experimenting with it. Many have heard about it. All of us will be affected by it in one way or another.
I’ve been trying to figure out what exactly does a “ChatGPT WebRTC” duo means – or in other words – what does ChatGPT means for those of us working with and on WebRTC.
Here are my thoughts so far.Table of contents
Let’s start with a quick look at what ChatGPT really is (in layman terms, with a lot of hand waving, and probably more than a few mistakes along the way).BI, AI and Generative AI
I’ll start with a few slides I cobbled up for a presentation I did for a group of friends who wanted to understand this.
ChatGPT is a product/service that makes use of machine learning. Machine learning is something that has been marketed a lot as AI – Artificial Intelligence. If you look at how this field has evolved, it would be something like the below:
We started with simple statistics – take a few numbers, sum them up, divide by their count and you get an average. You complicate that a bit with weighted average. Add a bit more statistics on top of it, collect more data points and cobble up a nice BI (Business Intelligence) system.
At some point, we started looking at deep learning:
Here, we train a model by using a lot of data points, to a point that the model can infer things about new data given to it. Things like “do you see a dog in this picture?” or “what is the text being said in this audio recording?”.
Here, a lot of 3 letter acronyms are used like HMM, ANN, CNN, RNN, GNN…
What deep learning did in the past decade or two was enable machines to describe things – be able to identify objects in images and videos, convert speech to text, etc.
It made it the ultimate classifier, improving the way we search and catalog things.
And then came a new field of solutions in the form of Generative AI. Here, machine learning is used to generate new data, as opposed to classifying existing data:
Here what we’re doing is creating a random input vector, pushing it into a generator model. The generator model creates a sample for us – something that *should* result in the type of thing we want created (say a picture of a dog). That sample that was generated is then passed to the “traditional” inference model that checks if this is indeed what we wanted to generate. If it isn’t, we iteratively try to fine tune it until we get to a result that is “real”.
This is time consuming and resource intensive – but it works rather well for many use cases (like some of the images on this site’s articles that are now generated with the help of Midjourney).
The thing is that all this thing I just explained wouldn’t be interesting without ChatGPT – a service that came to our lives only recently, becoming the hottest thing out there:February 16, 2023
ChatGPT is based on LLMs – Large Language Models – and it is fast becoming the hottest thing around. No other service grew as fast as ChatGPT, which is why every business in the world now is trying to figure out if and how ChatGPT will fit into their world and services.Why ChatGPT and WebRTC are like oil and water
So it begged the question: what can you do with ChatGPT and WebRTC?
Problem is, ChatGPT and WebRTC are like oil and water – they don’t mix that well.
ChatGPT generates data whereas WebRTC enables people to communicate with each other. The “generation” part in WebRTC is taken care of by the humans that interact mostly with each other on it.
On one hand, this makes ChatGPT kinda useless for WebRTC – or at least not that obvious to use for it.
But on the other hand, if someone succeeds to crack this one up properly – he will have an innovative and unique thing.What have people done with ChatGPT and WebRTC so far?
It is interesting to see what people and companies have done with ChatGPT and WebRTC in the last couple of months. Here are a few things that I’ve noticed:
In LiveKit’s and Twilio’s examples, the concept is to use the audio source from humans as part of prompts for ChatGPT after converting them using Speech to Text and then converting the ChatGPT response using Text to Speech and pass it back to the humans in the conversation.Broadening the scope: Generative AI
ChatGPT is one of many generative AI services. Its focus is on text. Other generative AI solutions deal with images or sound or video or practically any other data that needs to be generated.
I have been using MidJourney for the past several months to help me with the creation of many images in this blog.
Today it seems that in any field where new data or information needs to be created, a generative AI algorithm can be a good place to investigate. And in marketing-speak – AI is overused and a new overhyped term was needed to explain what innovation and cutting edge is – so the word “generative” was added to AI for that purpose.Fitting Generative AI to the world of RTC
How does one go about connecting generative AI technologies with communications then? The answer to this question isn’t an obvious or simple one. From what I’ve seen, there are 3 main areas where you can make use of generative AI with WebRTC (or just RTC):
Here’s what it meansConversations and bots
In this area, we either have a conversation with a bot or have a bot “eavesdrop” on a conversation.
The LiveKit and Twilio examples earlier are about striking a conversation with a bot – much like how you’d use ChatGPT’s prompts.
A bot eavesdropping to a conversation can offer assistance throughout a meeting or after the meeting –
As I stated above, this has little to do with WebRTC itself – it takes place elsewhere in the pipeline; and to me, this is mostly an application capability.Media compression
An interesting domain where AI is starting to be investigated and used is media compression. I’ve written about Lyra, Google’s AI enabled speech codec in the past. Lyra makes assumptions on how human speech sounds and behaves in order to send less data over the network (effectively compressing it) and letting the receiving end figure out and fill out the gaps using machine learning. Can this approach be seen as a case of generative AI? Maybe
Would investigating such approaches where the speakers are known to better compress their audio and even video makes sense?
How about the whole super resolution angle? Where you send video at resolutions of WVGA or 720p and then having the decoder scale them up to 1080p or 4K, losing little in the process. We’re generating data out of thin air, though probably not in the “classic” sense of generative AI.
I’d also argue that if you know the initial raw content was generated using generative AI, there might be a better way in which the data can be compressed and sent at lower bitrates. Is that something worth pursuing or investigating? I don’t know.Media processing
Similar to how we can have AI based codecs such as Lyra, we can also use AI algorithms to improve quality – better packet loss concealment that learns the speech patterns in real time and then mimics them when there’s packet loss. This is what Google is doing with their WaveNetEQ, something I mentioned in my WebRTC unbundling article from 2020.
Here again, the main question is how much of this is generative AI versus simply AI – and does that even matter?Is the future of WebRTC generative (AI)?
ChatGTP and other generative AI services are growing and evolving rapidly. While WebRTC isn’t directly linked to this trend, it certainly is affected by it:
Like any other person and business out there, you too should see if and how does generative AI affects your own plans.
The post ChatGPT meets WebRTC: What Generative AI means to Real Time Communications appeared first on BlogGeek.me.
RTC@Scale is Facebook’s virtual WebRTC event, covering current and future topics. Here’s the summary for RTC@Scale 2023 so you can pick and choose the relevant ones for you.
WebRTC Insights is a subscription service I have been running with Philipp Hancke for the past two years. The purpose of it is to make it easier for developers to get a grip of WebRTC and all of the changes happening in the code and browsers – to keep you up to date so you can focus on what you need to do best – build awesome applications.
We got into a kind of a flow:
Oh – and we’re covering important events somewhat separately. Last month, a week after Meta’s RTC@Scale event took place, Philipp sat down and wrote a lengthy summary of the key takeaways from all the sessions, which we distributed to our WebRTC Insights subscribers.
As a community service (and a kind of a promotion for WebRTC Insights), we are now opening it up to everyone in this articleTable of contents
Meta ran their rtc@scale event again. Last year was a blast and we were looking forward to this one. The technical content was pretty good again. As last year, our focus for this summary is what we learned or what it means for folks developing with WebRTC. Once again, the majority of speakers were from Meta. At times they crossed the line of “is this generally useful” to the realm of “Meta specific” but most of the talks still provide value.
Compared to last year there were almost no “work with me” pitches (with one exception).
It is surprising how often Meta says “WebRTC” or “Google” (oh and Amazon as well).
Writing up these notes took a considerable amount of time (again) but we learned a ton and will keep referencing these talks in the future so it was totally worth it (again). You can find the list of speakers and topics on the conference website, the seven hours of raw video here (which includes the speaker introductions) or you just scroll down below for our summary.SESSION 1 Rish Tandon / Meta – Meta RTC State of the Union
Watch if you
Watch if you are
Watch if you are
Watch if you are
Watch if you are
Watch if you are
Watch if you are
Watch if you are
Watch if you
Watch if you are
Watch if you are
Watch if you are
Watch if you are
We tried capturing as much as possible, which made this a wee bit long. The purpose though is to make it easier for you to decide in which sessions to focus, and even in which parts of each session.
Oh – and did we mention you should check out (and subscribe) to our WebRTC Insights service?
WebRTC media server is an optional component in a WebRTC application. That said, in most common use cases, you will need one.
There are different types of WebRTC servers. One of them is the WebRTC media server. When will you be needing one and what exactly it does? Read on.
Oh – and if you’re looking to dig deeper into WebRTC media servers, make sure to check the end of this article for an announcement of our latest WebRTC courseTable of contents
There are quite a few moving parts in a WebRTC application. There’s the client device side, where you’ll have the web browsers with WebRTC support and maybe other types of clients like mobile applications that have WebRTC implementations in them.
And then there are the server side components and there are quite a few of them. The illustration above shows the 4 types of WebRTC servers you are likely to need:
The illustration below shows how all of these WebRTC servers connect to the client devices and what types of data flows through them:
What is interesting, is that the only real piece of WebRTC infrastructure component that can be seen as optional is the WebRTC media server. That said, in most real-world use-cases you will need media servers.The role of a WebRTC media server
At its conception, WebRTC was meant to be “between” browsers. Only recently, did the good people at the W3C see it fit to change it to something that can work also in browsers. We’ve know that to be the case all along
What does a WebRTC media server do exactly? It processes and routes media packets through the backend infrastructure – either in the cloud or on premise.
Let’s say you are building a group calling service and you want 10 people to be able to join in and talk to each other. For simplicity’s sake, assume we want to get 1Mbps of encoded video from each participant and show the other 9 participants on the screen of each of the users:
How would we go about building such an application without a WebRTC media server?
To do that, we will need to develop a mesh architecture:
We’d have the clients send out 1Mbps of their own media to all the other participants who wish to display them on their screen. This amounts to 9*1Mbps = 9Mbps of upstream data that each participant will be sending out. Each client receives streams from all 9 other participants, getting us to 9Mbps of downstream data.
This might not seem like much, but it is. Especially when sent over UDP in real time, and when we need to encode and encrypt each stream separately for each user, and to determine bandwidth estimation across the network. Even if we reduce the requirement from 1Mbps to a lower bitrate, this is still a hard problem to deal with and solve.
It becomes devilishly hard (impossible?) when we crank up the number to say 50 or a 100 participants. Not to mention the numbers we see today of 1,000 or more participants in sessions (either active participants or passive viewers).
Enter the WebRTC media server
This is where a WebRTC media server comes in. We will add it here to be able to do the following tasks for us:
Here’s what’s really going on and what we use these media servers for:
WebRTC media servers bridge the gaps in the architecture that we can’t solve with clients aloneHow is a WebRTC media server different from TURN servers
Before we continue and dive in to the different types of media servers, there’s something that must be said and discussed:
WebRTC media server != TURN server
I’ve seen people try to use the TURN server to do what media servers do. Usually that would be things like recording the data stream.
This doesn’t work.
TURN servers route media through firewalls and NAT devices. They aren’t privy to the data being sent through them. WebRTC privacy is maintained by having data encrypted end to end when passing via TURN servers – the TURN servers don’t know the encryption key so can’t do anything with the media.
WebRTC media servers are implementations of WebRTC clients in a server component. From an architectural point of view, the “session” terminates in the WebRTC media server:
A WebRTC media server is privy to all data passing through it, and acts as a WebRTC client in front of each of the WebRTC devices it works with. It is also why it isn’t so well defined in WebRTC but at the same time so versatile.Types of WebRTC media servers
This versatility of WebRTC media servers means that there are different types of such servers. Each one works under different architectural assumptions and concepts. Lets review them quickly here.Routing media using an SFU
The most common and popular WebRTC media server is the SFU.
An SFU routes media between the devices, doing as little as possible when it comes to the media processing part itself.
The concept of an SFU is that it offloads much of the decision making of layout and display to the clients themselves, giving them more flexibility than any other alternative. At the same time, it takes care of bandwidth management and routing logic to best fit the capabilities of the devices it works with.
At the beginning, SFUs were introduced and used for group calls. Later on, they started to appear as live streaming and broadcast components.Mixing media with an MCU
Probably the oldest media server solution is the MCU.
The MCU was introduced years before WebRTC, when networks were limited. Telephony systems had/have voice conferencing bridges built around the concept of MCUs. Video conferencing systems required the use of media servers simply because video compression required specialized hardware and later too much CPU from client devices.
In telephony and audio, you’ll see this referred to as mixers or audio bridges and not MCUs. That said, they still are one and the same technically.
What MCUs do is to receive and mix the media streams it receives from the various participants, sending a single stream of media towards the clients. For clients, an MCU looks like a call between 2 participants – it is the only entity the client really interacts with directly. This means there’s a single audio and a single video stream coming into and going out of the client – regardless of the number of participants and how/when they join and leave the session.
MCUs were less used in WebRTC from the get go. Part of it was the simple economies of scale – MCUs are expensive to operate, requiring a lot of CPU power (encoding and decoding media is expensive). It is cheaper to offer the same or similar services using SFUs. There are vendors who still rely on MCUs in WebRTC for group calling, though in most cases, you will find MCUs providing the recording mechanism only – where what they end up doing is taking all inputs and mixing them into a single stream to place in storage.Bridging across standards using a gateway
Another type of media server that is used in WebRTC is a gateway.
In some cases, content – rendered, live or otherwise – needs to be shared in a WebRTC session – or a WebRTC session needs to be shared on another type of a protocol/medium. To do so, a gateway can be used to bridge between the protocols.
The two main cases where these happen are probably:
One more example is a kind of a hybrid media server. One that might do routing and processing together. A group calling service that also records the call into a single stream for example. Such solutions are becoming more and more popular and are usually deployed as multiple media servers of different types (unlike the illustration above), each catering for a different part of the service. Splitting them up makes it easier to develop, maintain and scale them based on the workload needed by each media server type.Cloud rendering
This might not be a WebRTC media server per se, but for me this falls within the same category.
Sometimes, what we want is to render content in the cloud and share it live with a user on a browser. This is true for things like cloud gaming or cloud application delivery (Photoshop in the cloud for hourly consumption). In such a case, this is more like a peer-to-peer WebRTC session taking place between a user on a browser and a cloud server that renders the content.
I see it as a media server because many of the aspects of development and scaling of the cloud rendering components are more akin to how you’d think about WebRTC media servers than they are about browser or native clients.A quick exercise: What WebRTC media servers are used by Google Meet?
Let’s look at an example service – Google Meet. Why Google Meet? Well, because it is so versatile today and because if you want to trace capabilities in WebRTC, the best approach is to keep close tabs with what Google Meet is doing.
What WebRTC media servers does Google Meet use? Based on the functionality it offers, we can glean out the types that make up this service:
A classing meeting service in WebRTC may well require more than a single type of a WebRTC media server, likely deployed in hybrid mode across different hardware configurations.When will you need a WebRTC media server?
As we’ve seen earlier, the answer to this is simple – when doing things with WebRTC clients only isn’t possible and we need something to bridge this gap.
We may lack:
What I usually do when analyzing the needs of a WebRTC application is to find these gaps and determine if a WebRTC media server is needed (it usually is). I do so by thinking of the solution as a P2P one, without media servers. And then based on the requirements and the gaps found, I’ll be adding certain WebRTC media server elements into the infrastructure needed for my WebRTC application.E2EE and WebRTC media servers
We’ve seen a growing interest in recent years in privacy. The internet has shifted to encryption first connections and WebRTC offers encrypted only media. This shift towards privacy started as privacy from other malicious actors on the public internet but has since shifted also towards privacy from the service provider itself.
Running a group meetings service through a service provider that cannot access the meeting’s content himself is becoming more commonplace.
This capability is known as E2EE – End to End Encryption.
When introducing WebRTC media servers into the mix, it means that while they are still a part of the session and are terminating WebRTC peer connections (=terminating encrypted SRTP streams) on their own, they shouldn’t have access to the media itself.
This can be achieved only in the SFU type of WebRTC media servers by the use of insertable streams. With it, the application logic can exchange private encryption keys between the users and have a second encryption layer that passes transparently through the SFU – enabling it to do its job of packet routing without the ability to understand the media content itself.WebRTC media servers and open source
Another important aspect to understand about WebRTC media servers is that most of those using media servers in WebRTC do so using open source frameworks for media servers.
I’ve written at length about WebRTC open source projects – there are details there about the market state and open source WebRTC media servers there.
What is important to note is that more often than not, projects who don’t use managed services for their WebRTC media servers usually pick open source WebRTC media servers to work with and not develop their own from scratch. This isn’t always the case, but it is quite common.Video APIs, CPaaS and WebRTC media servers
WebRTC Video API and CPaaS is another area I cover quite extensively.
Vendors who decide to use a CPaaS vendor for their WebRTC application will mainly do it in one of two situations:
Both cases require media servers…
This leads to the following important conclusion: there’s no such thing as a CPaaS vendor doing WebRTC that isn’t offering a managed WebRTC media server as part of its solution – and if there is, then I’ll question its usefulness for most potential customers.Taking a deep dive into WebRTC protocols
Last year, I released the Low-level WebRTC protocols course along with Philipp Hancke.
The Low-level WebRTC protocols course has been a huge success, which is why we’re starting to work on our next course in this series: Higher level WebRTC protocols
Before we go about understanding WebRTC media servers, it is important to understand the inner-workings of the network protocols that WebRTC employs. Our low-level protocols course covers the first part of the underlying protocols. This second course, looks at the higher level protocols – the parts that look and deal a bit more with network realities – challenges brought to us by packet losses as well as other network characteristics.
Things we cover here include retransmissions, forward error correction, codecs packetization and a myriad of media processing algorithms.
Want to be the first to know when we open our early bird enrollment?Join the waiting list
WHIP and WHEP are specifications to get WebRTC into live streaming. But is this really what is needed moving forward?
In recent months, there has been a growing adoption in the implementation of these protocols (the adoption of actual use isn’t something I am privy to so can’t attest either way). This progress is a positive one, but I can’t ignore the feelings I have that this is only a temporary solution.Table of contents
WHIP stands for WebRTC-HTTP Ingestion Protocol. WHEP stands for WebRTC-HTTP Egress Protocol. They are both relatively new IETF drafts that define a signaling protocol for WebRTC.
WebRTC explicitly decided NOT to have any signaling protocol so that developers will be able to pick and choose any existing signaling protocol of their choice – be it SIP, XMPP or any other alternative. For the media streaming industry, this wasn’t a good thing – they needed a well known protocol with ready-made implementations. Which led to WHIP and WHEP.
To understand them how they fit into a solution, we can use the diagram below:
In a live streaming use case, we have one or more broadcasters who “Ingest” their media to a media server. That’s where WHIP comes in. The viewers on the other side, get their media streams on the egress side of the media servers infrastructure.
For a technical overview of WHIP & WHEP, check out this Kranky Geek session by Sergio Garcia Murillo from Dolby:
In video conferencing, WebRTC transformed the market and how it thought of meetings and interoperability by practically killing the notion of interoperability across vendors on the protocol level, shifting it to the application level and letting users install their own apps on devices or just load web pages on demand.
The streaming industry is different – it relies on 3 components, which can easily come from 3 different vendors:
When a broadcaster implements his application, he picks and chooses the media servers and media players. Sometimes he will also pick the ingestion part, but not always. And none of the vendors in each of these 3 categories can really enforce the use of his own components for the others.
This posed a real issue for WebRTC – it has no signaling protocol – this is left for the implementers, but how do you develop such a solution that works across vendors without a suitable signaling protocol?
The answer for that was WHIP and WHEP –
These are really simple protocols built around the notion of a single HTTP request – in an attempt to get the streaming industry to use them and not shy away from the complexities hidden in WebRTC.Strengths
Here’s what’s working well for WHIP and WHEP:
There’s the challenging side of things as well:
This last weakness – WebRTC – leads me to the next issue at hand.Streaming, latency and WebRTC
Streaming comes in different shapes and sizes.
The scenario might have different broadcasters:viewers count – 1:1, 1:many, few:1, few:many – each has its own requirements and nuances as to what I’d prefer using on the sending side, receiving end and on the media server itself.
What really changes everything here is latency. How much latency are we willing to accept?
The lower the latency we want the more challenging the implementation is. The closer to live/real time we wish to get, the more sacrifices we will need to make in terms of quality. I’ve written about the need to choose either quality or latency.
WebRTC is razor focused on real time and live. So much so that it can’t really handle something that has latency in it. It can – but it will sacrifice too much for it at a high complexity cost – something you don’t really want or need.
What does that mean exactly?
This is when a few tough questions need to be asked – what exactly does your streaming service need?
If you need things to be conducted in sub-second latency only, then WebRTC is probably the way to go. But if you have in your use case other latencies as well, then think twice before choosing WebRTC as your go-to solution.A hybrid WebRTC approach to “live” streaming
An important aspect that needs to be mentioned here is that in many cases, WebRTC is used in a hybrid model in media streaming.
Oftentimes, we want to ingest media using WebRTC and view the media elsewhere using other protocols – usually because we don’t care as much about latency or because we already have the viewing component solved and deployed – here WebRTC ingest is added to an existing service.
Adding the WHIP protocol here, and ingesting WebRTC media to the streaming service means we can acquire the media from a web browser without installing anything. Real time is nice, but not always needed. Browser ingest though is mostly about reducing friction and enabling web applications.The 3 horsemen: WebTransport, WebCodecs and WebAssembly
That last suggestion would have looked different just two years ago, when for real time the only game in town for browsers was WebRTC. Today though, it isn’t the case.
In 2020 I pointed to the unbundling of WebRTC. The trend in which WebRTC is being split into its core components so that developers will be able to use each one independently, and in a way, build their own solution that is similar to WebRTC but isn’t WebRTC. These components are:
Theoretically, using these 3 components one can build a real time communication solution, which is exactly what Zoom is trying to do inside web browsers.
In the past several months I’ve seen more and more companies adopting these interfaces. It started with vendors using WebAssembly for background blurring and replacement. Moved on to companies toying around with WebTransport and/or WebCodecs for streaming and recently a lot of vendors are doing noise suppression with WebAssembly.
Here’s what Intel showcased during Kranky Geek 2021:
This trend is only going to grow.
How does this relate to streaming?
Good that you asked!
These 3 enables us to implement our own live streaming solution, not based on WebRTC that can achieve sub second latency in web browsers. It is also flexible enough for us to be able to add mechanisms and tools into it that can handle higher latencies as needed, where in higher latencies we improve upon the quality of the media.Strengths
Here’s what I like about this approach:
It isn’t all shiny though:
I don’t know.
WHIP and WHEP are here. They are gaining traction and have vendors behind them pushing them.
On the other hand, they don’t solve the whole problem – only the live aspect of streaming.
The reason WebRTC is used at the moment is because it was the only game in town. Soon that will change with the adoption of solutions based on WebTransport+WebCodecs+WebAssembly where an alternative to WebRTC for live streaming in browsers will introduce itself.
Can this replace WebRTC? For media streaming – yes.
Is this the way the industry will go? This is yet to be seen, but definitely something to track.
The post WHIP & WHEP: Is WebRTC the future of live streaming? appeared first on BlogGeek.me.
The post Web 上的视频帧处理 – WebAssembly、WebGPU、WebGL、WebCodecs、WebNN 和 WebTransport appeared first on webrtcHacks.
The post Video Frame Processing on the Web – WebAssembly, WebGPU, WebGL, WebCodecs, WebNN, and WebTransport appeared first on webrtcHacks.
Understanding how WebRTC is governed in reality will enable you to make better decisions in your development strategy.
If you are correct or not is something we can argue about. What we can’t argue is that the expectation that a company who is maintaining an open source library doesn’t owe you anything.
Free is worth exactly what you pay for it. 0⃣
And there lies the whole issue – if you aren’t paying for WebRTC, then what gives you the right to complain? (btw – this is different from the other side of it – could Google do a better job of maintaining WebRTC for everyone at the same or lower effort, while increasing external contributions to it).Table of contents
To. Many, Times. People. Complain. About. Google.
I do that as well
If you are complaining, at least know that you’re complaining about something that is reasonable…
One of the more recent cases comes from Twilio (or more accurately a customer of theirs):
There was a minor change in Google’s implementation of WebRTC. For some reason, they decided to be less lenient with how they parse iceServers in peer connections to be more “spec compliant”.
Yes. It is nitpicking.
Yes. It is a useless change.
Yes. They could have decided not to do it.
But they did. And in a weird way, it makes sense to do so.
And there’s a process in place already for dealing with that – Canary and Beta versions of Chrome that vendors (like Twilio) can use to catch and handle these things beforehand. Or they can… well… register to the WebRTC Insights
Twilio had to fix their code (and they did by the way), and yet there are those who blame Google here for making changes in Chrome. Changes that one can say are needed.
I’d add a few more thoughts here before I continue to dive in to this topic properly:
WebRTC is an open standard governed by the W3C and an open source library which confusingly is also named “webrtc”. I prefer to call it libwebrtc.
The WebRTC open source standard is somewhat split in “ownership” between the W3C and the IETF. W3C is in charge of the API surface we use in the browser for WebRTC and the IETF on the network protocol itself – what gets sent over the network.
WebRTC as an open source library is… well… it depends. Google develops and maintains libwebrtc – that’s the source code that goes into Chrome. And Edge. And Firefox. And Safari. Yes – all of them. And then there are other alternative libraries you can use.
The thing is this – you can’t really use a different WebRTC implementation in the browser, because browsers come with libwebrtc “built-in”. And in many cases, if you don’t need a browser, you may still want to use libwebrtc just to be as close as possible to the browser implementation.
Does that mean that Google owns the WebRTC implementation? To some degree it does – while there are alternatives, none of them are truly usable for many of the use cases.
That said, anyone can fork the Google WebRTC implementation and create his own project – open source or otherwise – and continue from there. Apple could do it. So could Microsoft and Mozilla. And yet they all decided to stick with libwebrtc as is.
Why is that?
I can think of two main reasons:
So in a way, Google owns WebRTC without really owning it. At least as long as Chrome is the undisputed and dominant form in which we consume the internet (are you reading this on a Chrome browser?)
I usually place a global market share graph at this stage. This time, I’ll share this website’s visitors distribution:A few words about libwebrtc
libwebrtc is maintained by Google for Google. It is open sourced and you can use it. You can even contribute back, which isn’t a simple process.
By Google for Google means that prioritization of features, testing and bug fixes is done based on Google’s needs. These needs include Google Meet, a few other Google services and the need to support and maintain the larger ecosystem.
Who sets the tone here? What decides if your bug is more important to deal with than Google Meet or another vendor’s problems?
Put yourself in the shoes of the Google product manager for WebRTC and you’ll know the answer – it would be Google Meet first. The others later.
This also sets the tone as to the build system and code structure of libwebrtc. It is highly geared towards its use inside Chrome. Less elsewhere. And this in turn means that adopting it as a library inside your own application means dealing with code that isn’t meant to be a classic generic purpose SDK – you’ll need to figure your way through it (and with a bit less documentation than you’d like).Vendors in the WebRTC ecosystem
There are now hundreds if not thousands of vendors using WebRTC in the ecosystem. They do it directly or indirectly via CPaaS vendors and other tooling and solutions. You can find many of them in my WebRTC Developer Tools Lanscape. Most of them view WebRTC as free. Not only that, it seems like many treat WebRTC as a human right – it needs to be there for them, it must be perfect, and if there’s something ”wrong” with it, then humanity has the obligation to fix it for them.
So… WebRTC is free. But what does that mean exactly? What is the SLA associated with it? What can you expect of it and come back to complain if it isn’t met?
Here are a few additional interesting questions, If WebRTC is cardinal and strategic to your application:
To be clear – there are no right or wrong answers here – just make sure you position your expectations based on your answers as wellPutting your money where your mouth is
Philipp Hancke has been doing WebRTC for a long time and is renowned for his bug reports. He even got Google to fix quite a few of them. Some bugs stayed open for years however, like this bug about TURN relay servers being used sometimes in cases where using STUN will be just fine. A bug here has an impact on the percentage of calls that get relayed via TURN servers which has a negative impact on call quality (at times) but also increases the cost to run those.
This bug has been open for since 2016. Quite a few Googlers took a look but without finding anything that stood out. The crucial hint of what goes wrong came in 2021 in another bug report. In the end, Philipp had to acquire the skills necessary to fix the bug (which will hopefully happen before the end of 2023).
This takes time and time is not cheap – especially that of engineers. Microsoft as his employer apparently decided it was important enough for him to spend time on fixing this and other issues.Please Google add a feature for me!
HEVC encoding and decoding in WebRTC seems to be a topic some folks get excited about. It would be great to know why..
There is a bug report about it in the WebRTC issue tracker which gets fairly frequent updates. And yet… Google does nothing! How can that be?
One would say that’s because it is out of the requirements of what Google needs for Google. There are other contributing factors as well here:
There’s this modern concept of zero trust in cloud computing these days.
Here’s my suggestion to you wrt WebRTC and your stance:
Don’t expect – and you won’t be disappointed.
But more importantly – understand how this game is played:
And yes – we’re here to help – you can use WebRTC Insights to get ahead of these issues in many ways.
The post With WebRTC, don’t expect Google to be your personal outsourcing vendor appeared first on BlogGeek.me.
WebRTC used to be about capturing some media and sending it from Point A to Point B. Machine Learning has changed this. Now it is common to use ML to analyze and manipulate media in real time for things like virtual backgrounds, augmented reality, noise suppression, intelligent cropping, and much more. To better accommodate this […]
The post Real-Time Video Processing with WebCodecs and Streams: Processing Pipelines (Part 1) appeared first on webrtcHacks.