Explore the world of video codecs and their significance in WebRTC. Understand the advantages and trade-offs of switching between different codec generations.
Technology grinds forward with endless improvements. I remember when I first came to video conferencing, over 20 years ago, the video codecs used were H.261, H.263 and H.263+ with all of its glorious variants. H.264 was starting to be discussed and deployed here and there.
Today? H.264 and VP8 are everywhere. We bump into VP9 in WebRTC applications and we talk about AV1.
What does it mean exactly to move from one video codec generation to another? What do we gain? What do we lose? This is what I want to cover in this article.
Table of contentsDon’t have time for my ramblings? This short video should have you mostly covered:
👉 I started recording these videos a few months back. If you like them, then don’t forget to like them 😉
The TL;DR:
A codec is a piece of software that compresses and decompresses data. A video codec consists of an encoder which compresses a raw video input and a decoder which decompresses the compressed bitstream of a video back to something that can be displayed.
👉 We are dealing here with lossy codecs. Codecs that don’t maintain the whole data, but rather lose information trying to hold as much as the original as possible with as little data that needs to be stored as possible
The way video codecs are defined is by their decoder:
Given a bitstream generated by a video encoder, the video codec specification indicates how to decompress that bitstream back into a viewable format.
What does that mean?
Video codecs require a lot of CPU and memory to operate. This means that in many cases, our preference would be to offload their job from the CPU to hardware acceleration. Most modern devices today have media acceleration components in the form of GPUs or other chipset components that are capable of bearing the brunt of this work. It is why mobile devices can shoot high quality videos with their internal camera for example.
Since video codecs are dictated by the specification of their decoder, defining and implementing hardware acceleration for video decoders is a lot easier than doing the same thing for video encoders. That’s because the decoders are deterministic.
For the video encoder, you need to start asking questions –
This leads us to the fact that in many cases and scenarios, hardware acceleration of video codecs isn’t suitable for WebRTC at all – they are added to devices so people can watch YouTube videos of cats or create their own TikTok videos. Both of these activities are asynchronous ones – we don’t care how long the process of encoding and decoding takes (we do, but not in the range of milliseconds of latency).
Up until a few years ago, most hardware acceleration out there didn’t work well for WebRTC and video conferencing applications. This started to change with the Covid pandemic, which caused a shift in priorities. Remote work and remote collaboration scenarios climbed the priorities list for device manufacturers and their hardware acceleration components.
Where does that leave us?
The end result? Another headache to deal with… and we didn’t even start to talk about codec generations.
New video codec generation = newer, more sophisticated toolsI mentioned the tools that are the basis of a video codec. The decoder knows how to read a bitstream based on these tools. The encoder picks and chooses which tools to use when.
When moving to a newer codec generation what usually happens is that the tools we had are getting more flexible and sophisticated, introducing new features and capabilities. And new tools are also added.
More tools and features mean the encoder now has more decisions to make when it compresses. This usually means the encoder needs to use more memory and CPU to get the job done if what we’re aiming for is better compression.
Switching from one video codec generation to another means we need the devices to be able to carry that additional resource load…
A few hard facts about video codecsHere are a few things to remember when dealing with video codecs:
It is time to start looking at WebRTC and its video codecs. We will begin with the MTI video codecs – the Mandatory To Implement. This has been a big debate back in the day. The standardization organizations couldn’t decide if VP8 or H.264 need to be the MTI codecs.
To make a long story short – a decision was made that both are MTI.
What does this mean exactly?
These video codecs are rather comparable for their “price/performance”. There are differences though.
👉 If you’re contemplating which one to use, I’ve got a short free video course to guide you through this decision making process: H.264 or VP8 – What Shall it be?
The emergence of VP9 and rejection of HEVCThe descendants of VP8 and H.264 are VP9 and HEVC.
H.264 is a royalty bearing codec and so is HEVC. VP8 and VP9 are both royalty free codecs.
HEVC being newer and considerably more expensive made things harder for it to be adopted for something like WebRTC. That’s because WebRTC requires a large ecosystem of vendors and agreements around how things are done. With a video codec, not knowing who needs to pay the royalties stifles its adoption.
And here, should the ones paying be the chipset vendor? Device manufacturer? The browser vendor? The application developer? No easy answer, so no decision.
This is why HEVC ended up being left out of WebRTC for the time being.
VP9 was an easy decision in comparison.
Today, you can find VP9 in applications such as Google Meet and Jitsi Meet among many others who decided to go for this video codec generation and not stay in the VP8/H.264 generation.
The big promise of VP9 was its SVC support
Our brave new world of AV1AV1 is our next gen of video codecs. The promise of a better world. Peace upon the earth. Well… no.
Just a divergence in the road that puts a focus in a future that is mostly royalty free for video codecs (maybe).
What do we get from AV1 as a new video codec generation compared to VP9? Mainly what we did from VP9 compared to VP8. Better quality for the same bitrate and the price of CPU and memory.
Where VP9 brought us the promise of SVC, AV1 is bringing with it the promise of better screen sharing of text. Why? Because its compression tools are better equipped for text, something that was/is lacking in previous video codecs.
AV1 has behind it most of the industry. Somehow, at a magical moment in the past, they got together and got to the conclusion that a royalty free video codec would benefit everyone, creating the Alliance of Open Media and with it the AV1 specification. This got the push the codec needed to become the most dominant video coding technology of our near future.
For WebRTC, it marks the 3rd video generation codec that we can now use:
Here’s an update of what Meta is doing with AV1 on mobile from their RTC@Scale event earlier this year.
This is a start. And a good one. You see experiments taking place as well as first steps towards productizing it (think Google Meet and Jitsi Meet here among others) in the following areas:
First things first. If you’re going to use a video codec of a newer generation than what you currently have, then this is what you’ll need to decide:
Do you focus on getting the same bitrate you have in the past, effectively increasing the media quality of the session. Or alternatively, are you going to lower the bitrate from where it was, reducing your bandwidth requirements.
Obviously, you can also pick anything in between the two, reducing the bitrate used a bit and increasing the quality a bit.
Starting to use another video codec though isn’t only about bitrate and quality. It is about understanding its tooling and availability as well:
There’s a lot more to be said about video codecs and how they get used in WebRTC.
For more, you can always enroll in my WebRTC courses.
The post WebRTC video codec generations: Moving from VP8 and H.264 to VP9 and AV1 appeared first on BlogGeek.me.
WebRTC’s peer connection includes a getStats method that provides a variety of low-level statistics. Basic apps don’t really need to worry about these stats but many more advanced WebRTC apps use getStats for passive monitoring and even to make active changes. Extracting meaning from the getStats data is not all that straightforward. Luckily return author […]
The post Power-up getStats for Client Monitoring appeared first on webrtcHacks.
Lip synchronization is a solved problem in WebRTC. That’s at least the case in the naive 1:1 sessions. The challenges start to amount once you hit multiparty architectures or when audio and video get generated/rendered separately.
Let’s dive into the world of lip synchronization, understand how it is implemented in WebRTC and in which use cases we need to deal with the headaches it brings with it.
Table of contentsDiscover the fascinating world of lip synchronization technology and its impact on WebRTC applications.
When you watch a movie or any video clip for that matter on your device – be it a PC display, tablet, smartphone or television – the audio and video that gets played back at you gets lip synced. There’s no “combination” of audio and video. These are two separate data sets / files / streams that are associated with one another in a synchronized fashion.
When you play out an mp4 file for example, it is actually a container file of multiple media streams. Each decoded and played out independently, synchronized again by timing the playout.
This was a decision made long ago that enables more flexibility in encoding technologies – you can use different codecs for the audio and the video of the content, based on your needs and the type of content you have. It also makes more sense since the codecs and technologies for compression audio and video are quite different from one another.
The RTP/RTCP solution to lip synchronizationWhen we’re dealing with WebRTC, we’re using SRTP as the protocol to send our media. SRTP is just the secure variant of RTP which is what I want to focus on here.
RTP is used to send media over the internet. RTCP acts as the control protocol for RTP and is tightly coupled with it.
The solution used for lip synchronization of RTP and RTCP was to rely on timestamps. To make sure we’re all confused though, the smart folks who conjured this solution up, decided to go with different types of timestamps and frequencies (it likely made them feel smart, though there’s probably a real reason I am not aware of that at least made sense at some point in the past).
We’re going to dive together into the charming world of RTP and NTP timestamps and see how together, we can lip sync audio and video in WebRTC.
RTP timestampRTP timestamp is like using “position: relative;” in CSS. We cannot use it to discern the absolute time a packet was sent (and we do not know the receiver’s clock in relation to ours).
What we can do with it, is discern the time that has passed between one RTP timestamp to another.
The slide above is from my Low-level WebRTC protocols course in the RTP lesson. Whenever we send a packet of media over the internet in WebRTC, the RTP header for that packet (be it audio or video) has a timestamp field. This field has 32 bits of data in it (which means it can’t be absolute in a meaningful way – not enough bits).
WebRTC picks the starting point for the RTP timestamps randomly, and from there it increases the value based on the frequency of the codec. Why the frequency of the codec and not something saner like “milliseconds” or similar? Because.
For audio, we increment the RTP timestamp by 48,000 every second for the Opus voice codec. For video, we increment it by 90,000 every second.
The headache we’re left dealing with here?
We said RTP timestamp is relative? Then NTP timestamp is like using “position: absolute;” in CSS. It gives us the wallclock time. It is 64 bits of data, which means we don’t want to send it as much over the network.
Oh, and it covers 1900-2036 after which it wraps around (expect a few minor bugs a decade from now because of this). This is slightly different from the more common Unix 1970 startpoint timestamp.
The slide above is from my Higher-level WebRTC protocols course in the Inside RTCP lesson.
You can see that when an RTCP SR block is sent over the network (let’s assume once every 5 seconds), then we get to learn about the NTP timestamp of the sender, as well as the RTP timestamp associated with it.
In a way,we can “sync” between any given RTP timestamp we bump into with the NTP/RTP timestamp pair we receive for that stream in a RTCP SR.
What are we going to use this for?
Let’s sum this part up:
Easy peasy. Until it isn’t.
👉 RTP, RTCP and other protocols are covered in our WebRTC Protocols courses. If you want to dig deeper into WebRTC or just to upskill yourself, check out webrtccourse.com
When lip synchronization breaks in WebRTCRTP/RTCP gives us the mechanism to maintain lip synchronization. And WebRTC already makes use of it. So why and how can WebRTC lose lip synchronization?
There are three main reasons for this to happen:
I’d like to tackle that from the perspective of the use cases. There are a few that are more prone than others to lip synchronization issues in WebRTC.
Group video conferencesIn group video conferencing there are no lip synchronization issues. At least not if you design and develop it properly and make sure that you either use the SFU model or the MCU model.
Some implementations decide to decouple voice and video streams, handling them separately and in different architectural solutions:
The diagram above shows what that means. Take a voice conferencing vendor that decided to add video capabilities:
In such cases, I often hear the explanation of “this is quite synchronized. It only loses sync when the network is poor”. Well… when the network is poor is when users complain. And adding this to their list of complaints won’t help. Especially if you want to be on par with the competition.
💡 What to do in this case? Go all in for SFU or all in for MCU – at least when it comes to the avoidance of splitting the audio and video processing paths.
Cloud renderingThe other big architectural headache for lip synchronization is cloud rendering. This is when the actual audio and/or video gets rendered and not acquired from a camera/microphone on some browser or mobile device.
In cloud gaming, for example, a game gets played, processed and rendered on a server in the cloud. Since this isn’t done in the web browser, the WebRTC stack used there needs to be aware of the exact timing of the audio and video frames – prior to them being encoded. This information should then be translated to the NTP+RTP timestamps that WebRTC needs. Not too hard, but just another headache to deal with.
For many cases of cloud gaming, we might even prioritize latency over lip synchronization, playing whatever we have when we get it as much as possible over having audio (or video) wait up for the other media type. That’s because in cloud games, a few milliseconds can be the difference between winning and game over.
When we’re dealing with our brave new world of conversational AI, now powered by LLM and WebRTC, then the video will usually follow the rendering of the audio, and might be done on a totally different machine. At the very least, it will occur using a different set of processes and algorithms.
💡 Here, it is critical to understand and figure out how to handle the NTP and RTP timestamps to get proper lip synchronization.
Latency and peripherals (and their effect on lip synchronization)Something I learned a bit later in my life when dealing with video conferencing is that the devices you use (the peripherals) have their own built in latency.
The sad thing here is that there’s NOTHING you can do about it. Remember that this is the user’s display or headset we’re talking about – you can’t tell them to buy something else.
On top of this, you have software device drivers that do noise reduction on the audio or add silly hats on the video (or replace the video altogether). These take their own sweet time to process the data and to add their own inherent latency into the whole media pipeline.
Device drivers on the operating system level should take care of this lag and this need to be factored into your lip synchronization logic – otherwise, you are bound to get issues here.
Got lip synchronization issues in your WebRTC application?Lip synchronization is one of these nasty things that can negatively impact the perception of media quality in WebRTC applications. Solving it requires reviewing the architecture, sniffing the network, and playing around with the code to figure out the root cause prior to doing any actual fixing.
I’ve assisted a few clients in this area over the years, trying together to figure out what went wrong and working out suitable solutions around this.
The post Lip synchronization and WebRTC applications appeared first on BlogGeek.me.
Explore the concept of WebRTC latency and its impact on real-time communication. Discover techniques to minimize latency and optimize your application.
WebRTC is about real time. Real time means low latency, low delay, low round trip – whatever metric you want to relate to (they are all roughly the same).
Time to look closer at latency and how you can reduce it in your WebRTC application.
Table of contentsLet’s do this one short and sweet:
Latency sometimes gets confused with round trip time. Let’s put things in order quickly here so we can move on:
Need more?
👉 I’ve written a longer form post on Cyara’s blog – What is Round-trip Time and How Does it Relate to Network Latency
👉 Round trip time (RTT) is one of the 7 top WebRTC video quality metrics
Latency isn’t good for your WebRTC healthWhen it comes to WebRTC and communications in general, latency isn’t a good thing. The higher it is, the worse off you are in terms of media quality and user experience.
That’s because interactivity requires low latency – it needs the ability to respond quickly to what is being communicated.
Here are a few “truths” to be aware of:
👉 One of the main things you should monitor and strive to lower is latency. This is usually done by looking at the round trip time metrics (which is what we can measure better than latency).
What are we measuring when it comes to latency?When you say “latency” – what do you mean exactly?
Latency starts with defining what part of the session are we measuring
And within that definition, there might be multiple pieces of processing in the pipeline that we’d want to measure individually. Usually we’d want to do that to decide where to focus our energies in optimizing and reducing the latency.
Here are two recent posts that talk about latency in the WebRTC-LLM space:
👉 You can decide to improve latency of the same use case, and take very different routes in how you end up doing that.
Different use cases deal with latency differentlyLatency is tricky. There are certain physical limits we can’t overcome – the most notable one used as an example is the speed of light: trying to pass a message from one side of the globe to the other will take considerable milliseconds no matter what you do, even not accounting for the need to process the data along its route.
Each use case or scenario has different ways to deal with these latencies. From defining what a low enough value is, through where in the processing pipeline to focus on optimizations, to the techniques to use to optimize latency.
Here are a few industries/markets where we can see how these requirements vary.
👉 Interested in the Programmable Video market, where vendors take care of your latency and use case? Check out my latest report: Video APIs: market status
ConferencingVideo conferencing has a set of unique challenges:
💡 Latency in conferencing? Below 200 milliseconds you’re doing great. 400 or 500 milliseconds is too high, but can be lived with if one must (though grudgingly).
StreamingStreaming is more lenient than video conferencing. We’re used to seconds of latency for streaming. You click on Netflix to start a movie and it can take a goodly couple of seconds at times. Nagging? Yes. Something to cancel the service for? No.
That said, we are moving towards live streaming, where we need more interactivity. From auctions, to sports and betting, to webinars and other use cases. Here are a few of the challenges seen here:
💡 For live streaming? 500 milliseconds is great. 1-2 seconds is good, depending on the scenario.
GamingGaming has a multitude of scenarios where WebRTC is used. What I want to focus on here is the one of having the game rendered by a cloud server and played “remotely” on a device.
The games here vary as well (which is critical). These can be casual games, board games (turn by turn), retro games, high end games, first person shooters, …
Often, these games have a high level of interaction that needs to be real time. Online gamers would pick an ISP, equipment and configuration that lowers their latency for games – just in order to get a bit more reaction time to improve their performance and score in the game. And this has nothing to do with rendering the whole game in the cloud – just about passing game state (which is smaller). Here’s an example of an article by CenturyLink for gamers on latency on their network. Lots of similar articles out there.
Cloud gaming, where the game gets rendered on the server in full and the video frames are sent via WebRTC over the network? That requires low latency to be able to play properly.
💡 In cloud gaming 50-60 milliseconds latency will be tolerable. Above that? Game over. Oh, and if you play against someone with 30 milliseconds? You’re still dead at 50 milliseconds. The lower the better at any number of milliseconds
Conversational AIConversational AI is a hot topic these days. Voice bots, LLM, Generative AI. Lots of exciting new technologies. I’ve covered LLM and WebRTC recently, so I’ll skip the topic here.
Suffice to say – conversational AI requires the same latencies as conferencing, but brings with it a slew of new challenges by the added processing time needed in the media pipeline of the voice bot itself – the machine that needs to listen and then generate responses.
I know it isn’t a fare comparison to latencies in conferencing (because there we don’t add it the human participant time or even the time it takes him to understand what is being sent his way, but at the moment, the response time of most voice bots is too slow for high levels of interaction).
💡 In conversational AI, the industry is striving to reach sub 500 milliseconds latencies. Being able to get to 200-300 milliseconds will be a dream come true.
Reducing latency in WebRTCDifferent use cases have different latency requirements. They also have different architecture and infrastructure. This leads to the simple truth that there’s no single way to reduce latency in WebRTC. It depends on the use case and the way you’ve architected your application that will decide what needs to be done to reduce the latency in it.
If you split the media processing pipeline in WebRTC to its coarse components, it makes it a bit easier to understand where latency occurs and from there, to decide where to focus your attention to optimize it.
Browsers and latency reductionWhen handling WebRTC in browsers there’s not much you can do on the browser side to reduce latency. That’s because the browser controls and owns the whole media processing stack for you.
There are still areas where you and and should take a look at what you’re doing. Here are a few questions you should ask yourself:
The most important thing in the browser is going to be the collection of latency related measurements. Ones you can use later on for monitoring and optimizing it. These would be rtt, jitter and packet loss that we mentioned earlier.
Mobile and latency reductionMobile applications, desktop applications, embedded applications. Any device side application that doesn’t run on a browser is something where you have more control of.
This means there’s more room for you to optimize latency. That said, it usually requires specialized expertise and more resources than many would be willing to invest.
Places to look at here?
When taking this route, also remember that most optimizations here are going to be device and operating system specific. This means you’ll have your hands full with platforms to optimize.
Infrastructure latency reductionThis is the network latency that most of the rtt metric in WebRTC statistics come from.
Where your infrastructure is versus the users has a huge impact on the latency.
The example I almost always use? Two users in France connected via a media or TURN server in the US.
Figuring out where your users are, what ISPs they are using, where to place your own servers, through which carriers to connect them to the users, how to connect your servers to one another when needed – all these are things you can optimize.
For starters, look at where you host your TURN servers and media servers. Compare that to where your users are coming from. Make sure the two are aligned. Also make sure the servers allocated for users are the ones closest to them (closest in terms of latency – not necessarily geography).
See if you need to deploy your infrastructure in more locations.
Rinse and repeat – as your service grows – you may need to change focus and infrastructure locations.
Other areas of improvement here are using Anycast or network acceleration that is offered by most large IaaS vendors today (at higher network delivery prices).
Media server processing and latenciesThen there are the media servers themselves. Most services need them.
Media servers are mainly the SFUs and MCUs that take care of routing and mixing media. There are also gateways of many shapes and sizes.
These media servers process media and have their own internal media processing pipelines. As with other pipelines, they have inherent latencies built into them.
Reducing that latency will reduce the end to end latency of the application.
The brave (new) world of generative AI, conversation AI and… LLMsRemember where we started here? Me discussing latency because WebRTC-LLM use cases had to focus on reducing latency in their own pipeline.
This got the industry looking at latency again, trying to figure out how and where you can shave a few more milliseconds along the way.
Frankly? This needs to be done throughout the pipeline – from the device, through the infrastructure and the media servers and definitely within the TTS-LLM-STT pipeline itself. This is going to be an ongoing effort for the coming year or two I believe.
Know your latency in WebRTCWe can’t optimize what we don’t measure.
The first step here is going to be measurements.
Here are some suggestions:
Did I mention that testRTC has some of the tools you’ll need to set up these environments? 😉And if you need assistance with this process, you know where to find me.
The post Reducing latency in WebRTC appeared first on BlogGeek.me.
Learn about WebRTC LLM and its applications. Discover how this technology can improve real-time communication using conversational AI.
Talk about an SEO-rich title… anyways. When Philipp suggests something to write about I usually take note and write about it. So it is time for a teardown of last month’s demo by OpenAI – what place WebRTC takes there, how it affects the programmable video market of Video APIs.
I’ve been dragged into this discussion before. In my monthly recorded conversation with Arin Sime, we talked about LLMs and WebRTC:
Time to break down the OpenAI demo that was shared last month and what role WebRTC and its ecosystem plays in it.
Table of contentsJust to be on the same page, watch the demo below – it is short and to the point:
(for the full announcement demos video check out this link. You really should watch it all)
There were several interfaces shown (and not shown) in these demos:
Besides the interface used, there were 3 important aspects mentioned, explained and shown:
Let’s see why this is different from what we’ve seen so far, and what is needed to build such things.
Text be like…ChatGPT started off as text prompting.
You write something in the prompt, and ChatGPT obligingly answers.
It does so with a nice “animation”, spewing the words out a few at a time. Is that due to how it works, or does it slow down the animation versus how it works? Who knows?
This gives a nice feel of a conversation – as if it is processing and thinking about what to answer, making up the sentences as it goes along (which to some extent it does).
This quaint prompting approach works well for text. A bit less for voice.
And now that ChatGPT added voice, things are getting trickier.
“Traditional” voice bots are like turn based gamesBefore all the LLM craze and ChatGPT, we had voice bots. The acronyms at the time were NLP and NLU (Natural Language Processing and Natural Language Understanding). The result was like a board game where each side has its turn – the customer and the machine.
The customer asks something. The bot replies. The customer says something more. Oh – now’s the bot’s turn to figure out what was said and respond.
In a way, it felt/feels like navigating the IVR menus via voice commands that are a bit more natural.
The turn by turn nature means there was always enough time.
You could wait until you heard silence from the user (known as endpointing). Then start your speech to text process. Then run the understanding piece to figure out intents. Then decide what to reply and turn it into text and from there to speech, preferably with punctuation, and then ship it back.
The pieces in red can easily be broken down into more logic blocks (and they usually are). For the purpose of discussing the real time nature of it all, I’ve “simplified” it into the basic STT-NLU-TTS
To build bots, we focused on each task one at a time. Trying to make that task work in the best way possible, and then move the output of that task to the next one in the pipeline.
If that takes a second or two – great!
But it isn’t what we want or need anymore. Turn based conversations are arduous and tiring.
Realtime LLMs are like… real-time gamesHere are the 4 things that struck a chord with me when GPT-4o was introduced from the announcement itself:
Then there was the fact that the person in the demo cuts GPT-4o short in mid-sentence and actually gets a response back without waiting until the end.
There’s more flexibility here as well. Less to learn about what needs to be said to “strike” specific intents.
Moving from turn based voice bots to real-time voice bots is no easy feat. It is also what’s in our future if we wish these bots to become commonplace.
Real life and conversational botsThe demo was quite compelling. In a way, jaw dropping.
There were a few things there that were either emphasized or skimmed through quickly that show off capabilities that if arrive in the product once it launches are going to make a huge difference in the industry.
Here are the ones that resonated with me
There are quite a few topics that still need to be addressed. OpenAI and ChatGPT have made huge strides and this is another big step. But it is far from the last one.
We will know more on how this plays out in real life once we get people using it and writing about their own experiences – outside of a controlled demo at a launch event.
Working on the WebRTC and LLM infrastructureIn our domain of communication platforms and infrastructure, there are a few notable vendors that are actively working on fusing WebRTC with LLMs. This definitely isn’t an exhaustive list. It includes:
They are taking slightly different approaches, which makes it all the more interesting.
Before we start, let’s take the diagram from above of voicebots and rename the NLU piece into LLM, following marketing hype as it is today:
The main difference now is that LLM is like pure black magic: We throw corpuses of text into it, the more the merrier. We then sprinkle a bit of our own knowledge base and domain expertise. And voila! We expect it to work flawlessly.
Why? Because OpenAI makes it seem so easy to do…
Programmable Video and Video APIs doing LLMIn our domain of programmable video, what we see are vendors trying to figure out the connectors that make up the WebRTC-LLM pipeline and doing that at as low latency as possible.
Agora
Agora just published a nice post about the impact of latency on conversational AI.
The post covers two areas:
In a way, they focus on the WebRTC-realm of the problem, ignoring (or at least not saying anything about) the AI/LLM-realm of the problem.
It should be said that this piece is important and critical in WebRTC no matter if you are using LLMs or just doing a plain meeting between mere humans.
Daily
Daily take their unique approach for LLM the same way they do for other areas. They offer a kind of a Prebuild solution. They bring in partners and integrations and optimize them for low latency.
In a recent post they discuss the creation of the fastest voice bot.
For Daily, WebRTC is the choice to go for since it is already real time in nature. Sprinkle on top of it some of the Daily infrastructure (for low latency). And add the new components that are not part of a typical WebRTC infrastructure. In this case, packing Deepgram’s STT and TTS along with Meta’s Llama 3.
The concept here is to place STT-LLM-TTS blocks together in the same container so that the message passing between them doesn’t happen over a network or an external API. This reduces latencies further.
Go read it. They also have a nice table with the latency consumers along the whole pipeline in a more detailed breakdown than my diagrams here.
LiveKit
In January this year, LiveKit introduced the LiveKit Agents. Components used to build conversational AI applications. They haven’t spoken since about this on their blog, or about latency.
That said, it is known that OpenAI is using LiveKit for their conversational AI. So whatever worries OpenAI has about latencies are likely known to LiveKit…
LiveKit has been lucky to score such a high profile customer in this domain, giving it credibility in this space that is hard to achieve otherwise.
Twilio’s approach to LLMsTwilio took a different route when it comes to LLM.
Ever since its acquisition of Segment, Twilio has been pivoting or diversifying. From communications and real time into personalization and storage. I’ve written about it somewhat when Twilio announced sunsetting Programmable Video.
This makes the announcement a few months back quite reasonable: Twilio AI Assistant
This solution, in developer preview, focuses on fusing the Segment data on a customer with the communication channel of Twilio’s CPaaS. There’s little here in the form of latency or real time conversations. That seems to be secondary for Twilio at the moment, but is also something they are likely now exploring as well due to OpenAI’s announcement of GPT-4o.
For Twilio? Memory and personalization is what is important about the LLM piece. And this is likely highly important to their customer base. How will other vendors without access to something like Segment are going to deal with it is yet to be seen.
Fixie anyone?When you give Philipp Hancke to review an article, he has good tips. This time it meant I couldn’t make this one complete without talking about fixie.ai. For a company that raised $17M they don’t have much of a website.
Fixie is important because of 3 things:
Fixie is working on Ultravox, an open source platform that is meant to offer a speech-to-speech model. No more need for STT and TTS components. Or breaking these into smaller pieces yet.
From the website, it seems that their focus at the moment is modeling speech directly into LLM, avoiding the need to go through text to speech. The reasoning behind this approach is twofold:
The second part of it, of converting the result of the LLM back into speech, is not there yet.
Why is that interesting?
There are a lot more topics to cover around WebRTC and LLM. Rob Pickering looks at scaling these solutions for example. Or how do you deal with punctuations, pauses and other phenomena of human conversations.
With every step we make along this route, we find a few more challenges we need to crack and solve. We’re not there yet, but we definitely stumbled upon a route that seems really promising.
The post OpenAI, LLMs, WebRTC, voice bots and Programmable Video appeared first on BlogGeek.me.
Get your copy of my ebook on the top 7 video quality metrics and KPIs in WebRTC (below).
I’ve been dealing with VoIP ever since I finished my first degree in computer science. That was… a very long time ago.
WebRTC? Been at it since the start. I co-founded testRTC, dealing with testing and monitoring WebRTC applications. Did consulting. Wrote a lot about it.
For the last two years I’ve been meaning to write a short ebook explaining video quality metrics in WebRTC. And I finally did that 😎
The challenges of measuring video qualityEver since we started testRTC, customers came to us asking for a quality score to fit their video application. But where do you even begin?
Deciding what’s good or bad is a personal decision that needs to be made by each and every company for its applications. Sometimes, differently per scenario used.
Where do we even start then?
Packet loss and latency aren’t enoughIf I had to choose two main characteristics of media quality in real time communications, these were going to be packet loss and latency.
Packet loss tells you how bad the network conditions are (at least most of the time this is what it is meant to do). Your goal would be to reduce packet loss as much as possible (don’t expect to fully eradicate it).
Latency indicates how far the users are from your infrastructure or from each other. Shrinking this improves quality.
But that’s not enough. There’s more to it than these two metrics to be able to get a better picture of your application’s media quality – especially when dealing with video streams.
Know your top 7 video quality metrics in WebRTCWhich is why I invite you to download and review the top 7 video quality metrics in WebRTC – my new ebook which lists the most important KPIs when it comes to understanding video quality in WebRTC. There you will find an explanation of these metrics, along with my suggestions on what to do about them in order to improve your application’s video quality.
And yes – the ebook is free to download and read – once you jot down your name and email, it will be sent to you directly.
The post Video quality metrics you should track in WebRTC applications appeared first on BlogGeek.me.
Discover the hidden dangers of packet loss and its impact on your WebRTC application. Find out how to optimize your network performance and minimize packet loss.
If there’s one thing that can give you better media quality in WebRTC it is going to be the reduction (or elimination?) of packet loss. Nothing else will be as effective as this.
What I want to do here, is to explain packet loss, what it is inevitable, and the many ways we have at our disposal to increase the resilience and quality of our media in WebRTC in the face of packet losses.
Table of contentsThere are many reasons for packet losses to occur on modern networks and with WebRTC. To count a few of these:
We think of the internet as a reliable network. You direct a browser to a web page. And magically the page loads. If it doesn’t, then the network or server is down. End of story. That’s because packet losses there are handled by retransmitting what is lost. The cost? You wait a wee bit longer for your page to load.
With WebRTC we are dealing with real time communications. So if something gets lost there is little time to fix that.
👉 Packet losses are a huge headache for WebRTC applications
What to do to overcome packet losses?Packet loss is an inevitability when it comes to WebRTC and VoIP in general. You can’t really avoid them. The question then becomes what can we do about this?
There are four different approaches here that can be combined for a better user experience:
From here on, let’s review each one of these four approaches.
Have less packet lossesThis is the most important solution.
Because I don’t want you to miss this, I’ll write this again:
This is the most important solution.
If there is less packet loss, there is going to be less headache to deal with when trying to “fix” this situation. So reducing packet loss should be your primary objective. Since you can’t fully eradicate packet loss, we will still need to use other techniques. But it starts with reducing the amount of packet losses.
Location of infrastructure elements in WebRTCWhere you place your media servers and TURN servers and how you route traffic for your WebRTC service will have a huge impact on packet loss.
Best practice today is having the first server that WebRTC media hits as close to the user as possible. The understanding behind that is that this reduces the number of hops and network infrastructure components that the media packets need to traverse over the open internet. Once on your server, you have a lot more control over how that data gets processed and forwarded between the servers.
Having a single data center in the US cater for all your traffic is great. Assuming your users are from that region – once users start joining from across the pond – say… France. Or India. You will start seeing higher latencies and with it higher levels of packet loss.
A few things here:
Where to start?
👉 Know the latency (RTT) of your users. Monitor it. Strive towards improving it
👉 Check if there are locations and users that are routed across regions. Beef up your infrastructure in the relevant regions based on this data
👉 Since we want to reduce packet loss, you should also monitor… packet loss
Better bandwidth estimationI should have called this better bandwidth management, but for SEO reasons, kept it bandwidth estimation 😉
Here’s the thing:
Sending more than the network can handle, the sender can send or the receiver can receive leads to packet loss and packet drops.
Fixing that boils down to bandwidth management – you don’t want to send too little since media quality will be lower than what you can achieve. And you don’t want to send too much since… well… packet loss.
Your service needs to be able to estimate bandwidth. That needs to happen on both the uplink and the downlink for each user.
The challenge is that available bandwidth is dynamic in nature. At each point in time, we need to estimate it. If we overshoot – packets are going to be delayed or lost. If we undershoot, we are going to reduce media quality below what we can achieve.
Web browser implementations of WebRTC have their own bandwidth management algorithms and they are rather good. Media servers have different implementations and their quality varies.
For media servers, we also need to remember that we aren’t dealing only with bandwidth estimation but rather with bandwidth management. Once we approximately know the available bandwidth, we need to decide which of the streams to send over the connection and at which bitrates; doing that while seeing the bigger picture of the session (hence bandwidth management and not estimation).
Conceal packet losses (PLC)Packet loss concealment is what we do after the fact. We lost packets, but we need to play out something for the user. What should we do to conceal the problem of packet loss?
This may seem like the last thing to deal with, but it is the first we need to tackle. There are two reasons why:
Audio and video are different, which is why from here on, we will distinguish between the two in the techniques we are going to use.
Audio and packet loss concealmentWith audio, a loss of an audio packet almost always translates immediately to a loss of one or more audio frames (and we usually have 50 audio frames per second).
“Skipping” them doesn’t work so well, as it leads to robotic audio when there’s packet loss.
Other naive approaches here include things like playing back the last frame received – either as is or with a reduction in its volume.
More sophisticated approaches try to estimate what should have been received by way of machine learning (or what we love calling it these days – generative AI). Google has such a capability inhouse (though not inside the open source implementation of WebRTC that they have). If you are interested in learning more about this, you can check out Google’s explanation of WaveNetEQ.
A few things to remember here:
👉 For the most part, this isn’t something in your control, unless you own/compile your WebRTC stack on the device side
👉 Knowing how browsers behave here enables you to be slightly smarter with the other techniques you are going to use (by deciding when to use them and how aggressively)
👉 In your own native application? You can improve on things, but you need to know what you’re doing and you need to have a compelling reason to take this route
Video and packet loss concealment 👉 frame droppingVideo is trickier with packet losses:
One lost packet translates into a lost frame, which can easily cause loss of the whole video sequence:
Packet loss concealment in video means dropping a frame, and oftentimes freezing the video until the next keyframe arrives.
What can the receiver do in case of such a loss? If it believes it won’t recuperate quickly (which is most commonly the case), he can send out a FIR or PLI message over RTCP to the sender. These messages indicate to the sender that there’s a loss that needs to be addressed, where the usual solution is to reset the encoder and send a new keyframe.
In the past, systems used to try and overcome packet losses by continuing to decode without the missing packets. The end result was smearing artifacts on the video until a new keyframe arrived. Today, best practice is to freeze the video until a keyframe arrives (which is what all browser implementations do).
A few things to remember here:
👉 You have more control here than in audio. That’s because a lost packet means you will receive FIR or PLI message on the other end. If that’s your media server receiving these messages, you can decide how to respond
👉 Sending a keyframe means investing more on bitrate for that frame. If there’s congestion over the network, then this will just put more burden. Most media servers would avoid sending too many of these in larger group meetings
👉 There are video coding techniques that reduce the dependencies between frames. These include temporal scalability and SVC
Retransmitting lost packets (RTX)If a packet is missing, then the first solution we can go for is to retransmit it.
The receiver knows what packets it is missing. Once the sender knows about the missing packets (via
NACK messages), it can resend them as RTX packets.
Retransmission is the most economic solution in terms of network resources. It is the least wasteful solution. It is also the hardest to make use of. That’s because it ends up looking something like this:
In order to retransmit, we need to:
This takes time. A long time.
The question then becomes, is it going to be too late to retransmit them.
Video and RTXVideo can make real use of retransmissions (and it does in WebRTC).
With video compression, we have a kind of hierarchy of frames. Some frames are more important than others:
The above illustration, for example, shows how keyframes and temporal scalability build dependency chains. Key denotes the keyframe while L0 has higher usability than L1 frames (L1 frames are dependent on L0 frames and nothing depends on them).
When we have such a dependency tree of frames, we can do some interesting things with resiliency. One of them is deciding if it is worthwhile to ask for a retransmission:
Audio compression doesn’t enjoy the same dependency tree that video compression does. Which is why libwebrtc doesn’t have code to deal with audio RTX.
Would having RTC for audio be useful? It can. Audio packets usually wait for video packets to arrive for lip synchronization purposes. If we can use that wait time to retransmit, then we can improve upon audio quality. Google likely deemed this not important enough.
Correct packet losses in advance (FEC)We could ask for a retransmission after the fact, but what about making sure there’s no need? This is what FEC (Forward Error Correction) is all about.
Think of it this way – if we had one shot at what we want to send and it was super important – would it make sense to send 100 copies of it, knowing that the chances that one of these copies would reach its destination is high?
FEC is about sending more packets that can be used to reconstruct or replace lost packets.
There are different FEC schemes that can be used, with the main 3 of them being:
WebRTC supports duplication and XOR out of the box.
The biggest hurdle of FEC is its use of bitrate – it is quite network hungry in that regard.
Audio FECAudio FEC comes in two different manners:
In-band FEC is implemented as part of the Opus codec library. It is ok’ish at best – nothing to write home about.
Then there’s RED – Redundancy Encoding – where each audio packet holds more than a single audio frame. And the ones it holds are just slightly older frames, so that if a packet is lost, we get it in another packet.
RED is implemented in libwebrtc. Support is limited to 1 level of redundancy for RED (meaning recovering up to one sequential lost packet). You can use WebRTC’s Insertable Streams mechanism to generate RED packets at higher redundancy or dynamic redundancy in the browser though.
In the above, Philipp Hancke explains RED (along with other resiliency features for audio in WebRTC).
Video FECFEC for video is considered wasteful. If we need to increase bitrate by 20% or more to introduce robustness using FEC, then it comes at a cost of video quality that we could increase by using higher video bitrate.
For the most part, WebRTC ignores FEC for video, which is a shame. When using temporal scalability or SVC, the same way that we can decide to retransmit only important packets, we can also decide to only add FEC protection only to more important frames.
Wrapping it all upDealing with packet loss in WebRTC isn’t a simple task. It gets more complex over time, as more techniques and optimizations are bolted on to the implementation. What I want to do here is to list the various tools at our disposal to deal with packet losses. When and how we decide to use them would determine the resulting robustness and media quality of the implementation.
Here’s a quick table to sum things up a bit:
PLCRTXFECFocusWhat to playback to the userWhen to ask for missing packetsWhen to send duplicated packetsAdvantagesNone. You must have this logic implementedLow network footprintLow latency overheadChallengesAudio may sound roboticVideo will freezeIncreases latency. Might not be usable due to itHigh network footprint. Can be quite wastefulAudioDuplicate last frames or reduce volumeUse Gen AI to estimate what was lostNot commonly used for audio in WebRTCFlexFEC used by WebRTCCan use RED if you want toVideoSkip video framesAsk for a fresh keyframe to reset the video streamCan be optimized to retransmit packets of important frames onlyNot commonly used for video in WebRTCOh – and make sure you first put an effort to reduce the amount of packet losses before starting to deal with how to overcome packet losses that occur…
Learn more about WebRTC (and everything about it)Packet loss is one of the topics you need to deal with when writing WebRTC applications. There are many aspects affecting media quality – packet loss is but one of them. This time, we looked into the tools available in WebRTC for dealing with packet losses.
To learn more about media processing and everything else related to WebRTC, check out these services:
And if what you want is to test, monitor, optimize and improve the performance of your WebRTC application, then I’d suggest checking out testRTC.
The post Fixing packet loss in WebRTC appeared first on BlogGeek.me.
Getting HEVC and WebRTC to work together is tricky and time consuming. Lets see what the advantages are and if this is worth your time or not.
Does HEVC & WebRTC make a perfect match, or a match at all???
WebRTC is open source, open standard, royalty free, …
HEVC is royalty bearing, made by committee, expensive
And yet… we do see areas where WebRTC and HEVC mix rather well. Here’s what I want to cover this time:
Table of contentsDigging here in my blog, you can find articles discussing the WebRTC codec wars dating as early as 2012.
Prior to WebRTC, most useful audio and video codecs were royalty bearing. Companies issued patents related to media compression and then got the techniques covered by their patents integrated into codec standards, usually, under the umbrella of a standardization organization.
The logic was simple: companies and research institutes need to make a profit out of their effort, otherwise, there would be no high quality codecs. That was before the internet as we know it…
Once websites such as YouTube appeared, and UGC (User Generated Content) became a thing, this started to shift:
The new business models broke in one way or another the notion of royalty bearing codecs. Or at least tried to break. There were solutions of sorts – smartphones had hardware encoders prepaid for, decoder licenses required no payments, etc.
But that didn’t fit something symmetric like WebRTC.
When WebRTC was introduced, the codec wars began – which codecs should be supported in WebRTC?
The early days leaned towards royalty free codecs – VP8 for video and Opus for voice. At some point, we ended up with H.264 as well…
How H.264 wiggled its way into WebRTCH.264 is royalty bearing. But it still found its way into WebRTC that was due to Cisco in a large part – they decided to contribute their encoder implementation of H.264 and pay the royalties on it (they likely already paid up to the cap needed anyways). That opened a weird technical solution to be concocted to make room for H.264 and allow it in WebRTC:
Why? Because lawyers. Or something.
It worked for browsers. But not on mobile, where the solution was to use the hardware encoder on the device, that doesn’t always exist and doesn’t always work as advertised. And it left a gaping headache for native developers that wanted to use H.264. But who cared? Those who wanted to make a decision for WebRTC and move on – got it.
That made certain that at some point in the future, the H.264 royalty bearing crowd would come back asking for more. They’d be asking for HEVC.
HEVC, patents and big 💰HEVC is a patents minefile, or at least were – I admit I haven’t been following up on this too closely for a few years now.
Here are two slides I have in my architecture course:
There are a gazillion patents related to HEVC (not that many, but 5 figures). They are owned by a lot of companies and get aggregated by multiple patent pools. Some of them are said to be trickling into VP9 and AV1, though for the time being, most of the market and vendors ignore that.
These patents make including HEVC in applications a pain – you need to figure out where to get the implementation of HEVC and who pays for its patents. With regard to WebRTC:
Oh, and there’s no “easy” cap to reach as there is/were with H.264 when it was included in WebRTC and paid for by Cisco.
HEVC is expensive, with a lot of vendors waiting to be paid for their efforts.
HEVC hardwareSoftware codecs and royalty payments are tricky. Why? Because it opens up the can of worms above, about who is paying. Hardware codecs are different in nature – the one paying for them is either the hardware acceleration vendor or the device manufacturer.
This means that hardware acceleration of codecs has two huge benefits – not only one:
This is likely why Apple decided to go all in with HEVC from iPhone 8 and on – it gave them an edge that Android phones couldn’t easily solve:
This gap for Android devices was a nice barrier for many years that kept Apple devices ahead. Apple could “easily” pay the HEVC royalties while Android vendors try to figure out how to get this done.
Today?
We have Intel and Apple hardware supporting HEVC. Other chipset vendors as well. Some Android devices. Not all of them. And many just do decoding but not encoding.
For the most part, the HEVC hardware support on devices is a swiss cheese with more holes than cheese in it. Which is why many focus on HEVC support in Apple devices only today (if at all).
Advantages of HEVC in WebRTCWhen it comes to video codecs, there are different generations of codecs. In the context of WebRTC, this is what it looks like:
There are two axes to look at in the illustration above
If we move from the VP8 and H.264 to the next generation of VP9 and HEVC, we’re improving on the media quality for the same bitrate. The challenge though is the complexity and performance associated with it.
To deal with the increased compute, a common solution is to use hardware acceleration. This doesn’t exist that much for VP9 but is more prevalent in HEVC. That’s especially true since ALL Apple devices have HEVC support in them – at least when using WebRTC in Safari.
The other reason for using HEVC is media processing outside of WebRTC. Streaming and broadcasting services have traditionally been using royalty bearing video codecs. They are slowly moving now from H.264 to HEVC. This shift means that a lot of media sources are going to have available in them either H.264 or HEVC as the video codec – a lot less common will be VP8 or VP9. This being the case, vendors would rather use HEVC than go for VP9 and deal with transcoding – their other alternative is going to stick to using H.264.
So, why use HEVC?
HEVC requires royalty payments in a minefield of organizations and companies.
Apple already committed itself fully to HEVC, but Google and the rest of the WebRTC industry haven’t.
Google will be supporting HEVC in Chrome for WebRTC only as a decoder and only if there’s hardware accelerator available – no software implementation. Google’s “official” stance on the matter can be found in the Chrome issues tracker.
So if you are going to support HEVC, this is where you’ll find it:
Then there is AV1. A video codec years in the making. Royalty free. With a new non-profit industry consortium behind it, with all the who’s who:
The specification is ready. The software implementation already exists inside libwebrtc. Hardware acceleration is on its way. And compression results are better than HEVC. What’s not to like here?
This makes the challenge extra hard these days –
Should you invest and adopt HEVC, or start investing and adopting AV1 instead?
Adopt VP9? Wait for AV1?
Where can you fit HEVC and WebRTC?Let’s see where there is room today to use HEVC. From here, you can figure out if it is worth the effort for your use case.
The Apple opportunity of WebRTC and HEVCWhy invest now in HEVC? Probably because HEVC is available on Apple devices. Mainly the iPhone. Likely for very specific and narrow use cases.
For a use case that needs to work there, there might be some reasoning behind using HEVC. It would work best there today with the hardware acceleration that Apple pampered us with for HEVC. It will be really hard or even impossible to achieve similar video quality in any other way on an iPhone today.
Doing this brings with it differentiation and uniqueness to your solution.
Deciding if this is worth it is a totally different story.
Intel (and other) HEVC hardwareIntel has worked on adding HEVC hardware acceleration to its chipsets. And while at it, they are pushing towards having HEVC implemented in WebRTC on Chrome itself. The reason behind this is a big unknown, or at least something that isn’t explained that much.
If I had to take a stab at it here, it would be the desire of Intel to work closely with Apple. Not sure why, it isn’t as if Intel chipsets are interesting for Apple anymore – they have been using their own chips for their devices for a few years now.
This might be due to some grandiose strategy, or just because a fiefdom (or a business unit or a team) within Intel needs to find things to do, and HEVC is both interesting and can be said to be important. And it is important, but is it important for WebRTC on Intel chipsets? That’s an open question.
Should you invest in HEVC for WebRTC?No. Yes. Maybe. It depends.
When I told Philipp Hancke I am going to write about this topic, he said be sure to write that “it is a bit late to invest in HEVC in 2024”.
I think this is more nuanced than this.
It starts with the question how much energy and resources do you have and can you spend them on both HEVC and AV1. If you can’t then you need to choose only one of them or none of them.
Investing in HEVC means figuring out how the end result will differentiate your service enough or give it an advantage with certain types of users that would make your service irresistible (or usable).
For the most part, a lot of the WebRTC applications are going to ignore and skip HEVC support. This means there might be an opportunity to shine here by supporting it. Or it might be wasted effort. Depending how you look at these things.
Learn more about WebRTC (and everything about it)Which codecs are available, which ones to use, how is that going to affect other parts of your application, how should you architect your solutions, can you keep up with the changes coming to WebRTC?
These and many other questions are being asked on a daily basis around the world by people who deal with WebRTC. I get these questions in many of my own meetings with people.
If you need assistance with answering them, then you may want to check out these services that I offer:
The post WebRTC & HEVC – how can you get these two to work together appeared first on BlogGeek.me.
GStreamer is one of the oldest and most established libraries for handling media. As a core media handling element in Linux and WebKit that as launched near the turn of the century, it is not surprising that many early WebRTC projects use various pieces of it. Today, GStreamer has expanded options for helping developers plumb […]
The post WebRTC Plumbing with GStreamer appeared first on webrtcHacks.
From time to time, WebRTC is going to discard media packets. Monitoring such behavior and understanding the reasons is important to optimize media quality.
WebRTC does things in real time. That means that if something takes its sweet time to occur, it will be too late to process it. This boils down to the fact that from time to time, WebRTC will discard media packets, which isn’t a good thing. Why is that going to happen? There are quite a few reasons for it, which is what this article is all about.
Table of contentsI just started a new initiative with Philipp Hancke. We’re publishing an answer to a WebRTC related question once a week (give or take), trying to keep it all below the 2 minutes mark.
We are going to cover topics ranging from media processing, through signaling to NAT traversal. Dealing with client side or server side issues. Or anything else that comes to mind.
👉 Want to be the first to know? Subscribe to the YouTube channel
👉 Got a question you need answered? Let us know
Discarded media packets in WebRTCMedia packets and frames can and are discarded by WebRTC in real life calls. There are even getstats metrics that allow you to track these:
The screenshot above was taken from the RTCInboundRtpStreamStats dictionary of getstats. I marked most of the important metrics we’re interested in for discarding media data.
packetsDiscarded – this field indicates any fields that the jitter buffer decided to discard and ignore because they arrived too early or too late. It relates to audio packets.
framesXXX fields are dealing with video only and look at full frames which can span multiple packets. They get discarded because of a multitude of reasons which we will be dealing with later in this article. For the time being – just know where to find this.
The diagram below is a screenshot taken in testRTC of a real session of a client. Here you can see a spike of 200 packetsDiscarded less than a minute into the call. We’ve recently added in testRTC insights that hunt for such cases (as well as for video frame drops), alerting about these scenarios so that the user doesn’t have to drill down and search for them too much – they now appear front and center to the user.
WebRTC = Real-Time. Timing is everythingWebRTC stands for Web Real Time Communication. The Real Time part of it is critical. It means that things need to happen in… real time… and if they don’t, then the opportunity has already passed. This leads to the eventuality that at times, media packets will need to be discarded simply because they aren’t useful anymore – the opportunity to use them has already passed.
For all that logic to happen, WebRTC uses a protocol called RTP. This protocol is in charge of sending and receiving real time media packets over the network. For that to occur, each RTP packet has two critical fields in its header:
The illustration above is taken from our course Low level WebRTC protocols. In it, you can see these two fields:
The sequence number is just a running counter which can easily be used to order the packets on the receiving end based on the value of the counter. This takes care of any reordering, duplication and packet losses that can occur over modern networks.
The timestamp is used to understand when the media packet was originally generated. It is used when we need to playback this packet. Multiple packets can have the same timestamp for example, when the frame we want to send gets split across packets – something that occurs frequently with video frames.
These two, sequence number and timestamp, are used to deal with the various characteristics of the network. Usually, we deal with the following problems (I am not going to explain them here): jitter, latency, packet loss and reordering.
All of this goodness, and more is handled in WebRTC by what is called a jitter buffer. Here’s a short explainer of how a jitter buffer works:
WebRTC discarding incoming audio packetsThe above video is our first WebRTC Q&A video. We started off with this because it popped up in discuss-webrtc. The question has since been deleted for some reason, but it was a good one.
LatencyThe main reason for discarded audio packets is receiving them too late.
When audio packets are received by WebRTC, it pushes them into its jitter buffer. There, these packets get sorted in their sending order by looking at the sequence number of these packets. When to play them out is then dependent on the timestamp indicated in the packet.
Assuming we already played a newer packet to the user, we will be discarding packets that have a lower (and older) sequence number since their time has already passed.
LipsyncAudio and video packets get played out together. This is due to a lip synchronization mechanism that WebRTC has, where it tries to match timestamps of audio and video streams to make sure there’s lip synchronization.
Here, if the video advanced too much, then you may need to drop some audio packets instead of playing them out in sync with the video (simply because you can’t sync the two anymore).
BugsHere’s another reason why audio packets might end up being discarded by the receiver – bugs in the sender’s implementation…
When the sender doesn’t use the correct timestamp in the packets, or does other “bad” things with the header fields of the RTP packets, you can get to a point when packets get discarded.
👉 Our focus here was on the timestamp because for some arcane reasons, figuring out the timestamp values and their progression in audio (and video) is never a simple task. Audio and video use different frequency clocks when calculating timestamps, done with values that make little sense to those who aren’t dealing with the innards and logic of audio and video encoders. This may easily lead to miscalculations and bugs in timestamp setting
WebRTC discarding outgoing audio packetsThis doesn’t really happen. Or at least WebRTC ignores this option altogether.
How do we know that? Besides looking at the code, we can look at the fields that we have in getstats for this. While we have discarded frames for incoming and outgoing video and discarded incoming audio packets, we don’t have anything of this kind for outgoing audio packets.
These packets are too small and “insignificant” to cause any dropping of them on the sender side. That’s at least the logic…
WebRTC discarding incoming video framesBefore we go into the reasons, let’s understand how video packets are handled in the media processing pipeline of WebRTC. This is partial at best, and specifically focused on what I am trying to convey here:
The above diagram shows the process that video packets go through once they are received, along with the metrics that get updated due to this processing:
👉 The exact places where these metrics might be updated are a wee bit more nuanced. Consider the above just me flailing my hands in the air as an explanation.
This also hints that with video, there are multiple places where things can get dropped and discarded along the pipeline.
The above is another screenshot from testRTC. This time, indicating framesDropped. You can see how throughout the session, quite a few frames got dropped by WebRTC.
Let’s find the potential reasons for such dropped frames..
Latency, lip sync & bugsJust like incoming audio packets, we can get dropped packets and video frames because of much the same reasons.
Latency and lip synchronization may cause the jitter buffer to discard video packets.
And bugs on the sender side can easily cause WebRTC to drop incoming packets here as well.
That said, with video, we have to look at a slightly bigger picture – that of a frame instead of that of a singular packet.
Not all packets of a frame are availableAssume you have a packet dropped. And that packet is part of a frame that is sent over a series of 7 packets. We had 1 packet drop that caused a frame drop, which in turn, caused another 6 packets to be useless to us since we can’t really decode them without the missing packet (we can to some extent, but we usually don’t these days).
Dependency on older framesWith video, unless we’re decoding a keyframe, the frame we need to decode requires a previous frame to be decoded. There are dependencies here since for the most part, we only encode and compress the differences across frames and not the full frame (that would be a keyframe).
What happens then if a frame we need for decoding a fresh frame we just received isn’t available? Here, all packets were received for this new frame, but the frame (and all its packets) will still get dropped. This will be reported in framesDropped.
Not enough CPUWe might not have enough CPU available to decode video. Video is CPU intensive, and if WebRTC understands that it won’t have time to decode the frame, it will simply drop it before decoding it.
But, it might also decode the frame, but then due to CPU issues, miss the time for playout, causing framesRendered not to increment.
WebRTC discarding outgoing video framesWith outgoing media, there is a different dictionary we need to look at in getstats – RTCOutboundRtpStreamStats:
Here, the relevant fields are framesSent and framesEncoded. We should strive to have these two equal to each other.
We know that WebRTC decided to discard frames here if framesEncoded is higher than framesSent. If this happens, then it is bad in a few levels:
On the RTCIceCandidatePairStats dictionary, there’s also packetsDiscardedOnSend metric, which hints to when and why would we lose and discard packets and frames on the sender side:
Total number of packets for this candidate pair that have been discarded due to socket errors, i.e. a socket error occurred when handing the packets to the socket. This might happen due to various reasons, including full buffer or no available memory.
If you’re dropping video frames on the sender side (framesEncoded < framesSent), then in all likelihood the network buffer on the device is full, causing a send failure. Here you should check the resources available on the device – especially memory and CPU – or just understand the network traffic you are dealing with.
Maintaining media quality in WebRTCMedia quality in WebRTC is a lot more than just dealing with bitrates or deciding what to do about packet losses. There are many aspects affecting media quality and they all do it dynamically throughout the session and in parallel to each other.
This time, we looked into why WebRTC discards media packets during calls. We’ve seen that there are many reasons for it.
To learn more about media processing and everything else related to WebRTC, check out these services:
The post Reasons for WebRTC to discard media packets appeared first on BlogGeek.me.
What exactly is simulcast, how is it used in WebRTC and why is it a critical component in any SFU media server.
WebRTC simulcast is one of these things that is commonly used by WebRTC applications that have SFU media servers. If your media server doesn’t use simulcast – make sure to ask why and to understand the answer. And if it does, then you should know what it means exactly. Which is why we’re here now.
In this article, I want to explain what WebRTC simulcast is, when and how it is used AND some new advancements coming to simulcast.
Table of contentsBefore we begin, we need to understand the concept of bitrate. In a WebRTC video session, the first thing to look at and understand is the bitrate used. Video encoding requires sending a lot of data over the network, and WebRTC tries to match the bitrate it sends to the available bandwidth of the network.
See how I switched between talking about sending data to bitrate to bandwidth? For me, sending data is what we are trying to do. Bitrate is the actual (or target) amount of data we’re aiming for, and bandwidth is what is available for us on the network (assume that bandwidth should always be the same or preferably even higher than the bitrate).
When it comes to audio, we’re mostly working with bitrates that are static and known in advance. They are also low compared to video bitrates, so we just don’t care as much. Which leaves us with video streams.
For video streams:
This means that what we want to do is use as little bitrate as possible to get the highest possible quality. We’re trying to reach for the stars first by deciding our desired bitrate, and then we start lowering due to the constraints of the real world. Here are a few reasons for this:
👉 If you want to learn more about this topic, then read this article on WebRTC video quality
SFU media servers and group video sessionsFor video group sessions in WebRTC, we use SFU media servers. Not always, but most of the time. Why? Because SFUs route media – this ends up costing us less compared to MCUs and in many ways makes things more flexible for us on the viewer’s end.
The challenge though is that SFUs harbor a wee bit more complex logic and smarts than the alternatives and they also delegate a lot of the work to the clients themselves. A good SFU is one that has tight integration and optimization methods with the clients using it. And remember here that the implementation of the browser (Chrome) is optimized for Google Meet’s needs.
Simulcast was “invented” for SFUs. Let’s take a quick example to show what we mean here.
We have 4 people on a call. All connected to an SFU. Each participant is sending his video to the SFU, and the SFU routes that video to the other 3 participants in the call:
If everyone has a decent network, then we’re all happy. But what if D has poor network conditions on his downlink? Here are some assumptions for our scenario:
If we want everyone to be displayed at the same quality on D’s screen, we need to give each one of them ~330Kbps. That’s instead of 2Mbps. So… do we just reduce the sending bitrate of everyone down to 330Kbps to accommodate for user D? Or do we drop him out of the call altogether?
Notice how we can still send 2Mbps from D to the rest of the participants? That’s just the nature and dynamics of the network and capabilities we have in this example.
Here’s where simulcast comes in…
We’re going to engineer the solution so that each participant is going to create 3 separate bitstreams of their video data: 1150kbps, 600kbps and 250kbps, totalling 2Mbps. The exact numbers are less important than the concept itself, so please go with the flow here.
* Being lazy, I’ve denoted simulcast lines as dotted lines, indicating Simulcast instead of using a better notation like 1150/600/250.
Now that we do that, A, B and C get 1150Kbps video from everyone else and D receives the lower 250Kbps bitstreams (it can’t handle 1150kbps or 600kbps even for only one of the users without dropping one of the other video streams it is receiving altogether). Now each one is getting the most he can handle (or at the very least, closer to that than just lowering everyone down).
Media quality: LCD or BABI am going to use names that don’t necessarily exist. I am making them up here to explain the nature of simulcast a bit better.
What we’ve seen in the example above is how we move from LCD (Least Common Denominator) to BAB (Best Available Bandwidth).
We started with a naive implementation where the same video bitrate is being sent to everyone. So if there’s a hiccup somewhere along the session, everyone is going to be affected. When D had network issues, everyone had to lower their bitrate from 2Mbps down to 330Kbps… that’s quite a hit to media quality across the board for them all.
That’s our LCD – we’re going to need to accommodate the bitrate to the lowest common denominator of the available bandwidth we have across our meeting participants. And that sucks. Bigtime…
But then we went for BAB – we’re going to try and work with the best available bandwidth that each user is capable of receiving.
How did we do that? By asking the participants (nicely) to generate more than a single bitstream. Each bitstream has a different bitrate here, which gives the SFU the flexibility it needs to decide which bitrate to send to which user.
We use simulcast (or SVC, though not in this article) because there’s no equality in digital communications. Participants have different devices, they connect with different networks and they even see and focus on different things during the same meeting. Simulcast enables us to give different participants a different view of the meeting with varying degrees of quality based on the capabilities of each participant at any given moment AND based on each participants’ preference/desire.
How much flexibility and how high media quality we can reach is determined by the tools and optimizations we end up employing in our implementation. No two implementations of SFU with simulcast are exactly alike.
Client side = Simulcast; Server side = Adaptive bitrateSimulcast as a concept and solution is about a client generating multiple streams so that a media server can use whichever of the streams it needs to send to other participants.
Video streaming had a similar(?) solution known as ABR – Adaptive Bitrate.
Here, the client sends a single media stream to the server and the server is the one that generates any number of streams in different bitrates as it sees fit. This makes sense when there are many viewers (thousands or more) and it can be useful to invest in server resources (these cost money to the vendor providing the service) for the given scenario.
Some use ABR as a term to simply say that the bitrate is variable in nature and adapts to the network. I use it to refer to server side adaptation, where there are multiple video streams generated (in advance or in realtime) and the server simply chooses the best to use per viewer.
For large scale live streaming broadcasts, you can start seeing solutions that incorporate ABR as a technology to transcode the stream to broadcast on the server and generate multiple bitrates with it. This can and is done sometimes in parallel to using simulcast from the client as well.
The way for me to compartmentalize and remember this? Simulcast is multiple bitrates generated by the client. ABR is multiple bitrates generated by the server.
👉 Your can learn more about ABR vs simulcast or just about simulcast
Advantages and weaknesses of using simulcast in WebRTCSimulcast is great, but it isn’t a catchall solution.
What simulcast does as a concept is to offload some of the work from the media server. Offloading here means that for the client it comes at an increase in CPU use and outgoing bandwidth required.
WebRTC simulcast advantagesHere are some great things that simulcast brings with it:
It isn’t all good though. There are weaknesses to the use of WebRTC simulcast:
There are usually two to three layers/streams when it comes to WebRTC simulcast. Each with a different bitrate, and from there, also with different resolutions, frame rates and quality. I am focusing on bitrate because for me, that’s the leading factor – everything else gets derived from it.
Which bitrates are we going to support and which ones get sent to whom are the most important questions for any SFU implementation that uses simulcast.
WebRTC by itself can’t make such decisions. It has its own default bitrates for simulcast, but this is only what they are – defaults. I wouldn’t recommend developers to use these without understanding their implications (they’re likely not useful for the use case you have at hand).
The decision which bitrates to support in simulcast to begin with should take into consideration the possible display layouts of the videos on the viewers’ end. By knowing at what resolutions the videos get displayed we can try to better estimate the desired bitrates to use while using simulcast. Factor into it things like number of videos in the layout (so that you take total bitrates and available bandwidth into consideration), importance of videos on the display (lower priority streams can manage with lower frame rates and resolutions), etc.
Here’s the thing though:
The end result is that the application in charge of it all needs to orchestrate the clients and the media servers in order to optimize the session for higher media quality, taking into consideration all the information. It also means that your application needs to somehow share this out-of-band information with the application session logic so decisions can be made. And this part is proprietary – it isn’t something that we have written as a standard or even a best practice.
Keyframes and switching costs in simulcastWith all this goodness, there’s an achilles heel. One that stems from the way Google implemented simulcast in Chrome, but also by the realities of such a solution.
Here’s the thing: Whenever a viewer switches from one simulcast layer to another, there’s a change in the video stream that gets decoded. That change can only occur with a fresh keyframe on the layer that is being switched to, so that the video decoder will be able to decode the stream properly.
When there’s a need to generate a keyframe in simulcast, Chrome will automatically generate a keyframe across all simulcast layers. This isn’t a good thing, but it is what it is.
This also means that SFU media servers need to be conscious about this and not have viewers switch between the different layers all the time, limiting switches to the minimum necessary to maintain high video quality.
Temporal scalability improves WebRTC simulcastWhen using temporal scalability alongside simulcast in WebRTC it gives us another level of flexibility.
In temporal scalability, the frames of a video stream are encoded in such a way that their dependency chain enables us to decode some of the frames and not others – something that is usually impossible in video compression. WebRTC’s implementation has in Chrome temporal scalability in VP8 with 2 such “layers”, so if you’re sending 30 frames per second, the SFU media server can decide to send either 30 or 15 FPS to participants (the 15 frames per second is roughly 60% of the bitrate of the 30 frames per second).
Think of it like multiplying your simulcast streams without an additional cost:
And yes, like everything else, this depends on the codec you use, the browser and the fact that some layers might not have enough frames per second to begin with (for example, the lower layer might only produce 10 or 15 frames per second and then temporal scalability might be useless).
When using simulcast, the level and variety of tools you use will enable you to increase the media quality you offer your users.
Decisions of highest layer bitrate in WebRTC simulcastSimulcast in WebRTC gives us another level of flexibility. One that Daily explains nicely in their post where they title their solution as adaptive bitrate.
Let’s assume we’re going for the classic 3 media stream in our WebRTC simulcast solution:
Remember our example from before? Our smallest bitrate (250kbps) and medium sized bitrate (600kbps) are “static” in nature. The video encoder in our browser is going to generate these in such a way each and every time (assuming the CPU allows and bandwidth estimation is higher than the summation of these two).
That highest bitrate there isn’t really static. At least not by default. It will use as much bitrate as it needs, taking into consideration the CPU consumption and bandwidth estimation. Left to its own device, this highest bitrate layer is going to be greedy in its resource consumption. It can also get below the medium sized bitrate if there’s not enough CPU or bandwidth available, which beats the point of this being the highest layer. This all leads us to what we need to do…
Like everything else that WebRTC does in the browser though, it needs to be managed and taken into account by the SFU media server. In this case, deciding what that highest layer bitrate should be at any given point in time.
Here are some questions to ask yourself when making that decision in your SFU:
These questions don’t have a single simple answer. The answer to these will vary based on the strategy you wish to employ, the use case you have, the video layouts you support, the level of your engineers, the media server you start with, …
At the end of the day, your answers are just a set of heuristics, and being able to compare one to another is going to be a challenging task. Make sure you get this right (or right enough) for your needs.
WebRTC and multi-codec simulcastThis is something that we’re just starting to see now.
Up until recently, as a developer, you chose a codec, used simulcast on it and that’s about it. The available alternatives were mostly VP8 and H.264. These days? With the introduction of the AV1 video codec a new idea started cropping:
So the above diagram was thought out in a way. Instead of using the same video codec in a simulcast session for WebRTC, why not use multiple codecs? Have AV1 used on the lowest bitrate and then another codec, say VP8 or VP9 on the higher bitrates?
This way, the machine’s CPU is capable of encoding the data, and the resulting media quality of the lowest bitrate in there is now higher than it would have been if we used a single codec for simulcast.
At the time of writing, this hasn’t been implemented in a workable fashion just yet.
In a way, this is our future for the coming years, until AV1 will become popular enough and its use made possible by commonplace hardware acceleration or better CPUs on the devices.
A word about SVC… and where to learn moreThere are alternatives to using WebRTC simulcast:
SVC stands for Scalable Video Coding. At its heart, it is quite similar to simulcast, just done on the codec level. The video encoder itself generates a bitstream that can be peeled like an onion into multiple bitrates. This gives a solution that is less wasteful than simulcast in bitrate and CPU. The downside here is an increase in complexity and in lack of availability of hardware encoders and decoders that know how to handle SVC.
There are video meeting solutions out there that use SVC. They can usually also use WebRTC simulcast – simply because SVC gets added later as an additional tool for further optimization and flexibility.
To learn more about simulcast, SVC and everything related to WebRTC, check out these services:
The post WebRTC simulcast – what is it and how is it used appeared first on BlogGeek.me.
Maximizing stream quality on an imperfect network in real-time is a delicate balancing act. If you send too much information then will cause congestion and packet loss. If you send too little then your video quality (or audio) will look like garbage. But how much can you send? One of the techniques used to find […]
The post Probing WebRTC Bandwidth Probing – why and how in gcc appeared first on webrtcHacks.
Is it time to change the governance of WebRTC in order to keep it growing and flourishing?
WebRTC started life in 2011 or 2012. Depending when you start counting.
That’s around 13 years now. Time to put things on the table – we might need a change in governance. A different way of thinking about WebRTC.
Table of contentsI published the above on LinkedIn last month.
It was a culmination of thoughts I’ve been having for the past several years.
You can pinpoint the first time I made that distinction in 2020 while coining the term WebRTC unbundling.
The notion was that WebRTC is being broken down into smaller pieces and developers are given more leeway and control over what WebRTC does (=a good thing). The result of all this is the ability to differentiate further, but also that the baseline of what WebRTC is gets farther behind what good media quality means.
There’s the popular open source implementation for WebRTC known as libwebrtc. It is maintained and governed by Google. When Google can enact its strategy by implementing their technologies and IP outside and around libwebrtc instead of inside libwebrtc – why wouldn’t they?
Google runs a business. They have commercial objectives. Differentiating from competitors who use libwebrtc to outwin Google would be a poor decision to make. Giving competitors using proprietary technology the source code of libwebrtc to copy from and improve upon without contributing back isn’t a smart move either.
This means cutting edge technologies and research is now done mostly outside of libwebrtc (and WebRTC) as much as possible. And the unbundling of WebRTC that started some 4 years ago is now starting to show.
Before we dive into the detailsSomething I always explain to people new to WebRTC is that WebRTC isn’t a single thing. When someone refers to it, he either thinks of WebRTC as a standard or WebRTC as an open source project:
The above is one of the first slides I’ve ever created about WebRTC.
WebRTC is an open standard. It is being specified by the IETF and W3C. The IETF deals with the network side while the W3C is all about the browser interface (JavaScript APIs).
WebRTC is also viewed as an open source project. That’s actually libwebrtc… the most common and popular implementation of WebRTC which has been created and is maintained by Google.
So remember – when people say WebRTC they can refer to it as either a standard or a package or both at the same time.
What we will do in this article from here on, is jump between these two definitions and see where we are with them today. We will start with the libwebrtc open source library.
The power and importance of libwebrtcHere’s what I shared in my RTC@Scale 2024 session:
In WebRTC, libwebrtc is the most important library. There are others, but this is by far the most important. Why?
The end result is that… well… It is the most important WebRTC library out there.
–
Before libwebrtc, what we had was lame open source libraries that implemented media engines. All good options were commercial ones. In fact, libwebrtc (and WebRTC) started with Google acquiring a company called GIPS who had a great implementation of a commercial media engine that they licensed to companies. I know because the company I worked at licensed it, and the moment they got acquired, we got a flood of requests and questions about finding an alternative.
What WebRTC did was make media engines a commodity of sorts. A new era where high quality media can be had from open source. This also meant that the commercial media engine market died at the same time.
This new development of pushing innovations and improvements in the media engine pipeline outside of libwebrtc is what is going to take that advantage from open source and libwebrtc away.
More on that, a bit later. But next, why don’t we look at the standardization of WebRTC?
WebRTC standardization effortsThe standardization of WebRTC was split between two different organizations: the W3C and the IETF. They were always semi-aligned.
The IETF was in charge of what goes on in the network. How a WebRTC session looks like on the wire. For WebRTC, it uses stuff that we all considered quite modern in 2012 – light years in tech-time. The IETF Working Group working on WebRTC, RTCWEB, concluded its work and closed down.
The W3C was/is in charge of the API layer in the browser. The JavaScript interface, mostly revolving around the RTCPeerConnection. And yes, they are trying to wrap this one up and call it a day.
In many ways, what brought WebRTC to what it is today is the W3C – the part focused on the interface in the browser that developers use. That is because the browser is our window to the internet (and in many ways to the world as well). And this window includes the ability to use WebRTC through the APIs specified by the W3C.
The catch here is that the standardization done by the W3C for WebRTC consists almost solely by the browser vendors themselves. There aren’t any (or not enough) web developers sitting at the table. The ones who need and end up using the WebRTC APIs have no real voice in the WebRTC spec itself. The cooks in the kitchen are far remote from the restaurant diners who need to enjoy their dish.
And meanwhile, the cooks have different opinions and directions as well:
So what do we end up with?
Google, trying to add things it needs to the WebRTC specification to solve their product needs
Other browser vendors, trying to delay Google a bit..
And developers who aren’t part of the game at all and are happy with the leftovers from what Google needs.
Vendors differentiating outside of (lib)WebRTCThe whole WebRTC ecosystem is enjoying the work of Google in libWebRTC. They do so in various ways:
The first alternative is the most interesting one here.
When vendors do that, they usually end up forking the original codebase and modifying bits and pieces of it to fit their own needs. These might be minor bug fixes for edge cases or they may be full blown optimizations (like what Meta has done with their new MLow codec and Beryl echo cancellation algorithm – there were other areas as well. You’ll find them in the RTC@Scale event summary).
Video API vendors are no different. They usually take libWebRTC and compile it as part of their own mobile SDKs. Again, with likely changes in the code. They also get to see and work with a multitude of customers, each with its own unique requirements. In a way,they see a LOT of the market. Having these insights and understanding is great. Passing it to the libWebRTC team can be even better. These Video API vendors can be a great aggregator of customer insights…
Then there’s the fact that not many end up contributing back what they’ve done to libWebRTC. And even that comes with a whole set of reasons why:
If you ask me, (1) is just bad manners – you get something for free from another vendor you might even be competing directly with. The least you can do is to share and contribute back, so that you have a level playing field at that low level of the stack.
Looking at (2) means someone needs to sit and talk to the legal team at your company. On one hand, you make use of open source and on the other you’re not giving back anything. I am not even sure if that reduces your exposure in any way. I am not a lawyer, but I do see the problem in this free lunch approach of the industry.
That third one is a big issue. And partly due to the fault of Google. They don’t make it easy enough to contribute back to the codebase. I can easily understand the reasoning – with billions of Chrome installations, having a no-name developer with a weird github alias from *somewhere* in the globe trying to push a piece of arcane/mundane code into libWebRTC that ends up in Chrome is darn dangerous. But the current situation seems almost insufferable.
I just don’t know who’s to blame here – companies who are just too lazy to contribute back and take the hoops required to get there or Google, for adding more blockers and hoops along their way.
Is standardization moving to the next shiny thing(s)?There are two separate routes in web browsers that are setting up themselves to displace WebRTC: WebTransport + WebCodecs + WebAssembly & MoQ (Media over QUIC)
WebTransport + WebCodecs + WebAssemblyThis trio is the unbundling of WebRTC. Taking it and breaking it into smaller components that cannot really be implemented in a web browser – these are WebTransport and WebCodecs. And adding the glue to them so that developers can cobble up the missing pieces however they feel like it – that’s the WebAssembly piece.
Vendors are already using WebAssembly to intervene with the WebRTC media processing pipeline to differentiate and improve on the user experience in various ways (noise suppression and background replacement being the main examples).
The next step is to skip WebRTC altogether:
Don’t believe me? Zoom is doing almost that. They are using the WebRTC data channel as transport, and use WebCodecs and WebAssembly for the rest of it. Switching to WebTransport will likely happen for Zoom once it is ubiquitous across browsers (and offers solid performance compared to the data channel in WebRTC).
A new shiny toy for developers? Definitely.
Where will we see it first? In live streaming. I’ve written about it when discussing WHIP and WHEP, calling it the 3 horsemen.
MoQ (Media over QUIC)The next big thing is likely to be MoQ.
WebTransport makes use of QUIC as its own transport. Around 5 years ago, I thought that QUIC can be a really good solution to replace WebRTC’s transport altogether. And it now has an official name – MoQ.
MoQ is about doing to RTP what WebTransport does to HTTP.
WebTransport takes QUIC and uses it as a modernized transport for web browsers, replacing HTTP and WebSocket.
MoQ takes QUIC and uses it as modernized media streaming for web browsers, replacing HLS and DASH.
There’s an overview for MoQ on the IETF website. Here’s the best part of it, directly from this post:
It includes a single protocol for sending and receiving high-quality media (including audio, video, and timed metadata, such as closed captions and cue points) in a way that provides ultra low latency for the end user.
If that sounds like WebRTC to you, then you’re almost correct. It is why many are going to see it (and use it) as a WebRTC alternative once it gets standardized and implemented by web browsers.
The main differences?
While this is targeted at live streaming services, this can easily trickle into video conferencing.
Just like WebRTC was designed and built for video conferencing, but later adopted by live streaming services – the opposite can and is likely to happen: MoQ is being designed and built first and foremost for live streaming and it will be adopted and used by video conferencing services as well.
–
Would Google be interested in WebRTC enough? Maybe it would venture to use WebTransport + WebCodecs + WebAssembly instead. Or just go for MoQ and consolidate its protocols across services (think YouTube + Google Meet). What would happen to WebRTC if that would take place?
Who contributes to libwebrtc?Here’s what I showed at RTC@Scale:
Let’s unpack this a bit.
The bars show the number of commits on a yearly basis. We see the numbers dwindling and winding down just as the use of WebRTC skyrockets (the redline) due to the pandemic. 2024 is likely to be even lower in terms of commits.
The greenish colored bars are Google’s contributions to libwebrtc. The blue? All the rest of the industry who make money using WebRTC – not all of them mind you – just those that contribute back (there are many others who never contribute back). Google has been sponsoring them somewhat which can not make them happy.
Why is that?
Why are so few contributions outside of Google end up in libwebrtc?
I guess there are two reasons here:
Many developers the world over enjoy the fruits of libwebrtc, but most aren’t willing to contribute back. This is true for both individual engineers as well as companies. Google even gave up on being frustrated with this and resorts to solving their own issues these days. They probably have a very good understanding of the overall usage in Chrome where Google Meet remains the dominant user.
On the one hand, Google isn’t making this easy. On the other hand, companies are lazy or protective of their own forked libwebrtc code to never contribute it back.
Can we save libwebrtc & WebRTC?It is time to rethink WebRTC’s future.
For libwebrtc, we might need some other form of governance. Have more of the bigger vendors pitch in with the engineering effort itself. Meta, Microsoft and a few others who rely heavily on libwebrtc need to step up to that responsibility (the W3C Working Group is not where this kind of discussion happens) while Google needs to let go a bit. I have no clue how things are done in the world of Linux and I am sure libwebrtc isn’t big enough or important enough for that. But I do believe that something can be done here. At the end of the day it will require taking some of the maintenance cost off Google.
Just like Chrome has third party libraries such as libopus and dav1d (AV1 decoder) embedded into Chrome as part of libwebrtc, there is no real reason why libwebrtc itself can’t end up in the same way.
For WebRTC standardization, it is time to ask – is it finished, or are there more things needed?
Do we want to progress and modernize it further or are we happy with it as is?
Should we “migrate” it towards MoQ or a similar approach?
In the W3C, do we need to get more people involved? The web developers themselves maybe? They need to be listened to and made part of the process.
–
Will the above happen? Likely not.
The post Does WebRTC need a change in governance? appeared first on BlogGeek.me.
RTC@Scale is Facebook’s virtual WebRTC event, covering current and future topics. Here’s the summary for RTC@Scale 2024 so you can pick and choose the relevant ones for you.
WebRTC Insights is a subscription service I have been running with Philipp Hancke for the past three years. The purpose of it is to make it easier for developers to get a grip of WebRTC and all of the changes happening in the code and browsers – to keep you up to date so you can focus on what you need to do best – build awesome applications.
We got into a kind of a flow:
Oh – and we’re covering important events somewhat separately. Last month, a week after Meta’s RTC@Scale event took place, Philipp sat down and wrote a lengthy summary of the key takeaways from all the sessions, which we distributed to our WebRTC Insights subscribers.
As a community service (and a kind of a promotion for WebRTC Insights), we are now opening it up to everyone in this article 😎
Table of contentsMeta ran their rtc@scale event for the third time. Here’s what we published last year and in 2022. This year was “slightly” different for us:
While you can say we’re both biased on this one, we will still be offering an event summary here for you. And we will be doing it as objectively as we can.
Our focus for this summary is what we learned or what it means for folks developing with WebRTC. Once again, the majority of speakers were from Meta. At times they crossed the line of “is this generally useful” to the realm of “Meta specific” but most of the talks provide value.
Writing up these notes takes a considerable amount of time, but is worth it (we know – we’ve done this before). You can find the list of speakers and topics on the conference website, the playlist of the videos can be found here (there’s also a 6+ hours long session there that includes all the Q&As). You can also just scroll down below for our summary.
Our top picksOur top picks:
We find these most applicable to how you deal with WebRTC in general, even outside of Meta.
General thoughts (TL;DR)(4 minutes)
Watch if you: need a second opinion on what sessions to watch
Key insights:
(13 minutes)
Watch if you: are a product person
Key insights:
(20 minutes)
Watch if you: are an engineer working on audio and enjoyed last year’s session
Key Insights:
(17 minutes)
Blog post: we hope there will be one!
Watch if you: are an engineer working on audio
Key Insights:
(19 minutes)
Watch if you: are looking for architecture insights also applicable to WebRTC
Key Insights:
https://www.youtube.com/live/dv-iEozS9H4?feature=shared&t=5821 (25 minutes)
Watch if: you found any of the sessions this covers interesting
Key Insights:
(22 minutes)
Blog post: https://engineering.fb.com/2024/03/20/video-engineering/mobile-rtc-video-av1-hd/
Watch if: you are thinking of adopting AV1 or trying to improve video quality
Key Insights:
(19 minutes)
Watch if: you are interested in a deep dive on AV1 and video encoding in general
Key Insights:
(16 minutes)
Watch if: you are working in the 360-degree video domain
Key points:
(19 minutes)
Watch if:
Key points:
https://www.youtube.com/watch?v=dv-iEozS9H4&t=13260s (23 minutes)
Watch if: you found any of the sessions this covers interesting
Key Insights:
(24 minutes)
Watch if: you like to hear Tsahi speaking. He does some juggling too!
Key Insights:
(20 minutes)
Watch if: you deploy a WebRTC-based system in production
Key points:
(21 minutes)
Watch if you: like open source
Key takeaways:
(20 minutes)
Watch if: you are interested in BWE and machine-learning
Key takeaways:
https://www.youtube.com/live/dv-iEozS9H4?feature=shared&t=21000 (24 minutes)
Watch if: you found any of the sessions this covers interesting
Key Insights:
As in previous years, we tried capturing as much as possible, which made this a wee bit long. The purpose though is to make it easier for you to decide in which sessions to focus, and even in which parts of each session. And of course for us so we can look things up and reference it in future blog posts or courses!
The post RTC@Scale 2024 – an event summary appeared first on BlogGeek.me.
We covered End-to-end encryption (E2EE) before, first back in 2020 when Zoom’s claims to do E2EE were demystified (not just by us; they later got fined $85m for this), followed by the quite exciting beta implementation of E2EE in Jitsi using Chromium’s Insertable Streams API. A bit later we had Matrix explain how their approach […]
The post End-to-End Encryption in WebRTC… 4 Years Later appeared first on webrtcHacks.
Need WebRTC recording in your application? Check out the various requirements and architectural decisions you’ll have to make when implementing it.
A critical part of many WebRTC applications is the ability to record the session. This might be a requirement for an optional feature or it might be the main focus of your application.
Whatever the reasons, WebRTC recording comes in different shapes and sizes, with quite a few alternatives on how to get it done these days.
What I want to do this time is to review a few of the aspects related to WebRTC recording, making sure that when it is your time to implement, you’ll be able to make better choices in your own detailed requirements and design.
Table of contentsOne of the fundamental things you will need to consider is where do you plan the WebRTC recording to take place – on the device or on the server. You can either record the media on the device and then (optionally?) upload it to a server. Or you can upload the media to a server (live in a WebRTC session) and conduct the recording operation itself on the server.
Recording locally uses the MediaRecorder API while uploading uses HTTPS or WebSocket. Recording on the server uses WebRTC peer connection and then whatever media server you use for containerizing the media itself on the server.
Here’s how I’d compare these two alternatives to one another:
Record-and-uploadUpload-and-recordTechnologyMediaRecorder API + HTTPSWebRTC peer connectionClient-sideSome complexity in implementation, and the fact that browsers differ in the formats they supportNo changes to client sideServer-sideSimple file serverComplexity in recording functionMain advantagesWhen would I record-and-upload?
I would go for client-side recording using MediaRecorder in the following scenarios:
When would I upload-and-record?
Here’s when I’d use classic WebRTC architectures of upload-and-record:
How about both?
There’s also the option of doing both at the same time – recording and uploading and in parallel to upload-and-record. Confused?
Here’s where you will see this taking place:
If you are recording more than a single media source, let’s say a group of people speaking to each other, then you will have this dilemma to solve:
Will you be using WebRTC recording to get a single mixed stream out of the interaction or multiple streams – one per source or participant?
Assuming you are using an SFU as your media server AND going with the upload-and-record method, then what you have in your hands are separate media streams, each per source. Also, what you need is a kind of an MCU if you plan on recording as a single stream…
For each source you could couple their audio and video into a single media file (say .webm or .mp4), but should you instead mix all of the audio and video sources together into a single stream?
Using such a mixer means spending a lot of CPU and other resources for this process. The illustration below (from my Advanced WebRTC Architecture course) shows how that gets done for two users – you can deduce from there for more media sources:
The red blocks are the ones eating up on your CPU budget. Decoding, mixing and encoding are expensive operations, especially when an SFU is designed and implemented to avoid exactly such tasks.
Here’s how these two alternatives compare to each other:
Multiple streamsMixed streamOperationSave into a media fileDecode, mix and re-encodeResourcesMinimalHigh on CPU and memoryPlaybackCustomized, or each individual stream separatelySimpleMain advantagesWhen would I use multi stream recording?
Multi stream can be viewed as a step towards mixed stream recording or as a destination of its own. Here’s when I’d pick it:
When would I decide on mixed stream recording?
Mixed recording would be my go-to solution almost always. Usually because of these reasons:
What about mixed stream client side recording?
One thing that I’ve seen once or twice is an attempt to use a device browser to mix the streams for recording purposes. This might be doable, but quality is going to be degraded for both the actual user in the live session as well as in the recorded session.
I’d refrain from taking this route…
Switching or compositingIf you are aiming for a single stream recording, then the next dilemma you need to solve is the one between switching and compositing. Switching is the poor man’s choice, while compositing offers a richer “experience”.
What do I mean by that?
Audio is easy. You always need to mix the sources together. There isn’t much of a choice here.
For video though, the question is mostly what kind of a vantage point do you want to give that future viewer of yours. Switching means we’re going to show one person at a time – the one shouting the loudest. Compositing means we’re going to mix the video streams into a composite layout that shows some or all of the participants in the session.
Google Meet, for example, uses the switching method in its recordings, with a simple composite layout when screen sharing takes place (showing the presenter and his screen side by side, likely because it wasn’t too hard on the mixing CPU).
In a way, switching enables us to “get around” the complexity of single stream creation from multiple video sources:
SwitchingCompositingAudioMix all audio sourcesMix all audio sourcesVideoSelect single video at a time, based on active speaker detectionPick and combine multiple video streams togetherResourcesModerateHigh CPU and memory needsMain advantagesCost effectiveMore flexible in layouts and understanding of participants and what they visually did during the meetingWhen would I pick switching?
When the focus is the audio and not the video.
Let’s face it – most meetings are boring anyway. We’re more interested in what is being said in them, and even that can be an exaggeration (one of the reasons why AI is used for creation of meeting summaries and action items in some cases).
The only crux of the matter here, is that implementing switching might take slightly longer than compositing. In order to optimize for machine time in the recording process, we need to first invest in more development time. Bear that in mind.
When would compositing be my choice?
The moment the video experience is important. Webinars. Live events. Video podcasts.
Media that plan or want to apply post production editing to.
Or simply when the implementation is there and easier to get done.
I must say that in many cases that I’ve been involved with, switching could have been selected. Compositing was picked just because it was thought of as the better/more complete solution. Which begs the question – how can Google Meet get away with switching in 2024? (the answer is simple – it isn’t needed in a lot of use cases).
Rigid layouts or flexible layoutsAssuming you decided on compositing the multiple video streams into a single stream in your WebRTC recording, it is now time to decide on the layout to use.
You can go for a single rigid layout used for all (say tiles or presenter mode). You can go for a few layouts, with the ability to switch from one to the other based on context or some external “intervention”. You can also go for something way more flexible. I guess it all depends on the context of what you’re trying to achieve:
SingleRigidFlexibleConceptA single layout to rule them allHave 2, 3 or 7 specific layouts to choose fromAllow virtually any layout your users may wish to useMain advantagesHere’s a good example of how this is done in StreamYard:
StreamYard gives 8 predefined different layouts a host can dynamically choose from, along with the ability to edit a layout or add new ones (the buttons at the bottom right corner of the screen).
When to aim for rigid layouts?
Here’s when I’ll go with rigid layouts:
Here, make sure to figure out which layouts are best to use and how to automatically make the decision for the users (it might be that whatever the host layout is you record, or based on the current state of the meeting – with screen sharing, without, number of participants, etc).
When would flexibility be in my menu?
Flexibility will be what I’ll aim for if:
You decided to go for a composite video stream for your WebRTC recording? Great! Now how do you achieve that exactly?
For the most part, I’ve seen vendors pick up one of two approaches here – either build their own proprietary/custom transcoding pipeline – or use a headless browser as their compositor:
Transcoding pipelineBrowser engineUnderlying technologyUsually ffmpeg or gstreamerChrome (and ffmpeg)ConceptStitch the pipeline on your own from scratchAdd a headless browser in the cloud as a user to the meeting and capture the screen of that browserResourcesHighHigh, with higher memory requirements (due to Chrome)Main advantagesHere I won’t be giving an opinion about which one to use as I am not sure there’s an easy guideline. To make sure I am not leaving you half satisfied here, I am sharing a session Daily did at Kranky Geek in 2022, talking about their native transcoding pipeline:
Since that’s the alternative they took, look at it critically, trying to figure out what their challenges were, to create your own comparison table and making a decision on which path to take.
Live or “offline”Last but not least, decide if the recording process takes place online or post mortem – live or “offline”.
This is relevant when what you are trying to do is to have a composite single media stream out of the session being recorded. With WebRTC recording, you can decide to start off by just saving the media received by your SFU with a bit of metadata around it, and only later handle the actual compositing:
Live“offline”ConceptHandle recording on demand, as it is taking place. Usually, adding 0-5 seconds of delayUse job queues to handle the recording process itself, making the recorded media file available for playback minutes or hours after the session endedMain advantagesWhen to go live?
The simple answer here is when you need it:
When to use “offline”?
Going “offline” has its set of advantages:
How about both?
Here are some suggestions of combinations of these approaches that might work well:
This has been long. Sorry about that.
Designing your WebRTC recording architecture isn’t simple once you dive into the details. Take the time to think of these requirements and understand the implications of the architecture decisions you make.
Oh, and did I mention there’s a set of courses for WebRTC developers available? Just go check them out at https://webrtccourse.com
The post WebRTC recording challenges and solutions appeared first on BlogGeek.me.
I am working on a personal Chrome Extension project where I need a way to convert a video file – like your standard mp4 – into a media stream, all within the browser. Adding a file as a src to a Video Element is easy enough. How hard could it be to convert a video […]
The post All the ways to send a video file over WebRTC appeared first on webrtcHacks.
Some science fiction books I carry in my heart and mind wherever I go for quite a few years now. Consider it a condensed book review.
I am a sucker for science fiction books. About 15 years ago, when I had a blog on RADVISION’s website, I even wrote a post about how writers envisioned video conferencing in science fiction books. Alas, that post has died, along with the RADVISION blogs, years ago.
Last week I sat down in the car with my daughter, ending up talking about books. It dawned on me that there are several that have stuck with me throughout the years and resonated. Books that keep me thinking even today.
This time, I decided to share them here. Unrelated to WebRTC, video, CPaaS or communication technologies. Just something I wanted to share 🤷♂️
And yes. All links are affiliated – my Kindle needs a few new good science fiction books 😉
They’re brought here in no specific order (alphabetically…)
Table of contentsGreg Bear has many great books. Blood Music is definitely one of them (I had to decide if I suggest this one on Drawin’s Radio – ending up with this one).
What I like about this one is how it combines miniaturization with biology. I know nothing about biology and what I do know about technology and miniaturization is by using computers.
This was a compelling read and a really interesting one of what happens at the extreme ends of connecting the dots between these two things.
It also resonated with my own philosophical thoughts about the difference in depiction and scale between the makings of atoms to the whole universe. To understand this specific sentence, reading Blood Music by Greg Bear is likely needed.
Daemon / Daniel SuarezLLMs, chatbots, AI. This book has it all.
One of my previous managers suggested I read that, and he was spot on. It takes the angle of how the gaming industry and its NPCs (Non Player Characters) can make a difference if they are “let loose” in the world.
It takes the technologies we have today (or rather a few years ago) and tries to prophesize where we will be with them. Definitely a few misses in where we are headed, but a lot to think about.
Especially when the time to decide who works for who – the machine for us or us for the machine.
Go read Daemon by Daniel Suarez
Ender’s Game / Orson Scott CardThis is the second or third science fiction book I read in English and it got me onto the path of reading in English a lot. A roommate at the university gave it to me to read and said “it is about a small kid that saves the world”.
Besides the science fiction part of the book, how it covers bullying and the way to win in wars is interesting. I like how Orson outlines the story.
A few years after reading it, Orson Scott Card came to Israel for an event. I went there with a colleague from work for the book signing event, standing two hours in line for one minute with Orson. He gave me his full attention and was surprised at the book I brought to sign (Enchantment – it isn’t in this list since it is fantasy and not science fiction).
Anyway, Orson Scot Card is always a good read and Enter’s Game is a great starting point.
Expendable / James Alan GardnerThis is one enjoyable read. It took me into this riveting series of books by James Alan Gardner.
To put it short, explorers are expendable. They are dropped into new worlds to explore, and the reason they were selected is because they are deformed in one way or another but smart. So instead of fixing their external deformity (or ugliness), they are used as explorers. Why? Because if they looked good – they wouldn’t be expendable. Their death might matter to someone.
The rest of the series revolves around nanotech and AI. Or magic. Or something in between.
This is a lot less about ruminating about the books afterwards and more about enjoying the read – go read Expendable by James Alan Gardner.
Old Man’s War / John SclaziJohn Sclazi is another master storyteller (at least for me). Old Man’s War marks the beginning of a great series of many books (and not the only ones I love from John Sclazi).
Old Man’s War places humanity in a universe full of alien life – most of it warring in nature (or at least that’s the initial premise of it all). The way to build an army, the solution is to take the elderly and have them undergo a physical change, essentially taking them a bit apart from the rest of humanity and turning them into soldiers.
Since Earth is kept a wee bit back in its technology, they’ve seen most of what there is in life already and are old. So getting a younger body is all that is needed to recruit them for the cause.
The more I get older (age 40 was especially rough – it is when I started breaking in the seams or so it seems), the more I think about this series of books – and how I wish (or don’t wish) to be young again.
This series, as well as many of his other books are a joy to read – Old Man’s War by John Sclazi
Ready player one / Ernest ClineSkip the movie. Read the book.
This has the word metaverse all over it. If you read Snow Crash by Neal Sephenson then you’ll want to read this one. And if you haven’t then just go read them both 🤷♂️
Besides the part of metaverse, large corp and all that stuff we’re here to ponder, what really sets this book apart is the treasure trove that it is for nostalgy. If you are 40 years or older, know what a Commodore 64 is, played Pac Man on a handheld device before there was such a thing as a PC, then you’ll find your youth inside this book. For me, this was a true joy to read.
Oh, and I just started reading Ready Player Two (noticed that when I went searching for the books I loved for this article).
Go read Ready Player One by Ernest Cline.
The Peace War / Vernor VingeIf you know Vernor Vinge as a scifi writer then you don’t need me for this one. If you read scifi and haven’t read a Vernor Vinge book then you should. In such a case, The Peace War is a great place to start.
This one is about technology and fighting wars with the resources you have. Where one side rules all the other goes and miniaturizes stuff.
This, as well as many of his other books just float in my head and come out from time to time (especially books like A Fire Upon The Deep or Rainbows End, both from the point of view of communication technologies and artificial intelligence).
Anyways, just go read The Peace War by Vernor Vinge. Or any other book by Vernor Vinge for that matter…
The Speed of Dark / Elizabeth MoonThis book touched me in many ways. It isn’t exactly science fiction – it is mostly the effect improvements in healthcare on moral decisions we need to take.
In this case, it is about the last autistic people in the world, after autism is all but eradicated, and what it means for an autistic adult to decide to “heal”. Would that be a good thing for him? A bad one? Will he stay the same person?
And all of that written from the point of view of the autistic person.
I truly loved this one and walked around with the baggage it left in me afterwards. Highly recommended – The Speed of Dark / Elizabeth Moon.
Winter World / A.G. RiddleI read this one last winter… and it got me into the mood of winter and kept me there. All dark and cold. This book (and the series) is so well written. You can just feel the cold and the darkness as you read it.
The story is about our earth, dealing with climate change – one where the sun just gets blotted out of the sky until it is no more visible. At least that’s the first book. It is about choices – technological and human ones. And about our will to survive.
I’ll just leave it at that and say that this winter here is cold as well. And it got me thinking about this book series again.
Go read Winter World by A.G Riddle.
Wool / Hugh HoweyNo. I haven’t seen it on Apple TV. I read the book and then all 3 books in this series. And then the rest of the Silo stories available. It is that riveting.
This is less about technology (at least the first book) and more about the human condition and how technology affects it. Like many of the other books in this article that I am recommending, this series is also dystopian in nature. It isn’t that I like my books bleak – it is just that the bleak ones stick with me longer and cause me to think about my day to day a lot more.
Anyways, go read Wool by Hugh Howey.
Your turnGot any books you think I should be reading? Science fiction and fantasy would be great:
Now I need to get back to Ready Player Two 😉
I’ll be back to the usual communication technology articles next time.
The post Science fiction books that resonated with me appeared first on BlogGeek.me.
Answering some common FAQ questions about WebRTC that seem to be top of mind on Google search.
A few days ago, I searched something on Google, and somehow bumped into a page full of questions Google found relevant or common. These weren’t exactly relevant to my search term (not directly), but they were there. And they were beginner questions about WebRTC.
It dawned on me that I’ve probably mentioned some of these things in passing (or a wee bit more) in the past, but placing them all neatly together in one place made sense. So here we are. And here’s the WebRTC FAQ for beginners.
Table of contentsWebRTC is neither TCP nor UDP. At the same time WebRTC is both TCP and UDP.
Confused?
Let’s put things in order.
With WebRTC there’s signaling and media.
Signaling is considered to be out of scope and left to the application. Most applications will use HTTPS or a secure WebSocket as transport for signaling. HTTPS runs over TCP… sort of… since HTTP/3 can also do UDP. But mostly, you can think of signaling in WebRTC as TCP and the skies won’t fall ( what we want for signaling is reliability and messages order, and TCP based protocols give us that).
Media in WebRTC wants to use UDP. It strives to use UDP as much as possible, but that’s not always available to it, so it then falls back towards using TCP. But you can consider this as a last resort (we don’t want to be in that predicament).
Read more about WebRTC transport:
Yes. You wouldn’t be reading my blog otherwise
It isn’t that there aren’t any challengers. It is that WebRTC is still the most popular and common solution for real time communications in web browsers.
WebTransport + WebCodecs + WebAssembly might someday replace WebRTC. But we’re not there yet.
Read more about WebRTC’s success and future:
Free. Err. Paid. Free? Paid? Both? None?
Let’s sort things out here.
WebRTC is an open standard with a popular open source implementation maintained by Google and used by all major browser vendors.
Accessing the APIs and using them is free.
But creating most of the meaningful applications is going to require some sort of payment. That can be to a CPaaS vendor to host the WebRTC infrastructure; or to an IaaS vendor (think AWS) to host the servers and the bandwidth use (especially with TURN and media servers).
So yes. WebRTC is free, but expect to pay for it, in particular if you need help. Google will not help you…
Read more about WebRTC’s costs:
WebRTC is used for implementing realtime voice and video communications over the internet using web browsers. But it definitely isn’t limited to that.
I’ve seen use cases dealing with recording, live streaming, broadcasting, cloud gaming, remote teleoperation (that’s driving a car… remotely), peer assisted delivery, file transfer, … the list is endless.
Read more about WebRTC use cases:
WebRTC enables browsers to have (and give) access to your microphone, camera, display and IP address. This is what every voice or video meeting application you install requires in order to work properly as well.
Is that a security risk? That’s up to you to decide as a user.
Giving such power to the browser reduces the friction for users but also for nefarious third parties who want to exploit these capabilities, so some will see this as an increase in security risk.
For developers it simply means that they need to know and understand what they are doing and how they implement their applications with this technology in order to mitigate any potential risk. It is worth noting that WebRTC and web browsers from their side do the most they can to reduce such security risks and even encourage developers to write secure applications.
Read more about WebRTC security:
Does Netflix use WebRTC?No.
Netflix might be using WebRTC somewhere, but for its main video streaming service Netflix doesn’t use WebRTC.
Why? Because WebRTC is designed and fine tuned for real time communications. As such, it sacrifices quality for improved latency.
Netflix is the exact opposite. It strives to deliver the best quality and is willing to sacrifice a bit of latency while at it – you wouldn’t mind waiting a few seconds for your movie to start in order to have crisp and pristine video. On the other hand, you’d be pissed if your online video conversation had a latency of 5 seconds and felt as if the other person was sitting on the moon.
Read more about WebRTC and latency:
Yes.
Everything can be hacked.
Browsers are trying to do their best to reduce that risk for WebRTC (and other technologies they implement), but it is an arms race…
Read more about WebRTC security:
Does WebRTC expose your IP?This is a tricky question. The answer is yes and no.
Let’s start by understanding which IP address…
Your device usually has two IP addresses:
Each application on your device, including the browser, has access to the local IP address.
Each web server you connect to on the internet sees your public IP address.
When negotiating a WebRTC session, WebRTC uses a mechanism called ICE which discovers your public IP address and shares your local and public IP address with the peer it connects with.
A few quick clarifications here:
More about WebRTC IP leak:
A cheesecake is definitely better than WebRTC. A chocolate cheesecake is doubly so.
In all seriousness though, I have no clue.
It depends. Which is a cop out answer but the only one here.
The question should be more specific. It should include what it is you are trying to build, what is the target audience and what medium do you want to use for it.
For live streaming, WebRTC might not be the best fit. Especially if you can live with a 2 seconds delay (in that case, LL-HLS and LL-DASH would be better solutions for example).
For video conferencing… well… I’d start by selecting WebRTC by default. And then try to poke holes in my decision and select something else – proprietary – since there is nothing else…
More about WebRTC alternatives:
Apples to oranges.
I’d use both. In the same application. Seriously.
WebSocket for signaling and WebRTC for media.
There are two places where you can think of WebRTC and WebSocket as alternatives:
Did I already say apples to oranges?
More about transport in WebRTC:
To be frank – Google is Google. Not sure what the question is here
Google and WebRTC have an interesting relationship.
It all started when Google acquired GIPS, a company who licensed media engines. A bit afterward, WebRTC was announced in the standardization organizations and Google made the GIPS media engine into an open source implementation, integrating it into Chrome and placing APIs on top of it – these APIs were the WebRTC API specifications (or close enough at the time).
That was over 10 years ago. Since then, WebRTC has evolved and so has Google’s implementation of it.
Google uses WebRTC internally for Google Meet and for other products and projects it has.
The actual WebRTC project is open source. Maintained by Google. And most of the contributions to it are Google’s.
More about WebRTC & Google:
Yes. WebRTC needs a server. In fact, it needs multiple servers.
For starters, you need to download the application logic from somewhere, and a way to signal who you want to make a conversation with. This is done with a signaling server.
Then, when connecting the WebRTC session, there are times when you won’t have a direct route for the media. In such cases, you are going to need a TURN server. TURN servers also act as STUN servers but STUN servers are not the same as signaling servers.
And, you may want to go fancy – run a group meeting, record stuff. Such capabilities almost always mean you are adding a media server into the mix.
Read more about WebRTC servers:
Does WebRTC require Internet?Yes.
Everything today requires the Internet. Even you being able to read this FAQ requires the Internet.
WebRTC can run in local networks or private networks without connecting to the public Internet. But it still needs an IP network to work.
Does WebRTC use SSL?Yes.
Let’s start with definitions first: For me SSL and TLS are one and the same.
HTTPS and WSS (Secure HTTP and Secure WebSocket) both run on top of TLS so they are also → SSL.
Web browsers practically force application developers to use HTTPS for the pages that host these services, which means all signaling used with WebRTC will be done via HTTPS or WSS.
The media uses SRTP, which is Secure RTP, which doesn’t use TLS (because it isn’t running over TCP). That said, when sessions need to be relayed via TURN servers, they might end up being relayed over TURN/TLS.
Read more about WebRTC security:
Couldn’t find the answer?
I can invite you to follow and read my blog – it has a lot of resources about WebRTC
My suggestion? Start here What is WebRTC?
If you are looking to skill up with WebRTC, I also have WebRTC courses for you.
The post An FAQ for WebRTC beginners appeared first on BlogGeek.me.
Here are the WebRTC trends and predictions you should expect in 2024. They are a continuation of what we’ve seen in 2023 with a few variations.
Time to look at what we’ve accomplished in 2023 and think what’s ahead of us in 2024 when it comes to WebRTC.
When we look ahead, there are several notable things that glare at us immediately:
Last year, I became CPO at Spearline. This year, Spearline got acquired by Cyara and I am now Senior Director of Product Management there. I am still delving into WebRTC and CPaaS. Still consulting a bit here and there on these subjects when it makes sense.
If you are interested, you can read my last year’s WebRTC predictions for 2023
Let’s get started here…
Table of contentsThis year, I took the liberty of also sharing my predictions in a video form. It holds the essence of my WebRTC predictions for 2024, in a short form.
Read on below to get into the details.
The era of differentiation in WebRTCWe are well into the era of differentiation:
I’ve had this slide done somewhere in 2020, modifying it a bit to fit the pandemic.
It is as relevant today as it was last year:
The answers of how we compete varies on a yearly basis. Now, it obviously revolves around generative AI and LLMs. That’s the easy answer. The truth is a lot more complicated and nuanced. It requires understanding where investments are currently made – both at Google and in the ecosystem around WebRTC and its use.
What does WebRTC use look like?Last year I predicted usage would be 3 times higher than pre-pandemic. That meant lowering the use at the beginning of 2023 from 4 times to 3 times pre-pandemic. The end result? We stayed at around 4 times pre-pandemic usage.
From here, it can only go up, though slowly and linearly but likely after 2024:
I am not going to touch the topic of open source here. I’ve done that in my article two weeks ago writing about the top WebRTC open source media servers on github.
XaaS requires a few words of explanation, and I am likely to cover them in the coming months in further detail in a separate article.
For me, XaaS is IaaS, CPaaS and SaaS. In all cases, it is a matter of looking at them from the prism of WebRTC APIs CPaaS.
CPaaSThe landscape is changing in the CPaaS domain. A few years back, the leading vendors for WebRTC APIs were Vonage, Twilio and Agora. Probably in this order.
Here’s what I had to say in my last year predictions article:
The perceived leaders in WebRTC CPaaS are still Twilio, Vonage and Agora. I have a feeling that by the end of 2023 this will change.
Little did I know this would be spot on…
Twilio just announced in December that it is exiting the video business altogether. They still have and use WebRTC for their voice capabilities, mainly with a focus on call centers. But other than that? They just became irrelevant to many developers.
Most vendors are now likely to want to compare themselves now to Vonage and Amazon Chime SDK. Agora probably as well.
From a perspective of innovation or specific market niches, other vendors come to mind as solid alternatives here. Companies such as Daily and Dolby for example (there are others – sorry for not mentioning everyone). Or LiveKit with its open source alternative.
Notables?
That change at Twilio places more strain on developers who need to choose who to use, with the added new risk of the level of commitment they see in the CPaaS vendor they choose. When someone like Twilio throws you under the bus, what can you expect from other vendors?
SaaSSaaS vendors are vying towards CPaaS, assuming for some unknown reason that there’s money to be had from developers.
There are a few that are taking this route.
The problem that I see here is the fact that Twilio decided this isn’t interesting enough. While they have the APIs – they don’t invest in it any further. Meaning it isn’t a big enough market for Twilio. In such an atmosphere, how would it be big enough for SaaS vendors, and how will they see the explosion in use of their infrastructure that they likely haven’t seen in SaaS.
Some of them may yet succeed, but the path here isn’t an obvious or a simple one.
IaaSAmazon, Microsoft, Google… and… Cloudflare.
Let’s see where that takes us
Amazon is investing in Chime SDK. Especially when it comes to audio quality and capabilities. In many ways, Amazon is shifting the attention of developers from CPaaS to their Chime SDK as a solid alternative. This is a trend that should be watched by CPaaS vendors and developers alike.
Microsoft seems content with their current offering of Azure Communication Services. There were no new or interesting announcements around it in 2023, which begs the question – is it important enough for Microsoft and a viable solution for developers?
Google announced APIs for Google Meet. Ones that integrate with it, but not ones that use its infrastructure for me to build my own video experiences. So no luck there for a CPaaS play. Time will tell if this changes. It is unlikely to happen in 2024.
Cloudflare entered the market with much fanfare. I covered them in 2023’s predictions. Since then, there have been no material announcements. Is that good? Bad? I just don’t know.
How did I do with my 2023 WebRTC predictions?I spent quite a lot of time on my predictions in 2023. Let’s see how well I did.
#1 – libWebRTC (and the future of WebRTC)I’ve made the prediction that Google’s WebRTC library will focus on house cleaning, optimizing and polishing collaboration. It did all that this year. We see this on an ongoing basis in our WebRTC Insights service.
What was interesting to note, is a slight shift towards requirements coming outside of Google Meet. There’s work being done to include H.265 support in libWebRTC, wherever H.265 is available in a hardware implementation form (i.e – someone is already paying the patent royalties bill).
Is that because Google was benevolent and nice? Is it because they wanted to show they aren’t a monopoly in Chrome? Is it because of some other deal with Intel (the ones pushing H.265 into WebRTC). Or is it simply because they might end up using it in Google Meet in all-Apple devices meetings? Time will tell.
#2 – Machine learning and media processingI assumed that WebAssembly would continue to be used with WebRTC for media processing in things like background replacement, noise suppression and proprietary codecs implementations.
It was.
Some of it was done in WebAssembly and browser level. A lot of it was relegated to the cloud or kept in native applications. What I found interesting, that some vendors chose to announce and release such solutions across all platforms and not start from native and move towards the web later.
Most interesting (and obvious) change here? A lot of this use is now being remarketed as generative AI – doesn’t matter if it is generative or not.
#3 – Voice before video (Lyra first, AV1 later)I thought Lyra (=new voice codec) would find its way to applications faster than AV1 (=new video codec). Or at least new voice codecs…
The results are… inconclusive.
Webex did come out with a new Webex AI audio codec, with little explanation about it.
AV1 is starting to make real noises of almost-maturity, with Apple supporting AV1 hardware acceleration (for decoding only at the moment) and Google fiddling around with AV1 in Google Meet.
We didn’t hear much this year about Google’s Lyra or Microsoft’s Satin codecs. Just this new announcement of the new Webex AI codec. So I am not sure if voice happened before video or not.
#4 – ObservabilityYes. There is more interest in observability. I know that by looking at our numbers in testRTC. There is no specific market or industry where it happens more. What I can say is that many contact centers are starting to take note. Probably due to their increased reliance in WebRTC and the fact that many contact center agents are working from home now.
#5 – M&As and shutdownsWe had a few interesting shutdowns and M&As. The most notable ones?
A lot of WebRTC engineers found themselves a new home. Either because their startups shut down, their company downsized or they saw no future where they were.
Good talent is there to be had if you look hard enough.
WebRTC predictions for 2024Enough about 2023. That’s old news. Lets see what’s going to happen with WebRTC in 2024
#1 – libWebRTC (and the future of WebRTC)I’ll start with the most important piece of our technology puzzle – libWebRTC, maintained by Google.
This year will be a continuation of last year. Mostly maintenance releases, with a few minor improvements. The places where we will see the most amount of focus by Google in libWebRTC:
By the end of 2024, we will find ourselves similar to where we are at the beginning of it:
WebAssembly is where we see innovation and differentiation in WebRTC. 2024 will be no different.
It will be incorporated in the “same old places” of media processing.
What we will see is also a lot more machine learning on the server side, and a lot of it will be leaning towards generative AI and LLM technologies. This isn’t really a prediction, but just stating the obvious here. For someone who uses Midjourney for many of his recent articles for imagery, that shouldn’t seem as a surprise to you.
#3 – The year of Lyra and AV1Time to take a huge risk.
I mentioned this in the libWebRTC prediction, but it deserves a section of its own as well.
Each year I say AV1 is years away. I think it is still going to take time until it becomes commonplace. That said, I believe this year we will see AV1 in one or more commercial WebRTC services, including Google Meet. It will be used judiciously and in very specific use cases and scenarios – call this testing the water.
On the audio side, we will see an AI audio codec being used in production in web browsers. Likely from Google. I believe Lyra will find its way into Google Meet. How exactly is where I am uncertain.
#4 – WebTransport as a real alternativeWebTransport started life somewhere in 2020. We’re now at the beginning of 2024.
It still isn’t available in all browsers – Safari is still missing support for it. It is available elsewhere, but far from being commonly used or in the mainstream’s mindset.
We’ve seen this year a few more experiments and proof of concepts with WebTransport that incorporate low latency media delivery. Mostly in the domain of streaming. There are reasons for that. I’ve written about that when discussing WHIP and WHEP.
Here’s what I think is going to happen: in 2024, we will see the first production ready low latency streaming solution that makes use of WebTransport instead of WebRTC or other technologies. This will be for one-way large scale broadcast use cases, where 1-2 seconds of latency are fine.
There will be those that will use WebTransport for bidirectional media delivery, similar to what Zoom is doing in web browsers, though that will stay the exception of the rule and more of an experimentation.
#5 – M&As and shutdownsThis was easy in 2023 and will remain easy in 2024.
The recession is here. It is likely to stay throughout 2024, with no real end in sight. At least not yet.
More vendors relying on WebRTC will shut down. Small startups will run out of steam. Large vendors may decide to exit this market and focus on other avenues where they conduct business.
Shutting down may mean getting acqui-hired, or acquired for peanuts. It might also mean selling chunks of the business to another company.
Vendors who stick to this market are likely to slow down their efforts throughout the year in an attempt to survive and weather this ongoing storm.
2024, here we comeLots to do in 2024, but with limited resources:
All that while trying to satiate users and customers with new features and releases.
The post My WebRTC predictions for 2024 appeared first on BlogGeek.me.
Phosfluorescently utilize future-proof scenarios whereas timely leadership skills. Seamlessly administrate maintainable quality vectors whereas proactive mindshare.
Dramatically plagiarize visionary internal or "organic" sources via process-centric. Compellingly exploit worldwide communities for high standards in growth strategies.
Wow, this most certainly is a great a theme.
Donec sed odio dui. Nulla vitae elit libero, a pharetra augue. Nullam id dolor id nibh ultricies vehicula ut id elit. Integer posuere erat a ante venenatis dapibus posuere velit aliquet.
Donec sed odio dui. Nulla vitae elit libero, a pharetra augue. Nullam id dolor id nibh ultricies vehicula ut id elit. Integer posuere erat a ante venenatis dapibus posuere velit aliquet.