Potential Key Frame Corruption in SimulcastConsumer Due to Packet Reordering

ivgotcrazy · June 20, 2025, 10:57am

Description:

There appears to be a potential race condition in the SimulcastConsumer’s spatial layer switching logic that could lead to the corruption of a key frame if its constituent RTP packets arrive out of order. This can result in a failed NACK recovery, forcing a more disruptive PLI/FIR recovery cycle.

Problem Analysis:

The core of the issue lies in how SimulcastConsumer handles the transition to a new spatial layer. The logic seems to make a simplifying assumption: that the first packet of a key frame from the new target spatial layer to arrive is also the one with the lowest sequence number.

This is a specific scenario:

The SimulcastConsumer is currently forwarding spatialLayer = 0. The application requests a switch to a higher layer, so targetSpatialLayer becomes 1.
The Producer sends a key frame for spatialLayer = 1. This key frame is large and is fragmented into multiple RTP packets (e.g., with sequence numbers 1000, 1001, 1002).
Due to network conditions, the packets arrive at the SimulcastConsumer in a reordered sequence: 1001, 1000, 1002.

SimulcastConsumer Processes This:

Packet 1001 Arrives First: It’s a key frame on the targetSpatialLayer. Packet 1001 is processed and forwarded.
Packet 1000 Arrives Second: Identified as an “old” packet from the previous layer and is dropped.

if (!shouldSwitchCurrentSpatialLayer && this->checkingForOldPacketsInSpatialLayer)
{
	// If this is a packet previous to the spatial layer switch, ignore the packet.
	if (SeqManager<uint16_t>::IsSeqLowerThan(
	      packet->GetSequenceNumber(), this->snReferenceSpatialLayer))
	{
#ifdef MS_RTC_LOGGER_RTP
		packet->logger.Dropped(
		  RtcLogger::RtpPacket::DropReason::PACKET_PREVIOUS_TO_SPATIAL_LAYER_SWITCH);
#endif

		this->rtpSeqManager->Drop(packet->GetSequenceNumber());

		return;
	}
	else if (SeqManager<uint16_t>::IsSeqHigherThan(
	           packet->GetSequenceNumber(), this->snReferenceSpatialLayer + MaxSequenceNumberGap))
	{
		this->checkingForOldPacketsInSpatialLayer = false;
	}
}

Consequences:

The SimulcastConsumer forwards an incomplete key frame to the receiver.
The receiver’s decoder fails, leading to visual artifacts (freezing, green screen, etc.).
The receiver’s Jitter Buffer detects the missing packet and sends a NACK request.
The NACK request fails because the dropped packet was never added to the RtpStreamSend history buffer in mediasoup.
After the NACK mechanism times out, the only remaining recovery option for the client is to send a PLI or FIR, forcing a full key frame request from the producer. This introduces significant delay and a noticeable disruption for the end-user.

My analysis is based solely on code inspection. It is entirely possible that I have overlooked another mechanism or a different part of the codebase that already mitigates this issue. If that is the case, I would be very grateful for a pointer to the relevant code or an explanation of the intended behavior.

Thank you for your consideration.

ibc · June 20, 2025, 3:39pm

AFAIK only the first RTP packet of the key frame will be detected as a key frame, and not the subsequent ones, so the issue you mention should never take place. However it’s an interesting topic and now that we are implementing Dependency Descriptor I don’t know if all RTP packets containing chunks of the same key frame may have the “key frame” flag set to 1 in the DD extension.

Could you please open a ticket in mediasoup GitHub and also mention the Dependency Descriptor thing I say above? I am out for some days and I don’t want to forget about this.

ibc · June 20, 2025, 5:29pm

More info:

Assuming that a key frame in AV1 of 3600 bytes takes 3 RTP packets to carry it:

Only one of the RTP packets — typically the first one of the frame — will carry the full Dependency Descriptor with is_key_frame = 1.

According to the AV1 RTP Payload spec and the Dependency Descriptor draft:

The Dependency Descriptor describes a full frame, not just a fragment.
Usually, only the first RTP packet of the frame (the one marked as the start of the frame, with Z=1) includes:
- The full frame dependency structure
- Fields like is_key_frame, frame_number, frame_dependencies, etc.
Subsequent RTP packets of the same frame:
- May omit the DD entirely
- Or may carry a reduced DD, which often doesn’t include the is_key_frame field again.

The is_key_frame (or is_switch_frame) field won’t be set to 1 in all RTP packets of the frame. It will only appear in the packet that includes the full Dependency Descriptor — typically the first.

jmillan · June 20, 2025, 5:51pm

As Iñaki said, the upcoming new version will contain Dependency Description for H264 and it will behave as such: just the first packet of the key frame contains all the dependency structure and will be considered by mediasoup as key frame.

ibc · June 20, 2025, 6:44pm

So then we are good and there is no issue here.

BronzedBroth · June 20, 2025, 8:14pm

A single keyframe? What if that packet is unusable due to packet loss?

ibc · June 20, 2025, 8:38pm

Then the Producer side of mediasoup will request retransmission of that packet using MACK as usual.

ibc · June 20, 2025, 8:46pm

@jmillan imagine this scenario.

VP8 keyframe takes two RTP packets with seq numbers 1001 and 1002.
Producer device sends both but 1002 arrives to mediasoup first.
mediasoup sends NACK requesting retransmission of 1001 but anyway it will arrive later.
Packet 1002 is not a keyframe so the SimulcastConsumer that was waiting for a keyframe of that ssrc will ignore it and WONT store it in the retransmission buffer of the consumer.
Later packet 1001 arrives and the SimulcastConsumer allows it to pass and it switches to that ssrc stream, so the consumer device receives that packet 1001 which is the first half of the keyframe.
Later the producer sends packet 1003 and the consumer device receives it.
Then the consumer device sends NACK requesting packet 1002.
But 1002 is not in the retransmission buffer of the SimulcastConsumer so mediasoup will not retransmit it.
Result is that 1002 will never be sent to the consumer device and hence, after some terrible seconds the consumer device will send PLI requesting a new keyframe.

Is this correct? Or is step 8 is wrong? If this assumption is correct there is room for improvement.

ivgotcrazy · June 21, 2025, 12:40am

Hi @jmillan,

I’m still concerned about a scenario as follow:

A consumer is waiting to switch up to a new spatial layer and needs a key frame from that layer.
A key frame is sent, split into multiple packets (100,101,102).
Due to network reordering, the second packet of the key frame (101) arrives before the first packet (100).
Based on the current logic, my understanding is that only the first packet (100) is identified as a key frame. When 101 arrives first, the consumer will see it’s for the correct target layer but not a key frame, and will therefore drop it.

Later, when 100 arrives, the layer switch will proceed, but 101 has already been permanently lost. This results in a corrupted key frame being sent to the receiver, and since the packet was dropped early, it cannot be recovered via NACK.

Is my analysis of this situation correct?

Thanks again for your time.

jmillan · June 23, 2025, 7:34am

There is an assumption here that if a key frame is sent within 3 RTP packets, the three of the packets are considered key frames. It should not be that way.

Can you please provide a .pcap with such packets?

Also, as said, in the next release, with Dependency Description (which will be applied to H264) we’re sure that only the first packet in the key frame will be considered so.

For VP8 we’ll need such a .pcap file in order to confirm the issue and if so, apply the corresponding fix. The s field in the VP8 Payload header indicates the beginning of the partition and it’s only applied to the first packet, so the problem should not exist on the first hand.

ibc · June 23, 2025, 8:18am

@jmillan, in comments above we are not saying that the 3 packets containing a key frame are detected as keyframe. Please re-read the scenario I described above which is the same one the author of this topic described later. There is indeed a potential issue there if 1002 arrives before 1001.

jmillan · June 23, 2025, 8:27am

I missunderstood you comments @ibc because this is a different problem that the user indicated (multiple RTP packets considered keyfame).

Yes, in your specific scenario 1002 will be dropped, and not stored for retransmission and hence will never be sent. There is definitely room for enhancement here.

ivgotcrazy · June 24, 2025, 3:28am

Thank you for your reply. My expression was incorrect; I did not mean to say that every packet would be considered a key frame (only the first RTP packet of a key frame carries key frame information). Rather, I meant that out-of-order delivery may cause packets belonging to a key frame to be discarded prematurely, and they cannot be recovered through NACK from the client.

jmillan · June 24, 2025, 7:40am

I’ve created the corresponding issue in GH Potential unrecoverable RTP packet upon un-order reception · Issue #1547 · versatica/mediasoup · GitHub

ibc · June 26, 2025, 12:40pm

Issue is being fixed here: `Consumer` classes: Add target layer retransmission buffer to avoid PLIs/FIRs when RTP packets containing a key frame arrive out of order by ibc · Pull Request #1550 · versatica/mediasoup · GitHub

Topic		Replies	Views
Key frame not receiv by client mediasoup libraries	6	268	May 30, 2024
mediasoup 3.11.8 released with important fix in simulcast Announcements	3	385	February 2, 2023
Problem using SimulcastConsumer on piped transports mediasoup libraries	46	2498	December 2, 2019
RTP bytes received, but never sent mediasoup libraries	8	1597	January 18, 2021
Consumer of PlainTransport RTP H264 Simulcast not switching layers mediasoup libraries	11	980	June 23, 2021

Potential Key Frame Corruption in SimulcastConsumer Due to Packet Reordering

Related topics