I have a chat (using datachannels) and voip application and I’m noticing increasing memory usage over time from mediasoup. I notice this behaviour with the default allocator and also with jemalloc. Some of my applications are getting oom killed with 8Gb allocated.
Using the default allocator and heaptrack, I ran one server for 45 min, and then marked it offline so that all users would disconnect. This is the flamegraph view from heaptrack of the remaining allocations after all users had disconnected:
Using jemalloc’s profiling, I created a dump of the memory in use on a server that had been running for almost a day and was at 90% memory usage. I similarly marked the server offline so that all active users would disconnect and this is the flamegraph of the remaining allocations:
These show there is a lot of memory in use that was allocated from onAlloc inside of uv_read, but I’m not sure how to get more information.
Does anyone have any tips on how to further debug this?
I’m not familiar with how libuv is used in mediasoup - is it used only for sctp stuff or also rtp? We do allow for some large messages over datachannels so I’m wondering if this could be a result of some very large buffers that don’t get cleaned up
It seems to be a Rust version, and you must be creating a lot of Workers. Each Worker allocates only one buffer of this kind, about 4 MB long. It can leak only if Worker’s thread exits incorrectly somehow (without calling the destructors) or does not exit at all.
Well, the graph shows a call stack. Is “176.4MB leaked in total” at the bottom related to this call stack? How many times this call happened? If 1 Worker was created, then one worker thread was created, mediasoup_worker_run was called once in this thread, and read buffer should have been allocated only once - 4 MB, not 176 MB.
The “176.4MB leaked in total” is the amount leaked at the current reading. 164.4MB of that was allocated in onAlloc. These flamegraphs don’t show the number of times a call happened, only the amount of memory allocated at the time of viewing. The width of each bar is the % of the total (176.4MB in that particular one).
I’m quite confident there’s only one worker because if we had multiple we’d start running into this bug again