building an always-on (ruby) production profiler
For the past few years, I’ve been working at Datadog on a new open-source Ruby profiler. This profiler is shipped as part of the datadog
Ruby gem.
Why spend all this time and effort on building a new profiler? The key detail is that we want (and need) something that is built to be always-on in production. That translates to having really low overhead in several dimensions, including low cpu usage, low memory usage and low impact on the application’s latency, while also being something that can run for a long time unattended without impacting the application.
Doing this involves a different set of trade-offs than most profilers. For instance, the datadog
profiler is able to profile cpu, wall-time, allocations, heap, the Global VM Lock, and Garbage Collection all at the same time, and while staying within the same low-overhead. It also has workarounds for a number of bugs in Ruby and 3rd-party gems, so as to make sure it never impacts any application it is added to.
I’ve had to spent quite a lot of time writing C and Rust to make the Datadog Ruby profiler happen. And it even spawned a few off-shoots I’ve talked about in this blog in the past, such as the gvl-tracing
gem to investigate activity around Ruby’s Global VM Lock and the backtracie
gem to read stack traces with more detail and lower overhead.
Last year, at the RubyKaigi 2024 conference I had the chance to talk about how Datadog’s Ruby profiler works. It took me almost a year to write this blog post, but here’s the video and slide deck for the talk.
Slide deck:
Ruby performance is seeing a big renaissance: YJIT is getting better and better, JRuby and TruffleRuby continue to provide really strong alternatives, there’s a number of new profilers for Ruby beyond the Datadog profiler (Vernier, pf2), and there’s ongoing work to upgrade Ruby’s garbage collector.
And… I am totally here for it!