The common theme in many time-travel movies is to go back in time to find out what went wrong and fix it. Developers also have that desire to go back in time and find why the code broke and fix it. But, often, that crucial step where everything went wrong happened long ago, and the information is no longer available.
The rr project lets programmers examine the entire life of a C or C++ program run, and replay code execution to see what action in the past caused "things to go horribly wrong." rr
is packaged with Extra Packages for Enterprise Linux (EPEL) in Red Hat Enterprise Linux (RHEL), and with Fedora 31, 32, 33, and 34.
rr
records trace information about the execution of an application. This information allows you to repeatedly replay a particular recording of a failure and examine it in the GNU Debugger (GDB) to better investigate the cause. In addition to replaying the trace, rr
lets you run the program in reverse, in essence allowing you "rewind the tape" to see what happened earlier in the execution of the program.
The techniques that rr
provides for recording the reproducer for further examination can be a useful addition to traditional core dumps and backtraces, which give a snapshot of an issue at a particular moment. The rr
recording can provide a way for developers to further investigate intermittent problems where only some application runs fail.
Let's see how to set up rr
and use it in an example to better illustrate its utility.
Requirements and setup
Because rr
uses a number of newer Linux kernel features and specific processor performance monitoring hardware, the environments it runs in are limited. The newer Fedora kernels have the needed support. However, to correctly track and recreate when asynchronous events happen in a thread, rr
uses performance monitoring hardware event counters that provide deterministic counts of when a thread is interrupted by an asynchronous event. Currently, rr
supports these counts only for Intel processors using the Westmere or newer microarchitectures.
Installing rr
on Fedora is a simple task, requiring a single RPM. If you are on RHEL, you will need to enable EPEL. Once you have access to the appropriate repositories, you can install rr
with the following command:
$ sudo dnf -y install rr
Due to the possibility that hardware performance monitoring counters might leak privileged information, many kernels default to limiting their monitoring capabilities. You will need to run the following command to allow rr
access to the performance monitoring counters:
$ sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'
If you want to make that setting for the performance monitoring hardware permanent, you can also add the following line to /etc/sysctl.conf
:
kernel.perf_event_paranoid = 1
A simple debugging example
The following is a toy program that computes 2 times 0, 1, 2, and 3 in an array and prints the information:
#include <stdio.h> #define SIZE 4 void zero(char *a, int size) { while (size>0) a[size--] = 0; } void initialize(char *a, int size) { zero(a, size); } void multiply(char *a, int size, int mult) { int i; for (i=0; i<size; i++) a[i] = i * mult; } void pr_array(char *a, int size) { int i; for (i=0; i<size; i++) printf("f(%d)=%d\n", i, a[i]); } int main(int argc, char **argv) { char a[SIZE]; int mult = 2; initialize(a, SIZE); multiply(a, SIZE, mult); pr_array(a, SIZE); return 0; }
Compile it with the usual command, as follows, and include the debug option -g
to allow later debugging:
$ gcc -g multiply.c -o multiply
When you run multiply
, the results may be very surprising. All the results are zero. The main function is passing in 2 to the multiply
function. The loop in that function is very straightforward. How could this possibly go wrong?
$ ./multiply f(0)=0 f(1)=0 f(2)=0 f(3)=0
Recording the issue
You can investigate what is going wrong with rr
. Record a run of the multiply
program that demonstrates the issue with the following command. If the problem is intermittent, you could make multiple runs until the issue is seen:
$ rr record ./multiply rr: Saving execution to trace directory `/home/wcohen/.local/share/rr/multiply-0`. f(0)=0 f(1)=0 f(2)=0 f(3)=0
Replaying and investigating the issue
To start debugging the issue, use the rr replay
command, which puts you into a GDB session:
$ rr replay
You are at the start of execution:
(rr) where #0 0x00007f7f56b1f110 in _start () from /lib64/ld-linux-x86-64.so.2 #1 0x0000000000000001 in ?? () #2 0x00007ffe0bea6889 in ?? () #3 0x0000000000000000 in ?? ()
You can continue the program from the start and see that it has the same results as before:
(rr) c Continuing. f(0)=0 f(1)=0 f(2)=0 f(3)=0 Program received signal SIGKILL, Killed. 0x0000000070000002 in ?? ()
You can set a breakpoint at return 0
in the main
function and work backward from there:
(rr) break multiply.c:37 Breakpoint 1 at 0x401258: file multiply.c, line 37. (rr) c Continuing. f(0)=0 f(1)=0 f(2)=0 f(3)=0 Breakpoint 1, main (argc=1, argv=0x7ffe0bea5c58) at multiply.c:37 37 return 0;
First, we check the values in the array a
. Maybe the pr_array
function printed wrong values. But that isn't the problem, because according to GDB, the values are all 0:
(rr) print a[0] $5 = 0 '\000' (rr) print a[1] $6 = 0 '\000' (rr) print a[2] $7 = 0 '\000' (rr) print a[3] $8 = 0 '\000'
Maybe pr_array
corrupted the values. Let's set a breakpoint on the entry to pr_array
and go backward in execution with a reverse-continue
command, to see what the state was before the pr_array
function executed. Looks like pr_array
is printing the correct values:
(rr) break pr_array Breakpoint 2 at 0x4011cc: file multiply.c, line 25. (rr) reverse-continue Continuing. Breakpoint 2, pr_array (a=0x7ffe0bea5b58 "", size=4) at multiply.c:25 25 for (i=0; i<size; i++) (rr) print a[0] $9 = 0 '\000' (rr) print a[1] $10 = 0 '\000' (rr) print a[2] $11 = 0 '\000' (rr) print a[3] $12 = 0 '\000'
Maybe something is going wrong in the multiply
function. Let's set a breakpoint on it and reverse-continue
to it:
rr) break multiply Breakpoint 3 at 0x401184: file multiply.c, line 18. (rr) reverse-continue Continuing. Breakpoint 3, multiply (a=0x7ffe0bea5b58 "", size=4, mult=0) at multiply.c:18 18 for (i=0; i0) (rr) c Continuing.
Wait!
What happened to the mult
argument? It should be 2. Zero times anything is zero. That explains the results. However, how did the main
function's local variable mult
end up being zero? It is initialized in main
and only passed to the compute function. Let's set a hardware watchpoint on mult
and continue the reverse execution of the program:
(rr) up #1 0x0000000000401247 in main (argc=1, argv=0x7ffe0bea5c58) at multiply.c:35 35 multiply(a, SIZE, mult); (rr) watch -l mult Hardware watchpoint 4: -location mult (rr) reverse-continue Continuing. Hardware watchpoint 4: -location mult Old value = 0 New value = 2 zero (a=0x7ffe0bea5b58 "", size=3) at multiply.c:7 7 a[size--] = 0;
Ah, now it's becoming clear how things went wrong. The zero
function wrote past the end of the a
array and overwrote the mult
variable even though it wasn't passed in. The size--
statement should be --size
. We can see the call to zero
hidden in the initialize
function:
(rr) where #0 zero (a=0x7ffe0bea5b58 "", size=3) at multiply.c:7 #1 0x0000000000401173 in initialize (a=0x7ffe0bea5b58 "", size=4) at multiply.c:12 #2 0x0000000000401233 in main (argc=1, argv=0x7ffe0bea5c58) at multiply.c:34
If we want, we can use regular GDB continues and play it forward and go through those breakpoints we set earlier again:
(rr) c Continuing. Hardware watchpoint 4: -location mult Old value = 2 New value = 0 zero (a=0x7ffe0bea5b58 "", size=3) at multiply.c:6 6 while (size>0) (rr) c Continuing. Breakpoint 3, multiply (a=0x7ffe0bea5b58 "", size=4, mult=0) at multiply.c:18 18 for (i=0; i<size; i++) (rr) c Continuing. Breakpoint 2, pr_array (a=0x7ffe0bea5b58 "", size=4) at multiply.c:25 25 for (i=0; i<size; i++)
Fixing the issue
Now that we've identified the problem, we can fix the zero
function in the "new and improved" multiply2.c
program, and things work as expected:
$ gcc -g multiply2.c -o multiply2 $ ./multiply2 f(0)=0 f(1)=2 f(2)=4 f(3)=6
Limitations of rr
Although rr
is a useful addition to the tools that help developers find issues in programs, it does have limitations:
- All the threads that
rr
records run on a single core, so multithreaded applications will be slower. rr
syscall monitoring is not complete, so some syscalls might slip through the cracks and not be recorded in the trace.rr
usesptrace
to monitor apps and will not work well with apps that also useptrace
.rr
does not monitor processes outside the children of what it is recording, and misses any communication through shared memory to an outside process.rr
operates only on very specific processor microarchitectures.
Conclusion
The ability of rr
to go backward in the execution of a program to investigate the earlier events that lead to a problem is a useful addition to the developer's set of tools. The example in this article illustrates how you might use rr
to track down problems. The example was just a toy, but rr
has been used to track down problems in much much more substantial applications such as JavaScriptCore, the Apache httpd server, and Ruby on Rails.
See the following resources for additional information:
Last updated: October 14, 2022