It is not altogether uncommon for software developers to do regression testing, comparing results of testing before and after a change. This is a common continuous integration strategy that I employ when doing releases of GDB for Red Hat Enterprise Linux (RHEL).
What happens, though, when the results of testing are non-deterministic due to racy (thread) tests or "slow" or high-load machines that cause tests to timeout? This often requires human intervention and analysis of the test results.
Preparing a release of GDB typically involves identifying patches to backport to fix specific code deficiencies identified in bugs reported by users. Once these backports are completed, we effectively have a "patch" (a quasi-release candidate, if you will), that we can test to verify that no regressions will be introduced into the release.
For the GDB team, this "release candidate" is tested using an internal command-line tool called gdb-beaker, which, as the name suggests, uses an internal Beaker instance to reserve computers of every supported RHEL architecture and run regression tests. The beaker task associated with this tool will install a fresh copy of the operating system, check out the desired Git repository, install build dependencies, and build and test GDB before and after applying the patch to test.
A Bunsen database is then created, and all test results are imported. This is where the fun begins.
What is Bunsen and how does it help?
From the project's homepage: "Bunsen is a test result storage and analysis toolkit that collects test result and build log files in a variety of formats (e.g., DejaGnu, Autoconf config.log, glibc), stores them in a ludicrously compact de-duplicated Git repo, parses and indexes the contents, and provides a toolkit for analyzing and browsing the indexed test results."
Before Bunsen, the process of analyzing test results was very manual: inspect diffs of unpatched versus patched results and determine if a regression has occurred. We run regression testing against several DejaGnu "target boards," meaning that we actually run testing at least three times for every supported architecture—more on x86_64 because we support both 64- and 32-bit executables on that architecture. In total, a single release requires the analysis of 15 test runs—per actual release. Factor in GCC Toolset releases on both RHEL 8 and RHEL 9, and this amounts to a whopping 60 test runs to analyze!
Staring at this many test diffs is exceptionally error-prone, but with Bunsen, there is a more accurate and much faster way to analyze this many results. The rest of this article will explain the approach I've implemented internally at Red Hat to shave many person-days off this process.
Overview
Bunsen stores test data in a Git repository, tagging each data set with a user-generated name. Therefore, the first order of business is to name our test result runs uniformly.
There are a number of considerations for GDB. We have different target boards, different "bitness" (32- vs 64-bit), and "unpatched" versus "patched" test runs to name.
I have chosen to name all Bunsen imports with (un)?patched-TARGET_BOARD(.-mBITNESS)?
, where TARGET_BOARD
is the DejaGnu board name (see below), and BITNESS
, if needed, is either "32" or "64". For example, the Git tag that contains the results for the patched, 64-bit native-gdbserver test run would be patched-native-gdbserver.-m64
.
GDB's test suite is written in DejaGnu, and the GDB sources define a number of "target boards" meant to facilitate testing various hardware/software configurations. Table 1 summarizes the configurations that are tested for RHEL GDB releases. I refer to these throughout this article via a "target abbreviation," which I will use later to save my fingers from having to repeatedly type so much.
Abbreviation |
DejaGnu –target_board/bitness |
Support architectures |
U |
unix/-m64 |
All |
G |
native-gdbserver/-m64 |
All |
E |
native-extended-gdbserver/-m64 |
All |
u |
unix/-m32 |
x86_64 only |
g |
native-gdbserver/-m32 |
x86_64 only |
e |
native-extended-gdbserver/-m32 |
x86_64 only |
Table 1: GDB DejaGnu target boards and target abbreviations.
When the Bunsen database is populated by automated testing, it will create imports for each of these target boards both before ("unpatched") and after ("patched") patching the sources with the proposed changes for the release.
A note on implementation
For all of the following implementations, it is assumed that the environment has two variables defined that point at the Bunsen instance's Git repository (BUNSEN_REPO
, e.g., $HOME/bunsendb
) and the Bunsen database (BUNSEN_DB
, e.g., $BUNSEN_REPO/bunsen.sqlite3
).
Note: The Bunsen Git repository is the Git repository created by importing results into Bunsen. It is not the actual source repository of the Bunsen project.
The scripts
The following scripts/shell functions are almost verbatim copies of what I use to analyze test results for RHEL GDB. All of the examples given in the rest of this article have been run on Red Hat Enterprise Linux 9 on x86_64, and while they work for me, they are neither "perfect" nor guaranteed to work for you. While all of the scripts in this article are actually shell functions, I will use the terms "script(s)" and "function(s)" interchangeably.
For discussion purposes, I have intentionally introduced regressions into two test files, gdb.base/default.exp
and gdb.cp/var-tag.exp
, and I have limited the output of the scripts to these results. In later examples, we will discover what these regressions are using the methods described herein.
tgt: Target abbreviation expansion
This simple function eases the pain of typing native-extended-gdbserver/-m64
over and over. While the Bunsen database stores the "long form" of these board configurations, I don't like to type them all the time. The following shell function allows me to simply type tgt U
. You can pass the output of this function directly to DejaGnu via the RUNTESTFLAGS
variable, as indicated in the function's comments.
# Function to convert one-letter target abbreviation to full target boards
# strings.
#
# This function returns strings that can be passed to
# RUNTESTFLAGS="--target_board $(tgt LTR)". LTR is either 'u', 'g', 'e',
# or any of those capitalized. When testing 64- and 32-bit variations,
# lower case will represent 32-bit boards and capitals will represent
# 64-bit boards. If we only support one bitness, then both lower and
# capital versions will be the same.
#
# For example, `u' on x86_64 will return "unix/-m32", but on aarch64, 'u'
# and `U' will just be "unix".
function tgt() {
if [ $# -ne 1 ]; then
>&2 echo "error: tgt requires an argument"
>&2 echo “tgt: usage: tgt [UuGgEe]"
return 1
fi
case $(uname -m) in
aarch64)
case "$1" in
U|u) echo "unix";;
G|g) echo "native-gdbserver";;
E|e) echo "native-extended-gdbserver" ;;
*)
>&2 echo "unknown letter' $1': must be U, u, G, g, E, or e"
return 1
;;
esac
;;
ppc64le)
# For historical reasons (to follow gdb.spec), we explicitly
# use "-m64".
case "$1" in
U|u) echo "unix/-m64" ;;
G|g) echo "native-gdbserver/-m64" ;;
E|e) echo "native-extended-gdbserver/-m64" ;;
*)
>&2 echo "unknown letter' $1': must be U, u, G, g, E, or e"
return 1
;;
esac
;;
s390x)
# For historical reasons (to follow gdb.spec), we explicitly
# use "-m64".
case "$1" in
U|u) echo "unix/-m64" ;;
G|g) echo "native-gdbserver/-m64" ;;
E|e) echo "native-extended-gdbserver/-m64" ;;
*)
>&2 echo "unknown letter' $1': must be U, u, G, g, E, or e"
return 1
;;
esac
;;
x86_64)
case "$1" in
U) echo "unix/-m64";;
u) echo "unix/-m32";;
G) echo "native-gdbserver/-m64";;
g) echo "native-gdbserver/-m32";;
E) echo "native-extended-gdbserver/-m64";;
e) echo "native-extended-gdbserver/-m32";;
*)
>&2 echo "unknown letter' $1': must be U, u, G, g, E, or e"
return 1
;;
esac
;;
*)
>&2 echo "unknown architecture: $(uname -m)"
return 1
;;
esac
return 0
}
The following examples demonstrate how you can use this function, including how to use the tgt
function to tell DejaGnu to use the specific target board for testing.
$ tgt g native-gdbserver/-m32 $ make check RUNTESTFLAGS=”--target_board $(tgt U)" [snip]
btgt: Return a Git tag snippet for a target abbreviation
Because Bunsen uses Git for underlying storage, we are necessarily limited to naming our imports with valid Git tag characters. This means translating any invalid characters to .
. This is what btgt
does.
# Similar to tgt above, but outputs something suitable for use in
# bunsen git tags, e.g., "unix.-m32".
function btgt()
{
if [ $# -ne 1 ]; then
>&2 echo "error: btgt requires an argument"
>&2 echo “btgt: usage: btgt [UuGgEe]”
return 1
fi
r=$(tgt $1)
echo $r | sed s@/@.@g
return 0
}
The following examples show how btgt
converts the target board abbreviations g
(native-gdbsever/-m32
) and U
(unix/-m64
) into something suitable for use with Git. We will use this function's output later to construct the full Git tag used when importing results into our Bunsen instance.
$ btgt g native-gdbserver.-m32 $ btgt U unix.-m64
btag: Create the Git tag used by Bunsen imports
Using the above two scripts, we can define a simple shell function to output the complete name of any test result run (a.k.a. Bunsen import) with minimal typing. [See a theme? :-)]
Note that like any of the functions defined in this article, any argument that designates a "patched" test run versus an "unpatched" one can simply use some subset of those words; i.e., u
will expand to unpatched
and pa
will expand to patched
.
# Output a suitable tag for the test run's bunsen import.
# These look like "(un)?patched-TARGET_BOARD.-mBITNESS"
function btag()
{
if [ $# -ne 2 ]; then
>&2 echo "error: btag requires two arguments"
>&2 echo "btag: usage: btag [UuGgEe] (un)?patched"
return 1
fi
target=$1
case "$2" in
u*) ver="unpatched" ;;
p*) ver="patched" ;;
*)
>&2 echo "error: unknown version, '$2'. Must be 'unpatched' or 'patched'"
return 1
;;
esac
echo "$ver-$(btgt $target)"
return 0
}
The following output lists the Git tags that we will use later to import the results for the target board configurations native-gdbserver/-m32
and unix/-m64
:
$ btag g u unpatched-native-gdbserver.-m32 $ btag U p patched-unix.-m64
btags: List all imported Bunsen test runs
Bunsen creates a Git tag for each import. Therefore, a simple shell alias suffices to query the Bunsen Git repository for any stored results. This is usually the first thing I do to verify that testing completed successfully.
alias btags="git --git-dir $BUNSEN_REPO/.git tag -l"
In the following example, the output of btags
shows that our Bunsen instance is loaded with patched and unpatched results for all the target boards and configurations tested on x86_64 RHEL.
$ btags patched-native-extended-gdbserver.-m32 patched-native-extended-gdbserver.-m64 patched-native-gdbserver.-m32 patched-native-gdbserver.-m64 patched-unix.-m32 patched-unix.-m64 unpatched-native-extended-gdbserver.-m32 unpatched-native-extended-gdbserver.-m64 unpatched-native-gdbserver.-m32 unpatched-native-gdbserver.-m64 unpatched-unix.-m32 unpatched-unix.-m64
bsum: Summarize test results
Finally, we get to the interesting scripts!
bsum
outputs the testing summary from a given test run that has been imported into Bunsen. Its output mimics DejaGnu's summary table. To output results that exactly imitate DejaGNU's output, use the –verbose
option. This will explicitly list all non-passing tests.
The function also optionally takes the name of a test file (as a glob expression) to limit the results to specific test file(s), as shown in the example that follows:
# Summarize test results for a given bunsen commit.
function bsum () {
if [ $# -lt 2 ]; then
>&2 echo "error: bsum requires at least two arguments"
>&2 echo "bsum: usage: bsum [-v] TARGET_ABBREV (un)?patched [EXP_GLOB]"
return 1
fi
# Get verbose output option.
verbose=""
if [ "$1" = "-v" ]; then
verbose="-v"
shift
fi
# Get target abbreviation and version.
target=$1 # any valid target abbreviation (UuGgEe)
ver=$2 # "unpatched" or "patched"
# Get optional expfile glob.
glob=""
if [ $# -eq 3 ]; then
glob="--expfile-glob $3"
fi
# Run sum script.
tag=$(btag $target $ver)
r-dejagnu-summary --git ${BUNSEN_REPO} --db ${BUNSEN_DB} $verbose \ $glob $tag
return $?
}
U
target board (unix/-m64
) in the "unpatched" test suite run. The second outputs the summary for just GDB's C++-specific tests (in the gdb.cp
folder) for the same test run.$ bsum U u # of expected passes 111673 # of unexpected failures 10 # of expected failures 83 # of known failures 112 # of untested testcases 30 # of unsupported tests 559 $ bsum U u 'gdb.cp/*.exp' # of expected passes 8585 # of expected failures 14 # of known failures 25 # of untested testcases 3
Note that the C++-specific summary did not output a field for "unexpected failures" because there were no such failures. This is consistent with how DejaGnu outputs results. (I often compare manual test run summaries with the data stored in Bunsen.)
breg: Compute regressions
The breg
function will query Bunsen for any regressions discovered during testing. Its inputs include the target board abbreviation and an optional test file to which to limit results. Notice that unlike bsum
, this function does not accept a glob expression.
# Convenience function to find (possible) regressions between two testruns.
# Optionally takes a test file name (expfile).
function breg () {
if [ $# -lt 1 ]; then
>&2 echo "error: breg requires at least one arguments"
>&2 echo "breg: usage: breg TARGET_ABBREV [EXPFILE]"
return 1
fi
# Get target abbreviation.
expfile=""
target=$1 # any valid target abbreviation (UuGgEe)
# Get optional expfile.
if [ $# -eq 2 ]; then
expfile="--dgexpfile $2"
fi
# Run diff script. Add "--regressions"?
r-diff-testruns --git ${BUNSEN_REPO} --db ${BUNSEN_DB} \
$(btag $target u) $(btag $target p) $expfile
return $?
}
breg
lists the regressions discovered between running the "unpatched" and "patched" test suite runs of the U
target. For the sake of brevity, I have limited the output to the regressions I previously introduced to gdb.base/default.exp
and gdb.cp/var-tag.exp
:$ breg U 2023-08-30 13:24:45,299:r-diff-testruns:INFO:opened git repo /root/bunsendb unpatched-unix.-m64 patched-unix.-m64 dejagnu diffs gdb.base/default.exp gdb.base/default.exp: call PASS gdb.log:109690 gdb.sum:9435 FAIL gdb.log:109893 gdb.sum:9435 gdb.base/default.exp gdb.base/default.exp: inspect PASS gdb.log:111372 gdb.sum:9529 FAIL gdb.log:111575 gdb.sum:9529 gdb.base/default.exp gdb.base/default.exp: print PASS gdb.log:111453 gdb.sum:9551 FAIL gdb.log:111656 gdb.sum:9551 gdb.base/default.exp gdb.base/default.exp: print "p" abbreviation PASS gdb.log:111450 gdb.sum:9552 FAIL gdb.log:111653 gdb.sum:9552 gdb.base/default.exp gdb.base/default.exp: ptype PASS gdb.log:111459 gdb.sum:9554 FAIL gdb.log:111662 gdb.sum:9554 gdb.base/default.exp gdb.base/default.exp: whatis PASS gdb.log:112837 gdb.sum:9675 FAIL gdb.log:113040 gdb.sum:9675 gdb.cp/var-tag.exp gdb.cp/var-tag.exp: before start: c++: ptype C PASS gdb.log:424504 gdb.sum:62659 FAIL gdb.log:425341 gdb.sum:62685 gdb.cp/var-tag.exp gdb.cp/var-tag.exp: before start: c: ptype C PASS gdb.log:424626 gdb.sum:62689 FAIL gdb.log:425463 gdb.sum:62715 gdb.cp/var-tag.exp gdb.cp/var-tag.exp: C::f: c++: ptype C PASS gdb.log:425013 gdb.sum:62723 FAIL gdb.log:425850 gdb.sum:62750 gdb.cp/var-tag.exp gdb.cp/var-tag.exp: C::f: c: ptype C PASS gdb.log:425135 gdb.sum:62753 FAIL gdb.log:425972 gdb.sum:62780 gdb.cp/var-tag.exp gdb.cp/var-tag.exp: main: c++: ptype C PASS gdb.log:424761 gdb.sum:62783 FAIL gdb.log:425598 gdb.sum:62810 gdb.cp/var-tag.exp gdb.cp/var-tag.exp: main: c: ptype C PASS gdb.log:424883 gdb.sum:62813 FAIL gdb.log:425720 gdb.sum:62840
The following example limits the list to those in the gdb.cp/var-tag.exp
test file:
$ breg U gdb.cp/var-tag.exp 2023-08-30 13:26:30,565:r-diff-testruns:INFO:opened git repo /root/bunsendb unpatched-unix.-m64 patched-unix.-m64 dejagnu diffs gdb.cp/var-tag.exp gdb.cp/var-tag.exp: before start: c++: ptype C PASS gdb.log:424504 gdb.sum:62659 FAIL gdb.log:425341 gdb.sum:62685 gdb.cp/var-tag.exp gdb.cp/var-tag.exp: before start: c: ptype C PASS gdb.log:424626 gdb.sum:62689 FAIL gdb.log:425463 gdb.sum:62715 gdb.cp/var-tag.exp gdb.cp/var-tag.exp: C::f: c++: ptype C PASS gdb.log:425013 gdb.sum:62723 FAIL gdb.log:425850 gdb.sum:62750 gdb.cp/var-tag.exp gdb.cp/var-tag.exp: C::f: c: ptype C PASS gdb.log:425135 gdb.sum:62753 FAIL gdb.log:425972 gdb.sum:62780 gdb.cp/var-tag.exp gdb.cp/var-tag.exp: main: c++: ptype C PASS gdb.log:424761 gdb.sum:62783 FAIL gdb.log:425598 gdb.sum:62810 gdb.cp/var-tag.exp gdb.cp/var-tag.exp: main: c: ptype C PASS gdb.log:424883 gdb.sum:62813 FAIL gdb.log:425720 gdb.sum:62840
Now that Bunsen can tell us what test(s) might have regressed, the next step is to ascertain why a test might have failed. This normally means comparing the test output in the two log files, but with Bunsen, there is a much easier way.
bdiff: Output a diff between two tests
This script outputs a diff of the log output for a given test for a target board. This is where Bunsen can save ridiculous amounts of time by forgoing having to manually search log files for a given test and inspecting the results. With Bunsen, it's all automagic.
The function requires two inputs: the target board (abbreviation) and the test name.
# Convenience function to diff two test results between any two commits.
function bdiff()
{
if [ $# -ne 2 ]; then
>&2 echo "error: bdiff requires two arguments"
>&2 echo "bdiff: usage: bdiff TARGET_ABBREV 'TEST'"
return 1
fi
target=$1
r-dejagnu-diff-logs --template=diff --git ${BUNSEN_REPO} \
--db ${BUNSEN_DB} $(btag $target unpatched) $(btag $target patched) \
"$2"
return 0
}
Now let's use bdiff
to quickly figure out what happened to cause my (intentionally introduced) regressions.
One of the regressing tests identified in the breg
example output is gdb.base/default.exp: print
. bdiff
can show us how the output of this test changed between the "unpatched" and "patched" test runs:
$ bdiff U 'gdb.base/default.exp: print' logfile: <class 'dict'> ********************* *** unpatched-unix.-m64: gdb.base/default.exp: print --- patched-unix.-m64: gdb.base/default.exp: print *************** *** 1,3 **** print ! The history is empty. ! (gdb) PASS: gdb.base/default.exp: print --- 1,3 ---- print ! The history is really empty. ! (gdb) FAIL: gdb.base/default.exp: print
The diff clearly shows how I artificially regressed the test by adding the word "really" to the string The history is empty
.
Is that also what caused the subsequent tests in the file to regress? Asking for the Bunsen diff of the next test (gdb.base/default.exp: print "p" abbreviation
) also clearly shows this to be the case:
$ bdiff U 'gdb.base/default.exp: print "p" abbreviation' logfile: <class 'dict'> ********************* *** unpatched-unix.-m64: gdb.base/default.exp: print "p" abbreviation --- patched-unix.-m64: gdb.base/default.exp: print "p" abbreviation *************** *** 1,3 **** p ! The history is empty. ! (gdb) PASS: gdb.base/default.exp: print "p" abbreviation --- 1,3 ---- print ! The history is really empty. ! (gdb) FAIL: gdb.base/default.exp: print "p" abbreviation
As for the subsequent failing tests, you can repeat this process to verify that the sudden appearance of the word "really" is the cause for all of these failures.
Turning our attention to the other introduced regression in gdb.cp/var-tag.exp
:
$ bdiff U 'gdb.cp/var-tag.exp: in C::f: c++: ptype C' logfile: <class 'dict'> ********************* *** unpatched-unix.-m64: gdb.cp/var-tag.exp: in C::f: c++: ptype C --- patched-unix.-m64: gdb.cp/var-tag.exp: in C::f: c++: ptype C *************** *** 1,12 **** ptype C ! type = class C { public: ! C::C1 C1; ! C::E1 E1; ! C::U1 U1; ! C(void); void global(void) const; int f(void) const; } ! (gdb) PASS: gdb.cp/var-tag.exp: in C::f: c++: ptype C --- 1,12 ---- ptype C ! type = class CX { public: ! CX::C1 C1; ! CX::E1 E1; ! CX::U1 U1; ! CX(void); void global(void) const; int f(void) const; } ! (gdb) FAIL: gdb.cp/var-tag.exp: in C::f: c++: ptype C
From the above output, it is obvious that I renamed a class in the source file from C
to CX
, causing the observed regression. Using bdiff
on the other failing tests would show that this is the root cause of the regressions.
Conclusion
In this article, I have attempted to demonstrate how I use Bunsen to facilitate the analysis of automated regression testing for GDB releases on Red Hat Enterprise Linux. While there is still more automation and analysis that would yield further benefits, the process implemented above has already saved me countless hours of tedious and unrewarding work.
If you are also doing manual regression testing analyses, I hope this article will inspire you to adopt Bunsen and save yourself and your organization time and money!
Further reading
- Automating the testing process for SystemTap, Part 2: Test result analysis with Bunsen by Serhei Makarov
- Detecting nondeterministic test cases with Bunsen by Serhei Makarov
- Store and analyze your test-suite logs with this open source tool by Frank Eigler
- Uncover interesting test cases with AI/ML and Bunsen by Frank Eigler
Special thanks
I would like to thank Serhei Makarov, Frank Eigler, and the entire Bunsen community for all the work they've done over the past few years. You have saved my sanity!