Quickly analyze regression testing results with Bunsen

It is not altogether uncommon for software developers to do regression testing, comparing results of testing before and after a change. This is a common continuous integration strategy that I employ when doing releases of GDB for Red Hat Enterprise Linux (RHEL).

What happens, though, when the results of testing are non-deterministic due to racy (thread) tests or "slow" or high-load machines that cause tests to timeout? This often requires human intervention and analysis of the test results.

Preparing a release of GDB typically involves identifying patches to backport to fix specific code deficiencies identified in bugs reported by users. Once these backports are completed, we effectively have a "patch" (a quasi-release candidate, if you will), that we can test to verify that no regressions will be introduced into the release.

For the GDB team, this "release candidate" is tested using an internal command-line tool called gdb-beaker, which, as the name suggests, uses an internal Beaker instance to reserve computers of every supported RHEL architecture and run regression tests. The beaker task associated with this tool will install a fresh copy of the operating system, check out the desired Git repository, install build dependencies, and build and test GDB before and after applying the patch to test.

A Bunsen database is then created, and all test results are imported. This is where the fun begins.

What is Bunsen and how does it help?

From the project's homepage: "Bunsen is a test result storage and analysis toolkit that collects test result and build log files in a variety of formats (e.g., DejaGnu, Autoconf config.log, glibc), stores them in a ludicrously compact de-duplicated Git repo, parses and indexes the contents, and provides a toolkit for analyzing and browsing the indexed test results."

Before Bunsen, the process of analyzing test results was very manual: inspect diffs of unpatched versus patched results and determine if a regression has occurred. We run regression testing against several DejaGnu "target boards," meaning that we actually run testing at least three times for every supported architecture—more on x86_64 because we support both 64- and 32-bit executables on that architecture. In total, a single release requires the analysis of 15 test runs—per actual release. Factor in GCC Toolset releases on both RHEL 8 and RHEL 9, and this amounts to a whopping 60 test runs to analyze!

Staring at this many test diffs is exceptionally error-prone, but with Bunsen, there is a more accurate and much faster way to analyze this many results. The rest of this article will explain the approach I've implemented internally at Red Hat to shave many person-days off this process.

Overview

Bunsen stores test data in a Git repository, tagging each data set with a user-generated name. Therefore, the first order of business is to name our test result runs uniformly.

There are a number of considerations for GDB. We have different target boards, different "bitness" (32- vs 64-bit), and "unpatched" versus "patched" test runs to name.

I have chosen to name all Bunsen imports with (un)?patched-TARGET_BOARD(.-mBITNESS)?, where TARGET_BOARD is the DejaGnu board name (see below), and BITNESS, if needed, is either "32" or "64". For example, the Git tag that contains the results for the patched, 64-bit native-gdbserver test run would be patched-native-gdbserver.-m64.

GDB's test suite is written in DejaGnu, and the GDB sources define a number of "target boards" meant to facilitate testing various hardware/software configurations. Table 1 summarizes the configurations that are tested for RHEL GDB releases. I refer to these throughout this article via a "target abbreviation," which I will use later to save my fingers from having to repeatedly type so much.

Abbreviation	DejaGnu –target_board/bitness	Support architectures
U	unix/-m64	All
G	native-gdbserver/-m64	All
E	native-extended-gdbserver/-m64	All
u	unix/-m32	x86_64 only
g	native-gdbserver/-m32	x86_64 only
e	native-extended-gdbserver/-m32	x86_64 only

Table 1: GDB DejaGnu target boards and target abbreviations.

When the Bunsen database is populated by automated testing, it will create imports for each of these target boards both before ("unpatched") and after ("patched") patching the sources with the proposed changes for the release.

A note on implementation

For all of the following implementations, it is assumed that the environment has two variables defined that point at the Bunsen instance's Git repository (BUNSEN_REPO, e.g., $HOME/bunsendb) and the Bunsen database (BUNSEN_DB, e.g., $BUNSEN_REPO/bunsen.sqlite3).

Note: The Bunsen Git repository is the Git repository created by importing results into Bunsen. It is not the actual source repository of the Bunsen project.

The scripts

The following scripts/shell functions are almost verbatim copies of what I use to analyze test results for RHEL GDB. All of the examples given in the rest of this article have been run on Red Hat Enterprise Linux 9 on x86_64, and while they work for me, they are neither "perfect" nor guaranteed to work for you. While all of the scripts in this article are actually shell functions, I will use the terms "script(s)" and "function(s)" interchangeably.

For discussion purposes, I have intentionally introduced regressions into two test files, gdb.base/default.exp and gdb.cp/var-tag.exp, and I have limited the output of the scripts to these results. In later examples, we will discover what these regressions are using the methods described herein.

tgt: Target abbreviation expansion

This simple function eases the pain of typing native-extended-gdbserver/-m64 over and over. While the Bunsen database stores the "long form" of these board configurations, I don't like to type them all the time. The following shell function allows me to simply type tgt U. You can pass the output of this function directly to DejaGnu via the RUNTESTFLAGS variable, as indicated in the function's comments.

# Function to convert one-letter target abbreviation to full target boards
# strings.
#
# This function returns strings that can be passed to
# RUNTESTFLAGS="--target_board $(tgt LTR)". LTR is either 'u', 'g', 'e',
# or any of those capitalized. When testing 64- and 32-bit variations,
# lower case will represent 32-bit boards and capitals will represent
# 64-bit boards. If we only support one bitness, then both lower and
# capital versions will be the same.
#
# For example, `u' on x86_64 will return "unix/-m32", but on aarch64, 'u'
#  and `U' will just be "unix".

function tgt() {
    if [ $# -ne 1 ]; then
        >&2 echo "error: tgt requires an argument"
        >&2 echo “tgt: usage: tgt [UuGgEe]"
        return 1
    fi

    case $(uname -m) in
    aarch64)
        case "$1" in
        U|u) echo "unix";;
        G|g) echo "native-gdbserver";;
        E|e) echo "native-extended-gdbserver" ;;
        *)
            >&2 echo "unknown letter' $1': must be U, u, G, g, E, or e"
            return 1
            ;;
        esac
        ;;
    ppc64le)
        # For historical reasons (to follow gdb.spec), we explicitly
        # use "-m64".
        case "$1" in
        U|u) echo "unix/-m64" ;;
        G|g) echo "native-gdbserver/-m64" ;;
        E|e) echo "native-extended-gdbserver/-m64" ;;
        *)
            >&2 echo "unknown letter' $1': must be U, u, G, g, E, or e"
            return 1
            ;;
        esac
        ;;
    s390x)
        # For historical reasons (to follow gdb.spec), we explicitly
        # use "-m64".
        case "$1" in
        U|u) echo "unix/-m64" ;;
        G|g) echo "native-gdbserver/-m64" ;;
        E|e) echo "native-extended-gdbserver/-m64" ;;
        *)
            >&2 echo "unknown letter' $1': must be U, u, G, g, E, or e"
            return 1
            ;;
        esac
        ;;
    x86_64)
        case "$1" in
        U) echo "unix/-m64";;
        u) echo "unix/-m32";;
        G) echo "native-gdbserver/-m64";;
        g) echo "native-gdbserver/-m32";;
        E) echo "native-extended-gdbserver/-m64";;
        e) echo "native-extended-gdbserver/-m32";;
        *)
            >&2 echo "unknown letter' $1': must be U, u, G, g, E, or e"
            return 1
            ;;
        esac
        ;;
    *)
        >&2 echo "unknown architecture: $(uname -m)"
        return 1
        ;;
    esac

    return 0
}

The following examples demonstrate how you can use this function, including how to use the tgt function to tell DejaGnu to use the specific target board for testing.

$ tgt g
native-gdbserver/-m32
$ make check RUNTESTFLAGS=”--target_board $(tgt U)"
[snip]

btgt: Return a Git tag snippet for a target abbreviation

Because Bunsen uses Git for underlying storage, we are necessarily limited to naming our imports with valid Git tag characters. This means translating any invalid characters to .. This is what btgt does.

# Similar to tgt above, but outputs something suitable for use in
# bunsen git tags, e.g., "unix.-m32".
function btgt()
{
    if [ $# -ne 1 ]; then
          >&2 echo "error: btgt requires an argument"
          >&2 echo “btgt: usage: btgt [UuGgEe]”
          return 1
    fi

    r=$(tgt $1)
    echo $r | sed s@/@.@g
    return 0
}

The following examples show how btgt converts the target board abbreviations g (native-gdbsever/-m32 ) and U (unix/-m64) into something suitable for use with Git. We will use this function's output later to construct the full Git tag used when importing results into our Bunsen instance.

$ btgt g
native-gdbserver.-m32
$ btgt U
unix.-m64

btag: Create the Git tag used by Bunsen imports

Using the above two scripts, we can define a simple shell function to output the complete name of any test result run (a.k.a. Bunsen import) with minimal typing. [See a theme? :-)]

Note that like any of the functions defined in this article, any argument that designates a "patched" test run versus an "unpatched" one can simply use some subset of those words; i.e., u will expand to unpatched and pa will expand to patched.

# Output a suitable tag for the test run's bunsen import.
# These look like "(un)?patched-TARGET_BOARD.-mBITNESS"
function btag()
{
    if [ $# -ne 2 ]; then
        >&2 echo "error: btag requires two arguments"
        >&2 echo "btag: usage: btag  [UuGgEe] (un)?patched"
        return 1
    fi

    target=$1
    case "$2" in
        u*) ver="unpatched" ;;
        p*) ver="patched" ;;
        *)
            >&2 echo "error: unknown version, '$2'. Must be 'unpatched' or 'patched'"
            return 1
            ;;
    esac
    echo "$ver-$(btgt $target)"
    return 0
}

The following output lists the Git tags that we will use later to import the results for the target board configurations native-gdbserver/-m32 and unix/-m64:

$ btag g u
unpatched-native-gdbserver.-m32
$ btag U p
patched-unix.-m64

btags: List all imported Bunsen test runs

Bunsen creates a Git tag for each import. Therefore, a simple shell alias suffices to query the Bunsen Git repository for any stored results. This is usually the first thing I do to verify that testing completed successfully.

alias btags="git --git-dir $BUNSEN_REPO/.git tag -l"

In the following example, the output of btags shows that our Bunsen instance is loaded with patched and unpatched results for all the target boards and configurations tested on x86_64 RHEL.

$ btags
patched-native-extended-gdbserver.-m32
patched-native-extended-gdbserver.-m64
patched-native-gdbserver.-m32
patched-native-gdbserver.-m64
patched-unix.-m32
patched-unix.-m64
unpatched-native-extended-gdbserver.-m32
unpatched-native-extended-gdbserver.-m64
unpatched-native-gdbserver.-m32
unpatched-native-gdbserver.-m64
unpatched-unix.-m32
unpatched-unix.-m64

bsum: Summarize test results

Finally, we get to the interesting scripts!

bsum outputs the testing summary from a given test run that has been imported into Bunsen. Its output mimics DejaGnu's summary table. To output results that exactly imitate DejaGNU's output, use the –verbose option. This will explicitly list all non-passing tests.

The function also optionally takes the name of a test file (as a glob expression) to limit the results to specific test file(s), as shown in the example that follows:

# Summarize test results for a given bunsen commit.
function bsum () {
    if [ $# -lt 2 ]; then
        >&2 echo "error: bsum requires at least two arguments"
        >&2 echo "bsum: usage: bsum [-v] TARGET_ABBREV (un)?patched [EXP_GLOB]"
        return 1
    fi

    # Get verbose output option.
    verbose=""
    if [ "$1" = "-v" ]; then
        verbose="-v"
        shift
    fi

    # Get target abbreviation and version.
    target=$1 # any valid target abbreviation (UuGgEe)
    ver=$2 # "unpatched" or "patched"

    # Get optional expfile glob.
    glob=""
    if [ $# -eq 3 ]; then
        glob="--expfile-glob $3"
    fi

    # Run sum script.
    tag=$(btag $target $ver)
    r-dejagnu-summary --git ${BUNSEN_REPO} --db ${BUNSEN_DB} $verbose \ $glob $tag
    return $?
}

In the following examples, the first command outputs the summary results for the entire test suite run of the U target board (unix/-m64) in the "unpatched" test suite run. The second outputs the summary for just GDB's C++-specific tests (in the gdb.cp folder) for the same test run.

$ bsum U u
# of expected passes        111673
# of unexpected failures    10
# of expected failures      83
# of known failures         112
# of untested testcases     30
# of unsupported tests      559
$ bsum U u 'gdb.cp/*.exp'
# of expected passes        8585
# of expected failures      14
# of known failures         25
# of untested testcases     3

Note that the C++-specific summary did not output a field for "unexpected failures" because there were no such failures. This is consistent with how DejaGnu outputs results. (I often compare manual test run summaries with the data stored in Bunsen.)

breg: Compute regressions

The breg function will query Bunsen for any regressions discovered during testing. Its inputs include the target board abbreviation and an optional test file to which to limit results. Notice that unlike bsum, this function does not accept a glob expression.

# Convenience function to find (possible) regressions between two testruns.
# Optionally takes a test file name (expfile).
function breg () {
    if [ $# -lt 1 ]; then
        >&2 echo "error: breg requires at least one arguments"
        >&2 echo "breg: usage: breg TARGET_ABBREV [EXPFILE]"
        return 1
    fi

    # Get target abbreviation.
    expfile=""
    target=$1 # any valid target abbreviation (UuGgEe)

    # Get optional expfile.
    if [ $# -eq 2 ]; then
        expfile="--dgexpfile $2"
    fi

    # Run diff script. Add "--regressions"?
    r-diff-testruns --git ${BUNSEN_REPO} --db ${BUNSEN_DB} \
                    $(btag $target u) $(btag $target p) $expfile
    return $?
}

In the following example, breg lists the regressions discovered between running the "unpatched" and "patched" test suite runs of the U target. For the sake of brevity, I have limited the output to the regressions I previously introduced to gdb.base/default.exp and gdb.cp/var-tag.exp:

$ breg U
2023-08-30 13:24:45,299:r-diff-testruns:INFO:opened git repo /root/bunsendb

    unpatched-unix.-m64
       patched-unix.-m64

dejagnu diffs

gdb.base/default.exp gdb.base/default.exp: call
    PASS    gdb.log:109690 gdb.sum:9435
       FAIL    gdb.log:109893 gdb.sum:9435

gdb.base/default.exp gdb.base/default.exp: inspect
    PASS    gdb.log:111372 gdb.sum:9529
       FAIL    gdb.log:111575 gdb.sum:9529

gdb.base/default.exp gdb.base/default.exp: print
    PASS    gdb.log:111453 gdb.sum:9551
       FAIL    gdb.log:111656 gdb.sum:9551

gdb.base/default.exp gdb.base/default.exp: print "p" abbreviation
    PASS    gdb.log:111450 gdb.sum:9552
       FAIL    gdb.log:111653 gdb.sum:9552

gdb.base/default.exp gdb.base/default.exp: ptype
    PASS    gdb.log:111459 gdb.sum:9554
       FAIL    gdb.log:111662 gdb.sum:9554

gdb.base/default.exp gdb.base/default.exp: whatis
    PASS    gdb.log:112837 gdb.sum:9675
       FAIL    gdb.log:113040 gdb.sum:9675

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: before start: c++: ptype C
    PASS    gdb.log:424504 gdb.sum:62659
       FAIL    gdb.log:425341 gdb.sum:62685

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: before start: c: ptype C
    PASS    gdb.log:424626 gdb.sum:62689
       FAIL    gdb.log:425463 gdb.sum:62715

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: C::f: c++: ptype C
    PASS    gdb.log:425013 gdb.sum:62723
       FAIL    gdb.log:425850 gdb.sum:62750

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: C::f: c: ptype C
    PASS    gdb.log:425135 gdb.sum:62753
       FAIL    gdb.log:425972 gdb.sum:62780

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: main: c++: ptype C
    PASS    gdb.log:424761 gdb.sum:62783
       FAIL    gdb.log:425598 gdb.sum:62810

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: main: c: ptype C
    PASS    gdb.log:424883 gdb.sum:62813
       FAIL    gdb.log:425720 gdb.sum:62840

The following example limits the list to those in the gdb.cp/var-tag.exp test file:

$ breg U gdb.cp/var-tag.exp
2023-08-30 13:26:30,565:r-diff-testruns:INFO:opened git repo /root/bunsendb

    unpatched-unix.-m64
       patched-unix.-m64

dejagnu diffs

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: before start: c++: ptype C
    PASS    gdb.log:424504 gdb.sum:62659
       FAIL    gdb.log:425341 gdb.sum:62685

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: before start: c: ptype C
    PASS    gdb.log:424626 gdb.sum:62689
       FAIL    gdb.log:425463 gdb.sum:62715

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: C::f: c++: ptype C
    PASS    gdb.log:425013 gdb.sum:62723
       FAIL    gdb.log:425850 gdb.sum:62750

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: C::f: c: ptype C
    PASS    gdb.log:425135 gdb.sum:62753
       FAIL    gdb.log:425972 gdb.sum:62780

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: main: c++: ptype C
    PASS    gdb.log:424761 gdb.sum:62783
       FAIL    gdb.log:425598 gdb.sum:62810

gdb.cp/var-tag.exp gdb.cp/var-tag.exp: main: c: ptype C
    PASS    gdb.log:424883 gdb.sum:62813
       FAIL    gdb.log:425720 gdb.sum:62840

Now that Bunsen can tell us what test(s) might have regressed, the next step is to ascertain why a test might have failed. This normally means comparing the test output in the two log files, but with Bunsen, there is a much easier way.

bdiff: Output a diff between two tests

This script outputs a diff of the log output for a given test for a target board. This is where Bunsen can save ridiculous amounts of time by forgoing having to manually search log files for a given test and inspecting the results. With Bunsen, it's all automagic.

The function requires two inputs: the target board (abbreviation) and the test name.

# Convenience function to diff two test results between any two commits.
function bdiff()
{
    if [ $# -ne 2 ]; then
        >&2 echo "error: bdiff requires two arguments"
        >&2 echo "bdiff: usage: bdiff TARGET_ABBREV 'TEST'"
        return 1
    fi

    target=$1
    r-dejagnu-diff-logs --template=diff --git ${BUNSEN_REPO}  \
      --db ${BUNSEN_DB} $(btag $target unpatched) $(btag $target patched) \ 
      "$2"
    return 0
}

Now let's use bdiff to quickly figure out what happened to cause my (intentionally introduced) regressions.

One of the regressing tests identified in the breg example output is gdb.base/default.exp: print. bdiff can show us how the output of this test changed between the "unpatched" and "patched" test runs:

$ bdiff U 'gdb.base/default.exp: print'
logfile: <class 'dict'> *********************
*** unpatched-unix.-m64: gdb.base/default.exp: print
--- patched-unix.-m64: gdb.base/default.exp: print
***************
*** 1,3 ****
  print
! The history is empty.
! (gdb) PASS: gdb.base/default.exp: print
--- 1,3 ----
  print
! The history is really empty.
! (gdb) FAIL: gdb.base/default.exp: print

The diff clearly shows how I artificially regressed the test by adding the word "really" to the string The history is empty.

Is that also what caused the subsequent tests in the file to regress? Asking for the Bunsen diff of the next test (gdb.base/default.exp: print "p" abbreviation) also clearly shows this to be the case:

$ bdiff U 'gdb.base/default.exp: print "p" abbreviation'

logfile: <class 'dict'> *********************
*** unpatched-unix.-m64: gdb.base/default.exp: print "p" abbreviation
--- patched-unix.-m64: gdb.base/default.exp: print "p" abbreviation
***************
*** 1,3 ****
  p
! The history is empty.
! (gdb) PASS: gdb.base/default.exp: print "p" abbreviation
--- 1,3 ----
  print
! The history is really empty.
! (gdb) FAIL: gdb.base/default.exp: print "p" abbreviation

As for the subsequent failing tests, you can repeat this process to verify that the sudden appearance of the word "really" is the cause for all of these failures.

Turning our attention to the other introduced regression in gdb.cp/var-tag.exp:

$ bdiff U 'gdb.cp/var-tag.exp: in C::f: c++: ptype C'
logfile: <class 'dict'> *********************
*** unpatched-unix.-m64: gdb.cp/var-tag.exp: in C::f: c++: ptype C
--- patched-unix.-m64: gdb.cp/var-tag.exp: in C::f: c++: ptype C
***************
*** 1,12 ****
  ptype C
! type = class C {
public:
!  C::C1 C1;
!  C::E1 E1;
!  C::U1 U1;

!  C(void);
   void global(void) const;
   int f(void) const;
  }
! (gdb) PASS: gdb.cp/var-tag.exp: in C::f: c++: ptype C
--- 1,12 ----
  ptype C
! type = class CX {
public:
!  CX::C1 C1;
!  CX::E1 E1;
!  CX::U1 U1;

!  CX(void);
   void global(void) const;
   int f(void) const;
  }
! (gdb) FAIL: gdb.cp/var-tag.exp: in C::f: c++: ptype C

From the above output, it is obvious that I renamed a class in the source file from C to CX, causing the observed regression. Using bdiff on the other failing tests would show that this is the root cause of the regressions.

Conclusion

In this article, I have attempted to demonstrate how I use Bunsen to facilitate the analysis of automated regression testing for GDB releases on Red Hat Enterprise Linux. While there is still more automation and analysis that would yield further benefits, the process implemented above has already saved me countless hours of tedious and unrewarding work.

If you are also doing manual regression testing analyses, I hope this article will inspire you to adopt Bunsen and save yourself and your organization time and money!

Quickly analyze regression testing results with Bunsen

Share:

What is Bunsen and how does it help?

Overview

A note on implementation

The scripts

tgt: Target abbreviation expansion

btgt: Return a Git tag snippet for a target abbreviation

btag: Create the Git tag used by Bunsen imports

btags: List all imported Bunsen test runs

bsum: Summarize test results

breg: Compute regressions

bdiff: Output a diff between two tests

Conclusion

Further reading

Special thanks

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue