Fast Kernel Builds

A number of months ago I did an “Ask Me Anything” interview on r/linux on redit. As part of that, a discussion of the hardware I used came up, and someone said, “I know someone that can get you a new machine” “get that person a new machine!” or something like that.

Fast forward a few months, and a “beefy” AMD Threadwripper 3970X shows up on my doorstep thanks to the amazing work of Wendell Wilson at Level One Techs.

Ever since I started doing Linux kernel development the hardware I use has been a mix of things donated to me for development (workstations from Intel and IBM, laptops from Dell) machines my employer have bought for me (various laptops over the years), and machines I’ve bought on my own because I “needed” it (workstations built from scratch, Apple Mac Minis, laptops from Apple and Dell and ASUS and Panasonic). I know I am extremely lucky in this position, and anything that has been donated to me, has been done so only to ensure that the hardware works well on Linux. “Will code for hardware” was an early mantra of many kernel developers, myself included, and hardware companies are usually willing to donate machines and peripherals to ensure kernel support.

This new AMD machine is just another in a long line of good workstations that help me read email really well. Oops, I mean, “do kernel builds really fast”…

For full details on the system, see this forum description, and this video that Wendell did in building the machine, and then this video of us talking about it before it was sent out. We need to do a follow-on one now that I’ve had it for a few months and have gotten used to it.

Benchmark tools

Below I post the results of some benchmarks that I have done to try to show the speed of different systems. I’ve used the tool Fio version fio-3.23-28-g7064, kcbench version v0.9.0 (from git), and perf version 5.7.g3d77e6a8804a. All of these are great for doing real-world tests of I/O systems (fio), kernel build tests (kcbench), and “what is my system doing at the moment” queries (perf). I recommend trying all of these out yourself if you haven’t done so already.

Fast Builds

I’ve been using a laptop for my primary development system for a number of years now, due to travel and moving around a bit, and because it was just “good enough” at the time. I do some local builds and testing, but have a “build machine” in a data center somewhere, that I do all of my normal stable kernel builds on, as it is much much faster than any laptop. It is set up to do kernel builds directly off of a RAM disk, ensuring that I/O isn’t an issue. Given that is has 128Gb of RAM, carving out a 40Gb ramdisk for kernel builds to run on (room for 4-5 at once), this has worked really well, with kernel builds of a full kernel tree in a few minutes.

Here’s the output of kcbench on my data center build box which is running Fedora 32:

Processor:           Intel Core Processor (Broadwell) [40 CPUs]
Cpufreq; Memory:     Unknown; 120757 MiB
Linux running:       5.8.7-200.fc32.x86_64 [x86_64]
Compiler:            gcc (GCC) 10.2.1 20200723 (Red Hat 10.2.1-1)
Linux compiled:      5.7.0 [/home/gregkh/.cache/kcbench/linux-5.7]
Config; Environment: defconfig; CCACHE_DISABLE="1"
Build command:       make vmlinux
Filling caches:      This might take a while... Done
Run 1 (-j 40):       81.92 seconds / 43.95 kernels/hour [P:3033%]
Run 2 (-j 40):       83.38 seconds / 43.18 kernels/hour [P:2980%]
Run 3 (-j 46):       82.11 seconds / 43.84 kernels/hour [P:3064%]
Run 4 (-j 46):       81.43 seconds / 44.21 kernels/hour [P:3098%]

Contrast that with my current laptop:

Processor:           Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz [8 CPUs]
Cpufreq; Memory:     powersave [intel_pstate]; 15678 MiB
Linux running:       5.8.8-arch1-1 [x86_64]
Compiler:            gcc (GCC) 10.2.0
Linux compiled:      5.7.0 [/home/gregkh/.cache/kcbench/linux-5.7]
Config; Environment: defconfig; CCACHE_DISABLE="1"
Build command:       make vmlinux
Filling caches:      This might take a while... Done
Run 1 (-j 8):        392.69 seconds / 9.17 kernels/hour [P:768%]
Run 2 (-j 8):        393.37 seconds / 9.15 kernels/hour [P:768%]
Run 3 (-j 10):       394.14 seconds / 9.13 kernels/hour [P:767%]
Run 4 (-j 10):       392.94 seconds / 9.16 kernels/hour [P:769%]
Run 5 (-j 4):        441.86 seconds / 8.15 kernels/hour [P:392%]
Run 6 (-j 4):        440.31 seconds / 8.18 kernels/hour [P:392%]
Run 7 (-j 6):        413.48 seconds / 8.71 kernels/hour [P:586%]
Run 8 (-j 6):        412.95 seconds / 8.72 kernels/hour [P:587%]

Then the new workstation:

Processor:           AMD Ryzen Threadripper 3970X 32-Core Processor [64 CPUs]
Cpufreq; Memory:     schedutil [acpi-cpufreq]; 257693 MiB
Linux running:       5.8.8-arch1-1 [x86_64]
Compiler:            gcc (GCC) 10.2.0
Linux compiled:      5.7.0 [/home/gregkh/.cache/kcbench/linux-5.7/]
Config; Environment: defconfig; CCACHE_DISABLE="1"
Build command:       make vmlinux
Filling caches:      This might take a while... Done
Run 1 (-j 64):       37.15 seconds / 96.90 kernels/hour [P:4223%]
Run 2 (-j 64):       37.14 seconds / 96.93 kernels/hour [P:4223%]
Run 3 (-j 71):       37.16 seconds / 96.88 kernels/hour [P:4240%]
Run 4 (-j 71):       37.12 seconds / 96.98 kernels/hour [P:4251%]
Run 5 (-j 32):       43.12 seconds / 83.49 kernels/hour [P:2470%]
Run 6 (-j 32):       43.81 seconds / 82.17 kernels/hour [P:2435%]
Run 7 (-j 38):       41.57 seconds / 86.60 kernels/hour [P:2850%]
Run 8 (-j 38):       42.53 seconds / 84.65 kernels/hour [P:2787%]

Having a local machine that builds kernels faster than my external build box has been a liberating experience. I can do many more local tests before sending things off to the build systems for “final test builds” there.

Here’s a picture of my local box doing kernel builds, and the remote machine doing builds at the same time, both running bpytop to monitor what is happening (htop doesn’t work well for huge numbers of cpus). It’s not really all that useful, but is fun eye-candy:

Lots of cpus

SSD vs. NVME

As shipped to me, the machine booted from a raid array of an NVME disk. Outside of laptops, I’ve not used NVME disks, only SSDs. Given that I didn’t really “trust” the Linux install on the disk, I deleted the data on the disks, and installed a trusty SATA SSD disk and got Linux up and running well on it.

After that was all up and running well (btw, I use Arch Linux), I looked into the NVME disk, to see if it really would help my normal workflow out or not.

Firing up fio, here are the summary numbers of the different disk systems using the default “examples/ssd-test.fio” test settings:

SSD:

Run status group 0 (all jobs):
   READ: bw=219MiB/s (230MB/s), 219MiB/s-219MiB/s (230MB/s-230MB/s), io=10.0GiB (10.7GB), run=46672-46672msec

Run status group 1 (all jobs):
   READ: bw=114MiB/s (120MB/s), 114MiB/s-114MiB/s (120MB/s-120MB/s), io=6855MiB (7188MB), run=60001-60001msec

Run status group 2 (all jobs):
  WRITE: bw=177MiB/s (186MB/s), 177MiB/s-177MiB/s (186MB/s-186MB/s), io=10.0GiB (10.7GB), run=57865-57865msec

Run status group 3 (all jobs):
  WRITE: bw=175MiB/s (183MB/s), 175MiB/s-175MiB/s (183MB/s-183MB/s), io=10.0GiB (10.7GB), run=58539-58539msec

Disk stats (read/write):
  sda: ios=4375716/5243124, merge=548/5271, ticks=404842/436889, in_queue=843866, util=99.73%

NVME:

Run status group 0 (all jobs):
   READ: bw=810MiB/s (850MB/s), 810MiB/s-810MiB/s (850MB/s-850MB/s), io=10.0GiB (10.7GB), run=12636-12636msec

Run status group 1 (all jobs):
   READ: bw=177MiB/s (186MB/s), 177MiB/s-177MiB/s (186MB/s-186MB/s), io=10.0GiB (10.7GB), run=57875-57875msec

Run status group 2 (all jobs):
  WRITE: bw=558MiB/s (585MB/s), 558MiB/s-558MiB/s (585MB/s-585MB/s), io=10.0GiB (10.7GB), run=18355-18355msec

Run status group 3 (all jobs):
  WRITE: bw=553MiB/s (580MB/s), 553MiB/s-553MiB/s (580MB/s-580MB/s), io=10.0GiB (10.7GB), run=18516-18516msec

Disk stats (read/write):
    md0: ios=5242880/5237386, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=1310720/1310738, aggrmerge=0/23, aggrticks=63986/25048, aggrin_queue=89116, aggrutil=97.67%
  nvme3n1: ios=1310720/1310729, merge=0/0, ticks=63622/25626, in_queue=89332, util=97.63%
  nvme0n1: ios=1310720/1310762, merge=0/92, ticks=63245/25529, in_queue=88858, util=97.67%
  nvme1n1: ios=1310720/1310735, merge=0/3, ticks=64009/24018, in_queue=88114, util=97.58%
  nvme2n1: ios=1310720/1310729, merge=0/0, ticks=65070/25022, in_queue=90162, util=97.49%

Full logs of both tests can be found here for the SSD, and here for the NVME array.

Basically the NVME array is up to 3 times faster than the SSD, depending on the specific read/write test, and is faster for everything overall.

But, does my normal workload of kernel builds matter when building on such fast storage? Normally a kernel build is very I/O intensive, but only up to a point. If the storage system can keep the CPU “full” of new data to build, and writes do not stall, a kernel build should be limited by CPU power, if the storage system can go fast enough.

So, is a SSD “fast” enough on a huge AMD Threadripper system?

In short, yes, here’s the output of kcbench running on the NVME disk:

Processor:           AMD Ryzen Threadripper 3970X 32-Core Processor [64 CPUs]
Cpufreq; Memory:     schedutil [acpi-cpufreq]; 257693 MiB
Linux running:       5.8.8-arch1-1 [x86_64]
Compiler:            gcc (GCC) 10.2.0
Linux compiled:      5.7.0 [/home/gregkh/.cache/kcbench/linux-5.7/]
Config; Environment: defconfig; CCACHE_DISABLE="1"
Build command:       make vmlinux
Filling caches:      This might take a while... Done
Run 1 (-j 64):       36.97 seconds / 97.38 kernels/hour [P:4238%]
Run 2 (-j 64):       37.18 seconds / 96.83 kernels/hour [P:4220%]
Run 3 (-j 71):       37.14 seconds / 96.93 kernels/hour [P:4248%]
Run 4 (-j 71):       37.22 seconds / 96.72 kernels/hour [P:4241%]
Run 5 (-j 32):       44.77 seconds / 80.41 kernels/hour [P:2381%]
Run 6 (-j 32):       42.93 seconds / 83.86 kernels/hour [P:2485%]
Run 7 (-j 38):       42.41 seconds / 84.89 kernels/hour [P:2797%]
Run 8 (-j 38):       42.68 seconds / 84.35 kernels/hour [P:2787%]

Almost the exact same number of kernels built per hour.

So for a kernel developer, right now, a SSD is “good enough”, right?

It’s not just all builds

While kernel builds are the most time-consuming thing that I do on my systems, the other “heavy” thing that I do is lots of git commands on the Linux kernel tree. git is really fast, but it is limited by the speed of the storage medium for lots of different operations (clones, switching branches, and the like).

After I switched to running my kernel trees off of the NVME storage, it “felt” like git was going faster now, so I came up with some totally-artifical benchmarks to try to see if this was really true or not.

One common thing is cloning a whole kernel tree from a local version in a new directory to do different things with it. Git is great in that you can keep the “metadata” in one place, and only check out the source files in the new location, but dealing with 70 thousand files is not “free”.

$ cat clone_test.sh
#!/bin/bash
git clone -s ../work/torvalds/ test
sync

And, to make sure the data isn’t just coming out of the kernel cache, be sure to flush all caches first.

SSD output:

$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
$ perf stat ./clone_test.sh
Cloning into 'test'...
done.
Updating files: 100% (70006/70006), done.

 Performance counter stats for './clone_test.sh':

          4,971.83 msec task-clock:u              #    0.536 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
            92,713      page-faults:u             #    0.019 M/sec
    14,623,046,712      cycles:u                  #    2.941 GHz                      (83.18%)
       720,522,572      stalled-cycles-frontend:u #    4.93% frontend cycles idle     (83.40%)
     3,179,466,779      stalled-cycles-backend:u  #   21.74% backend cycles idle      (83.06%)
    21,254,471,305      instructions:u            #    1.45  insn per cycle
                                                  #    0.15  stalled cycles per insn  (83.47%)
     2,842,560,124      branches:u                #  571.734 M/sec                    (83.21%)
       257,505,571      branch-misses:u           #    9.06% of all branches          (83.68%)

       9.270460632 seconds time elapsed

       3.505774000 seconds user
       1.435931000 seconds sys

NVME disk:

$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
~/linux/tmp $ perf stat ./clone_test.sh
Cloning into 'test'...
done.
Updating files: 100% (70006/70006), done.

 Performance counter stats for './clone_test.sh':

          5,183.64 msec task-clock:u              #    0.833 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
            87,409      page-faults:u             #    0.017 M/sec
    14,660,739,004      cycles:u                  #    2.828 GHz                      (83.46%)
       712,429,063      stalled-cycles-frontend:u #    4.86% frontend cycles idle     (83.40%)
     3,262,636,019      stalled-cycles-backend:u  #   22.25% backend cycles idle      (83.09%)
    21,241,797,894      instructions:u            #    1.45  insn per cycle
                                                  #    0.15  stalled cycles per insn  (83.50%)
     2,839,260,818      branches:u                #  547.735 M/sec                    (83.30%)
       258,942,077      branch-misses:u           #    9.12% of all branches          (83.25%)

       6.219492326 seconds time elapsed

       3.336154000 seconds user
       1.593855000 seconds sys

So a “clone” is faster by 3 seconds, nothing earth shattering, but noticable.

But clones are rare, what’s more common is switching between branches, which checks out a subset of the different files depending on what is contained in the branches. It’s a lot of logic to figure out exactly what files need to change.

Here’s the test script:

$ cat branch_switch_test.sh
#!/bin/bash
cd test
git co -b old_kernel v4.4
sync
git co -b new_kernel v5.8
sync

And the results on the different disks:

SSD:

$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
$ perf stat ./branch_switch_test.sh
Updating files: 100% (79044/79044), done.
Switched to a new branch 'old_kernel'
Updating files: 100% (77961/77961), done.
Switched to a new branch 'new_kernel'

 Performance counter stats for './branch_switch_test.sh':

	 10,500.82 msec task-clock:u              #    0.613 CPUs utilized
		 0      context-switches:u        #    0.000 K/sec
		 0      cpu-migrations:u          #    0.000 K/sec
	   195,900      page-faults:u             #    0.019 M/sec
    27,773,264,048      cycles:u                  #    2.645 GHz                      (83.35%)
     1,386,882,131      stalled-cycles-frontend:u #    4.99% frontend cycles idle     (83.54%)
     6,448,903,713      stalled-cycles-backend:u  #   23.22% backend cycles idle      (83.22%)
    39,512,908,361      instructions:u            #    1.42  insn per cycle
						  #    0.16  stalled cycles per insn  (83.15%)
     5,316,543,747      branches:u                #  506.298 M/sec                    (83.55%)
       472,900,788      branch-misses:u           #    8.89% of all branches          (83.18%)

      17.143453331 seconds time elapsed

       6.589942000 seconds user
       3.849337000 seconds sys

NVME:

$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
~/linux/tmp $ perf stat ./branch_switch_test.sh
Updating files: 100% (79044/79044), done.
Switched to a new branch 'old_kernel'
Updating files: 100% (77961/77961), done.
Switched to a new branch 'new_kernel'

 Performance counter stats for './branch_switch_test.sh':

         10,945.41 msec task-clock:u              #    0.921 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
           197,776      page-faults:u             #    0.018 M/sec
    28,194,940,134      cycles:u                  #    2.576 GHz                      (83.37%)
     1,380,829,465      stalled-cycles-frontend:u #    4.90% frontend cycles idle     (83.14%)
     6,657,826,665      stalled-cycles-backend:u  #   23.61% backend cycles idle      (83.37%)
    41,291,161,076      instructions:u            #    1.46  insn per cycle
                                                  #    0.16  stalled cycles per insn  (83.00%)
     5,353,402,476      branches:u                #  489.100 M/sec                    (83.25%)
       469,257,145      branch-misses:u           #    8.77% of all branches          (83.87%)

      11.885845725 seconds time elapsed

       6.741741000 seconds user
       4.141722000 seconds sys

Just over 5 seconds faster on an nvme disk array.

Now 5 seconds doesn’t sound like much, but I’ll take it…

Conclusion

If you haven’t looked into new hardware in a while, or are stuck doing kernel development on a laptop, please seriously consider doing so, the power in a small desktop tower these days (and who is traveling anymore that needs a laptop?) is well worth it if possible.

Again, many thanks to Level1Techs for the hardware, it’s been put to very good use.