linux kernel monkey log

First off, I am a Novell employee, but do not speak for the company at all, the following is my personal opinion, but one that is held by a number of different people at different companies...

I've been hearing from a lot of different companies and users about how the current "enterprise" Linux distros manage their kernels and the current problems they are having with them. With all products that bundle up an every-moving target of an OpenSource projects, there are trade-offs and for the past five years or so the two big Linux distros, SUSE and Red Hat have been trying to walk the line between stability and new features and doing so quite well.

But now that we have been living with these models for a few years, a number of problems have come to light, with the way things currently work. This document describes how the current kernels are managed in these two distros, the problems companies and users are having with them, and three proposed solutions.

Current Model of kernels

An "enterprise" Linux kernel as packaged by Red Hat and SUSE is made up of a kernel.org kernel taken from a specific point in time, and then it is tested, tuned, stabilized and then packaged up and supported for usually a long period of time (5-9 years, depending on the product.) After the product is first released, a series of "service packs" or "maintenance updates" (from now on called "major update") are released, about one every 8-18 months depending on the vendor and the length of time the product has been released.

During the time between these major updates, the distro works to ensure that any bugfix or security update that happens will not break any of the current kernel API or internally visible ABI. It does this so that any third-party kernel modules will not need to change or be "recertified" because of them.

Currently both distros provide a way to have third-party module creators register with them and be notified when an internal ABI change is going to happen in order for them to be able to respin their pre-built modules. They also provide ways for third-party modules to be easily created and build against these kernels so as to ease the build issues that can be caused when attempting to build and package against a distro kernel.

When a major update is released, it usually consists of the original kernel version for the product, with a wide range of new features, drivers, and other fixes backported from the currently released kernel.org kernel to the originally released kernel. This enables new hardware to be supported and new features that are requested by the distro's users and partners to be rolled into the product. This release almost always has ABI and API changes due to the new features and drivers.

At the time of this newly released update, all third party modules must be rebuilt and sometimes reworked due to the new features and backports that were applied to the kernel. Hopefully all third-party module vendors work with the distro to be aware of the proposed changes, but sometimes there is a lag before the new modules are available for the customers to use.

An example of how this works can be seen in the latest Novell SLES10 Service Pack 1 release. Originally the SLES10 kernel was based on the 2.6.16 kernel release with a number of bugfixes added to it. At the time of the Service Pack 1 release, it was still based on the 2.6.16 kernel version, but the SCSI core, libata core, and all SATA drivers were backported from the 2.6.20 kernel.org kernel release to be included in this 2.6.16 based kernel package. This changed a number of ABI issues for any external SCSI or storage driver that they would need to be aware of when producing an updated version of their driver for the Service Pack 1 release.

Problems caused by this model

Here are some of the problems that I have heard with customers and partners who have been working with this kind of release model for a number of years:

Partners have to get their changes, features, and new drivers into the upstream kernel.org kernel in order for the distros to be willing to accept them. After that happens, they must then backport the feature/driver to the older vendor specific kernel, test it to verify everything works properly, and then ask the vendor to accept the patch for their next service pack, by the deadline imposed by the vendor. This causes a lot of extra work by the partners, having to track at least two vendor kernels, as well as the upstream kernel.org tree.
If the partner misses the release date for the next service pack, their hardware will not be supported within the whole product until 12-18 months later, when the next service pack is released.
Partners hate working with externally available drivers, through a driver-disk or some other process. Reasons cited for their dislike of this range from confusion of users for how to get access to these drivers, to security issues, to issues surrounding drivers for boot devices or network devices when doing network installs.
Due to the changing API between each service pack, third-party vendors need to create a different module build for every major product release (RHEL3, RHEL4, etc) as well as for every maintenance update. Because some customers do not upgrade to the latest maintenance update for various reasons, they are forced to support their driver on all maintenance releases, a very large combination.
It imposes the old Unix slow release cycle on to Linux, cutting off one of the main reasons people switch to Linux in the first place.
For machines that must work with new hardware all the time (laptops and some desktops), the 12-18 month cycle before adding new device support makes them pretty much impossible to use at times. (i.e. people want you to support the latest toy they just bought from the store.) This makes things like "enterprise" kernels that are directed toward desktops quite uncomfortable to use after even a single year has passed.

Potential solutions and pros and cons of them

Keep doing the same as before. This model has evolved over the years to the current state based on input from partners, users, developers and the companies.
- Pro: People know the model and are used to it.
- Con: See the above listed reasons why partners and third-party vendors dislike it.
Change the major update to only include kernel bugfixes and security fixes. No new features or drivers will be added to the kernel, helping to ensure that the ABI does not change. If an API has to change due to a more intrusive bugfix, this might still happen at this time, but the change is limited to only a specific area, involving a minor number of symbols and structures, instead of large sweeping change like currently happens.
- Pro: Partners will not have to worry about the release cycle of the distro kernels, and only focus on getting their drivers working with a set kernel version, not multiple versions over the lifetime of the product.
- Con: Many new features will not be able to be added in this manner, as they touch core portions of the kernel and can not be updated with external modules (ACPI and the large PCI quirk table for misbehaving hardware are two such examples.)
On every major update, the kernel is updated to the latest kernel.org release, much like the consumer products are (Fedora, openSUSE, Ubuntu, Mandriva, etc.) This will ensure that any upstream update for drivers and new features will be automatically included.
- Pro: All of the latest kernel drivers and features will be automatically supported and included by the distro, enabling the Partners to focus on upstream kernel.org development and not worry about backporting things to older kernel versions. All bugfixes and security updates that the vendor has not included in their minor updates are also pulled in at this time (and there are a lot of them.)
- Con: Partners whose code is not present in kernel.org releases for whatever reason (do not want it, incompatible licenses, etc.) will have to do a bit more work in tracking the new releases, although this should be only be slightly more than the current amount of development and testing that they currently do.

So, what to do? Currently this discussion is happening in the major distros and their partners. Let me know what you think the model should be for enterprise Linux distros in the future.

Do you like one of the currently proposed models above, or do you have some other model you feel would work better?

Thanks to the many different people who read earlier versions of this document and helped to form the ideas here. This includes the whole Novell/SuSE kernel team who while did not all agree with the ideas here, helped out immensely. Also a number of people at different companies helped out, but probably do not wish to be named here, I want to thank them anyway.

posted Tue, 19 Jun 2007 in [/linux]