I love Linux. Which is why, whenever there’s a new distro release and it’s less than optimal (read, horrible), a unicorn dies somewhere. And since unicorns are pretty much mythical, it tells you how bad the situation is. On a more serious note, I’ve started my autumn crop of distro testing, and the results are rather discouraging. Worse than just bad results, we get inconsistent results. This is possibly even worse than having a product that works badly. The wild emotional seesaw of love-hate, hope-despair plays havoc with users and their loyalty.
Looking back to similar tests in previous years, it’s as if nothing has changed. We’re spinning. Literally. Distro releases happen in a sort of intellectual vacuum, isolated from one another, with little to no cross-cooperation or cohesion. This got me thinking. Are there any mechanisms that could help strengthen partnership among different distro teams, so that our desktops looks and behave with more quality and consistency?
Where are we now?
Remember kerneloops.org? I first mentioned this site in my Linux Kernel Crash Book back in 2011. It’s a sort of central repository that lists kernel crashes across multiple distro editions and kernel versions, acting as a unified database for problems related to the core of the operating system. To the best of my knowledge, it is the only such resource in the wider Linux world that crosses the boundaries of particular releases. There’s nothing else truly “cross-platform” in this regard.
And it’s getting worse.
The Linux desktop seems to be regressing. The latest Ubuntu is an example of time travel back to times before the enthusiastic attempt by Canonical to make the Linux desktop big. In the past, distributions had various mechanisms in place to report kernel crashes (application crash tools are still there), and upon completed installations, some of these distributions would also send a hardware report. It wasn’t perfect, but it was something. Unfortunately, we see less and less of this every day. Even kerneloops is on the blink at the moment. For a while now, the site has been undergoing maintenance, and the UI portion is not available for use. Not a good sign.
Here’s my proposal
Ranting may help keep the spirits up, but it won’t fix things. Hence, I have a few practical suggestions that may give us solid, stable desktops that do not regress in between releases. This is not a call for action, a political speech, or anything of that sort. Just an honest attempt to see more quality and consistency in the Linux desktop.
Wholly owned stack
Blameshifting is one of the most common evasion tactics in the corporate world, and it also affects Linux desktop distributions and associated software. The basic concept is: when faced with a possibility that some software code you own may be fault (there’s a bug that needs investigating and fixing), try to blame someone else for the failure.
The best example from day-to-day life is when you call your ISP to complain about network speed. Even if you’ve done your due diligence and tested everything, they will still tell you to reboot your computer or replace your router or something. True, for a large number of people, this may be the case, but the attitude is prevalent, and then, when there are genuine issues on the vendor side, they get ignored, sidetracked or lobbed over the fence.
Linux distributions are not wholly owned stacks – they are a combination of some unique parts and lots of components taken from other places (let’s call it upstream). Ideally, the distribution team would own everything and be accountable for what happens from the app to the kernel and back. Alas, this is not the case, and so, if you do report a problem, it will often be relegated to the “owner” of the tool, which could be a library developer half across the world, a part-time student volunteer with no desire to meddle or change their code. If you’re lucky, you’ll land on a component that belongs to a professional institution, including the likes of Novell, Red Hat and Canonical, in which case you can expect a resolution. Even so, there’s no guarantee.
Looking at the enterprise world, we can see that the full-stack model is obeyed and followed religiously, with companies offering full support (for what they own) for many years, sometimes even more than a decade. The stack isn’t big, but whatever is inside it gets proper ownership and accountability. This works beautifully, and it’s one of the reasons why Linux has found a happy home in many a multi-billion-dollar organization.
The Linux desktop does not have this luxury – and the problems users face and report rely mostly on chance for being resolved. The blameshifting syndrome (it’s a syndrome because it’s a human emotional condition) also takes form in the burden of proof (innocent until proven guilty) resting with the user. In other words, not only do users have to trouble themselves with reporting issues, often a tedious manual process, they are challenged by product owners with no direct accountability or affiliation to actually prove that the code is faulty in some way. Users end up running tests, collecting logs and spending a lot of time trying to act as an intermediary between bad user experience and software development. This is not a healthy or efficient process in any way.
The start to unraveling this problem rests in creating fully owned software stacks. This brings a whole new range of other issues with it. For instance, if more than one distribution decide to bundle a particular application in their bundle, who gets to be the owner? Say Firefox. It is the default browser in most Linux distributions, but none of them actually own the code. So if there’s a browser bug, what happens then?
The wrong answer is a non-deterministic one. The right one – if we follow the wholly owned stack model – is that distro owners are responsible for issues with the bundled software, even if they did not write the code themselves. Hence, Firefox (or Mozilla) only comes into play after the distro teams takes ownership of any possible issues, and handles it through back channels with the relevant parties. The important thing is, from the user perspective, there’s only one address, and that’s the distribution team.
In the long run, this will help distribution teams wean out software that gets no active development, and even more importantly, no active maintenance, and narrow down the choice to products that continuously improve through their life cycle. Of course, there’s the philosophical question of SLA, support time frames, expected results, and more, but that’s not something that can be easily solved. Not the topic here.
Today, error reporting on the desktop takes two forms: 1) for problems that are deterministic, usually application crashes that generate a crash core and/or trigger a signal that lets the system know something has gone wrong, there’s a semi-automated process of sending the report back to the vendor 2) for issues that are not perceived as problems by the system (like visual bugs, odd behavior, slowness), users need to manually report these to whoever they feel is the right address – the vendor or the app developer, both, or someone else entirely – and then follow through the blameshifting tribulation and error collection exercise that is entirely manual.
The problem with problem reporting is that the process is not reliable. Users are NOT reliable. There’s an almost infinite variance among users in terms of their skills, needs, usage patterns, abilities, patience, problem solving, and dozens of other variables that determine what will eventually be reported. In a wholly owned stack, you cannot rely on random goodwill to fix issues and/or produce consistent results, because the errors themselves will come with an inconsistent flair. The human factor must be removed from the equation.
Automated and intelligent reporting tools
I will now try to elaborate on several mechanisms that I believe should be present in every operating system, and should be active at all times. Yes, we touch on the concepts like telemetry and privacy and tracking, and this touches a sensitive spot, but then, it is possible to do this in a completely anonymous way, with zero personal information, and still achieve the desired results.
Kernel crash collection
This is a first must. Kdump should be automatically configured on every system – without asking the user to set up or change anything. Since this is nerdy territory, there should be no expectations that users understand the importance of the process. And as I’ve outlined in my 2012 Linux Journal article and 2014 LinuxCon presentation on this topic, it is possible to report kernel problems in a very simple and fast way. Effectively, it takes a single string (exception RIP + offset) to generate a unique kernel crash identifier and send it to the vendor. It contains zero personally identifiable information, and it’s a great starting point to allow initial investigation.
Now, let’s get innovative. We expand the error reporting to include some additional basic pointers, and then, if we do this is a central manner, we can then start deriving some intelligence from the data, figuring out types of hardware versus crash reports, kernel versions versus crash reports, and many other dimensions. With a smart backend to continuously monitor and alert on the global Linux kernel behavior, we can then focus on fixing critical issues, working with hardware vendors, and more.
This is the second one, and it’s already well covered. But even so, today, users get an alert from application reporting tools. Gnome, Plasma, you name it. The issue is, it’s a hassle and distraction. Then, users need to manually approve these reports. Sometimes, distros also need to download symbols, and there might be another step or three in the process. Unnecessary.
It is possible to do all these steps in the background – and perform error reporting in an elegant way that does not interfere with what the user is doing. Since privacy is important, sending complete cores is out of the question as they may contain memory pages with sensitive data, but the trace itself is already more than what you’d get from a lazy user choosing not to do anything with the problem.
Let me give you an example. On my LG RD510 laptop, it is impossible to resume from suspend. This has to do with the AHCI mode for ATA drives, which get link resets, and effectively, you must cold-boot the system to get back to a working desktop. In this case, automated tools won’t really work because the underlying disk is no longer recognized, and it is impossible to write logs, but also, the condition does not satisfy trigger kexec and starting a new kernel as in the case of a kernel crash – although this is a viable possibility.
However, what the system COULD do is track the state of its actions. Determinism is the key. Validation of input and output. The golden rules of software. If an operation starts, it should end. And if it does not end, it means we have a problem. Technically, something like this could be described in the following way:
- The laptop suspend action is activated in some way
- The system writes a file to the disk (atomically) – could be an empty file named SUSPEND
- On resume, the system checks for the presence of the file, deletes it and writes a new one called RESUME_OK
Let’s go back to my laptop and its resume issues. When the system boots, it can run a series of basic health checks to see whether the previous session had completed gracefully (reset versus reboot, for example). One of the checks may be the presence of REBOOT, SHUTDOWN or SUSPEND placeholders to figure out how the last session ended. The presence of a SUSPEND file in a freshly booted session might indicate that the system did not resume correctly. This may not tell us what had happened, but we would at least know that the resume was not successful.
Enter data intelligence – collect and monitor suspend & resume information over several sessions. Likewise, we could monitor network connections, reboots, or whatever else we fancy, and then check if the system has had any interruption in its normal activity. That information (a simple bitmask of operational states) can then be reported to the vendor, and again, similar to kernel crashes, we can compare it to anonymous information like hardware, kernel versions, desktop environment versions, drivers, libraries, and more.
If privacy is really crucial, then the intelligence module (some service) could run locally, use a local and encrypted database, and runs its checks without any connection to the mothership, and then potentially alert either the user (or whoever) that the particular hardware type or kernel version or anything might not be compatible, or that certain actions cause problems.
The metrics could also include all the kernel modules and the associated hardware, temperature and fan sensors, CPU behavior, all the drivers, everything. It would help generate a much clearer picture of what the desktop does. Again, this can be done without any commercial angle – no personal tracking to generate custom ads, no global big brother databases of user behavior patterns. Just pure technical data.
The second side of the coin is actually NOT doing any actions on platforms that are known to be incompatible. Going back to the resume issue, if the distro “knows” that it cannot reliably suspend and resume on a particular hardware type (and whatever other combo of components causes it), then the system should prevent the suspend action until the issue is resolved.
I would actually prefer for distributions to refuse to install on “buggy” hardware rather than complete and then end up with a wonky session that cannot be reliably used. Looking back at my tests in the past decade, I had Realtek Wireless issues on my Lenovo G50 laptop, Broadcom Wireless issues on my HP laptop, Intel Wireless issues on Lenovo T400, suspend issues on the LG RD510, occasional Intel graphics problems on several machines, and the list goes on. Some of these could be resolved with tweaks – essentially modprobe fixes, which beg a question, why are these not offered/included with distributions, and/or if these smart tools were to exist, could such fixes be offered? But some could not. In that case, the system installer might actually give user a choice – do they wish to proceed with an installation, knowing they could have problems that do not have known workarounds or permanent solutions?
And here, the live session can immensely help determine what gives. Most of the problems happen both in the live and installed systems, so a smart data module could compile the necessary list of known bugs and use them to determine whether the installation ought to be performed, or prompt the user. This is better than bad user experience.
Bugzilla? WHICH Bugzilla?
Now let’s consider the option of bugs and issues being reported automatically, with this or that amount of anonymous data. Where should it go? We already know that the issues ought to be channeled to distribution owners, but then if a problem affects a component that’s present in other distributions, other teams may not see it.
I believe that individual project bugzillas – whatever they may be, a forum, a GitHub page, etc – can remain around, but they must source their data from a central database. Imagine if all issues were uploaded to one system. We could then run intelligence reports that cross distro boundaries. We could focus on ALL Plasma or Gnome issues, not specifically Kubuntu or Antergos or Solus. And if there’s an issue in gstreamer or libmtpdevice, we could see how they map across multiple distros or desktop environments. It is very difficult to achieve such level of situational awareness with existing tools. The fragmentation is not just in the end product. It’s everywhere.
Project leads could then focus on fixing what they own – any which database filter will allow selecting the right data – but a central mechanism, similar to kerneloops.org, could help understand better where problems lie. And then, we might actually see TRUE COLLABORATION in the community. More people to look at issues and offer suggestions. An extra pair of eyes to validate a fix. More brain diversity. More. Period.
This could potentially reduce rivalry among teams, and minimize dependency on individual software engineers leading projects. If people leave, there would be less chance of projects going dead. People might actually develop sympathy or understanding for their colleagues and their work, and in the long run, the very fact there’s consistency in how problems are reported and seen could lead to consistency in how they are resolved. It would be chaos at first, but in the long run, it would lead to more order and quality.
Beyond 2000: Every problem must have a trail
Crossing the distribution boundaries brings in a whole new range of possibilities. If there was a way to work on problems in a unified fashion, perhaps there’s a way to work on making sure problems never happen. From a purely technical perspective, a resolved bug must never resurface. Likewise, there should be a way to test the resolution.
Every issue and its closure should become an entry in a validation suite, used to determine whether a distribution releases passes the QA and should be sent out into the open. I do know that distributions test their software, and usually follow the Alpha-Beta-RC model, but again, there’s a huge amount of human interaction and inconsistency in the process – which leads to inconsistency in the user experience. There’s also a significant duplication of effort, because essentially identical components are re-tested across many distros because teams work separately on their products.
But the same way there’s POSIX, there could be UI-POSIX, which goes beyond API and touches on every aspect of the desktop experience. It should be reflected in methods and procedures, and it should also mandate how software ought to be tested, released and maintained. There’s a huge element of boredom in this – maintaining code is not sexy or exciting, but it is the bread and butter of good, high-quality products. Fixing bugs is far more important than writing a new version of a faulty program. The current model does not align well with this, as most of the user-space code is written by maverick individuals. But we have the kernel as the example of immense commercial success – with 80% contributions done by corporations and enterprises.
We can start small – a validation suite that tests ALL distributions in the same manner, regardless of their desktop environment and app stack. The suite includes deterministic tests and covers all and every bug ever seen and found and resolved in the past, linking to one database. The next step is aligning validation procedures, based on the obvious discrepancies that the validation suite will discover.
In the end, we could still have color and variety, but products should be developed to the same high standards. Think automotive industry. You get tons of car manufacturers and models, but they still adhere to many strict standards that define what a car should do, and in the end, you essentially get a consistent driving experience – I don’t meant that subjectively – the cars all behave the same. Linux distributions could potentially achieve the same result, in that they all give the user the same essential experience, even if details may vary. But there are many fundamental aspects that cannot just be arbitrarily chosen or ignored the way have it today. Take a look at any which five distro reviews I’ve done in the past couple of months and you will see what I mean. Heck, just compare Ubuntu Aardvark to Kubuntu Aardvark!
Once we get to this common language, the next step is to maintain – and evolve it. The presence of automated system health tools will guarantee that distributions continue with a consistent experience after their release. It’s not just the GA release that matters. There will be new code introduced with every update, and that must also work perfectly and WITHOUT regressions.
My Open Linux idea is not the first nor the last benevolent thought to have sparked in the community in the past two decades. I know there have been attempts to bridge some gaps, but there has not been a comprehensive approach to the whole fragmentation mess that is plaguing the desktop world. I truly believe that the Linux desktop can become better, but it requires a unified baseline. At the moment, the foundations are weak. And the proof is in the pudding. The Linux desktop remains a marginal concept with barely any market share, low quality, and constant regressions that keep it from becoming a serious contender against the established players.
The notion of community is mostly a bittersweet dream. Linux has succeeded by having strict hierarchy and a strong commercial model. Look at Android. There’s another example. The desktop as we know it lacks these fundamentals. Perhaps we don’t want it to become a commercial entity, and that’s fine, but it does have to become a professional product. And to be a professional, you must act like one. It starts by having a first-class understanding of the system and being fully in control of its failings. We don’t have that at the moment. Linux distros may be open-source, but they are not truly open. Hopefully, this article could be a beginning of that change. Food for thought.
Cover image: Courtesy of nasa.gov, in public domain.