
Sign up to save your podcasts
Or
Looking at Lumina Desktop 2.0, 2 months of KPTI development in SmartOS, OpenBSD email service, an interview with Ryan Zezeski, NomadBSD released, and John Carmack's programming retreat with OpenBSD.
A few weeks ago I sat down with Lead Developer Ken Moore of the TrueOS Project to get answers to some of the most frequently asked questions about Lumina Desktop from the open source community. Here is what he said on Lumina Desktop 2.0. Do you have a question for Ken and the rest of the team over at the TrueOS Project? Make sure to read the interview and comment below. We are glad to answer your questions!
Ken: Lumina Desktop 2.0 is a significant overhaul compared to Lumina 1.x. Almost every single subsystem of the desktop has been streamlined, resulting in a nearly-total conversion in many important areas.
With Lumina Desktop 2.0 we will finally achieve our long-term goal of turning Lumina into a complete, end-to-end management system for the graphical session and removing all the current runtime dependencies from Lumina 1.x (Fluxbox, xscreensaver, compton/xcompmgr). The functionality from those utilities is now provided by Lumina Desktop itself.
Going along with the session management changes, we have compressed the entire desktop into a single, multi-threaded binary. This means that if any rogue script or tool starts trying to muck about with the memory used by the desktop (probably even more relevant now than when we started working on this), the entire desktop session will close/crash rather than allowing targeted application crashes to bypass the session security mechanisms. By the same token, this also prevents man-in-the-middle type of attacks because the desktop does not use any sort of external messaging system to communicate (looking at you dbus). This also gives a large performance boost to Lumina Desktop
The entire system for how a users settings get saved and loaded has been completely redone, making it a layered settings system which allows the default settings (Lumina) to get transparently replaced by system settings (OS/Distributor/SysAdmin) which can get replaced by individual user settings. This results in the actual changes in the user setting files to be kept to a minimum and allows for a smooth transition between updates to the OS or Desktop. This also provides the ability to restrict a users desktop session (based on a system config file) to the default system settings and read-only user sessions for certain business applications.
The entire graphical interface has been written in QML in order to fully-utilize hardware-based GPU acceleration with OpenGL while the backend logic and management systems are still written entirely in C++. This results in blazing fast performance on the backend systems (myriad multi-threaded C++ objects) as well as a smooth and responsive graphical interface with all the bells and whistles (drag and drop, compositing, shading, etc).
While I have never tried out Lumina in a MAC jail, I do not see anything on that page which should stop it from running in one right now. Lumina is already designed to be run as an unpriviledged user and is very smart about probing the system to find out what is/not available before showing anything to the user. The only thing that comes to mind is that you might need to open up some other system devices so that X11 itself can draw to the display (graphical environment setup is a bit different than CLI environment).
If I recall correctly, one of the very first versions of Lumina (pre-1.0) would occasionally flicker. If that is still happening, you might want to verify that you are using the proper video driver for your hardware and/or enable the compositor within the Lumina settings.
This was a common question about 4(?) years ago with the first release of the Lumina desktop and it basically boiled down to long-term support and reliability of the underlying toolkit. Some of the things we had to consider were: cross-platform/cross-architecture support, dependency reliability and support framework (Qt5 > EFL), and runtime requirements and dependency tracking (Qt5 is lighter than the EFL). That plus the fact that the EFL specifically states that it is linux-focused and the BSDs are just an afterthought (especially at the time we were doing the evaluation).
Like many people who work on system security, I had read Anders Fogh's post about a "Negative Result" in speculative execution research in July of 2017. At the time I thought it was an interesting writeup and I remember being glad that researchers were looking into this area. I sent the post to Bryan and asked him about his thoughts on it at the time, to which he replied saying that "it would be shocking if they left a way to directly leak out memory in the speculative execution". None of us seriously thought that there would be low-hanging fruit down that research path, but we also felt it was important that there was someone doing work in the area who was committed to public disclosure.
At first, after reading the blog post on Monday, we thought (or hoped) that the bug might "just" be a KASLR bypass and wouldn't require a lot of urgency. We tried to reach out to Intel at work to get more information but were met with silence. (We wouldn't hear back from them until after the disclosure was already made public.) The speculation on Tuesday intensified, until finally on Wednesday morning I arrived at the office to find links to late Tuesday night tweets revealing exploits that allowed arbitrary kernel memory reads.
Wednesday was not a happy day. Intel finally responded to our emails -- after they had already initiated public disclosure. We all spent a lot of time reading. An arbitrary kernel memory read (an info leak) is not that uncommon as far as bugs go, but for the most part they tend to be fairly easy to fix. The thing that makes the Meltdown and Spectre bugs particularly notable is that in order to mitigate them, a large amount of change is required in very deep low-level parts of the kernel. The kind of deep parts of the kernel where there are 20-year old errata workarounds that were single-line changes that you have to be very careful to not accidentally undo; the kind of parts where, as they say, mortals fear to tread.
On Friday we saw the patches Matthew Dillon put together for DragonFlyBSD for the first time. These were the first patches for KPTI that were very straightforward to read and understand, and applied to a BSD-derived kernel that was similar to those I'm accustomed to working on.
To mitigate Meltdown (and partially one of the Spectre variants), you have to make sure that speculative execution cannot reach any sensitive data from a user context. This basically means that the pages the kernel uses for anything potentially sensitive have to be unmapped when we are running user code. Traditionally, CPUs that were built to run a multi-user, UNIX-like OS did this by default (SPARC is an example of such a CPU which has completely separate address spaces for the kernel and userland). However, x86 descends from a single-address-space microcontroller that has grown up avoiding backwards-incompatible changes, and has never really introduced a clean notion of multiple address spaces (segmentation is the closest feature really, and it was thrown out for 64-bit AMD64). Instead, operating systems for x86 have generally wound up (at least in the post-AMD64 era) with flat address space models where the kernel text and data is always present in the page table no matter whether you're in user or kernel mode. The kernel mappings simply have the "supervisor" bit set on them so that user code can't directly access them.
The mitigation is basically to stop doing this: to stop mapping the kernel text, data and other memory into the page table while we're running in userland. Unfortunately, the x86 design does not make this easy. In order to be able to take interrupts or traps, the CPU has to have a number of structures mapped in the current page table at all times. There is also no ability to tell an x86 CPU that you want it to switch page tables when an interrupt occurs. So, the code that we jump to when we take an interrupt, as well as space for a stack to push context onto have to be available in both page tables. And finally, of course, we need to be able to figure out somehow what the other page table we should switch to is when we enter the kernel.
When we looked at the patches for Linux (and also the DragonFlyBSD patches at the time) on Friday and started asking questions, it became pretty evident that the initial work done by both was done under time constraints. Both had left the full kernel text mapped in both page tables, and the Linux trampoline design seemed over-complex. I started talking over some ideas with Robert Mustacchi about ways to fix these and who we should talk to, and reached out to some of my old workmates from the University of Queensland who were involved with OpenBSD. It seemed to me that the OpenBSD developers would care about these issues even more than we did, and would want to work out how to do the mitigation right.
I ended up sending an email to Philip Guenther on Friday afternoon, and on Saturday morning I drove an hour or so to meet up with him for coffee to talk page tables and interrupt trampolines. We wound up spending a good 6 hours at the coffee shop, and I came back with several pages of notes and a half-decent idea of the shape of the work to come.
One detail we missed that day was the interaction of per-CPU structures with per-process page tables. Much of the interrupt trampoline work is most easily done by using per-CPU structures in memory (and you definitely want a per-CPU stack!). If you combine that with per-process page tables, however, you have a problem: if you leave all the per-CPU areas mapped in all the processes, you will leak information (via Meltdown) about the state of one process to a different one when taking interrupts. In particular, you will leak things like %rip, which ruins all the work being done with PIE and ASLR pretty quickly. So, there are two options: you can either allocate the per-CPU structures per-process (so you end up with $NCPUS * $NPROCS of them); or you can make the page tables per-CPU.
OpenBSD, like Linux and the other implementations so far, decided to go down the road of per-CPU per-process pages to solve this issue. For illumos, we took the other route.
In illumos, it turned out that we already had per-CPU page tables. Robert and I re-discovered this on the Sunday of that week. We use them for 32-bit processes due to having full P>V PAE support in our kernel (which is, as it turns out, relatively uncommon amongst open-source OS). The logic to deal with creating and managing them and updating them was all already written, and after reading the code we concluded we could basically make a few small changes and re-use all of it. So we did.
By the end of that second week, we had a prototype that could get to userland. But, when working on this kind of kernel change we have a rule of thumb we use: after the first 70% of the patch is done and we can boot again, now it's time for the second 70%. In fact it turned out to be more like the second 200% for us -- a tedious long tail of bugs to solve that ended up necessitating some changes in the design as well.
At first we borrowed the method that Matt Dillon used for DragonFlyBSD, by putting the temporary "stack" space and state data for the interrupt trampolines into an extra page tacked onto the end of *%gs (in illumos the structure that lives there is the cpu_t).
If you read the existing logic in interrupt handlers for dealing with %gs though, you will quickly notice that the corner cases start to build up. There are a bunch of situations where the kernel temporarily alters %gs, and some of the ways to mess it up have security consequences that end up being worse than the bug we're trying to fix. As it turns out, there are no less than 3 different ways that ISRs use to try to get to having the right cpu_t in %gs on illumos, as it turns out, and they are all subtly different. Trying to tell which you should use when requires a bunch of test logic that in turn requires branches and changes to the CPU state, which is difficult to do in a trampoline where you're trying to avoid altering that state as much as possible until you've got the real stack online to push things into.
I kept in touch with Philip Guenther and Mike Larkin from the OpenBSD project throughout the weeks that followed. In one of the discussions we had, we talked about the NMI/MCE handlers and the fact that their handling currently on OpenBSD neglected some nasty corner-cases around interrupting an existing trap handler. A big part of the solution to those issues was to use a feature called IST, which allows you to unconditionally change stacks when you take an interrupt.
Traditionally, x86 only changes the stack pointer (%rsp on AMD64) while taking an interrupt when there is a privilege level change. If you take an interrupt while already in the kernel, the CPU does not change the stack pointer, and simply pushes the interrupt stack frame onto the stack you're already using. IST makes the change of stack pointer unconditional. If used unwisely, this is a bad idea: if you stay on that stack and turn interrupts back on, you could take another interrupt and clobber the frame you're already in. However, in it I saw a possible way to simplify the KPTI trampoline logic and avoid having to deal with %gs.
A few weeks into the project, John Levon joined us at work. He had previously worked on a bunch of Xen-related stuff as well as other parts of the kernel very close to where we were, so he quickly got up to speed with the KPTI work as well. He and I drafted out a "crazy idea" on the whiteboard one afternoon where we would use IST for all interrupts on the system, and put the "stack" they used in the KPTI page on the end of the cpu_t. Then, they could easily use stack-relative addresses to get the page table to change to, then pivot their stack to the real kernel stack memory, and throw away (almost) all the conditional logic. A few days later, we had convinced each other that this was the way to go.
Two of the most annoying x86 issues we had to work around were related to the SYSENTER instruction. This instruction is used to make "fast" system calls in 32-bit userland. It has a couple of unfortunate properties: firstly, it doesn't save or restore RFLAGS, so the kernel code has to take care of this (and be very careful not to clobber any of it before saving or after restoring it). Secondly, if you execute SYSENTER with the TF ("trap"/single-step flag) set by a debugger, the resulting debug trap's frame points at kernel code instead of the user code where it actually happened. The first one requires some careful gymnastics on the entry and return trampolines specifically for SYSENTER, while the second is a nasty case that is incidentally made easier by using IST. With IST, we can simply make the debug trap trampoline check for whether we took the trap in another trampoline's code, and reset %cr3 and the destination stack. This works for single-stepping into any of the handlers, not just the one for SYSENTER.
To make debugging easier, we decided that traps like the debug/single-step trap (as well as faults like page faults, #GP, etc.) would push their interrupt frame in a different part of the KPTI state page to normal interrupts. We applied this change to all the traps that can interrupt another trampoline (based on the instructions we used). These "paranoid" traps also set a flag in the KPTI struct to mark it busy (and jump to the double-fault handler if it is), to work around some bugs where double-faults are not correctly generated.
It's been a long and busy two months, with lots of time spent building, testing, and validating the code. We've run it on as many kinds of machines as we could get our hands on, to try to make sure we catch issues. The time we've spent on this has been validated several times in the process by finding bugs that could have been nasty in production.
One great example: our patches on Westmere-EP Xeons were causing busy machines to throw a lot of L0 I-cache parity errors. This seemed very mysterious at first, and it took us a few times seeing it to believe that it was actually our fault. This was actually caused by the accidental activation of a CPU errata for Westmere (B52, "Memory Aliasing of Code Pages May Cause Unpredictable System Behaviour") -- it turned out we had made a typo and put the "cacheable" flag into a variable named flags instead of attrs where it belonged when setting up the page tables. This was causing performance degradation on other machines, but on Westmere it causes cache parity errors as well. This is a great example of the surprising consequences that small mistakes in this kind of code can end up having. In the end, I'm glad that that erratum existed, otherwise it may have been a long time before we caught that bug.
As of this week, Mike and Philip have committed the OpenBSD patches for KPTI to their repository, and the patches for illumos are out for review. It's a nice kind of symmetry that the two projects who started on the work together after the public disclosure at the same time are both almost ready to ship at the same time at the other end. I'm feeling hopeful, and looking forward to further future collaborations like this with our cousins, the BSDs.
By design, email message headers need to be public, for exchanges to happen. The body of the message can be encrypted by the user, if desired. Moreover, there is no way to prevent the host from having access to the virtual machine. Therefore, full disk encryption (at rest) may not be necessary.
Given our low memory requirements, and the single-purpose concept of email service, Roundcube or other web-based IMAP email clients should be on a different VPS.
Antivirus software users (usually) have the service running on their devices. ClamAV can easily be incorporated into this configuration, if affected by the types of malware it protects against, but will require around 1GB additional RAM (or another VPS).
Every email message is important, if properly delivered, for Bayes classification. At least 200 ham and 200 spam messages are required to learn what one considers junk. By default (change to use case), a rspamd score above 50% will send the message to Spam/. Moving messages in and out of Spam/ changes this score. After 95%, the message is flagged as "seen" and can be safely ignored.
Spamd is effective at greylisting and stopping high volume spam, if it becomes a problem. It will be an option when IPv6 is supported, along with bgp-spamd.
System mail is delivered to an alias mapped to a virtual user served by the service. This way, messages are guaranteed to be delivered via encrypted connection. It is not possible for real users to alias, nor mail an external mail address with the default configuration. e.g. [email protected] is wheel, with an alias mapped to (virtual) [email protected], and user (puffy) can be different for each.
After a several year gap, I finally took another week-long programming retreat, where I could work in hermit mode, away from the normal press of work. My wife has been generously offering it to me the last few years, but Im generally bad at taking vacations from work.
Recently, Theo de Raadt (deraadt@) described a new type of mitigation he has been working on together with Stefan Kempf (stefan@):
This is now available in snapshots, and people are finding the first problems in the ports tree already. So far, few issues have been uncovered, but as Theo points out, more testing is necessary:
Fairly good results.
So, everybody, install the latest snapshot and try all your favorite ports. This is the time to report issues you find, so there is a good chance this additional security feature is present in 6.3 (and works with third party software from packages).
4.9
8989 ratings
Looking at Lumina Desktop 2.0, 2 months of KPTI development in SmartOS, OpenBSD email service, an interview with Ryan Zezeski, NomadBSD released, and John Carmack's programming retreat with OpenBSD.
A few weeks ago I sat down with Lead Developer Ken Moore of the TrueOS Project to get answers to some of the most frequently asked questions about Lumina Desktop from the open source community. Here is what he said on Lumina Desktop 2.0. Do you have a question for Ken and the rest of the team over at the TrueOS Project? Make sure to read the interview and comment below. We are glad to answer your questions!
Ken: Lumina Desktop 2.0 is a significant overhaul compared to Lumina 1.x. Almost every single subsystem of the desktop has been streamlined, resulting in a nearly-total conversion in many important areas.
With Lumina Desktop 2.0 we will finally achieve our long-term goal of turning Lumina into a complete, end-to-end management system for the graphical session and removing all the current runtime dependencies from Lumina 1.x (Fluxbox, xscreensaver, compton/xcompmgr). The functionality from those utilities is now provided by Lumina Desktop itself.
Going along with the session management changes, we have compressed the entire desktop into a single, multi-threaded binary. This means that if any rogue script or tool starts trying to muck about with the memory used by the desktop (probably even more relevant now than when we started working on this), the entire desktop session will close/crash rather than allowing targeted application crashes to bypass the session security mechanisms. By the same token, this also prevents man-in-the-middle type of attacks because the desktop does not use any sort of external messaging system to communicate (looking at you dbus). This also gives a large performance boost to Lumina Desktop
The entire system for how a users settings get saved and loaded has been completely redone, making it a layered settings system which allows the default settings (Lumina) to get transparently replaced by system settings (OS/Distributor/SysAdmin) which can get replaced by individual user settings. This results in the actual changes in the user setting files to be kept to a minimum and allows for a smooth transition between updates to the OS or Desktop. This also provides the ability to restrict a users desktop session (based on a system config file) to the default system settings and read-only user sessions for certain business applications.
The entire graphical interface has been written in QML in order to fully-utilize hardware-based GPU acceleration with OpenGL while the backend logic and management systems are still written entirely in C++. This results in blazing fast performance on the backend systems (myriad multi-threaded C++ objects) as well as a smooth and responsive graphical interface with all the bells and whistles (drag and drop, compositing, shading, etc).
While I have never tried out Lumina in a MAC jail, I do not see anything on that page which should stop it from running in one right now. Lumina is already designed to be run as an unpriviledged user and is very smart about probing the system to find out what is/not available before showing anything to the user. The only thing that comes to mind is that you might need to open up some other system devices so that X11 itself can draw to the display (graphical environment setup is a bit different than CLI environment).
If I recall correctly, one of the very first versions of Lumina (pre-1.0) would occasionally flicker. If that is still happening, you might want to verify that you are using the proper video driver for your hardware and/or enable the compositor within the Lumina settings.
This was a common question about 4(?) years ago with the first release of the Lumina desktop and it basically boiled down to long-term support and reliability of the underlying toolkit. Some of the things we had to consider were: cross-platform/cross-architecture support, dependency reliability and support framework (Qt5 > EFL), and runtime requirements and dependency tracking (Qt5 is lighter than the EFL). That plus the fact that the EFL specifically states that it is linux-focused and the BSDs are just an afterthought (especially at the time we were doing the evaluation).
Like many people who work on system security, I had read Anders Fogh's post about a "Negative Result" in speculative execution research in July of 2017. At the time I thought it was an interesting writeup and I remember being glad that researchers were looking into this area. I sent the post to Bryan and asked him about his thoughts on it at the time, to which he replied saying that "it would be shocking if they left a way to directly leak out memory in the speculative execution". None of us seriously thought that there would be low-hanging fruit down that research path, but we also felt it was important that there was someone doing work in the area who was committed to public disclosure.
At first, after reading the blog post on Monday, we thought (or hoped) that the bug might "just" be a KASLR bypass and wouldn't require a lot of urgency. We tried to reach out to Intel at work to get more information but were met with silence. (We wouldn't hear back from them until after the disclosure was already made public.) The speculation on Tuesday intensified, until finally on Wednesday morning I arrived at the office to find links to late Tuesday night tweets revealing exploits that allowed arbitrary kernel memory reads.
Wednesday was not a happy day. Intel finally responded to our emails -- after they had already initiated public disclosure. We all spent a lot of time reading. An arbitrary kernel memory read (an info leak) is not that uncommon as far as bugs go, but for the most part they tend to be fairly easy to fix. The thing that makes the Meltdown and Spectre bugs particularly notable is that in order to mitigate them, a large amount of change is required in very deep low-level parts of the kernel. The kind of deep parts of the kernel where there are 20-year old errata workarounds that were single-line changes that you have to be very careful to not accidentally undo; the kind of parts where, as they say, mortals fear to tread.
On Friday we saw the patches Matthew Dillon put together for DragonFlyBSD for the first time. These were the first patches for KPTI that were very straightforward to read and understand, and applied to a BSD-derived kernel that was similar to those I'm accustomed to working on.
To mitigate Meltdown (and partially one of the Spectre variants), you have to make sure that speculative execution cannot reach any sensitive data from a user context. This basically means that the pages the kernel uses for anything potentially sensitive have to be unmapped when we are running user code. Traditionally, CPUs that were built to run a multi-user, UNIX-like OS did this by default (SPARC is an example of such a CPU which has completely separate address spaces for the kernel and userland). However, x86 descends from a single-address-space microcontroller that has grown up avoiding backwards-incompatible changes, and has never really introduced a clean notion of multiple address spaces (segmentation is the closest feature really, and it was thrown out for 64-bit AMD64). Instead, operating systems for x86 have generally wound up (at least in the post-AMD64 era) with flat address space models where the kernel text and data is always present in the page table no matter whether you're in user or kernel mode. The kernel mappings simply have the "supervisor" bit set on them so that user code can't directly access them.
The mitigation is basically to stop doing this: to stop mapping the kernel text, data and other memory into the page table while we're running in userland. Unfortunately, the x86 design does not make this easy. In order to be able to take interrupts or traps, the CPU has to have a number of structures mapped in the current page table at all times. There is also no ability to tell an x86 CPU that you want it to switch page tables when an interrupt occurs. So, the code that we jump to when we take an interrupt, as well as space for a stack to push context onto have to be available in both page tables. And finally, of course, we need to be able to figure out somehow what the other page table we should switch to is when we enter the kernel.
When we looked at the patches for Linux (and also the DragonFlyBSD patches at the time) on Friday and started asking questions, it became pretty evident that the initial work done by both was done under time constraints. Both had left the full kernel text mapped in both page tables, and the Linux trampoline design seemed over-complex. I started talking over some ideas with Robert Mustacchi about ways to fix these and who we should talk to, and reached out to some of my old workmates from the University of Queensland who were involved with OpenBSD. It seemed to me that the OpenBSD developers would care about these issues even more than we did, and would want to work out how to do the mitigation right.
I ended up sending an email to Philip Guenther on Friday afternoon, and on Saturday morning I drove an hour or so to meet up with him for coffee to talk page tables and interrupt trampolines. We wound up spending a good 6 hours at the coffee shop, and I came back with several pages of notes and a half-decent idea of the shape of the work to come.
One detail we missed that day was the interaction of per-CPU structures with per-process page tables. Much of the interrupt trampoline work is most easily done by using per-CPU structures in memory (and you definitely want a per-CPU stack!). If you combine that with per-process page tables, however, you have a problem: if you leave all the per-CPU areas mapped in all the processes, you will leak information (via Meltdown) about the state of one process to a different one when taking interrupts. In particular, you will leak things like %rip, which ruins all the work being done with PIE and ASLR pretty quickly. So, there are two options: you can either allocate the per-CPU structures per-process (so you end up with $NCPUS * $NPROCS of them); or you can make the page tables per-CPU.
OpenBSD, like Linux and the other implementations so far, decided to go down the road of per-CPU per-process pages to solve this issue. For illumos, we took the other route.
In illumos, it turned out that we already had per-CPU page tables. Robert and I re-discovered this on the Sunday of that week. We use them for 32-bit processes due to having full P>V PAE support in our kernel (which is, as it turns out, relatively uncommon amongst open-source OS). The logic to deal with creating and managing them and updating them was all already written, and after reading the code we concluded we could basically make a few small changes and re-use all of it. So we did.
By the end of that second week, we had a prototype that could get to userland. But, when working on this kind of kernel change we have a rule of thumb we use: after the first 70% of the patch is done and we can boot again, now it's time for the second 70%. In fact it turned out to be more like the second 200% for us -- a tedious long tail of bugs to solve that ended up necessitating some changes in the design as well.
At first we borrowed the method that Matt Dillon used for DragonFlyBSD, by putting the temporary "stack" space and state data for the interrupt trampolines into an extra page tacked onto the end of *%gs (in illumos the structure that lives there is the cpu_t).
If you read the existing logic in interrupt handlers for dealing with %gs though, you will quickly notice that the corner cases start to build up. There are a bunch of situations where the kernel temporarily alters %gs, and some of the ways to mess it up have security consequences that end up being worse than the bug we're trying to fix. As it turns out, there are no less than 3 different ways that ISRs use to try to get to having the right cpu_t in %gs on illumos, as it turns out, and they are all subtly different. Trying to tell which you should use when requires a bunch of test logic that in turn requires branches and changes to the CPU state, which is difficult to do in a trampoline where you're trying to avoid altering that state as much as possible until you've got the real stack online to push things into.
I kept in touch with Philip Guenther and Mike Larkin from the OpenBSD project throughout the weeks that followed. In one of the discussions we had, we talked about the NMI/MCE handlers and the fact that their handling currently on OpenBSD neglected some nasty corner-cases around interrupting an existing trap handler. A big part of the solution to those issues was to use a feature called IST, which allows you to unconditionally change stacks when you take an interrupt.
Traditionally, x86 only changes the stack pointer (%rsp on AMD64) while taking an interrupt when there is a privilege level change. If you take an interrupt while already in the kernel, the CPU does not change the stack pointer, and simply pushes the interrupt stack frame onto the stack you're already using. IST makes the change of stack pointer unconditional. If used unwisely, this is a bad idea: if you stay on that stack and turn interrupts back on, you could take another interrupt and clobber the frame you're already in. However, in it I saw a possible way to simplify the KPTI trampoline logic and avoid having to deal with %gs.
A few weeks into the project, John Levon joined us at work. He had previously worked on a bunch of Xen-related stuff as well as other parts of the kernel very close to where we were, so he quickly got up to speed with the KPTI work as well. He and I drafted out a "crazy idea" on the whiteboard one afternoon where we would use IST for all interrupts on the system, and put the "stack" they used in the KPTI page on the end of the cpu_t. Then, they could easily use stack-relative addresses to get the page table to change to, then pivot their stack to the real kernel stack memory, and throw away (almost) all the conditional logic. A few days later, we had convinced each other that this was the way to go.
Two of the most annoying x86 issues we had to work around were related to the SYSENTER instruction. This instruction is used to make "fast" system calls in 32-bit userland. It has a couple of unfortunate properties: firstly, it doesn't save or restore RFLAGS, so the kernel code has to take care of this (and be very careful not to clobber any of it before saving or after restoring it). Secondly, if you execute SYSENTER with the TF ("trap"/single-step flag) set by a debugger, the resulting debug trap's frame points at kernel code instead of the user code where it actually happened. The first one requires some careful gymnastics on the entry and return trampolines specifically for SYSENTER, while the second is a nasty case that is incidentally made easier by using IST. With IST, we can simply make the debug trap trampoline check for whether we took the trap in another trampoline's code, and reset %cr3 and the destination stack. This works for single-stepping into any of the handlers, not just the one for SYSENTER.
To make debugging easier, we decided that traps like the debug/single-step trap (as well as faults like page faults, #GP, etc.) would push their interrupt frame in a different part of the KPTI state page to normal interrupts. We applied this change to all the traps that can interrupt another trampoline (based on the instructions we used). These "paranoid" traps also set a flag in the KPTI struct to mark it busy (and jump to the double-fault handler if it is), to work around some bugs where double-faults are not correctly generated.
It's been a long and busy two months, with lots of time spent building, testing, and validating the code. We've run it on as many kinds of machines as we could get our hands on, to try to make sure we catch issues. The time we've spent on this has been validated several times in the process by finding bugs that could have been nasty in production.
One great example: our patches on Westmere-EP Xeons were causing busy machines to throw a lot of L0 I-cache parity errors. This seemed very mysterious at first, and it took us a few times seeing it to believe that it was actually our fault. This was actually caused by the accidental activation of a CPU errata for Westmere (B52, "Memory Aliasing of Code Pages May Cause Unpredictable System Behaviour") -- it turned out we had made a typo and put the "cacheable" flag into a variable named flags instead of attrs where it belonged when setting up the page tables. This was causing performance degradation on other machines, but on Westmere it causes cache parity errors as well. This is a great example of the surprising consequences that small mistakes in this kind of code can end up having. In the end, I'm glad that that erratum existed, otherwise it may have been a long time before we caught that bug.
As of this week, Mike and Philip have committed the OpenBSD patches for KPTI to their repository, and the patches for illumos are out for review. It's a nice kind of symmetry that the two projects who started on the work together after the public disclosure at the same time are both almost ready to ship at the same time at the other end. I'm feeling hopeful, and looking forward to further future collaborations like this with our cousins, the BSDs.
By design, email message headers need to be public, for exchanges to happen. The body of the message can be encrypted by the user, if desired. Moreover, there is no way to prevent the host from having access to the virtual machine. Therefore, full disk encryption (at rest) may not be necessary.
Given our low memory requirements, and the single-purpose concept of email service, Roundcube or other web-based IMAP email clients should be on a different VPS.
Antivirus software users (usually) have the service running on their devices. ClamAV can easily be incorporated into this configuration, if affected by the types of malware it protects against, but will require around 1GB additional RAM (or another VPS).
Every email message is important, if properly delivered, for Bayes classification. At least 200 ham and 200 spam messages are required to learn what one considers junk. By default (change to use case), a rspamd score above 50% will send the message to Spam/. Moving messages in and out of Spam/ changes this score. After 95%, the message is flagged as "seen" and can be safely ignored.
Spamd is effective at greylisting and stopping high volume spam, if it becomes a problem. It will be an option when IPv6 is supported, along with bgp-spamd.
System mail is delivered to an alias mapped to a virtual user served by the service. This way, messages are guaranteed to be delivered via encrypted connection. It is not possible for real users to alias, nor mail an external mail address with the default configuration. e.g. [email protected] is wheel, with an alias mapped to (virtual) [email protected], and user (puffy) can be different for each.
After a several year gap, I finally took another week-long programming retreat, where I could work in hermit mode, away from the normal press of work. My wife has been generously offering it to me the last few years, but Im generally bad at taking vacations from work.
Recently, Theo de Raadt (deraadt@) described a new type of mitigation he has been working on together with Stefan Kempf (stefan@):
This is now available in snapshots, and people are finding the first problems in the ports tree already. So far, few issues have been uncovered, but as Theo points out, more testing is necessary:
Fairly good results.
So, everybody, install the latest snapshot and try all your favorite ports. This is the time to report issues you find, so there is a good chance this additional security feature is present in 6.3 (and works with third party software from packages).
1,971 Listeners
272 Listeners
283 Listeners
265 Listeners
215 Listeners
154 Listeners
65 Listeners
189 Listeners
181 Listeners
44 Listeners
21 Listeners
135 Listeners
92 Listeners
29 Listeners
47 Listeners