Spectre and Meltdown
You might have heard of Meltdown or Spectre in mainstream media since early Jan 2018. What are they and what’s it all about?
Spectre and Meltdown are the brand names for a related set of vulnerabilities that go to the heart of how modern CPUs work. (Yes, vulnerabilities have brand names, logos and their own website these days! Check it out here). These two brand names refer to three related problems. They share the same website but have different logos. There are three CVEs (Common Vulnerabilities and Exposures numbers) covering three “variants” of a class.
So how does it all fit together? The full detail is a level below where we want to go here, but let’s cover some high level technical stuff that will help make sense of it all.
- This is one of the biggest events to surface in IT and information security in many years. It will affect us for months, if not years, to come.
- This class of vulnerability is creating a lot of work for vendors, IT teams, security experts and end-users alike.
- The risks themselves are either patchable (Meltdown) or so far highly theoretical. There are, as yet, no known exploits in the wild that threaten systems or data of fully-patched systems.
- But in many cases the cure is worse than the disease (slower systems, less stable systems, data corruption/loss), particularly if you factor in the realistic risks involved.
- You need to be on top of system patching and learn about new releases as soon as they land.
- Be prepared for cloud-hosting IaaS services to disrupt you further as they patch host systems.
- But take a measured approach with systems under your control. Factor the relative risks (and mitigation costs) of different solutions, and expect this to take quite some time to settle down.
Break the tech down for me
The three variants are related because they all depend upon something called “speculative execution”. Speculative execution is something that is fundamental to how fast our computers and devices run.
If you know approximately what a “GHz” is you might have noticed that headline CPU speeds stopped growing a few years back. We got more “cores” in our CPUs, and newer ones were definitely faster than older ones. But one of the main ways that CPUs have got faster is actually by making them work more of the time. When you have a machine that can do maybe 3.8 billion things per second, even in a modern PC with tons of heavy software on it, the CPU gets stuck or stalled quite often. This is mainly caused by how fast they are, particularly compared to their memory size. Sometimes the CPU has to wait for things to arrive from memory, which is at least 100 x slower than the CPU. And sometimes it just has spare capacity for other reasons.
Things get faster when you try to use those gaps and spare capacity to do things now that you probably have to do in future anyway. This re-ordering of steps, and filling in the gaps, involves “speculative execution” – the concept of executing some things now that will probably be needed later. If they are in fact required later then great – we run faster. If they aren’t – well, not much time was wasted, and on average we will be right more than we are wrong. So CPUs are faster than they used to be because we have started using otherwise wasted time. This is great, and you can imagine the competitive and creative pressure that gives rise to such complex improvements for the same clock speed (GHz).
So where did it go wrong? Well, to answer that we have to mention memory cache.
Remember how the “main” memory is really slow (even though you bought the newest, fastest stuff you could get your hands on)? Because of that we need the CPU to cache things it got from main memory. Unless you are a technical reader, you may not have seen things like “L1 cache”, “L2 cache” and “L3 cache” written alongside your CPU specs. Those are layers of super-fast memory located on the CPU itself. The CPU uses those to work around the fact that main memory is so slow. It caches things that it is currently using into those levels of on-chip cache so it can get them back really quickly when it needs them. Once in a while it needs something that isn’t in the cache and it has to pause what it is doing to wait for that information to come back to main memory.
Remember that it can only do about 100 other things while it waits. So it’s quite bad news when that happens, and that slow down shows up on the radar of someone who’s watching closely.
One more point to note is that CPUs and operating systems expose incredibly accurate timing counters to user applications. You can count at clock speed, down to 1/3.8 billionth of a second. So you can work out whether a particular operation happened quickly or slowly, and therefore whether some information was located in nice fast CPU cache or really slow main memory.
What that means
So we have speculative execution, we have CPU cache, and we have really accurate timers built into almost every system. It turns out that these are the three main ingredients to this class of vulnerability.
In spite of all the headline protections built into modern operating systems and CPUs, you can go around the side of those. You can learn what data some other, protected, private process has, simply by leveraging what has been put into the cache by that other process. You can infer what they are using by getting the CPU to put it into the cache, then very craftily using what is in there. You can’t use it directly (the traditional protections prevent that), but you can learn what’s in there. And that is what is remarkable about Meltdown and Spectre.
It is remarkable, but it isn’t really news. People have been raising concerns about the potential for these types of problems for a long time. There was a huge headwind of “that will never happen – that’s just a theoretical risk – and besides, we don’t want slower CPUs, so shush”. It just took some smart (and impressively young) people at Google Project Zero, coincident with a number of other academic researchers, to put proofs-of-concept (PoC) exploit code together to really bring this to the very top of the industry’s priority lists.
An analogy for how this works
Think of it like this: Someone asks you to send an email. You are the only one who can send it because you have access to some private information it requires. You discover, during writing the mail, that you have to ask that colleague for some information. They take a few minutes to get you that information. You have nothing else urgently demanding your attention, so while you wait you decide to type the rest of the mail, leaving a gap for the requested information. If you sit on your hands until you get the information you are, ultimately, “wasting cycles”. It seems pretty likely that you will indeed need to send this email once the information arrives, so it probably won’t be a waste of time to continue typing.
But the colleague then comes back and gives you some information that means you now no longer need to send the email. Oh no! You wasted a few minutes typing something you didn’t end up needing. Oh well – no harm. Perhaps you could have used that time for some other productive work on this occasion, but on average it is more likely that you will end up sending the email. Hence you were right to continue writing it while you wait for the missing information to arrive on your desk. You are doing “speculative execution”. If you are smart, you know that on average this will pay off for you as you can get onto the next task sooner.
The confidential file
Now, let’s go one step further. Let’s say there’s a confidential paper file that you need in order to finish this email. So, during the time you are waiting for the colleague to get you the other information, you go and get the confidential file from a secure filing cabinet and put it on your desk. It’s marked as confidential and you work in a dedicated office with a lockable door (you must be the big cheese then). So it’s safe for you to handle the file.
But now imagine that the colleague you were waiting on for the other information wasn’t your typical honest employee. Imagine, instead, that they were a very crafty spy – a sleeper who has so far appeared normal and trustworthy. Perhaps it was actually their idea that you send this email in the first place – an email that would require you to ask them for information while you were typing it. And just maybe would also require you to go and get a confidential file.
Well, if the spy were crafty enough, over a long enough period of time (and lots of requested emails), they might be able to work out which confidential files you were going to get and what is (or isn’t) in some of those files. For example, if you already had a particular file on your desk, you wouldn’t need to go to the secure filing cabinet to get that file. So the spy could work out, from how long it took you to write an email, that the private information for this particular email is the same as the private information for the previous email they asked you to write. Well – the spy just “side-channel” attacked you! They used something indirect and observable that leaks information, but isn’t as direct as actually looking at the contents of the file itself (which they can’t do).
What the spy can learn
Spies can learn something about what sorts of things are in which files just by seeing how long things take. Anyone can see how long it takes you to write an email because you always get up for coffee after each one. Or maybe they just ask you and you tell them how long it took, down to the pico-second. Believe it or not, CPUs and operating systems will tell you such things!
Now this isn’t a perfect analogy for all three variants of the Meltdown and Spectre vulnerabilities. But it does give some insight into what speculative execution and side-channel attacks are. You can see how very old-world, traditional spy techniques and mindset have lead super smart researchers down similar lines of enquiry with modern CPUs designs. Same thinking; global implications.
How big is this?
There’s no denying this is a big deal. It reaches across almost the entire industry, through all classes of hardware, devices and operating systems used for much of this decade. And it almost certainly affects the device you are reading this on right now. Much of the most worrying stuff still hasn’t been adequately patched at the time of writing and we look set to enjoy the fall-out for far longer than some of the higher-profile security headlines of recent years.
What do we need to do?
You need to stay abreast of the information on these vulnerabilities and ensure you’re equipped to make good, informed decisions about actions and mitigations. This is an ongoing issue with ongoing attempted work-arounds. And more fundamental solutions will emerge only after a much longer period of time (e.g. later CPU versions). Advice and patches are also changing by the week. So it’s important to have access to someone who specialises in these issues and can provide up-to-date information that is understandable to your business as well as the IT team.
With the right information, you can make tactical decisions around what to do when, and in what order. But the bottom line is that you will need to patch everything, perhaps multiple times. And you should prepare for potential further (hopefully small) disruptions to any cloud services you use (including IaaS like AWS, Azure and Google Compute Engine).
Is it over yet?
No, unfortunately it isn’t. As at the start of Feb 2018 (we’re only a month into the public knowledge of this!) it is still very much ongoing. Meltdown has been mitigated by operating system patches, many of which result in a noticeable performance hit. Most businesses need to have completed, or be well into, important patching of all systems including servers, workstations and mobile devices. Meltdown exploits look imminent and if they take hold they will be able to liberate your most secret secrets.
Any cloud services you use should have been patched for Meltdown also, but for IaaS (eg AWS, Azue, Google Compute Engine) that only applies to what sits beneath your server instances. It doesn’t apply to the operating systems you manage on that IaaS, so you definitely need to patch those. Once you have patched them you will start to find out whether you have significant performance hits on any of the systems. Only the most recent CPU and operating system combinations are able to use a particular CPU facility that doesn’t cause much of a slow-down (PCID). But the actual slow down will depend on the type of load placed on the server or device. Systems that are significantly slower will require some thought around hardware replacement and/or OS upgrades. That will all take time to work through. So although Meltdown does have effective mitigations, they are imperfect, and the work certainly isn’t over yet.
Spectre mitigations, on the other hand, have had significant false starts and really haven’t been properly dealt with yet. Intel released but then actively withdrew a Variant-2 microcode patch last week. This was on account of it causing “higher than expected reboots” (a term that is unclear to pretty much everyone outside of Intel), as well as potential data loss or corruption. Microsoft had to release a “patch for the patch that included the microcode patch”. The latest patch was a voluntary, manual download not pushed to all systems. Google has an innovative approach called Retpoline that doesn’t require CPU microcode patching but it does require recompiling of application binaries. But this just pushes the problem to many, many app vendors rather than just OS vendors. There doesn’t seem to have been much uptake or attention on that as yet, but that might change depending on how the OS and CPU patching story plays out.
So this is still very much an evolving landscape, and sadly it is likely to be “haunting” us for months and most likely years. Hence the name Spectre.
The extra-techy detail
These are the three variants:
|Spectre||1||2017-5753||Bounds Check Bypass|
|Spectre||2||2017-5715||Branch Target Injection|
|Meltdown||3||2017-5754||Rogue Data Cache Load|
Although they are all related, there are differences between the Spectre and Meltdown variants, so let’s summarise those:
- There are no known exploits in the wild. There is just proof-of-concept (PoC) code (which will certainly help exploits to happen in due course).
- Centres on branch-prediction, which is a special case of speculative execution.
- Doesn’t rely on a particular CPU architecture other than having branch-prediction.
- Therefore affects Intel, AMD and ARM processors.
- Can bypass virtual machine host-guest and guest-guest isolation.
- Doesn’t have an easy operating system patch solution (requires either CPU microcode changes or recompilation of OS/app binaries à la Google’s Retpoline).
- At the time of writing, Intel’s attempts to patch CPU microcode have focused on Variant-2 and have caused more problems than they solve.
- Summary: really wide-ranging impact and highly concerning, requiring CPU microcode patching (albeit via OS patches), but not being exploited yet.
- At time of writing there is both working PoC code and hundreds of emergent malware attempts to adapt that PoC code. However there are no known active exploits in the wild.
- Breaks the isolation between user space and OS/kernel space. Therefore allows a user-space app to access memory of other apps, and in the kernel.
- Relies on Intel’s out-of-order execution, and affects all such Intel chips, as well as some ARM chips.
- Can’t bypass full-virtualisation host-guest isolation. But does threaten para-virtualised and containerised host-guest and guest-guest isolation.
- Only requires operating system patches to resolve (using KPTI on Linux, similar cache-flushing on Windows, or PCID where applicable).
- But those patches often result in measurable slow-downs in performance (except for CPU/OS combos that use PCID, INVPCID – >=Haswell, >=Linux 4.14 kernel, >=Windows 10 Fall Creator’s Edition).
- Summary: more narrow impact (still wide by any standards), patched via OS updates which slow down many systems, and likely to be exploited imminently.
This is really about as big as it gets, in terms of breadth and depth of industry impact. We’re going to be dealing with aspects of this for some time to come. Be it from patching our own servers and workstations, or being disrupted by cloud service provider patching. Or more frequent updates to our smartphones and devices, or deciding whether to replace hardware with new-generation chipsets sooner than we otherwise would.
Ultimately, we stand on the shoulders of many, many giants. We are at the mercy of so many layers and types of complexity in going about our business in 2018 and things like this are going to happen. It falls into the categories of human error, speed above security, unintended consequences, and perhaps a few others relating generally to the limits of our abilities to fully predict and control. Those things are unlikely to change (unless and until we train machines to do a better job than we do in certain areas). So we’d better get used to dealing with pervasive security and stability-related disruptions and challenges.
The key questions are:
- Are you equipped with the right information and expertise?
- Are you positioned to make informed decisions about risk-cost trade offs and mitigation strategies?
- Do you know who can help with that?
phew! We can help with that.