kernel_optimize_test/Documentation
Daniel Sneddon 509c2c9fe7 x86/speculation: Add RSB VM Exit protections
commit 2b1299322016731d56807aa49254a5ea3080b6b3 upstream.

tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as
documented for RET instructions after VM exits. Mitigate it with a new
one-entry RSB stuffing mechanism and a new LFENCE.

== Background ==

Indirect Branch Restricted Speculation (IBRS) was designed to help
mitigate Branch Target Injection and Speculative Store Bypass, i.e.
Spectre, attacks. IBRS prevents software run in less privileged modes
from affecting branch prediction in more privileged modes. IBRS requires
the MSR to be written on every privilege level change.

To overcome some of the performance issues of IBRS, Enhanced IBRS was
introduced.  eIBRS is an "always on" IBRS, in other words, just turn
it on once instead of writing the MSR on every privilege level change.
When eIBRS is enabled, more privileged modes should be protected from
less privileged modes, including protecting VMMs from guests.

== Problem ==

Here's a simplification of how guests are run on Linux' KVM:

void run_kvm_guest(void)
{
	// Prepare to run guest
	VMRESUME();
	// Clean up after guest runs
}

The execution flow for that would look something like this to the
processor:

1. Host-side: call run_kvm_guest()
2. Host-side: VMRESUME
3. Guest runs, does "CALL guest_function"
4. VM exit, host runs again
5. Host might make some "cleanup" function calls
6. Host-side: RET from run_kvm_guest()

Now, when back on the host, there are a couple of possible scenarios of
post-guest activity the host needs to do before executing host code:

* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not
touched and Linux has to do a 32-entry stuffing.

* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host
IBRS=1 shortly after VM exit, has a documented side effect of flushing
the RSB except in this PBRSB situation where the software needs to stuff
the last RSB entry "by hand".

IOW, with eIBRS supported, host RET instructions should no longer be
influenced by guest behavior after the host retires a single CALL
instruction.

However, if the RET instructions are "unbalanced" with CALLs after a VM
exit as is the RET in #6, it might speculatively use the address for the
instruction after the CALL in #3 as an RSB prediction. This is a problem
since the (untrusted) guest controls this address.

Balanced CALL/RET instruction pairs such as in step #5 are not affected.

== Solution ==

The PBRSB issue affects a wide variety of Intel processors which
support eIBRS. But not all of them need mitigation. Today,
X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates
PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e.,
eIBRS systems which enable legacy IBRS explicitly.

However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT
and most of them need a new mitigation.

Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE
which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT.

The lighter-weight mitigation performs a CALL instruction which is
immediately followed by a speculative execution barrier (INT3). This
steers speculative execution to the barrier -- just like a retpoline
-- which ensures that speculation can never reach an unbalanced RET.
Then, ensure this CALL is retired before continuing execution with an
LFENCE.

In other words, the window of exposure is opened at VM exit where RET
behavior is troublesome. While the window is open, force RSB predictions
sampling for RET targets to a dead end at the INT3. Close the window
with the LFENCE.

There is a subset of eIBRS systems which are not vulnerable to PBRSB.
Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB.
Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.

  [ bp: Massage, incorporate review comments from Andy Cooper. ]

Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-11 13:06:47 +02:00
..
ABI iio: adc: vf610: fix conversion mode sysfs node name 2022-06-29 08:59:50 +02:00
accounting psi: Fix uaf issue when psi trigger is destroyed while being polled 2022-02-05 12:37:55 +01:00
admin-guide x86/speculation: Add RSB VM Exit protections 2022-08-11 13:06:47 +02:00
arm ARM: 9012/1: move device tree mapping out of linear region 2021-05-19 10:13:18 +02:00
arm64 arm64: Enable repeat tlbi workaround on KRYO4XX gold CPUs 2022-05-25 09:18:01 +02:00
block block-5.10-2020-10-24 2020-10-24 12:46:42 -07:00
bpf bpf: Migrate from patchwork.ozlabs.org to patchwork.kernel.org. 2020-10-11 22:05:47 +02:00
cdrom
core-api Revert "swiotlb: fix info leak with DMA_FROM_DEVICE" 2022-05-25 09:17:55 +02:00
cpu-freq
crypto crypto: af_alg - add extra parameters for DRBG interface 2020-09-25 17:48:52 +10:00
dev-tools linux-kselftest-kunit-fixes-5.10-rc5 2020-11-18 11:57:55 -08:00
devicetree dt-bindings: dma: allwinner,sun50i-a64-dma: Fix min/max typo 2022-07-12 16:32:23 +02:00
doc-guide docs: kerneldoc.py: add support for kerneldoc -nosymbol 2020-10-15 07:49:38 +02:00
driver-api Documentation: fix firewire.rst ABI file path error 2022-01-27 10:54:29 +01:00
fault-injection lkdtm: replace SCSI_DISPATCH_CMD with SCSI_QUEUE_RQ 2021-09-15 09:50:42 +02:00
fb drm fixes (round two) for 5.10-rc1 2020-10-23 13:56:34 -07:00
features s390 updates for the 5.10 merge window 2020-10-16 12:36:38 -07:00
filesystems ext4, doc: fix incorrect h_reserved size 2022-04-27 13:53:57 +02:00
firmware_class
firmware-guide Documentation: ACPI: Fix data node reference documentation 2022-01-27 10:54:29 +01:00
fpga
gpu Revert "fbcon: Disable accelerated scrolling" 2022-02-08 18:30:40 +01:00
hid
hwmon hwmon: (lm90) Add basic support for TI TMP461 2021-12-29 12:25:59 +01:00
i2c Documentation: i2c: add testunit docs to index 2020-10-05 22:57:45 +02:00
ia64 docs/ia64: Drop obsolete Xen documentation 2020-08-31 16:16:03 -06:00
ide
iio Documentation: iio: fix a typo 2020-09-09 11:41:20 -06:00
infiniband
input
isdn
kbuild Documentation/Kbuild: Remove references to gcc-plugin.sh 2021-12-14 11:32:46 +01:00
kernel-hacking
leds docs: leds: index.rst: add a missing file 2020-11-02 13:45:37 +01:00
litmus-tests
livepatch
locking Documentation/locking/locktypes: Update migrate_disable() bits. 2021-12-14 11:32:42 +01:00
m68k
maintainer Documentation/maintainer: rehome sign-off process 2020-09-03 15:39:24 -06:00
mhi
mips dt: Remove booting-without-of.rst 2020-10-13 13:33:16 -05:00
misc-devices Documentation: remove mic/index from misc-devices/index.rst 2020-11-04 11:38:32 +01:00
netlabel
networking Documentation: fix sctp_wmem in ip-sysctl.rst 2022-08-03 12:00:47 +02:00
nios2
nvdimm
openrisc
parisc
PCI Documentation: better locations for sysfs-pci, sysfs-tagging 2020-10-09 09:33:23 -06:00
pcmcia
power PCI/PM: Rename pci_dev.d3_delay to d3hot_delay 2020-09-29 14:21:50 -05:00
powerpc powerpc/64s/syscall: Use pt_regs.trap to distinguish syscall ABI difference between sc and scv syscalls 2021-05-26 12:06:53 +02:00
process docs: submitting-patches: Fix crossref to 'The canonical patch format' 2022-06-06 08:42:44 +02:00
RCU Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu 2020-10-09 08:21:56 +02:00
riscv
s390
scheduler docs: scheduler: fix the directory name on two files 2020-09-10 10:45:45 -06:00
scsi scsi: libsas: Introduce a _gfp() variant of event notifiers 2021-03-25 09:04:11 +01:00
security watch_queue: Drop references to /dev/watch_queue 2021-03-04 11:37:59 +01:00
sh dt: Remove booting-without-of.rst 2020-10-13 13:33:16 -05:00
sound ALSA: hda/realtek: Add alc256-samsung-headphone fixup 2022-04-08 14:40:36 +02:00
sparc
sphinx tweewide: Fix most Shebang lines 2021-05-22 11:40:55 +02:00
sphinx-static
spi
staging
target tweewide: Fix most Shebang lines 2021-05-22 11:40:55 +02:00
timers
trace tracing: Add ustring operation to filtering string pointers 2022-03-08 19:09:31 +01:00
translations net: switch to the kernel.org patchwork instance 2020-11-11 17:12:00 -08:00
usb
userspace-api media: hevc: Fix dependent slice segment flags 2021-07-14 16:55:51 +02:00
virt KVM: X86: MMU: Use the correct inherited permissions to get shadow page 2021-06-16 12:01:40 +02:00
vm arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL 2022-05-15 20:00:09 +02:00
w1 docs: w1: w1_therm: Fix broken xref, mistakes, clarify text 2020-10-08 09:47:15 +02:00
watchdog
x86 x86/CPU/AMD: Save AMD NodeId as cpu_die_id 2020-12-30 11:54:29 +01:00
xtensa xtensa: fix TLBTEMP area placement 2020-11-16 02:13:15 -08:00
.gitignore
asm-annotations.rst x86/entry: Emit a symbol for register restoring thunk 2021-02-03 23:28:40 +01:00
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py docs/conf.py: Cope with removal of language=None in Sphinx 5.0.0 2022-06-09 10:21:28 +02:00
COPYING-logo
docutils.conf
dontdiff kbuild: generate Module.symvers only when vmlinux exists 2021-05-19 10:12:59 +02:00
index.rst
Kconfig docs: Kconfig/Makefile: add a check for broken ABI files 2020-10-30 13:08:07 +01:00
logo.gif
Makefile A small number of fixes, plus a build tweak to respect the desire for 2020-11-03 09:57:30 -08:00
memory-barriers.txt docs/memory-barriers.txt: Fix references for DMA*.txt files 2020-08-31 16:14:44 -06:00
SubmittingPatches
watch_queue.rst docs: watch_queue: fix some warnings 2020-09-10 10:48:56 -06:00