Pavel Machek reports a warning about W+X pages found in the "Persisent"
kmap area. After grepping for it (using the correct spelling), and not
finding it, I noticed how the debug printk was just misspelled. Fix it.
The actual mapping bug that Pavel reported is still open. It's
apparently a separate issue from the known EFI page tables, looks like
it's related to the HIGHMEM mappings.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MPX decodes instructions in order to tell which bounds register
was violated. Part of this decoding involves looking at the "REX
prefix" which is a special instrucion prefix used to retrofit
support for new registers in to old instructions.
The X86_REX_*() macros are defined to return actual bit values:
#define X86_REX_R(rex) ((rex) & 4)
*not* boolean values. However, the MPX code was checking for
them like they were booleans. This might have led to us
mis-decoding the "REX prefix" and giving false information out to
userspace about bounds violations. X86_REX_B() actually is bit 1,
so this is really only broken for the X86_REX_X() case.
Fix the conditionals up to tolerate the non-boolean values.
Fixes: fcc7ffd679 "x86, mpx: Decode MPX instruction to get bound violation information"
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20151201003113.D800C1E0@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull x86 fixes from Thomas Gleixner:
"This update contains:
- MPX updates for handling 32bit processes
- A fix for a long standing bug in 32bit signal frame handling
related to FPU/XSAVE state
- Handle get_xsave_addr() correctly in KVM
- Fix SMAP check under paravirtualization
- Add a comment to the static function trace entry to avoid further
confusion about the difference to dynamic tracing"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Fix SMAP check in PVOPS environments
x86/ftrace: Add comment on static function tracing
x86/fpu: Fix get_xsave_addr() behavior under virtualization
x86/fpu: Fix 32-bit signal frame handling
x86/mpx: Fix 32-bit address space calculation
x86/mpx: Do proper get_user() when running 32-bit binaries on 64-bit kernels
Pull x86 fixes from Thomas Gleixner:
"A couple of fixes and updates related to x86:
- Fix the W+X check regression on XEN
- The real fix for the low identity map trainwreck
- Probe legacy PIC early instead of unconditionally allocating legacy
irqs
- Add cpu verification to long mode entry
- Adjust the cache topology to AMD Fam17H systems
- Let Merrifield use the TSC across S3"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Call verify_cpu() after having entered long mode too
x86/setup: Fix low identity map for >= 2GB kernel range
x86/mm: Skip the hypervisor range when walking PGD
x86/AMD: Fix last level cache topology for AMD Fam17h systems
x86/irq: Probe for PIC presence before allocating descs for legacy IRQs
x86/cpu/intel: Enable X86_FEATURE_NONSTOP_TSC_S3 for Merrifield
I received a bug report that running 32-bit MPX binaries on
64-bit kernels was broken. I traced it down to this little code
snippet. We were switching our "number of bounds directory
entries" calculation correctly. But, we didn't switch the other
side of the calculation: the virtual space size.
This meant that we were calculating an absurd size for
bd_entry_virt_space() on 32-bit because we used the 64-bit
virt_space.
This was _also_ broken for 32-bit kernels running on 64-bit
hardware since boot_cpu_data.x86_virt_bits=48 even when running
in 32-bit mode.
Correct that and properly handle all 3 possible cases:
1. 32-bit binary on 64-bit kernel
2. 64-bit binary on 64-bit kernel
3. 32-bit binary on 32-bit kernel
This manifested in having bounds tables not properly unmapped.
It "leaked" memory but had no functional impact otherwise.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151111181934.FA7FAC34@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When you call get_user(foo, bar), you effectively do a
copy_from_user(&foo, bar, sizeof(*bar));
Note that the sizeof() is implicit.
When we reach out to userspace to try to zap an entire "bounds
table" we need to go read a "bounds directory entry" in order to
locate the table's address. The size of a "directory entry"
depends on the binary being run and is always the size of a
pointer.
But, when we have a 64-bit kernel and a 32-bit application, the
directory entry is still only 32-bits long, but we fetch it with
a 64-bit pointer which makes get_user() does a 64-bit fetch.
Reading 4 extra bytes isn't harmful, unless we are at the end of
and run off the table. It might also cause the zero page to get
faulted in unnecessarily even if you are not at the end.
Fix it up by doing a special 32-bit get_user() via a cast when
we have 32-bit userspace.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151111181931.3ACF6822@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1/ Add support for the ACPI 6.0 NFIT hot add mechanism to process
updates of the NFIT at runtime.
2/ Teach the coredump implementation how to filter out DAX mappings.
3/ Introduce NUMA hints for allocations made by the pmem driver, and as
a side effect all devm allocations now hint their NUMA node by
default.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWQX2sAAoJEB7SkWpmfYgCWsEQAK7w/xM9zClVY/DDlFJxFtYq
DZJ4faPj+E3FMTiJIEDzjtRgQvOFE+wmJtntYsCqKH/QZmpnyk9jeT/CbJzEEL2k
WsAk+qHGLcVUlSb36blwN1RFzYqC+IDYThewJqUvxDbOwL1AbiibbX7gplzZHLhW
+rj3ScVlSNOPRDgGGpkAeLNNsttuKvsGo7nB/JZopm0tV6g14rSK09wQbVhv6S6T
Lu7VGYqnJlkteL9YlzRiROf9hW2ZFCMGJz1YZydPTy3aX3hGTBX4w2qvmsPwBIKP
kW/gCNisVJGk1cZCk4joSJ8i/b3x3fE0zdZ5waivJ5jDvYbUUfyk0KtJkfw207Rl
14yWitUC6aeVuCeOqXHgsjRi+1QVN9Pg7i49xgGiUN1igQiUYRTgQPWZxDv6Zo/s
USrLFQBaRd+hJw+dl7A47lJ3mUF96tPCoQb4LCQ7DVsg5U4J2TvqXLH9Gek/CCZ4
QsMkZDTQlZw4+JEDlzBgg/L7xVty8DadplTADMdjaRhFU3y8zKNJ85Ileokt7KVt
IsBT4+S5HeZLvinZY95932DwAmFp1DtsyENd1BUXL06ddyvlQrFJ6NQaXji4fuDc
EVQmMoTAqDujZFupMAux9vkUBDFj/hmaVD5F7j3+MWP87OCritw/IZn+2LgTaKoX
EmttaYrDr2jJwIaGyw+H
=a2/L
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"Outside of the new ACPI-NFIT hot-add support this pull request is more
notable for what it does not contain, than what it does. There were a
handful of development topics this cycle, dax get_user_pages, dax
fsync, and raw block dax, that need more more iteration and will wait
for 4.5.
The patches to make devm and the pmem driver NUMA aware have been in
-next for several weeks. The hot-add support has not, but is
contained to the NFIT driver and is passing unit tests. The coredump
support is straightforward and was looked over by Jeff. All of it has
received a 0day build success notification across 107 configs.
Summary:
- Add support for the ACPI 6.0 NFIT hot add mechanism to process
updates of the NFIT at runtime.
- Teach the coredump implementation how to filter out DAX mappings.
- Introduce NUMA hints for allocations made by the pmem driver, and
as a side effect all devm allocations now hint their NUMA node by
default"
* tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
coredump: add DAX filtering for FDPIC ELF coredumps
coredump: add DAX filtering for ELF coredumps
acpi: nfit: Add support for hot-add
nfit: in acpi_nfit_init, break on a 0-length table
pmem, memremap: convert to numa aware allocations
devm_memremap_pages: use numa_mem_id
devm: make allocations numa aware by default
devm_memremap: convert to return ERR_PTR
devm_memunmap: use devres_release()
pmem: kill memremap_pmem()
x86, mm: quiet arch_add_memory()
Removal started in commit 5bbeed12bd ("sparc32: drop unused
kmap_atomic_to_page"). Let's do it across the whole tree.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The range between 0xffff800000000000 and 0xffff87ffffffffff is reserved
for hypervisor and therefore we should not try to follow PGD's indexes
corresponding to those addresses.
While this has always been a problem, with the new W+X warning
mechanism ptdump_walk_pgd_level_core() can now be called during boot,
causing a PV Xen guest to crash.
[ tglx: Replaced the macro with a readable inline ]
Fixes: e1a58320a3 "x86/mm: Warn on W^X mappings"
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: xen-devel@lists.xen.org
Link: http://lkml.kernel.org/r/1446749795-27764-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We decided to use KASAN as the short name of the tool and
KernelAddressSanitizer as the full one. Update log messages according to
that.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 mm changes from Ingo Molnar:
"The main changes are: continued PAT work by Toshi Kani, plus a new
boot time warning about insecure RWX kernel mappings, by Stephen
Smalley.
The new CONFIG_DEBUG_WX=y warning is marked default-y if
CONFIG_DEBUG_RODATA=y is already eanbled, as a special exception, as
these bugs are hard to notice and this check already found several
live bugs"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Warn on W^X mappings
x86/mm: Fix no-change case in try_preserve_large_page()
x86/mm: Fix __split_large_page() to handle large PAT bit
x86/mm: Fix try_preserve_large_page() to handle large PAT bit
x86/mm: Fix gup_huge_p?d() to handle large PAT bit
x86/mm: Fix slow_virt_to_phys() to handle large PAT bit
x86/mm: Fix page table dump to show PAT bit
x86/asm: Add pud_pgprot() and pmd_pgprot()
x86/asm: Fix pud/pmd interfaces to handle large PAT bit
x86/asm: Add pud/pmd mask interfaces to handle large PAT bit
x86/asm: Move PUD_PAGE macros to page_types.h
x86/vdso32: Define PGTABLE_LEVELS to 32bit VDSO
Pull x86 fpu changes from Ingo Molnar:
"There are two main areas of changes:
- Rework of the extended FPU state code to robustify the kernel's
usage of cpuid provided xstate sizes - and related changes (Dave
Hansen)"
- math emulation enhancements: new modern FPU instructions support,
with testcases, plus cleanups (Denys Vlasnko)"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/fpu: Fixup uninitialized feature_name warning
x86/fpu/math-emu: Add support for FISTTP instructions
x86/fpu/math-emu, selftests: Add test for FISTTP instructions
x86/fpu/math-emu: Add support for FCMOVcc insns
x86/fpu/math-emu: Add support for F[U]COMI[P] insns
x86/fpu/math-emu: Remove define layer for undocumented opcodes
x86/fpu/math-emu, selftests: Add tests for FCMOV and FCOMI insns
x86/fpu/math-emu: Remove !NO_UNDOC_CODE
x86/fpu: Check CPU-provided sizes against struct declarations
x86/fpu: Check to ensure increasing-offset xstate offsets
x86/fpu: Correct and check XSAVE xstate size calculations
x86/fpu: Add C structures for AVX-512 state components
x86/fpu: Rework YMM definition
x86/fpu/mpx: Rework MPX 'xstate' types
x86/fpu: Add xfeature_enabled() helper instead of test_bit()
x86/fpu: Remove 'xfeature_nr'
x86/fpu: Rework XSTATE_* macros to remove magic '2'
x86/fpu: Rename XFEATURES_NR_MAX
x86/fpu: Rename XSAVE macros
x86/fpu: Remove partial LWP support definitions
...
Pull RAS changes from Ingo Molnar:
"The main system reliability related changes were from x86, but also
some generic RAS changes:
- AMD MCE error injection subsystem enhancements. (Aravind
Gopalakrishnan)
- Fix MCE and CPU hotplug interaction bug. (Ashok Raj)
- kcrash bootup robustness fix. (Baoquan He)
- kcrash cleanups. (Borislav Petkov)
- x86 microcode driver rework: simplify it by unmodularizing it and
other cleanups. (Borislav Petkov)"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/mce: Add a default case to the switch in __mcheck_cpu_ancient_init()
x86/mce: Add a Scalable MCA vendor flags bit
MAINTAINERS: Unify the microcode driver section
x86/microcode/intel: Move #ifdef DEBUG inside the function
x86/microcode/amd: Remove maintainers from comments
x86/microcode: Remove modularization leftovers
x86/microcode: Merge the early microcode loader
x86/microcode: Unmodularize the microcode driver
x86/mce: Fix thermal throttling reporting after kexec
kexec/crash: Say which char is the unrecognized
x86/setup/crash: Check memblock_reserve() retval
x86/setup/crash: Cleanup some more
x86/setup/crash: Remove alignment variable
x86/setup: Cleanup crashkernel reservation functions
x86/amd_nb, EDAC: Rename amd_get_node_id()
x86/setup: Do not reserve crashkernel high memory if low reservation failed
x86/microcode/amd: Do not overwrite final patch levels
x86/microcode/amd: Extract current patch level read to a function
x86/ras/mce_amd_inj: Inject bank 4 errors on the NBC
x86/ras/mce_amd_inj: Trigger deferred and thresholding errors interrupts
...
__pa() in the x86 pageattr code. Since these virtual addreses are
not part of the direct mapping or kernel text mapping, passing them
to __pa() will trigger a BUG_ON() when CONFIG_DEBUG_VIRTUAL is
enabled - Sai Praneeth Prakhya
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWLLEiAAoJEC84WcCNIz1VWL4P/3i6TI8IksbMkjLrUPS80g9Q
QgYw2U1r750JrVkYGMQ6VyE1nZtSI7sbXRAjhYGLOQvs6dnCcsmAcT8zPVNSNVs4
hxRW2+hXe8LfMqddpGhSN7/s1Ssg7weEynUvQ58F+qRI4e5Lbk2zz4pqO6IQQVv2
Q4jIXosy7VlDxbx7Nn0wUov6cHNTeLQhGLqUiDJEw32qUnWA6NhYGrQof98Pu+gi
vcigb62ebicvduv6TRm5zPYANYz8AJqt7cywL7DkjT/vSKF+k/l6W5KWfOoVPd8N
99z9uTSJa1j+LuRYkPr8+YSPW9OtXD55QPkGYAFo/OWTt7j/QBgbGj0zJKWayrdK
3JefcQlH5wau5adaAjJwCm9qbapRS8De1yCMEz05gW8g5eZfqo32dnMg7UAWbWh5
fT6dAWffSxbYBLqPKV18nMgWTopwXh4IGddhB6y811SImUXhOvA7jwjJHYei8dKf
3eBN5rQ4owo0GfIiBGpKOt4ET2klgSF8xnnlYmbA7QeMRPKySNcXowsJ9wPimfZS
MGhmpz+tv/030ROuBqVjrvkGyFdxKgPbtH2hw6Ka7t88YqlyUyUIiu7pxZM2LY9g
l+6tLRWmExF993Tk3S9vCSfYEZwDHU2aYsObscgy1YEdydX4858uTSH91sVkoyBR
dtv8PmonJp/z3aw7XZKU
=xdZV
-----END PGP SIGNATURE-----
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into core/efi
Pull EFI fix from Matt Fleming:
- Fix a kernel panic by not passing EFI virtual mapping addresses to
__pa() in the x86 pageattr code. Since these virtual addreses are
not part of the direct mapping or kernel text mapping, passing them
to __pa() will trigger a BUG_ON() when CONFIG_DEBUG_VIRTUAL is
enabled. (Sai Praneeth Prakhya)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When CONFIG_DEBUG_VIRTUAL is enabled, all accesses to __pa(address) are
monitored to see whether address falls in direct mapping or kernel text
mapping (see Documentation/x86/x86_64/mm.txt for details), if it does
not, the kernel panics. During 1:1 mapping of EFI runtime services we access
virtual addresses which are == physical addresses, thus the 1:1 mapping
and these addresses do not fall in either of the above two regions and
hence when passed as arguments to __pa() kernel panics as reported by
Dave Hansen here https://lkml.kernel.org/r/5462999A.7090706@intel.com.
So, before calling __pa() virtual addresses should be validated which
results in skipping call to split_page_count() and that should be fine
because it is used to keep track of everything *but* 1:1 mappings.
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Glenn P Williamson <glenn.p.williamson@intel.com>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Merge the early loader functionality into the driver proper. The
diff is huge but logically, it is simply moving code from the
_early.c files into the main driver.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Switch to pr_debug() so that dynamic-debug can disable these messages by
default. This gets noisy in the presence of devm_memremap_pages().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Warn on any residual W+X mappings after setting NX
if DEBUG_WX is enabled. Introduce a separate
X86_PTDUMP_CORE config that enables the code for
dumping the page tables without enabling the debugfs
interface, so that DEBUG_WX can be enabled without
exposing the debugfs interface. Switch EFI_PGT_DUMP
to using X86_PTDUMP_CORE so that it also does not require
enabling the debugfs interface.
On success it prints this to the kernel log:
x86/mm: Checked W+X mappings: passed, no W+X pages found.
On failure it prints a warning and a count of the failed pages:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 1 at arch/x86/mm/dump_pagetables.c:226 note_page+0x610/0x7b0()
x86/mm: Found insecure W+X mapping at address ffffffff81755000/__stop___ex_table+0xfa8/0xabfa8
[...]
Call Trace:
[<ffffffff81380a5f>] dump_stack+0x44/0x55
[<ffffffff8109d3f2>] warn_slowpath_common+0x82/0xc0
[<ffffffff8109d48c>] warn_slowpath_fmt+0x5c/0x80
[<ffffffff8106cfc9>] ? note_page+0x5c9/0x7b0
[<ffffffff8106d010>] note_page+0x610/0x7b0
[<ffffffff8106d409>] ptdump_walk_pgd_level_core+0x259/0x3c0
[<ffffffff8106d5a7>] ptdump_walk_pgd_level_checkwx+0x17/0x20
[<ffffffff81063905>] mark_rodata_ro+0xf5/0x100
[<ffffffff817415a0>] ? rest_init+0x80/0x80
[<ffffffff817415bd>] kernel_init+0x1d/0xe0
[<ffffffff8174cd1f>] ret_from_fork+0x3f/0x70
[<ffffffff817415a0>] ? rest_init+0x80/0x80
---[ end trace a1f23a1e42a2ac76 ]---
x86/mm: Checked W+X mappings: FAILED, 171 W+X pages found.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1444064120-11450-1-git-send-email-sds@tycho.nsa.gov
[ Improved the Kconfig help text and made the new option default-y
if CONFIG_DEBUG_RODATA=y, because it already found buggy mappings,
so we really want people to have this on by default. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Unused space between the end of __ex_table and the start of
rodata can be left W+x in the kernel page tables. Extend the
setting of the NX bit to cover this gap by starting from
text_end rather than rodata_start.
Before:
---[ High Kernel Mapping ]---
0xffffffff80000000-0xffffffff81000000 16M pmd
0xffffffff81000000-0xffffffff81600000 6M ro PSE GLB x pmd
0xffffffff81600000-0xffffffff81754000 1360K ro GLB x pte
0xffffffff81754000-0xffffffff81800000 688K RW GLB x pte
0xffffffff81800000-0xffffffff81a00000 2M ro PSE GLB NX pmd
0xffffffff81a00000-0xffffffff81b3b000 1260K ro GLB NX pte
0xffffffff81b3b000-0xffffffff82000000 4884K RW GLB NX pte
0xffffffff82000000-0xffffffff82200000 2M RW PSE GLB NX pmd
0xffffffff82200000-0xffffffffa0000000 478M pmd
After:
---[ High Kernel Mapping ]---
0xffffffff80000000-0xffffffff81000000 16M pmd
0xffffffff81000000-0xffffffff81600000 6M ro PSE GLB x pmd
0xffffffff81600000-0xffffffff81754000 1360K ro GLB x pte
0xffffffff81754000-0xffffffff81800000 688K RW GLB NX pte
0xffffffff81800000-0xffffffff81a00000 2M ro PSE GLB NX pmd
0xffffffff81a00000-0xffffffff81b3b000 1260K ro GLB NX pte
0xffffffff81b3b000-0xffffffff82000000 4884K RW GLB NX pte
0xffffffff82000000-0xffffffff82200000 2M RW PSE GLB NX pmd
0xffffffff82200000-0xffffffffa0000000 478M pmd
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443704662-3138-1-git-send-email-sds@tycho.nsa.gov
Signed-off-by: Ingo Molnar <mingo@kernel.org>
try_preserve_large_page() checks if new_prot is the same as
old_prot. If so, it simply sets do_split to 0, and returns
with no-operation. However, old_prot is set as a 4KB pgprot
value while new_prot is a large page pgprot value.
Now that old_prot is initially set from p?d_pgprot() as a
large page pgprot value, fix it by not overwriting old_prot
with a 4KB pgprot value.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Robert Elliot <elliott@hpe.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1442514264-12475-12-git-send-email-toshi.kani@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
__split_large_page() is called from __change_page_attr() to change
the mapping attribute by splitting a given large page into smaller
pages. This function uses pte_pfn() and pte_pgprot() for PUD/PMD,
which do not handle the large PAT bit properly.
Fix __split_large_page() by using the corresponding pud/pmd pfn/
pgprot interfaces.
Also remove '#ifdef CONFIG_X86_64', which is not necessary.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Robert Elliot <elliott@hpe.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1442514264-12475-11-git-send-email-toshi.kani@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
try_preserve_large_page() is called from __change_page_attr() to
change the mapping attribute of a given large page. This function
uses pte_pfn() and pte_pgprot() for PUD/PMD, which do not handle
the large PAT bit properly.
Fix try_preserve_large_page() by using the corresponding pud/pmd
prot/pfn interfaces.
Also remove '#ifdef CONFIG_X86_64', which is not necessary.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Robert Elliot <elliott@hpe.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1442514264-12475-10-git-send-email-toshi.kani@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
gup_huge_pud() and gup_huge_pmd() cast *pud and *pmd to *pte,
and use pte_xxx() interfaces to obtain the flags and PFN.
However, the pte_xxx() interface does not handle the large
PAT bit properly for PUD/PMD.
Fix gup_huge_pud() and gup_huge_pmd() to use pud_xxx() and
pmd_xxx() interfaces according to their type.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Robert Elliot <elliott@hpe.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1442514264-12475-9-git-send-email-toshi.kani@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
slow_virt_to_phys() calls lookup_address() to obtain *pte and
its level. It then calls pte_pfn() to obtain a physical address
for any level. However, this physical address is not correct
when the large PAT bit is set because pte_pfn() does not mask
the large PAT bit properly for PUD/PMD.
Fix slow_virt_to_phys() to use pud_pfn() and pmd_pfn() for 1GB
and 2MB mapping levels.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Robert Elliot <elliott@hpe.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1442514264-12475-8-git-send-email-toshi.kani@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
/sys/kernel/debug/kernel_page_tables does not show the PAT bit for
PUD/PMD mappings. This is because walk_pud_level(), walk_pmd_level()
and note_page() mask the flags with PTE_FLAGS_MASK, which does not
cover their PAT bit, _PAGE_PAT_LARGE.
Fix it by replacing the use of PTE_FLAGS_MASK with p?d_flags(),
which masks the flags properly.
Also change to show the PAT bit as "PAT" to be consistent with
other bits.
Reported-by: Robert Elliott <elliott@hpe.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Robert Elliot <elliott@hpe.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1442514264-12475-7-git-send-email-toshi.kani@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull x86 fixes from Ingo Molnar:
- misc fixes all around the map
- block non-root vm86(old) if mmap_min_addr != 0
- two small debuggability improvements
- removal of obsolete paravirt op
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform: Fix Geode LX timekeeping in the generic x86 build
x86/apic: Serialize LVTT and TSC_DEADLINE writes
x86/ioapic: Force affinity setting in setup_ioapic_dest()
x86/paravirt: Remove the unused pv_time_ops::get_tsc_khz method
x86/ldt: Fix small LDT allocation for Xen
x86/vm86: Fix the misleading CONFIG_VM86 Kconfig help text
x86/cpu: Print family/model/stepping in hex
x86/vm86: Block non-root vm86(old) if mmap_min_addr != 0
x86/alternatives: Make optimize_nops() interrupt safe and synced
x86/mm/srat: Print non-volatile flag in SRAT
x86/cpufeatures: Enable cpuid for Intel SHA extensions
MPX includes two separate "extended state components". There is
no real need to have an 'mpx_struct' because we never really
manage the states together.
We also separate out the actual data in 'mpx_bndcsr_state' from
the padding. We will shortly be checking the state sizes
against our structures and need them to match. For consistency,
we also ensure to prefix these types with 'mpx_'.
Lastly, we add some comments to mirror some of the descriptions
in the Intel documents (SDM) of the various state components.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233129.384B73EB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are two concepts that have some confusing naming:
1. Extended State Component numbers (currently called
XFEATURE_BIT_*)
2. Extended State Component masks (currently called XSTATE_*)
The numbers are (currently) from 0-9. State component 3 is the
bounds registers for MPX, for instance.
But when we want to enable "state component 3", we go set a bit
in XCR0. The bit we set is 1<<3. We can check to see if a
state component feature is enabled by looking at its bit.
The current 'xfeature_bit's are at best xfeature bit _numbers_.
Calling them bits is at best inconsistent with ending the enum
list with 'XFEATURES_NR_MAX'.
This patch renames the enum to be 'xfeature'. These also
happen to be what the Intel documentation calls a "state
component".
We also want to differentiate these from the "XSTATE_*" macros.
The "XSTATE_*" macros are a mask, and we rename them to match.
These macros are reasonably widely used so this patch is a
wee bit big, but this really is just a rename.
The only non-mechanical part of this is the
s/XSTATE_EXTEND_MASK/XFEATURE_MASK_EXTEND/
We need a better name for it, but that's another patch.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233126.38653250@viggo.jf.intel.com
[ Ported to v4.3-rc1. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(),
rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple
wrapper on top of do_mmap(). Perhaps we should update the callers of
do_mmap_pgoff() and kill it later.
This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not
play with vm internals.
After this change mmap_region() has a single user outside of mmap.c,
arch/tile/mm/elf.c:arch_setup_additional_pages(). It would be nice to
change arch/tile/ and unexport mmap_region().
[kirill@shutemov.name: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge second patch-bomb from Andrew Morton:
"Almost all of the rest of MM. There was an unusually large amount of
MM material this time"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (141 commits)
zpool: remove no-op module init/exit
mm: zbud: constify the zbud_ops
mm: zpool: constify the zpool_ops
mm: swap: zswap: maybe_preload & refactoring
zram: unify error reporting
zsmalloc: remove null check from destroy_handle_cache()
zsmalloc: do not take class lock in zs_shrinker_count()
zsmalloc: use class->pages_per_zspage
zsmalloc: consider ZS_ALMOST_FULL as migrate source
zsmalloc: partial page ordering within a fullness_list
zsmalloc: use shrinker to trigger auto-compaction
zsmalloc: account the number of compacted pages
zsmalloc/zram: introduce zs_pool_stats api
zsmalloc: cosmetic compaction code adjustments
zsmalloc: introduce zs_can_compact() function
zsmalloc: always keep per-class stats
zsmalloc: drop unused variable `nr_to_migrate'
mm/memblock.c: fix comment in __next_mem_range()
mm/page_alloc.c: fix type information of memoryless node
memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
...
When parsing SRAT, all memory ranges are added into numa_meminfo. In
numa_init(), before entering numa_cleanup_meminfo(), all possible memory
ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all
ranges over max_pfn or empty.
But, this only works if the nodes are continuous. Let's have a look at
the following example:
We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
On boot, only node 0,1,2,3 exist.
And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 60000000]
2. on node 0: [100000000, 20000000000]
3. on node 1: [20000000000, 40000000000]
4. on node 4: [40000000000, 60000000000]
5. on node 5: [60000000000, 80000000000]
6. on node 2: [80000000000, a0000000000]
7. on node 3: [a0000000000, a0800000000]
8. on node 6: [c0000000000, a0800000000]
9. on node 7: [e0000000000, a0800000000]
And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the
end address is over max_pfn, which is a0800000000. But 4 and 5 are not
removed because their end addresses are less then max_pfn. But in fact,
node 4 and 5 don't exist.
In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
Since memory ranges in node 4 and 5 are in numa_meminfo, in
numa_register_memblks(), node 4 and 5 will be mistakenly set to online.
If you run lscpu, it will show:
NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220
In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block
contains all available memory at boot time, if they overlap, it means the
ranges exist. If not, then remove them from numa_meminfo.
After this patch, lscpu will show:
NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1/ Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map. This facility is used by the pmem driver to
enable pfn_to_page() operations on the page frames returned by DAX
('direct_access' in 'struct block_device_operations'). For now, the
'memmap' allocation for these "device" pages comes from "System
RAM". Support for allocating the memmap from device memory will
arrive in a later kernel.
2/ Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3. Completion of
the conversion is targeted for v4.4.
3/ Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
4/ Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
5/ Miscellaneous updates and fixes to libnvdimm including support
for issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV6Nx7AAoJEB7SkWpmfYgCWyYQAI5ju6Gvw27RNFtPovHcZUf5
JGnxXejI6/AqeTQ+IulgprxtEUCrXOHjCDA5dkjr1qvsoqK1qxug+vJHOZLgeW0R
OwDtmdW4Qrgeqm+CPoxETkorJ8wDOc8mol81kTiMgeV3UqbYeeHIiTAmwe7VzZ0C
nNdCRDm5g8dHCjTKcvK3rvozgyoNoWeBiHkPe76EbnxDICxCB5dak7XsVKNMIVFQ
NuYlnw6IYN7+rMHgpgpRux38NtIW8VlYPWTmHExejc2mlioWMNBG/bmtwLyJ6M3e
zliz4/cnonTMUaizZaVozyinTa65m7wcnpjK+vlyGV2deDZPJpDRvSOtB0lH30bR
1gy+qrKzuGKpaN6thOISxFLLjmEeYwzYd7SvC9n118r32qShz+opN9XX0WmWSFlA
sajE1ehm4M7s5pkMoa/dRnAyR8RUPu4RNINdQ/Z9jFfAOx+Q26rLdQXwf9+uqbEb
bIeSQwOteK5vYYCstvpAcHSMlJAglzIX5UfZBvtEIJN7rlb0VhmGWfxAnTu+ktG1
o9cqAt+J4146xHaFwj5duTsyKhWb8BL9+xqbKPNpXEp+PbLsrnE/+WkDLFD67jxz
dgIoK60mGnVXp+16I2uMqYYDgAyO5zUdmM4OygOMnZNa1mxesjbDJC6Wat1Wsndn
slsw6DkrWT60CRE42nbK
=o57/
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().
Summary:
- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.
This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').
For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.
- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.
Completion of the conversion is targeted for v4.4.
- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"
* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...
When unmapping pages it is necessary to flush the TLB. If that page was
accessed by another CPU then an IPI is used to flush the remote CPU. That
is a lot of IPIs if kswapd is scanning and unmapping >100K pages per
second.
There already is a window between when a page is unmapped and when it is
TLB flushed. This series increases the window so multiple pages can be
flushed using a single IPI. This should be safe or the kernel is hosed
already.
Patch 1 simply made the rest of the series easier to write as ftrace
could identify all the senders of TLB flush IPIS.
Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI
to flush the entire TLB.
Patch 3 tracks when there potentially are writable TLB entries that
need to be batched differently
Patch 4 increases SWAP_CLUSTER_MAX to further batch flushes
The performance impact is documented in the changelogs but in the optimistic
case on a 4-socket machine the full series reduces interrupts from 900K
interrupts/second to 60K interrupts/second.
This patch (of 4):
It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the
source of IPIs being transmitted for TLB flushes on x86.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the addition of NVDIMM support, a question came up as to
whether NVDIMM ranges should be in the SRAT with this bit set.
I think the consensus was no because the ranges are in the NFIT
with proximity domain information there.
ACPI is not clear on the meaning of this bit in the SRAT.
If someone is setting it, we might want to ask them what they
expect to happen with it.
Right now this bit is only printed if all the ACPI debug
information is turned on.
Signed-off-by: Linda Knippers <linda.knippers@hp.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150901194154.GA4939@ljkz400
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 mm updates from Ingo Molnar:
"The dominant change in this cycle was the continued work to isolate
kernel drivers from MTRR legacies: this tree gets rid of all kernel
internal driver interfaces to MTRRs (mostly by rewriting it to proper
PAT interfaces), the only access left is the /proc/mtrr ABI.
This work was done by Luis R Rodriguez.
There's also some related PCI interface additions for which I've
Cc:-ed Bjorn"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/mm/mtrr: Remove kernel internal MTRR interfaces: unexport mtrr_add() and mtrr_del()
s390/io: Add pci_iomap_wc() and pci_iomap_wc_range()
drivers/dma/iop-adma: Use dma_alloc_writecombine() kernel-style
drivers/video/fbdev/vt8623fb: Use arch_phys_wc_add() and pci_iomap_wc()
drivers/video/fbdev/s3fb: Use arch_phys_wc_add() and pci_iomap_wc()
drivers/video/fbdev/arkfb.c: Use arch_phys_wc_add() and pci_iomap_wc()
PCI: Add pci_iomap_wc() variants
drivers/video/fbdev/gxt4500: Use pci_ioremap_wc_bar() to map framebuffer
drivers/video/fbdev/kyrofb: Use arch_phys_wc_add() and pci_ioremap_wc_bar()
drivers/video/fbdev/i740fb: Use arch_phys_wc_add() and pci_ioremap_wc_bar()
PCI: Add pci_ioremap_wc_bar()
x86/mm: Make kernel/check.c explicitly non-modular
x86/mm/pat: Make mm/pageattr[-test].c explicitly non-modular
x86/mm/pat: Add comments to cachemode translation tables
arch/*/io.h: Add ioremap_uc() to all architectures
drivers/video/fbdev/atyfb: Use arch_phys_wc_add() and ioremap_wc()
drivers/video/fbdev/atyfb: Replace MTRR UC hole with strong UC
drivers/video/fbdev/atyfb: Clarify ioremap() base and length used
drivers/video/fbdev/atyfb: Carve out framebuffer length fudging into a helper
x86/mm, asm-generic: Add IOMMU ioremap_uc() variant default
...
Pull x86 asm changes from Ingo Molnar:
"The biggest changes in this cycle were:
- Revamp, simplify (and in some cases fix) Time Stamp Counter (TSC)
primitives. (Andy Lutomirski)
- Add new, comprehensible entry and exit handlers written in C.
(Andy Lutomirski)
- vm86 mode cleanups and fixes. (Brian Gerst)
- 32-bit compat code cleanups. (Brian Gerst)
The amount of simplification in low level assembly code is already
palpable:
arch/x86/entry/entry_32.S | 130 +----
arch/x86/entry/entry_64.S | 197 ++-----
but more simplifications are planned.
There's also the usual laudry mix of low level changes - see the
changelog for details"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (83 commits)
x86/asm: Drop repeated macro of X86_EFLAGS_AC definition
x86/asm/msr: Make wrmsrl() a function
x86/asm/delay: Introduce an MWAITX-based delay with a configurable timer
x86/asm: Add MONITORX/MWAITX instruction support
x86/traps: Weaken context tracking entry assertions
x86/asm/tsc: Add rdtscll() merge helper
selftests/x86: Add syscall_nt selftest
selftests/x86: Disable sigreturn_64
x86/vdso: Emit a GNU hash
x86/entry: Remove do_notify_resume(), syscall_trace_leave(), and their TIF masks
x86/entry/32: Migrate to C exit path
x86/entry/32: Remove 32-bit syscall audit optimizations
x86/vm86: Rename vm86->v86flags and v86mask
x86/vm86: Rename vm86->vm86_info to user_vm86
x86/vm86: Clean up vm86.h includes
x86/vm86: Move the vm86 IRQ definitions to vm86.h
x86/vm86: Use the normal pt_regs area for vm86
x86/vm86: Eliminate 'struct kernel_vm86_struct'
x86/vm86: Move fields from 'struct kernel_vm86_struct' to 'struct vm86'
x86/vm86: Move vm86 fields out of 'thread_struct'
...
While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.
Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory. Device memory has
different lifetime and performance characteristics than RAM. However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jerome Glisse <j.glisse@gmail.com>
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The file pageattr.c is obj-y and it includes pageattr-test.c
based on CPA_DEBUG (a bool), meaning that no code here is
currently being built as a module by anyone.
Lets remove the couple traces of modularity so that when reading
the code there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the
non-modular case, the init ordering remains unchanged with this
commit.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1440459295-21814-3-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce generic kasan_populate_zero_shadow(shadow_start,
shadow_end). This function maps kasan_zero_page to the
[shadow_start, shadow_end] addresses.
This replaces x86_64 specific populate_zero_shadow() and will
be used for ARM64 in follow on patches.
The main changes from original version are:
* Use p?d_populate*() instead of set_p?d()
* Use memblock allocator directly instead of vmemmap_alloc_block()
* __pa() instead of __pa_nodebug(). __pa() causes troubles
iff we use it before kasan_early_init(). kasan_populate_zero_shadow()
will be used later, so we ok with __pa() here.
Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Klimov <klimov.linux@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Keitel <dkeitel@codeaurora.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yury <yury.norov@gmail.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1439444244-26057-3-git-send-email-ryabinin.a.a@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add comments to the cachemode translation tables to clarify that
the default values are set as minimal supported mode, which are
necessary to handle WC and WT fallback to UC- when they are not
enabled.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1437588371-28223-1-git-send-email-toshi.kani@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
vm86.h was being implicitly included in alot of places via
processor.h, which in turn got it from math_emu.h. Break that
chain and explicitly include vm86.h in all files that need it.
Also remove unused vm86 field from math_emu_info.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1438148483-11932-7-git-send-email-brgerst@gmail.com
[ Fixed build failure. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allocate a separate structure for the vm86 fields.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1438148483-11932-2-git-send-email-brgerst@gmail.com
[ Build fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Toshi explains:
"No, the default values need to be set to the fallback types,
i.e. minimal supported mode. For WC and WT, UC is the fallback type.
When PAT is disabled, pat_init() does update the tables below to
enable WT per the default BIOS setup. However, when PAT is enabled,
but CPU has PAT -errata, WT falls back to UC per the default values."
Revert: ca1fec58bc 'x86/mm/pat: Adjust default caching mode translation tables'
Requested-by: Toshi Kani <toshi.kani@hp.com>
Cc: Jan Beulich <jbeulich@suse.de>
Link: http://lkml.kernel.org/r/1437577776.3214.252.camel@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
__ioremap_caller() calls region_is_ram() to walk through the
iomem_resource table to check if a target range is in RAM, which was
added to improve the lookup performance over page_is_ram() (commit
906e36c5c7 "x86: use optimized ioresource lookup in ioremap
function"). page_is_ram() was no longer used when this change was
added, though.
__ioremap_caller() then calls walk_system_ram_range(), which had
replaced page_is_ram() to improve the lookup performance (commit
c81c8a1eee "x86, ioremap: Speed up check for RAM pages").
Since both checks walk through the same iomem_resource table for
the same purpose, there is no need to call both functions.
Aside of that walk_system_ram_range() is the only useful check at the
moment because region_is_ram() always returns -1 due to an
implementation bug. That bug in region_is_ram() cannot be fixed
without breaking existing ioremap callers, which rely on the subtle
difference of walk_system_ram_range() versus non page aligned ranges.
Once these offending callers are fixed we can use region_is_ram() and
remove walk_system_ram_range().
[ tglx: Massaged changelog ]
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1437088996-28511-3-git-send-email-toshi.kani@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
__ioremap_check_ram() has a WARN_ONCE() which is emitted when the
given pfn range is not RAM. The warning is bogus in two aspects:
- it never triggers since walk_system_ram_range() only calls
__ioremap_check_ram() for RAM ranges.
- the warning message is wrong as it says: "ioremap on RAM' after it
established that the pfn range is not RAM.
Move the WARN_ONCE() to __ioremap_caller(), and update the message to
include the address range so we get an actual warning when something
tries to ioremap system RAM.
[ tglx: Massaged changelog ]
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1437088996-28511-2-git-send-email-toshi.kani@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make WT really mean WT (rather than UC).
I can't see why commit 9cd25aac1f ("x86/mm/pat: Emulate PAT when
it is disabled") didn't make this to match its changes to
pat_init().
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Link: http://lkml.kernel.org/r/55ACC3660200007800092E62@mail.emea.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
MPX setups private anonymous mapping, but uses vma->vm_ops too.
This can confuse core VM, as it relies on vm->vm_ops to
distinguish file VMAs from anonymous.
As result we will get SIGBUS, because handle_pte_fault() thinks
it's file VMA without vm_ops->fault and it doesn't know how to
handle the situation properly.
Let's fix that by not setting ->vm_ops.
We don't really need ->vm_ops here: MPX VMA can be detected with
VM_MPX flag. And vma_merge() will not merge MPX VMA with non-MPX
VMA, because ->vm_flags won't match.
The only thing left is name of VMA. I'm not sure if it's part of
ABI, or we can just drop it. The patch keep it by providing
arch_vma_name() on x86.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org> # Fixes: 6b7339f4 (mm: avoid setting up anonymous pages into file mapping)
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@sr71.net
Link: http://lkml.kernel.org/r/20150720212958.305CC3E9@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>