Commit Graph

70 Commits

Author SHA1 Message Date
Linus Torvalds
315227f6da DAX error handling for 4.7
- Until now, dax has been disabled if media errors were found on
   any device. This enables the use of DAX in the presence of these
   errors by making all sector-aligned zeroing go through the driver.
 - The driver (already) has the ability to clear errors on writes that
   are sent through the block layer using 'DSMs' defined in ACPI 6.1.
 
 Other misc changes:
 
 - When mounting DAX filesystems, check to make sure the partition
   is page aligned. This is a requirement for DAX, and previously, we
   allowed such unaligned mounts to succeed, but subsequent reads/writes
   would fail.
 
 - Misc/cleanup fixes from Jan that remove unused code from DAX related to
   zeroing, writeback, and some size checks.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXQ4GKAAoJEHr6Yb6juE3/zowP/iclIhgXXXMQJRUHJlePMXC8
 15sGZ32JS1ak9g7vrsmNVEDNynfNtiMYdBxtUyRuj6xqgwdZvFk3F55KOCPtaeA1
 +yADkgeRkTAcwzmHw9WQVEzBCqyzSisdrwtEfH817qdq9FJdH66x2Kos6i+HeAVr
 5Q/e4gs7lKrjf384/QBl+wxNZOndJaQAPd2VRHQqx2A9F33v0ljdwRaUG1r4fjK2
 dtmhcZCqdQyuAGXW3piTnZc5ZFc3DPqO4FkEfqkEK3lFOflK0fd8wMsAZRp/Jd0j
 GJsgnVSWSqG0Dz476djlG0w8t2p5Jv1g9cKChV+ZZEdFLKWHCOUFqXNj8uI8I4k5
 cOEKCHyJ3IwfSHhNQqktEWrQN4T8ZXhWtuc9GuV4UZYuqJqHci6EdR/YsWsJjV+L
 lm/qvK4ipDS1pivxOy8KX/iN0z7Io8J9GXpStDx3g8iWjLlh4YYlbJLWeeRepo/z
 aPlV/QAKcHiGY6jzLExrZIyCWkzwo6O+0p1Kxerv9/7K/32HWbOodZ+tC8eD+N25
 pV69nCGf+u50T2TtIx1+iann4NC1r7zg5yqnT9AgpyZpiwR5joCDzI5sXW+D0rcS
 vPtfM84Ccdeq/e6mvfIpZgR0/npQapKnrmUest0J7P2BFPHiFPji1KzZ7M+1aFOo
 9R6JdrAj0Sc+FBa+cGzH
 =v6Of
 -----END PGP SIGNATURE-----

Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull misc DAX updates from Vishal Verma:
 "DAX error handling for 4.7

   - Until now, dax has been disabled if media errors were found on any
     device.  This enables the use of DAX in the presence of these
     errors by making all sector-aligned zeroing go through the driver.

   - The driver (already) has the ability to clear errors on writes that
     are sent through the block layer using 'DSMs' defined in ACPI 6.1.

  Other misc changes:

   - When mounting DAX filesystems, check to make sure the partition is
     page aligned.  This is a requirement for DAX, and previously, we
     allowed such unaligned mounts to succeed, but subsequent
     reads/writes would fail.

   - Misc/cleanup fixes from Jan that remove unused code from DAX
     related to zeroing, writeback, and some size checks"

* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: fix a comment in dax_zero_page_range and dax_truncate_page
  dax: for truncate/hole-punch, do zeroing through the driver if possible
  dax: export a low-level __dax_zero_page_range helper
  dax: use sb_issue_zerout instead of calling dax_clear_sectors
  dax: enable dax in the presence of known media errors (badblocks)
  dax: fallback from pmd to pte on error
  block: Update blkdev_dax_capable() for consistency
  xfs: Add alignment check for DAX mount
  ext2: Add alignment check for DAX mount
  ext4: Add alignment check for DAX mount
  block: Add bdev_dax_supported() for dax mount checks
  block: Add vfs_msg() interface
  dax: Remove redundant inode size checks
  dax: Remove pointless writeback from dax_do_io()
  dax: Remove zeroing from dax_io()
  dax: Remove dead zeroing code from fault handlers
  ext2: Avoid DAX zeroing to corrupt data
  ext2: Fix block zeroing in ext2_get_blocks() for DAX
  dax: Remove complete_unwritten argument
  DAX: move RADIX_DAX_ definitions to dax.c
2016-05-26 19:34:26 -07:00
Dan Williams
c5ed926864 libnvdimm, dax: autodetect support
For autodetecting a previously established dax configuration we need the
info block to indicate block-device vs device-dax mode, and we need to
have the default namespace probe hand-off the configuration to the
dax_pmem driver.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-20 22:02:57 -07:00
Dan Williams
0a70bd4305 dax: enable dax in the presence of known media errors (badblocks)
1/ If a mapping overlaps a bad sector fail the request.

2/ Do not opportunistically report more dax-capable capacity than is
   requested when errors present.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-18 12:16:56 -06:00
Dan Williams
5a92289f41 libnvdimm, pmem: kill ->pmem_queue and ->pmem_disk
The devm conversion obviates the need to continue to remember the queue
and disk locally in the driver.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:24 -07:00
Dan Williams
ac515c084b libnvdimm, pmem, pfn: move pfn setup to the core
Now that pmem internals have been disentangled from pfn setup, that code
can move to the core.  This is in preparation for adding another user of
the pfn-device capabilities.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams
200c79da82 libnvdimm, pmem, pfn: make pmem_rw_bytes generic and refactor pfn setup
In preparation for providing an alternative (to block device) access
mechanism to persistent memory, convert pmem_rw_bytes() to
nsio_rw_bytes().  This allows ->rw_bytes() functionality without
requiring a 'struct pmem_device' to be instantiated.

In other words, when ->rw_bytes() is in use i/o is driven through
'struct nd_namespace_io', otherwise it is driven through 'struct
pmem_device' and the block layer.  This consolidates the disjoint calls
to devm_exit_badblocks() and devm_memunmap() into a common
devm_nsio_disable() and cleans up the init path to use a unified
pmem_attach_disk() implementation.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams
947df02d25 libnvdimm, pmem: clean up resource print / request
The leading '0x' in front of %pa is redundant, also we can just use %pR
to simplify the print statement.  The request parameters can be directly
taken from the resource as well.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams
030b99e39c libnvdimm, pmem: use devm_add_action to release bdev resources
Register a callback to clean up the request_queue and put the gendisk at
driver disable time.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams
bd842b8ca7 libnvdimm, pmem: use ->queuedata for driver private data
Save a pointer chase by storing the driver private data in the
request_queue rather than the gendisk.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 12:26:23 -07:00
Dan Williams
e32bc729a3 libnvdimm, btt, convert nd_btt_probe() to devm
Pass the device performing the probe so we can use a devm allocation for
the btt superblock.

Cc: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 10:59:54 -07:00
Dan Williams
bd032943b5 libnvdimm, pfn, convert nd_pfn_probe() to devm
Pass the device performing the probe so we can use a devm allocation for
the pfn superblock.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 10:59:54 -07:00
Dan Williams
298f2bc5db libnvdimm, pmem: kill pmem->ndns
We can derive the common namespace from other information.  We also do
not need to cache it because all the usages are in slow paths.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-22 10:59:54 -07:00
Dan Williams
0a370d261c libnvdimm, pmem: clarify the write+clear_poison+write flow
The ACPI specification does not specify the state of data after a clear
poison operation.  Potential future libnvdimm bus implementations for
other architectures also might not specify or disagree on the state of
data after clear poison.  Clarify why we write twice.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
2016-04-15 14:59:41 -06:00
Linus Torvalds
239467e852 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "Three fixes, the first two are tagged for -stable:

   - The ndctl utility/library gained expanded unit tests illuminating a
     long standing bug in the libnvdimm SMART data retrieval
     implementation.

     It has been broken since its initial implementation, now fixed.

   - Another one line fix for the detection of stale info blocks.

     Without this change userspace can get into a situation where it is
     unable to reconfigure a namespace.

   - Fix the badblock initialization path in the presence of the new (in
     v4.6-rc1) section alignment workarounds.

     Without this change badblocks will be reported at the wrong offset.

  These have received a build success report from the kbuild robot and
  have appeared in -next with no reported issues"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm, pfn: fix nvdimm_namespace_add_poison() vs section alignment
  libnvdimm, pfn: fix uuid validation
  libnvdimm: fix smart data retrieval
2016-04-09 14:05:45 -07:00
Dan Williams
a390180291 libnvdimm, pfn: fix nvdimm_namespace_add_poison() vs section alignment
When section alignment padding is in effect we need to shift / truncate
the range that is queried for poison by the 'start_pad' or 'end_trunc'
reservations.

It's easiest if we just pass in an adjusted resource range rather than
deriving it from the passed in namespace.  With the resource range
resolution pushed out to the caller we can also push the
namespace-to-region lookup to the caller and drop the implicit pmem-type
assumption about the passed in namespace object.

Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-04-07 20:02:06 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Dan Williams
fc0c202813 x86, pmem: use memcpy_mcsafe() for memcpy_from_pmem()
Update the definition of memcpy_from_pmem() to return 0 or a negative
error code.  Implement x86/arch_memcpy_from_pmem() with memcpy_mcsafe().

Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-28 17:19:31 -07:00
Dan Williams
489011652a Merge branch 'for-4.6/pfn' into libnvdimm-for-next 2016-03-09 17:15:43 -08:00
Dan Williams
59e6473980 libnvdimm, pmem: clear poison on write
If a write is directed at a known bad block perform the following:

1/ write the data

2/ send a clear poison command

3/ invalidate the poison out of the cache hierarchy

Cc: <x86@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 15:15:32 -08:00
Dan Williams
b5ebc8ec69 libnvdimm, pmem: fix kmap_atomic() leak in error path
When we enounter a bad block we need to kunmap_atomic() before
returning.

Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 15:12:41 -08:00
NeilBrown
55155291b3 pmem: don't allocate unused major device number
When alloc_disk(0) or alloc_disk-node(0, XX) is used, the ->major
number is completely ignored:  all devices are allocated with a
major of BLOCK_EXT_MAJOR.

So there is no point allocating pmem_major.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 11:31:34 -08:00
Dan Williams
45f68802f2 libnvdimm, pmem: fix ia64 build, use PHYS_PFN
drivers/nvdimm/pmem.c: In function 'nvdimm_namespace_attach_pfn':
   drivers/nvdimm/pmem.c:367:3: error: implicit declaration of function
   	'__phys_to_pfn' [-Werror=implicit-function-declaration]
   .base_pfn = __phys_to_pfn(nsio->res.start),

ia64 does not provide __phys_to_pfn(), just use the PHYS_PFN() alias.

Cc: Guenter Roeck <linux@roeck-us.net>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-06 08:04:12 -08:00
Dan Williams
cfe30b8720 libnvdimm, pmem: adjust for section collisions with 'System RAM'
On a platform where 'Persistent Memory' and 'System RAM' are mixed
within a given sparsemem section, trim the namespace and notify about the
sub-optimal alignment.

Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:25:45 -08:00
Dan Williams
d9cbe09d39 libnvdimm, pmem: fix 'pfn' support for section-misaligned namespaces
The altmap for a section-misaligned namespace needs to arrange for the
base_pfn to be section-aligned.  As a result the 'reserve' region (pfns
from base that do not have a struct page) must be increased.  Otherwise
we trip the altmap validation check in __add_pages:

	if (altmap->base_pfn != phys_start_pfn
			|| vmem_altmap_offset(altmap) > nr_pages) {
		pr_warn_once("memory add fail, invalid altmap\n");
		return -EINVAL;
	}

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:25:44 -08:00
Dan Williams
719994660c libnvdimm: async notification support
In preparation for asynchronous address range scrub support add an
ability for the pmem driver to dynamically consume address range scrub
results.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-05 12:24:06 -08:00
Arnd Bergmann
c45442055d nvdimm: use 'u64' for pfn flags
A recent bugfix changed pfn_t to always be 64-bit wide, but did not
change the code in pmem.c, which is now broken on 32-bit architectures
as reported by gcc:

In file included from ../drivers/nvdimm/pmem.c:28:0:
drivers/nvdimm/pmem.c: In function 'pmem_alloc':
include/linux/pfn_t.h:15:17: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
 #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))

This changes the intermediate pfn_flags in struct pmem_device to
be 64 bit wide as well, so they can store the flags correctly.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: db78c22230 ("mm: fix pfn_t vs highmem")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-02-23 17:17:20 -08:00
Dan Williams
5c2c2587b1 mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
get_dev_page() enables paths like get_user_pages() to pin a dynamically
mapped pfn-range (devm_memremap_pages()) while the resulting struct page
objects are in use.  Unlike get_page() it may fail if the device is, or
is in the process of being, disabled.  While the initial lookup of the
range may be an expensive list walk, the result is cached to speed up
subsequent lookups which are likely to be in the same mapped range.

devm_memremap_pages() now requires a reference counter to be specified
at init time.  For pmem this means moving request_queue allocation into
pmem_alloc() so the existing queue usage counter can track "device
pages".

ZONE_DEVICE pages always have an elevated count and will never be on an
lru reclaim list.  That space in 'struct page' can be redirected for
other uses, but for safety introduce a poison value that will always
trip __list_add() to assert.  This allows half of the struct list_head
storage to be reclaimed with some assurance to back up the assumption
that the page count never goes to zero and a list_add() is never
attempted.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
468ded03c0 libnvdimm, pmem: move request_queue allocation earlier in probe
Before the dynamically allocated struct pages from devm_memremap_pages()
can be put to use outside the driver, we need a mechanism to track
whether they are still in use at teardown.  Towards that goal reorder
the initialization sequence to allow the 'q_usage_counter' from the
request_queue to be used by the devm_memremap_pages() implementation (in
subsequent patches).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
d2c0f041e1 libnvdimm, pfn, pmem: allocate memmap array in persistent memory
Use the new vmem_altmap capability to enable the pmem driver to arrange
for a struct page memmap to be established in persistent memory.

[linux@roeck-us.net: mn10300: declare __pfn_to_phys() to fix build error]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
4b94ffdc41 x86, mm: introduce vmem_altmap to augment vmemmap_populate()
In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array.  The default vmemmap_populate()
allocates page table storage area from the page allocator.  Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'.  Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: kbuild test robot <lkp@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
9476df7d80 mm: introduce find_dev_pagemap()
There are several scenarios where we need to retrieve and update
metadata associated with a given devm_memremap_pages() mapping, and the
only lookup key available is a pfn in the range:

1/ We want to augment vmemmap_populate() (called via arch_add_memory())
   to allocate memmap storage from pre-allocated pages reserved by the
   device driver.  At vmemmap_alloc_block_buf() time it grabs device pages
   rather than page allocator pages.  This is in support of
   devm_memremap_pages() mappings where the memmap is too large to fit in
   main memory (i.e. large persistent memory devices).

2/ Taking a reference against the mapping when inserting device pages
   into the address_space radix of a given inode.  This facilitates
   unmap_mapping_range() and truncate_inode_pages() operations when the
   driver is tearing down the mapping.

3/ get_user_pages() operations on ZONE_DEVICE memory require taking a
   reference against the mapping so that the driver teardown path can
   revoke and drain usage of device pages.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
34c0fd540e mm, dax, pmem: introduce pfn_t
For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags.  These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory".  Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.

The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e.  O_DIRECT I/O with a persistent memory source/target).  However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dan Williams
8b63b6bfc1 Merge branch 'for-4.5/block-dax' into for-4.5/libnvdimm 2016-01-10 07:53:55 -08:00
Dan Williams
710d69cc99 libnvdimm, pmem: nvdimm_read_bytes() badblocks support
Support badblock checking in all the pmem read paths that do not go
through the block layer.  This protects info block reads (btt or pfn) as
well as data reads to a pmem namespace via a btt instance.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 22:42:31 -08:00
Dan Williams
57f7f317ab pmem, dax: disable dax in the presence of bad blocks
Longer term teach dax to punch "error" holes in mapping requests and
deliver SIGBUS to applications that consume a bad pmem page.  For now,
simply disable the dax performance optimization in the presence of known
errors.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 22:42:31 -08:00
Dan Williams
e10624f8c0 pmem: fail io-requests to known bad blocks
Check the sectors specified in a read bio to see if they hit a known bad
block, and return an error code pmem_do_bvec().

Note that the ->rw_page() is not in a position to return errors.  For
now, copy the same layering violation present in zram_rw_page() to avoid
crashes of the form:

 kernel BUG at mm/filemap.c:822!
 [..]
 Call Trace:
  [<ffffffff811c540e>] page_endio+0x1e/0x60
  [<ffffffff81290d29>] mpage_end_io+0x39/0x60
  [<ffffffff8141c4ef>] bio_endio+0x3f/0x60
  [<ffffffffa005c491>] pmem_make_request+0x111/0x230 [nd_pmem]

...i.e. unlock a page that was already unlocked via pmem_rw_page() =>
page_endio().

Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:04 -08:00
Dan Williams
b95f5f4391 libnvdimm: convert to statically allocated badblocks
If a device will ever have badblocks it should always have a badblocks
instance available.  So, similar to md, embed a badblocks instance in
pmem_device.  This reduces pointer chasing in the i/o fast path, and
simplifies the init path.

Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:04 -08:00
Dan Williams
ad9a8bde2c libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h
nd-core.h is private to the libnvdimm core internals and should not be
used by drivers.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:03 -08:00
Vishal Verma
0caeef63e6 libnvdimm: Add a poison list and export badblocks
During region creation, perform Address Range Scrubs (ARS) for the SPA
(System Physical Address) ranges to retrieve known poison locations from
firmware. Add a new data structure 'nd_poison' which is used as a list
in nvdimm_bus to store these poison locations.

When creating a pmem namespace, if there is any known poison associated
with its physical address space, convert the poison ranges to bad sectors
that are exposed using the badblocks interface.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09 08:39:03 -08:00
Dan Williams
3fa9626865 libnvdimm, pfn: fix nd_pfn_validate() return value handling
The -ENODEV case indicates that the info-block needs to established.
All other return codes cause nd_pfn_init() to abort.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-24 12:20:19 -08:00
Dan Williams
a34d5e8a6a libnvdimm, pfn: add parent uuid validation
Track and check the uuid of the namespace hosting a pfn instance.  This
forces the pfn info block to be invalidated if the namespace is
re-configured with a different uuid.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-13 11:40:20 -08:00
Dan Williams
315c562536 libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE
When setting aside capacity for struct page it must be aligned to the
largest mapping size that is to be made available via DAX.  Make the
alignment configurable to enable support for 1GiB page-size mappings.

The offset for PFN_MODE_RAM may now be larger than SZ_8K, so fixup the
offset check in nvdimm_namespace_attach_pfn().

Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-12 15:04:26 -08:00
Dan Williams
9f1e8cee77 libnvdimm, pfn: kill ND_PFN_ALIGN
The alignment constraint isn't necessary now that devm_memremap_pages()
allows for unaligned mappings.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-12-10 15:14:20 -08:00
Dan Williams
589e75d157 libnvdimm, pmem: fix size trim in pmem_direct_access()
This masking prevents access to the end of the device via dax_do_io(),
and is unnecessary as arch_add_memory() would have rejected an unaligned
allocation.

Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-11-12 09:55:23 -08:00
Linus Torvalds
3419b45039 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block
Pull block IO poll support from Jens Axboe:
 "Various groups have been doing experimentation around IO polling for
  (really) fast devices.  The code has been reviewed and has been
  sitting on the side for a few releases, but this is now good enough
  for coordinated benchmarking and further experimentation.

  Currently O_DIRECT sync read/write are supported.  A framework is in
  the works that allows scalable stats tracking so we can auto-tune
  this.  And we'll add libaio support as well soon.  Fow now, it's an
  opt-in feature for test purposes"

* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
  direct-io: be sure to assign dio->bio_bdev for both paths
  directio: add block polling support
  NVMe: add blk polling support
  block: add block polling support
  blk-mq: return tag/queue combo in the make_request_fn handlers
  block: change ->make_request_fn() and users to return a queue cookie
2015-11-10 17:23:49 -08:00
Jens Axboe
dece16353e block: change ->make_request_fn() and users to return a queue cookie
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
2015-11-07 10:40:46 -07:00
Dan Williams
538ea4aa44 pmem, memremap: convert to numa aware allocations
Given that pmem ranges come with numa-locality hints, arrange for the
resulting driver objects to be obtained from node-local memory.

Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-10-09 17:00:33 -04:00
Dan Williams
b36f47617f devm_memremap: convert to return ERR_PTR
Make devm_memremap consistent with the error return scheme of
devm_memremap_pages to remove special casing in the pmem driver.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-10-09 17:00:33 -04:00
Dan Williams
a639315d6c pmem: kill memremap_pmem()
Now that the pmem-api is defined as "a set of apis that enables access
to WB mapped pmem",  the mapping type is implied.  Remove the wrapper
and push the functionality down into the pmem driver in preparation for
adding support for direct-mapped pmem.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-10-09 17:00:32 -04:00
Ross Zwisler
ba8fe0f85e pmem: add proper fencing to pmem_rw_page()
pmem_rw_page() needs to call wmb_pmem() on writes to make sure that the
newly written data is durable.  This flow was added to pmem_rw_bytes()
and pmem_make_request() with this commit:

commit 61031952f4 ("arch, x86: pmem api for ensuring durability of
	persistent memory updates")

...the pmem_rw_page() path was missed.

Cc: <stable@vger.kernel.org>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-09-17 11:49:28 -04:00