When we go and allocate a bio for IO, we actually do two allocations.
One for the bio itself, and one for the bi_io_vec that holds the
actual pages we are interested in.
This feature inlines a definable amount of io vecs inside the bio
itself, so we eliminate the bio_vec array allocation for IO's up
to a certain size. It defaults to 4 vecs, which is typically 16k
of IO.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Instead of having a global bio slab cache, add a reference to one
in each bio_set that is created. This allows for personalized slabs
in each bio_set, so that they can have bios of different sizes.
This means we can personalize the bios we return. File systems may
want to embed the bio inside another structure, to avoid allocation
more items (and stuffing them in ->bi_private) after the get a bio.
Or we may want to embed a number of bio_vecs directly at the end
of a bio, to avoid doing two allocations to return a bio. This is now
possible.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We only very rarely need the mempool backing, so it makes sense to
get rid of all but one of the mempool in a bio_set. So keep the
largest bio_vec count mempool so we can always honor the largest
allocation, and "upgrade" callers that fail.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Empty barrier required special handling in __elv_next_request() to
complete it without letting the low level driver see it.
With previous changes, barrier code is now flexible enough to skip the
BAR step using the same barrier sequence selection mechanism. Drop
the special handling and mask off q->ordered from start_ordered().
Remove blk_empty_barrier() test which now has no user.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Barrier completion had the following assumptions.
* start_ordered() couldn't finish the whole sequence properly. If all
actions are to be skipped, q->ordseq is set correctly but the actual
completion was never triggered thus hanging the barrier request.
* Drain completion in elv_complete_request() assumed that there's
always at least one request in the queue when drain completes.
Both assumptions are true but these assumptions need to be removed to
improve empty barrier implementation. This patch makes the following
changes.
* Make start_ordered() use blk_ordered_complete_seq() to mark skipped
steps complete and notify __elv_next_request() that it should fetch
the next request if the whole barrier has completed inside
start_ordered().
* Make drain completion path in elv_complete_request() check whether
the queue is empty. Empty queue also indicates drain completion.
* While at it, convert 0/1 return from blk_do_ordered() to false/true.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In all barrier sequences, the barrier write itself was always assumed
to be issued and thus didn't have corresponding control flag. This
patch adds QUEUE_ORDERED_DO_BAR and unify action mask handling in
start_ordered() such that any barrier action can be skipped.
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Separate out ordering type (drain,) and action masks (preflush,
postflush, fua) from visible ordering mode selectors
(QUEUE_ORDERED_*). Ordering types are now named QUEUE_ORDERED_BY_*
while action masks are named QUEUE_ORDERED_DO_*.
This change is necessary to add QUEUE_ORDERED_DO_BAR and make it
optional to improve empty barrier implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Remove 8 bytes of padding from struct bio which also removes 16 bytes from
struct bio_pair to make it 248 bytes. bio_pair then fits into one fewer
cache lines & into a smaller slab.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
After many improvements on kblockd_flush_work, it is now identical to
cancel_work_sync, so a direct call to cancel_work_sync is suggested.
The only difference is that cancel_work_sync is a GPL symbol,
so no non-GPL modules anymore.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Allow the scsi request REQ_QUIET flag to be propagated to the buffer
file system layer. The basic ideas is to pass the flag from the scsi
request to the bio (block IO) and then to the buffer layer. The buffer
layer can then suppress needless printks.
This patch declutters the kernel log by removed the 40-50 (per lun)
buffer io error messages seen during a boot in my multipath setup . It
is a good chance any real errors will be missed in the "noise" it the
logs without this patch.
During boot I see blocks of messages like
"
__ratelimit: 211 callbacks suppressed
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242847
Buffer I/O error on device sdm, logical block 1
Buffer I/O error on device sdm, logical block 5242878
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242872
"
in my logs.
My disk environment is multipath fiber channel using the SCSI_DH_RDAC
code and multipathd. This topology includes an "active" and "ghost"
path for each lun. IO's to the "ghost" path will never complete and the
SCSI layer, via the scsi device handler rdac code, quick returns the IOs
to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
layer messages.
I am wanting to extend the QUIET behavior to include the buffer file
system layer to deal with these errors as well. I have been running this
patch for a while now on several boxes without issue. A few runs of
bonnie++ show no noticeable difference in performance in my setup.
Thanks for John Stultz for the quiet_error finalization.
Submitted-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
As is the case with SSD devices, we do not want to idle in AS/CFQ when
the block device is a paravirt front-end driver. This patch adds a flag
(QUEUE_FLAG_VIRT) which should be used by front-end drivers such as
virtio_blk and xen-blkfront to indicate a paravirtualized device.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (144 commits)
powerpc/44x: Support 16K/64K base page sizes on 44x
powerpc: Force memory size to be a multiple of PAGE_SIZE
powerpc/32: Wire up the trampoline code for kdump
powerpc/32: Add the ability for a classic ppc kernel to be loaded at 32M
powerpc/32: Allow __ioremap on RAM addresses for kdump kernel
powerpc/32: Setup OF properties for kdump
powerpc/32/kdump: Implement crash_setup_regs() using ppc_save_regs()
powerpc: Prepare xmon_save_regs for use with kdump
powerpc: Remove default kexec/crash_kernel ops assignments
powerpc: Make default kexec/crash_kernel ops implicit
powerpc: Setup OF properties for ppc32 kexec
powerpc/pseries: Fix cpu hotplug
powerpc: Fix KVM build on ppc440
powerpc/cell: add QPACE as a separate Cell platform
powerpc/cell: fix build breakage with CONFIG_SPUFS disabled
powerpc/mpc5200: fix error paths in PSC UART probe function
powerpc/mpc5200: add rts/cts handling in PSC UART driver
powerpc/mpc5200: Make PSC UART driver update serial errors counters
powerpc/mpc5200: Remove obsolete code from mpc5200 MDIO driver
powerpc/mpc5200: Add MDMA/UDMA support to MPC5200 ATA driver
...
Fix trivial conflict in drivers/char/Makefile as per Paul's directions
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: use the new byteorder headers
fbcon: Protect free_irq() by MACH_IS_ATARI check
fbcon: remove broken mac vbl handler
m68k: fix trigraph ignored warning in setox.S
macfb annotations and compiler warning fix
m68k: mac baboon interrupt enable/disable
m68k: machw.h cleanup
m68k: Mac via cleanup and commentry
m68k: Reinstate mac rtc
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1429 commits)
net: Allow dependancies of FDDI & Tokenring to be modular.
igb: Fix build warning when DCA is disabled.
net: Fix warning fallout from recent NAPI interface changes.
gro: Fix potential use after free
sfc: If AN is enabled, always read speed/duplex from the AN advertising bits
sfc: When disabling the NIC, close the device rather than unregistering it
sfc: SFT9001: Add cable diagnostics
sfc: Add support for multiple PHY self-tests
sfc: Merge top-level functions for self-tests
sfc: Clean up PHY mode management in loopback self-test
sfc: Fix unreliable link detection in some loopback modes
sfc: Generate unique names for per-NIC workqueues
802.3ad: use standard ethhdr instead of ad_header
802.3ad: generalize out mac address initializer
802.3ad: initialize ports LACPDU from const initializer
802.3ad: remove typedef around ad_system
802.3ad: turn ports is_individual into a bool
802.3ad: turn ports is_enabled into a bool
802.3ad: make ntt bool
ixgbe: Fix set_ringparam in ixgbe to use the same memory pools.
...
Fixed trivial IPv4/6 address printing conflicts in fs/cifs/connect.c due
to the conversion to %pI (in this networking merge) and the addition of
doing IPv6 addresses (from the earlier merge of CIFS).
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (26 commits)
IB/mlx4: Set ownership bit correctly when copying CQEs during CQ resize
RDMA/nes: Remove tx_free_list
RDMA/cma: Add IPv6 support
RDMA/addr: Add support for translating IPv6 addresses
mlx4_core: Delete incorrect comment
mlx4_core: Add support for multiple completion event vectors
IB/iser: Avoid recv buffer exhaustion caused by unexpected PDUs
IB/ehca: Remove redundant test of vpage
IB/ehca: Replace modulus operations in flush error completion path
IB/ipath: Add locking for interrupt use of ipath_pd contexts vs free
IB/ipath: Fix spi_pioindex value
IB/ipath: Only do 1X workaround on rev1 chips
IB/ipath: Don't count IB symbol and link errors unless link is UP
IB/ipath: Check return value of dma_map_single()
IB/ipath: Fix PSN of send WQEs after an RDMA read resend
RDMA/nes: Cleanup warnings
RDMA/nes: Add loopback check to make_cm_node()
RDMA/nes: Check cqp_avail_reqs is empty after locking the list
RDMA/nes: Fix TCP compliance test failures
RDMA/nes: Forward packets for a new connection with stale APBVT entry
...
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
sched, trace: update trace_sched_wakeup()
tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
Revert "x86: disable X86_PTRACE_BTS"
ring-buffer: prevent false positive warning
ring-buffer: fix dangling commit race
ftrace: enable format arguments checking
x86, bts: memory accounting
x86, bts: add fork and exit handling
ftrace: introduce tracing_reset_online_cpus() helper
tracing: fix warnings in kernel/trace/trace_sched_switch.c
tracing: fix warning in kernel/trace/trace.c
tracing/ring-buffer: remove unused ring_buffer size
trace: fix task state printout
ftrace: add not to regex on filtering functions
trace: better use of stack_trace_enabled for boot up code
trace: add a way to enable or disable the stack tracer
x86: entry_64 - introduce FTRACE_ frame macro v2
tracing/ftrace: add the printk-msg-only option
tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
x86, bts: correctly report invalid bts records
...
Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
being already partly merged by the SH merge.
* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (246 commits)
x86: traps.c replace #if CONFIG_X86_32 with #ifdef CONFIG_X86_32
x86: PAT: fix address types in track_pfn_vma_new()
x86: prioritize the FPU traps for the error code
x86: PAT: pfnmap documentation update changes
x86: PAT: move track untrack pfnmap stubs to asm-generic
x86: PAT: remove follow_pfnmap_pte in favor of follow_phys
x86: PAT: modify follow_phys to return phys_addr prot and return value
x86: PAT: clarify is_linear_pfn_mapping() interface
x86: ia32_signal: remove unnecessary declaration
x86: common.c boot_cpu_stack and boot_exception_stacks should be static
x86: fix intel x86_64 llc_shared_map/cpu_llc_id anomolies
x86: fix warning in arch/x86/kernel/microcode_amd.c
x86: ia32.h: remove unused struct sigfram32 and rt_sigframe32
x86: asm-offset_64: use rt_sigframe_ia32
x86: sigframe.h: include headers for dependency
x86: traps.c declare functions before they get used
x86: PAT: update documentation to cover pgprot and remap_pfn related changes - v3
x86: PAT: add pgprot_writecombine() interface for drivers - v3
x86: PAT: change pgprot_noncached to uc_minus instead of strong uc - v3
x86: PAT: implement track/untrack of pfnmap regions for x86 - v3
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (105 commits)
SELinux: don't check permissions for kernel mounts
security: pass mount flags to security_sb_kern_mount()
SELinux: correctly detect proc filesystems of the form "proc/foo"
Audit: Log TIOCSTI
user namespaces: document CFS behavior
user namespaces: require cap_set{ug}id for CLONE_NEWUSER
user namespaces: let user_ns be cloned with fairsched
CRED: fix sparse warnings
User namespaces: use the current_user_ns() macro
User namespaces: set of cleanups (v2)
nfsctl: add headers for credentials
coda: fix creds reference
capabilities: define get_vfs_caps_from_disk when file caps are not enabled
CRED: Allow kernel services to override LSM settings for task actions
CRED: Add a kernel_service object class to SELinux
CRED: Differentiate objective and effective subjective credentials on a task
CRED: Documentation
CRED: Use creds in file structs
CRED: Prettify commoncap.c
CRED: Make execve() take advantage of copy-on-write credentials
...
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (132 commits)
sh: oprofile: Fix up the module build.
sh: add UIO support for JPU on SH7722.
serial: sh-sci: Fix up port pinmux for SH7366.
sh: mach-rsk: Use uImage generation by default for rsk7201/7203.
sh: mach-sh03: Fix up pata_platform build breakage.
sh: enable deferred io LCDC on Migo-R
video: sh_mobile_lcdcfb deferred io support
video: deferred io with physically contiguous memory
video: deferred io cleanup
video: fix deferred io fsync()
sh: add LCDC interrupt configuration to AP325 and Migo-R
sh_mobile_lcdc: use FB_SYS helpers instead of FB_CFB
sh: split coherent pages
sh: dma: Kill off ISA DMA wrapper.
sh: Conditionalize the code dumper on CONFIG_DUMP_CODE.
sh: Kill off the unused SH_ALPHANUMERIC debug option.
sh: Enable skipping of bss on debug platforms for sh32 also.
doc: Update sh cpufreq documentation.
sh: mrshpc_setup_windows() needs to be inline.
serial: sh-sci: sci_poll_get_char() is only used by CONFIG_CONSOLE_POLL.
...
Remove some more cruft from machw.h and drop the #include where it isn't
needed.
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
See commit 1045b03e07 ("netlink: fix
overrun in attribute iteration") for a detailed explanation of why
this patch is necessary.
In short, nlmsg_next() can make "remaining" go negative, and the
remaining >= sizeof(...) comparison will promote "remaining" to an
unsigned type, which means that the expression will evaluate to
true for negative numbers, even though it was not intended.
I put "theoretical" in the title because I have no evidence that
this can actually happen, but I suspect that a crafted netlink
packet can trigger some badness.
Note that the last test, which seemingly has the exact same
problem (also true for nla_ok()), is perfectly OK, since we
already know that remaining is positive.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement socket option SCTP_GET_ASSOC_NUMBER of the latest ietf socket
extensions API draft.
8.2.5. Get the Current Number of Associations (SCTP_GET_ASSOC_NUMBER)
This option gets the current number of associations that are attached
to a one-to-many style socket. The option value is an uint32_t.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For CONFIG_SPARSEMEM_VMEMMAP=y on s390 I get warnings like
init/main.c: In function 'start_kernel':
init/main.c:641: warning: format '%08lx' expects type 'long unsigned int', but argument 2 has type 'int'
The warning can be suppressed with a cast to unsigned long in the
CONFIG_SPARSEMEM_VMEMMAP=y version of __page_to_pfn.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Provide a locking free version of iucv_message_receive and iucv_message_send
that do not call local_bh_enable in a spin_lock_(bh|irqsave)() context.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Impact: extend the wakeup tracepoint with the info whether the wakeup was real
Add the information needed to distinguish 'real' wakeups from 'false'
wakeups.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The tables used by the various AES algorithms are currently
computed at run-time. This has created an init ordering problem
because some AES algorithms may be registered before the tables
have been initialised.
This patch gets around this whole thing by precomputing the tables.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The bnx2x driver actually uses the crc32c_le name so this patch
restores the crc32c_le symbol through a macro.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch swaps the role of libcrc32c and crc32c. Previously
the implementation was in libcrc32c and crc32c was a wrapper.
Now the code is in crc32c and libcrc32c just calls the crypto
layer.
The reason for the change is to tap into the algorithm selection
capability of the crypto API so that optimised implementations
such as the one utilising Intel's CRC32C instruction can be
used where available.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch allows shash algorithms to be used through the old hash
interface. This is a transitional measure so we can convert the
underlying algorithms to shash before converting the users across.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
It is often useful to save the partial state of a hash function
so that it can be used as a base for two or more computations.
The most prominent example is HMAC where all hashes start from
a base determined by the key. Having an import/export interface
means that we only have to compute that base once rather than
for each message.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch allows shash algorithms to be used through the ahash
interface. This is required before we can convert digest algorithms
over to shash.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The shash interface replaces the current synchronous hash interface.
It improves over hash in two ways. Firstly shash is reentrant,
meaning that the same tfm may be used by two threads simultaneously
as all hashing state is stored in a local descriptor.
The other enhancement is that shash no longer takes scatter list
entries. This is because shash is specifically designed for
synchronous algorithms and as such scatter lists are unnecessary.
All existing hash users will be converted to shash once the
algorithms have been completely converted.
There is also a new finup function that combines update with final.
This will be extended to ahash once the algorithm conversion is
done.
This is also the first time that an algorithm type has their own
registration function. Existing algorithm types will be converted
to this way in due course.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch reintroduces a completely revamped crypto_alloc_tfm.
The biggest change is that we now take two crypto_type objects
when allocating a tfm, a frontend and a backend. In fact this
simply formalises what we've been doing behind the API's back.
For example, as it stands crypto_alloc_ahash may use an
actual ahash algorithm or a crypto_hash algorithm. Putting
this in the API allows us to do this much more cleanly.
The existing types will be converted across gradually.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>