Commit Graph

88 Commits

Author SHA1 Message Date
Sage Weil
03c677e1d1 ceph: reset msgr backoff during open, not after successful handshake
Reset the backoff delay when we reopen the connection, so that the delays
for any initial connection problems are reasonable.  We were resetting only
after a successful handshake, which was of limited utility.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-20 15:14:15 -08:00
Sage Weil
0dc2570fab ceph: reset requested max_size after mds reconnect
The max_size increase request to the MDS can get lost during an MDS
restart and reconnect.  Reset our requested value after the MDS recovers,
so that any blocked writes will re-request a larger max_size upon waking.

Also, explicit wake session caps after the reconnect.  Normally the cap
renewal catches this, but not in the cases where the caps didn't go stale
in the first place, which would leave writers waiting on max_size asleep.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-20 14:24:52 -08:00
Yehuda Sadeh
dc14657c9c ceph: mount fails immediately on error
Signed-off-by: Yehuda Sadeh <yehuda@newdream.net>
2009-11-20 14:24:46 -08:00
Sage Weil
94045e115e ceph: decode updated mdsmap format
The mds map now uses the global_id as the 'key' (instead of the addr,
which was a poor choice).

This is protocol change.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-20 14:24:33 -08:00
Sage Weil
0743304d87 ceph: fix debugfs entry, simplify fsid checks
We may first learn our fsid from any of the mon, osd, or mds maps
(whichever the monitor sends first).  Consolidate checks in a single
helper.  Initialize the client debugfs entry then, since we need the
fsid (and global_id) for the directory name.

Also remove dead mount code.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-20 14:24:27 -08:00
Sage Weil
cfea1cf42b ceph: small cleanup in hash function
Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-20 14:24:26 -08:00
Sage Weil
b9bfb93ce2 ceph: move mempool creation to ceph_create_client
Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-18 16:20:08 -08:00
Sage Weil
4e7a5dcd1b ceph: negotiate authentication protocol; implement AUTH_NONE protocol
When we open a monitor session, we send an initial AUTH message listing
the auth protocols we support, our entity name, and (possibly) a previously
assigned global_id.  The monitor chooses a protocol and responds with an
initial message.

Initially implement AUTH_NONE, a dummy protocol that provides no security,
but works within the new framework.  It generates 'authorizers' that are
used when connecting to (mds, osd) services that simply state our entity
name and global_id.

This is a wire protocol change.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-18 16:19:57 -08:00
Sage Weil
5f44f14260 ceph: handle errors during osd client init
Unwind initializing if we get ENOMEM during client initialization.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-18 15:02:36 -08:00
Sage Weil
71ececdaca ceph: remove unnecessary ceph_con_shutdown
We require that ceph_con_close be called before we drop the connection,
so this is unneeded.  Just BUG if con->sock != NULL.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-18 11:29:45 -08:00
Sage Weil
42ce56e50d ceph: remove bad calls to ceph_con_shutdown
We want to ceph_con_close when we're done with the connection, before
the ref count reaches 0.  Once it does, do not call ceph_con_shutdown,
as that takes the con mutex and may sleep, and besides that is
unnecessary.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-18 11:29:42 -08:00
Sage Weil
11ea8eda06 ceph: fix page invalidation deadlock
We occasionally want to make a best-effort attempt to invalidate cache
pages without fear of blocking.  If this fails, we fall back to an async
invalidate in another thread.

Use invalidate_mapping_pages instead of invalidate_inode_page2, as that
will skip locked pages, and not deadlock.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-12 15:57:05 -08:00
Sage Weil
039934b895 ceph: build cleanly without CONFIG_DEBUG_FS
Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-12 15:56:51 -08:00
Sage Weil
fef320ff88 ceph: pr_info when mds reconnect completes
This helps the user know what's going on during the (involved) reconnect
process.  They already see when the mds fails and reconnect starts.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-11 15:50:31 -08:00
Sage Weil
b377ff13b3 ceph: initialize i_size/i_rbytes on snapdir
Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-11 15:50:28 -08:00
Sage Weil
09b8a7d2af ceph: exclude snapdir from readdir results
It was hidden from sync readdir, but not the cached dcache version.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-11 15:50:25 -08:00
Sage Weil
cdac830313 ceph: remove recon_gen logic
We don't get an explicit affirmative confirmation that our caps reconnect,
nor do we necessarily want to pay that cost.  So, take all this code out
for now.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-10 16:03:53 -08:00
Sage Weil
eed0ef2caf ceph: separate banner and connect during handshake into distinct stages
We need to make sure we only swab the address during the banner once.  So
break process_banner out of process_connect, and clean up the surrounding
code so that these are distinct phases of the handshake.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-10 14:34:48 -08:00
Sage Weil
685f9a5d14 ceph: do not confuse stale and dead (unreconnected) caps
We were using the cap_gen to track both stale caps (caps that timed out
due to temporarily losing touch with the mds) and dead caps that did not
reconnect after an MDS failure.  Introduce a recon_gen counter to track
reconnections to restarted MDSs and kill dead caps based on that instead.

Rename gen to cap_gen while we're at it to make it more clear which is
which.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-09 12:06:07 -08:00
Sage Weil
fb690390e3 ceph: make CRUSH hash function a bucket property
Make the integer hash function a property of the bucket it is used on.  This
allows us to gracefully add support for new hash functions without starting
from scatch.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-07 20:18:22 -08:00
Sage Weil
1654dd0cf5 ceph: make object hash a pg_pool property
The object will be hashed to a placement seed (ps) based on the pg_pool's
hash function.  This allows new hashes to be introduced into an existing
object store, or selection of a hash appropriate to the objects that
will be stored in a particular pool.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-06 21:55:25 -08:00
Sage Weil
cfbbcd24a6 ceph: use strong hash function for mapping objects to pgs
We were using the (weak) dcache hash function, but it was leaving lower
bits consecutive for consecutive (inode) objects.  We really want to make
the object to pg mapping random and uniform, so use a proper hash function
here.

This is Robert Jenkin's public domain hash function (with some minor
cleanup):
	http://burtleburtle.net/bob/hash/evahash.html

This is a protocol revision.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-06 16:44:05 -08:00
Sage Weil
c6cf726316 ceph: make CRUSH hash functions non-inline
These are way to big to be inline.  I missed crush/* when doing the inline
audit for akpm's review.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-06 16:39:26 -08:00
Sage Weil
1bdb70e590 ceph: clean up 'osd%d down' console msg
No ceph prefix.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-06 13:57:49 -08:00
Sage Weil
f28bcfbe66 ceph: convert port endianness
The port is informational only, but we should make it correct.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-04 16:36:12 -08:00
Sage Weil
6a18be16f7 ceph: fix sparse endian warning
Use the __le macro, even though for -1 it doesn't matter.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-04 16:36:12 -08:00
Sage Weil
51042122d4 ceph: fix endian conversions for ceph_pg
The endian conversions don't quite work with the old union ceph_pg.  Just
make it a regular struct, and make each field __le.  This is simpler and it
has the added bonus of actually working.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-04 16:36:07 -08:00
Sage Weil
63f2d21195 ceph: use fixed endian encoding for ceph_entity_addr
We exchange struct ceph_entity_addr over the wire and store it on disk.
The sockaddr_storage.ss_family field, however, is host endianness.  So,
fix ss_family endianness to big endian when sending/receiving over the
wire.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-03 15:17:56 -08:00
Sage Weil
859e7b1493 ceph: init/destroy bdi in client create/destroy helpers
This keeps bdi setup/teardown in line with client life cycle.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-02 09:32:47 -08:00
Sage Weil
33aa96e743 crush: always return a value from crush_bucket_choose
Even when we encounter a corrupt bucket.  We still BUG().  This fixes
the warning

fs/ceph/crush/mapper.c: In function 'crush_choose':
fs/ceph/crush/mapper.c:352: warning: control may reach end of non-void function
'crush_bucket_choose' being inlined

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-01 17:53:24 -08:00
Sage Weil
63ff78b25c ceph: fix uninitialized err variable
Fixes warning
fs/ceph/xattr.c: In function '__build_xattrs':
fs/ceph/xattr.c:353: warning: 'err' may be used uninitialized in this function

Signed-off-by: Sage Weil <sage@newdream.net>
2009-11-01 17:51:15 -08:00
Noah Watkins
ff1d1f7179 ceph: fix intra strip unit length calculation
Commit 645a102581 fixes calculation of object
offset for layouts with multiple stripes per object. This updates the
calculation of the length written to take into account multiple stripes per
object.

Signed-off-by: Noah Watkins <noah@noahdesu.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-30 13:56:14 -07:00
Sage Weil
645a102581 ceph: fix object striping calculation for non-default striping schemes
We were incorrectly calculationing of object offset.  If we have multiple
stripe units per object, we need to shift to the start of the current
su in addition to the offset within the su.

Also rename bno to ono (object number) to avoid some variable naming
confusion.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-28 17:45:41 -07:00
Sage Weil
5600f5ebd3 ceph: correct comment to match striping calculation
The object extent offset is the file offset _modulo_ the stripe unit.
The code was correct, the comment was wrong.

Reported-by: Noah Watkins <jayhawk@soe.ucsc.edu>
Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-28 17:45:37 -07:00
Noah Watkins
35e054a66e ceph: remove redundant use of le32_to_cpu
Using stripe unit size calculated and saved on the stack to avoid
a redundant call to le32_to_cpu.

Signed-off-by: Noah Watkins <noah@noahdesu.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-28 17:44:37 -07:00
Noah Watkins
fbbccec9c6 ceph: replace list_entry with container_of
Usage of non-list.h list_entry function for container_of
functionality replaced with direct use of container_of.

Signed-off-by: Noah Watkins <noah@noahdesu.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-28 17:44:22 -07:00
Sage Weil
6b8051855d ceph: allocate and parse mount args before client instance
This simplifies much of the error handling during mount.  It also means
that we have the mount args before client creation, and we can initialize
based on those options.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-27 11:57:03 -07:00
Sage Weil
e53c2fe075 ceph: fix, clean up string mount arg parsing
Clearly demark int and string argument options, and do not try to convert
string arguments to ints.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-27 11:17:25 -07:00
Sage Weil
6ca874e92d ceph: silence uninitialized variable warning
Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-26 22:07:59 -07:00
Sage Weil
7b813c4602 ceph: reduce parse_mount_args stack usage
Since we've increased the max mon count, we shouldn't put the addr array
on the parse_mount_args stack.  Put it on the heap instead.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-26 22:07:53 -07:00
Sage Weil
ecb19c4649 ceph: remove small mon addr limit; use CEPH_MAX_MON where appropriate
Get rid of separate max mon limit; use the system limit instead.  This
allows mounts when there are lots of mon addrs provided by mount.ceph (as
with a host with lots of A/AAAA records).

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-22 10:53:17 -07:00
Sage Weil
232d4b0131 ceph: move directory size logic to ceph_getattr
We can't fill i_size with rbytes at the fill_file_size stage without
adding additional checks for directories.  Notably, we want st_blocks
to remain 0 on directories so that 'du' still works.

Fill in i_blocks, i_size specially in ceph_getattr instead.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-21 11:24:36 -07:00
Sage Weil
bb097ffaf8 ceph: v0.17 of client
Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-19 16:17:31 -07:00
Sage Weil
ee7fdfaff7 ceph: include preferred osd in placement seed
Mix the preferred osd (if any) into the placement seed that is fed into
the CRUSH object placement calculation.  This prevents all the placement
pgs from peering with the same osds.

Rev the osd client protocol with this change.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-19 11:42:41 -07:00
Sage Weil
8fa9765576 ceph: enable readahead
Initialized bdi->ra_pages to enable readahead.  Use 512KB default.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-16 14:44:43 -07:00
Sage Weil
76e3b390d4 ceph: move dirty caps code around
Cleanup only.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-15 18:14:44 -07:00
Sage Weil
8f3bc053c6 ceph: warn on allocation from msgpool with larger front_len
Pass the front_len we need when pulling a message off a msgpool,
and WARN if it is greater than the pool's size.  Then try to
allocate a new message (to continue without failing).

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-15 18:14:43 -07:00
Sage Weil
07bd10fb98 ceph: correct subscribe_ack msgpool payload size
Defined a struct for the SUBSCRIBE_ACK, and use that to size
the msgpool.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-15 18:14:42 -07:00
Sage Weil
afcdaea3f2 ceph: flush dirty caps via the cap_dirty list
Previously we were flushing dirty caps by passing an extra flag
when traversing the delayed caps list.  Besides being a bit ugly,
that can also miss caps that are dirty but didn't result in a
cap requeue: notably, mark_caps_dirty().

Separate the flushing into a separate helper, and traverse the
cap_dirty list.

This also brings i_dirty_item in line with i_dirty_caps: we are
on the list IFF caps != 0.  We carry an inode ref IFF
dirty_caps|flushing_caps != 0.

Lose the unused return value from __ceph_mark_caps_dirty().

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-15 18:14:35 -07:00
Sage Weil
cdc35f9627 ceph: move generic flushing code into helper
Both callers of __mark_caps_flushing() do the same work; move it
into the helper.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-14 14:43:56 -07:00