diff --git a/Documentation/networking/fib_trie.txt b/Documentation/networking/fib_trie.txt new file mode 100644 index 000000000000..f50d0c673c57 --- /dev/null +++ b/Documentation/networking/fib_trie.txt @@ -0,0 +1,145 @@ + LC-trie implementation notes. + +Node types +---------- +leaf + An end node with data. This has a copy of the relevant key, along + with 'hlist' with routing table entries sorted by prefix length. + See struct leaf and struct leaf_info. + +trie node or tnode + An internal node, holding an array of child (leaf or tnode) pointers, + indexed through a subset of the key. See Level Compression. + +A few concepts explained +------------------------ +Bits (tnode) + The number of bits in the key segment used for indexing into the + child array - the "child index". See Level Compression. + +Pos (tnode) + The position (in the key) of the key segment used for indexing into + the child array. See Path Compression. + +Path Compression / skipped bits + Any given tnode is linked to from the child array of its parent, using + a segment of the key specified by the parent's "pos" and "bits" + In certain cases, this tnode's own "pos" will not be immediately + adjacent to the parent (pos+bits), but there will be some bits + in the key skipped over because they represent a single path with no + deviations. These "skipped bits" constitute Path Compression. + Note that the search algorithm will simply skip over these bits when + searching, making it necessary to save the keys in the leaves to + verify that they actually do match the key we are searching for. + +Level Compression / child arrays + the trie is kept level balanced moving, under certain conditions, the + children of a full child (see "full_children") up one level, so that + instead of a pure binary tree, each internal node ("tnode") may + contain an arbitrarily large array of links to several children. + Conversely, a tnode with a mostly empty child array (see empty_children) + may be "halved", having some of its children moved downwards one level, + in order to avoid ever-increasing child arrays. + +empty_children + the number of positions in the child array of a given tnode that are + NULL. + +full_children + the number of children of a given tnode that aren't path compressed. + (in other words, they aren't NULL or leaves and their "pos" is equal + to this tnode's "pos"+"bits"). + + (The word "full" here is used more in the sense of "complete" than + as the opposite of "empty", which might be a tad confusing.) + +Comments +--------- + +We have tried to keep the structure of the code as close to fib_hash as +possible to allow verification and help up reviewing. + +fib_find_node() + A good start for understanding this code. This function implements a + straightforward trie lookup. + +fib_insert_node() + Inserts a new leaf node in the trie. This is bit more complicated than + fib_find_node(). Inserting a new node means we might have to run the + level compression algorithm on part of the trie. + +trie_leaf_remove() + Looks up a key, deletes it and runs the level compression algorithm. + +trie_rebalance() + The key function for the dynamic trie after any change in the trie + it is run to optimize and reorganize. Tt will walk the trie upwards + towards the root from a given tnode, doing a resize() at each step + to implement level compression. + +resize() + Analyzes a tnode and optimizes the child array size by either inflating + or shrinking it repeatedly until it fullfills the criteria for optimal + level compression. This part follows the original paper pretty closely + and there may be some room for experimentation here. + +inflate() + Doubles the size of the child array within a tnode. Used by resize(). + +halve() + Halves the size of the child array within a tnode - the inverse of + inflate(). Used by resize(); + +fn_trie_insert(), fn_trie_delete(), fn_trie_select_default() + The route manipulation functions. Should conform pretty closely to the + corresponding functions in fib_hash. + +fn_trie_flush() + This walks the full trie (using nextleaf()) and searches for empty + leaves which have to be removed. + +fn_trie_dump() + Dumps the routing table ordered by prefix length. This is somewhat + slower than the corresponding fib_hash function, as we have to walk the + entire trie for each prefix length. In comparison, fib_hash is organized + as one "zone"/hash per prefix length. + +Locking +------- + +fib_lock is used for an RW-lock in the same way that this is done in fib_hash. +However, the functions are somewhat separated for other possible locking +scenarios. It might conceivably be possible to run trie_rebalance via RCU +to avoid read_lock in the fn_trie_lookup() function. + +Main lookup mechanism +--------------------- +fn_trie_lookup() is the main lookup function. + +The lookup is in its simplest form just like fib_find_node(). We descend the +trie, key segment by key segment, until we find a leaf. check_leaf() does +the fib_semantic_match in the leaf's sorted prefix hlist. + +If we find a match, we are done. + +If we don't find a match, we enter prefix matching mode. The prefix length, +starting out at the same as the key length, is reduced one step at a time, +and we backtrack upwards through the trie trying to find a longest matching +prefix. The goal is always to reach a leaf and get a positive result from the +fib_semantic_match mechanism. + +Inside each tnode, the search for longest matching prefix consists of searching +through the child array, chopping off (zeroing) the least significant "1" of +the child index until we find a match or the child index consists of nothing but +zeros. + +At this point we backtrack (t->stats.backtrack++) up the trie, continuing to +chop off part of the key in order to find the longest matching prefix. + +At this point we will repeatedly descend subtries to look for a match, and there +are some optimizations available that can provide us with "shortcuts" to avoid +descending into dead ends. Look for "HL_OPTIMIZE" sections in the code. + +To alleviate any doubts about the correctness of the route selection process, +a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which +gives userland access to fib_lookup().