bird/doc/threads/04_memory_management.md

# BIRD Journey to Threads. Chapter 4: Memory and other resource management.

BIRD is mostly a large specialized database engine, storing mega/gigabytes of
Internet routing data in memory. To keep accounts of every byte of allocated data,
BIRD has its own resource management system which must be adapted to the
multithreaded environment. The resource system has not changed much, yet it
deserves a short chapter.

BIRD is a fast, robust and memory-efficient routing daemon designed and
implemented at the end of 20th century. We're doing a significant amount of
BIRD's internal structure changes to make it run in multiple threads in parallel.

## Resources

Inside BIRD, (almost) every piece of allocated memory is a resource. To achieve this,
every such memory block includes a generic `struct resource` header. The node
is enlisted inside a linked list of a *resource pool* (see below), the class
pointer defines basic operations done on resources.

```
typedef struct resource {
  node n;				/* Inside resource pool */
  struct resclass *class;		/* Resource class */
} resource;

struct resclass {
  char *name;				/* Resource class name */
  unsigned size;			/* Standard size of single resource */
  void (*free)(resource *);		/* Freeing function */
  void (*dump)(resource *);		/* Dump to debug output */
  resource *(*lookup)(resource *, unsigned long);	/* Look up address (only for debugging) */
  struct resmem (*memsize)(resource *);	/* Return size of memory used by the resource, may be NULL */
};

void *ralloc(pool *, struct resclass *);
```

Resource cycle begins with an allocation of a resource. To do that, you should call `ralloc()`,
passing the parent pool and the appropriate resource class as arguments. BIRD
allocates a memory block of size given by the given class member `size`.
Beginning of the block is reserved for `struct resource` itself and initialized
by the given arguments. Therefore, you may sometimes see an idiom where a structure
has a first member `struct resource r;`, indicating that this item should be
allocated as a resource.

The counterpart is resource freeing. This may be implicit (by resource pool
freeing) or explicit (by `rfree()`). In both cases, the `free()` function of
the appropriate class is called to cleanup the resource before final freeing.

To account for `dump` and `memsize` calls, there are CLI commands `dump
resources` and `show memory`, using these to dump resources or show memory
usage as perceived by BIRD.

The last, `lookup`, is quite an obsolete way to identify a specific pointer
from a debug interface. You may call `rlookup(pointer)` and BIRD should dump
that resource to the debug output. This mechanism is probably incomplete as no
developer uses it actively for debugging.

Resources can be also moved between pools by `rmove` when needed.

## Resource pools

The first internal resource class is a recursive resource – a resource pool. In
the singlethreaded version, this is just a simple structure:

```
struct pool {
  resource r;
  list inside;
  struct birdloop *loop;  /* In multithreaded version only */
  const char *name;
};
```

Resource pools are used for grouping resources together. There are pools everywhere
and it is a common idiom inside BIRD to just `rfree` the appropriate pool when
e.g. a protocol or table is going down. Everything left there is cleaned up.

There are anyway several classes which must be freed with care. In the
singlethreaded version, the *slab* allocator (see below) must be empty before
it may be freed and this is kept to the multithreaded version while other
restrictions have been added.

There is also a global pool, `root_pool`, containing every single resource BIRD
knows about, either directly or via another resource pool.

### Thread safety in resource pools

In the multithreaded version, every resource pool is bound to a specific IO
loop and therefore includes an IO loop pointer. This is important for allocations
as the resource list inside the pool is thread-unsafe. All pool operations
therefore require the IO loop to be entered to do anything with them, if possible.
(In case of `rfree`, the pool data structure is not accessed at all so no
assert is possible. We're currently relying on the caller to ensure proper locking.
In future, this may change.)

Each IO loop also has its base resource pool for its allocations. All pools
inside the IO loop pool must belong to the same loop or to a loop with a
subordinate lock (see the previous chapter for lock ordering). If there is a
need for multiple IO loops to access one shared data structure, it must be
locked by another lock and allocated in such a way that is independent on these
accessor loops.

The pool structure should follow the locking order. Any pool should belong to
either the same loop as its parent or its loop lock should be after its parent
loop lock in the locking order. This is not enforced explicitly, yet it is
virtually impossible to write some working code violating this recommendation.

### Resource pools in the wilderness

Root pool contains (among others):

* route attributes and sources
* routing tables
* protocols
* interfaces
* configuration data

Each table has its IO loop and uses the loop base pool for allocations.
The same holds for protocols. Each protocol has its pool; it is either its IO
loop base pool or an ordinary pool bound to main loop.

## Memory allocators

BIRD stores data in memory blocks allocated by several allocators. There are 3
of them: simple memory blocks, linear pools and slabs.

### Simple memory block

When just a chunk of memory is needed, `mb_alloc()` or `mb_allocz()` is used
to get it. The first with `malloc()` semantics, the other is also zeroed.
There is also `mb_realloc()` available, `mb_free()` to explicitly free such a
memory and `mb_move()` to move that memory to another pool.

Simple memory blocks consume a fixed amount of overhead memory (32 bytes on
systems with 64-bit pointers) so they are suitable mostly for big chunks,
taking advantage of the default *stdlib* allocator which is used by this
allocation strategy. There are anyway some parts of BIRD (in all versions)
where this allocator is used for little blocks. This will be fixed some day.

### Linear pools

Sometimes, memory is allocated temporarily. When the data may just sit on
stack, we put it there. Anyway, many tasks need more structured execution where
stack allocation is incovenient or even impossible (e.g. when callbacks from
parsers are involved). For such a case, a *linpool* is the best choice.

This data structure allocates memory blocks of requested size with negligible
overhead in functions `lp_alloc()` (uninitialized) or `lp_allocz()` (zeroed).
There is anyway no `realloc` and no `free` call; to have a larger chunk, you
need to allocate another block. All this memory is freed at once by `lp_flush()`
when it is no longer needed.

You may see linpools in parsers (BGP, Linux netlink, config) or in filters.

In the multithreaded version, linpools have received an update, allocating
memory pages directly by `mmap()` instead of calling `malloc()`. More on memory
pages below.

### Slabs

To allocate lots of same-sized objects, a [slab allocator](https://en.wikipedia.org/wiki/Slab_allocation)
is an ideal choice. In versions until 2.0.8, our slab allocator used blocks
allocated by `malloc()`, every object included a *slab head* pointer and free objects
were linked into a single-linked list. This led to memory inefficiency and to
contra-intuitive behavior where a use-after-free bug could do lots of damage
before finally crashing.

Versions from 2.0.9, and also all the multithreaded versions, are coming with
slabs using directly allocated memory pages and usage bitmaps instead of
single-linking the free objects. This approach however relies on the fact that
pointers returned by `mmap()` are always divisible by page size. Freeing of a
slab object involves zeroing (mostly) 13 least significant bits of its pointer
to get the page pointer where the slab head resides.

This update helps with memory consumption by about 5% compared to previous
versions; exact numbers depend on the usage pattern.

## Raw memory pages

Until 2.0.8 (incl.), BIRD allocated all memory by `malloc()`. This method is
suitable for lots of use cases, yet when gigabytes of memory should be
allocated by little pieces, BIRD uses its internal allocators to keep track
about everything. This brings some ineffectivity as stdlib allocator has its
own overhead and doesn't allocate aligned memory unless asked for.

Slabs and linear pools are backed by blocks of memory of kilobyte sizes. As a
typical memory page size is 4 kB, it is a logical step to drop stdlib
allocation from these allocators and to use `mmap()` directly. This however has
some drawbacks, most notably the need of a syscall for every memory mapping and
unmapping. For allocations, this is not much a case and the syscall time is typically 
negligible compared to computation time. When freeing memory, this is much
worse as BIRD sometimes frees gigabytes of data in a blink of eye.

To minimize the needed number of syscalls, there is a per-thread page cache,
keeping pages for future use:

* When a new page is requested, first the page cache is tried.
* When a page is freed, the per-thread page cache keeps it without telling the kernel.
* When the number of pages in any per-thread page cache leaves a pre-defined range,
  a cleanup routine is scheduled to free excessive pages or request more in advance.

This method gives the multithreaded BIRD not only faster memory management than
ever before but also almost immediate shutdown times as the cleanup routine is
not scheduled on shutdown at all.

## Other resources

Some objects are not only a piece of memory; notable items are sockets, owning
the underlying mechanism of I/O, and *object locks*, owning *the right to use a
specific I/O*. This ensures that collisions on e.g. TCP port numbers and
addresses are resolved in a predictable way.

All these resources should be used with the same locking principles as the
memory blocks. There aren't many checks inside BIRD code to ensure that yet,
nevertheless violating this recommendation may lead to multiple-access issues.

*It's still a long road to the version 2.1. This series of texts should document
what is needed to be changed, why we do it and how. The
[previous chapter](TODO)
showed the locking system and how the parallel execution is done.
The next chapter will cover a bit more detailed explanation about route sources
and route attributes and how lockless data structures are employed there. Stay tuned!*