Merge tag '3.0-alpha0' into HEAD

3.0-alpha0
2024-12-22 17:51:53 +00:00 · 2023-04-18 09:43:06 +02:00 · 2023-04-18 09:43:06 +02:00 · d975827f5f
commit d975827f5f
parent 787fb56da3 7404a21580
25 changed files with 5728 additions and 3 deletions
--- a/16
+++ b/16
@ -84,6 +84,22 @@ Version 2.0.9 (2022-02-09)
  filtering.
 Version 3.0-alpha0 (2022-02-07)
  o Removal of fixed protocol-specific route attributes
  o Asynchronous route export
  o Explicit table import / export hooks
  o Partially lockless route attribute cache
  o Thread-safe resource management
  o Thread-safe interface notifications
  o Thread-safe protocol API
  o Adoption of BFD IO loop for general use
  o Parallel Pipe protocol
  o Parallel RPKI protocol
  o Parallel BGP protocol
  o Lots of refactoring
  o Bugfixes and improvements as they came along
 Version 2.0.8 (2021-03-18)
  o Automatic channel reloads based on RPKI changes
  o Multiple static routes with the same network
--- a/doc/bird.sgml
+++ b/doc/bird.sgml
@ -1,7 +1,7 @@
 <!doctype birddoc system>
 <!--
-	BIRD 2.0 documentation
+	BIRD 3.0 documentation
 This documentation can have 4 forms: sgml (this is master copy), html, ASCII
 text and dvi/postscript (generated from sgml using sgmltools). You should always
@ -20,7 +20,7 @@ configuration - something in config which is not keyword.
 <book>
-<title>BIRD 2.0 User's Guide
+<title>BIRD 3.0 User's Guide
 <author>
 Ondrej Filip <it/&lt;feela@network.cz&gt;/,
 Martin Mares <it/&lt;mj@ucw.cz&gt;/,
--- a/doc/threads/.gitignore
+++ b/doc/threads/.gitignore
@ -0,0 +1,2 @@
 *.html
 *.pdf
--- a/doc/threads/00_filter_structure.png
+++ b/doc/threads/00_filter_structure.png
--- a/doc/threads/00_the_name_of_the_game.md
+++ b/doc/threads/00_the_name_of_the_game.md
@ -0,0 +1,114 @@
 # BIRD Journey to Threads. Chapter 0: The Reason Why.
 BIRD is a fast, robust and memory-efficient routing daemon designed and
 implemented at the end of 20th century. Its concept of multiple routing
 tables with pipes between them, as well as a procedural filtering language,
 has been unique for a long time and is still one of main reasons why people use
 BIRD for big loads of routing data.
 ## IPv4 / IPv6 duality: Solved
 The original design of BIRD has also some drawbacks. One of these was an idea
 of two separate daemons – one BIRD for IPv4 and another BIRD for IPv6, built from the same
 codebase, cleverly using `#ifdef IPV6` constructions to implement the
 common parts of BIRD algorithms and data structures only once.
 If IPv6 adoption went forward as people thought in that time,
 it would work; after finishing the worldwide transition to IPv6, people could
 just stop building BIRD for IPv4 and drop the `#ifdef`-ed code.
 The history went other way, however. BIRD developers therefore decided to *integrate*
 these two versions into one daemon capable of any address family, allowing for
 not only IPv6 but for virtually anything. This rework brought quite a lot of
 backward-incompatible changes, therefore we decided to release it as a version 2.0.
 This work was mostly finished in 2018 and as for March 2021, we have already
 switched the 1.6.x branch to a bugfix-only mode.
 ## BIRD is single-threaded now
 The second drawback is a single-threaded design. Looking back to 1998, this was
 a good idea. A common PC had one single core and BIRD was targeting exactly
 this segment. As the years went by, the manufacturers launched multicore x86 chips
 (AMD Opteron in 2004, Intel Pentium D in 2005). This ultimately led to a world
 where as of March 2021, there is virtually no new PC sold with a single-core CPU.
 Together with these changes, the speed of one single core has not been growing as fast
 as the Internet is growing. BIRD is still capable to handle the full BGP table 
 (868k IPv4 routes in March 2021) with one core, anyway when BIRD starts, it may take
 long minutes to converge.
 ## Intermezzo: Filters
 In 2018, we took some data we had from large internet exchanges  and simulated
 a cold start of BIRD as a route server. We used `linux-perf` to find most time-critical
 parts of BIRD and it pointed very clearly to the filtering code. It also showed that the
 IPv4 version of BIRD v1.6.x is substantially faster than the *integrated* version, while
 the IPv6 version was quite as fast as the *integrated* one.
 Here we should show a little bit more about how the filters really work. Let's use
 an example of a simple filter:
 ```
 filter foo {
  if net ~ [10.0.0.0/8+] then reject;
  preference = 2 * preference - 41;
  accept;
 }
 ```
 This filter gets translated to an infix internal structure.
 ![Example of filter internal representation](00_filter_structure.png)
 When executing, the filter interpreter just walks the filter internal structure recursively in the
 right order, executes the instructions, collects their results and finishes by
 either rejection or acceptation of the route
 ## Filter rework
 Further analysis of the filter code revealed an absurdly-looking result. The
 most executed parts of the interpreter function were the `push` CPU
 instructions on its very beginning and the `pop` CPU instructions on its very
 end. This came from the fact that the interpreter function was quite long, yet
 most of the filter instructions used an extremely short path, doing all the
 stack manipulation at the beginning, branching by the filter instruction type,
 then it executed just several CPU instructions, popped everything from the
 stack back and returned.
 After some thoughts how to minimize stack manipulation when everything you need
 is to take two numbers and multiply them, we decided to preprocess the filter
 internal structure to another structure which is much easier to execute. The
 interpreter is now using a data stack and behaves generally as a
 postfix-ordered language. We also thought about Lua which showed up to be quite
 a lot of work implementing all the glue achieving about the same performance.
 After these changes, we managed to reduce the filter execution time by 10–40%,
 depending on how complex the filter is.
 Anyway, even this reduction is quite too little when there is one CPU core
 running for several minutes while others are sleeping.
 ## We need more threads
 As a side effect of the rework, the new filter interpreter is also completely
 thread-safe. It seemed to be the way to go – running the filters in parallel
 while keeping everything else single-threaded. The main problem of this
 solution is a too fine granularity of parallel jobs. We would spend lots of
 time on synchronization overhead.
 The only filter parallel execution was also too one-sided, useful only for
 configurations with complex filters. In other cases, the major problem is best
 route recalculation, OSPF recalculation or also kernel synchronization.
 It also turned out to be dirty a lot from the code cleanliness' point of view.
 Therefore we chose to make BIRD multithreaded completely. We designed a way how
 to gradually enable parallel computation and best usage of all available CPU
 cores. Our goals are three:
 * We want to keep current functionality. Parallel computation should never drop
  a useful feature.
 * We want to do little steps. No big reworks, even though even the smallest
  possible step will need quite a lot of refactoring before.
 * We want to be backwards compatible as much as possible.
 *It's still a long road to the version 2.1. This series of texts should document
 what is needed to be changed, why we do it and how. In the next chapter, we're
 going to describe the structures for routes and their attributes. Stay tuned!*
--- a/doc/threads/01_the_route_and_its_attributes.md
+++ b/doc/threads/01_the_route_and_its_attributes.md
@ -0,0 +1,159 @@
 # BIRD Journey to Threads. Chapter 1: The Route and its Attributes
 BIRD is a fast, robust and memory-efficient routing daemon designed and
 implemented at the end of 20th century. We're doing a significant amount of
 BIRD's internal structure changes to make it possible to run in multiple
 threads in parallel. This chapter covers necessary changes of data structures
 which store every single routing data.
 *If you want to see the changes in code, look (basically) into the
 `route-storage-updates` branch. Not all of them are already implemented, anyway
 most of them are pretty finished as of end of March, 2021.*
 ## How routes are stored
 BIRD routing table is just a hierarchical noSQL database. On top level, the
 routes are keyed by their destination, called *net*. Due to historic reasons,
 the *net* is not only *IPv4 prefix*, *IPv6 prefix*, *IPv4 VPN prefix* etc.,
 but also *MPLS label*, *ROA information* or *BGP Flowspec record*. As there may
 be several routes for each *net*, an obligatory part of the key is *src* aka.
 *route source*. The route source is a tuple of the originating protocol
 instance and a 32-bit unsigned integer. If a protocol wants to withdraw a route,
 it is enough and necessary to have the *net* and *src* to identify what route
 is to be withdrawn.
 The route itself consists of (basically) a list of key-value records, with
 value types ranging from a 16-bit unsigned integer for preference to a complex
 BGP path structure. The keys are pre-defined by protocols (e.g. BGP path or
 OSPF metrics), or by BIRD core itself (preference, route gateway).
 Finally, the user can declare their own attribute keys using the keyword
 `attribute` in config.
 ## Attribute list implementation
 Currently, there are three layers of route attributes. We call them *route*
 (*rte*), *attributes* (*rta*) and *extended attributes* (*ea*, *eattr*).
 The first layer, *rte*, contains the *net* pointer, several fixed-size route
 attributes (mostly preference and protocol-specific metrics), flags, lastmod
 time and a pointer to *rta*.
 The second layer, *rta*, contains the *src* (a pointer to a singleton instance),
 a route gateway, several other fixed-size route attributes and a pointer to
 *ea* list.
 The third layer, *ea* list, is a variable-length list of key-value attributes,
 containing all the remaining route attributes.
 Distribution of the route attributes between the attribute layers is somehow
 arbitrary. Mostly, in the first and second layer, there are attributes that
 were thought to be accessed frequently (e.g. in best route selection) and
 filled in in most routes, while the third layer is for infrequently used
 and/or infrequently accessed route attributes.
 ## Attribute list deduplication
 When protocols originate routes, there are commonly more routes with the
 same attribute list. BIRD could ignore this fact, anyway if you have several
 tables connected with pipes, it is more memory-efficient to store the same
 attribute lists only once.
 Therefore, the two lower layers (*rta* and *ea*) are hashed and stored in a 
 BIRD-global database. Routes (*rte*) contain a pointer to *rta* in this
 database, maintaining a use-count of each *rta*. Attributes (*rta*) contain
 a pointer to normalized (sorted by numerical key ID) *ea*.
 ## Attribute list rework
 The first thing to change is the distribution of route attributes between
 attribute list layers. We decided to make the first layer (*rte*) only the key
 and other per-record internal technical information. Therefore we move *src* to
 *rte* and preference to *rta* (beside other things). *This is already done.*
 We also found out that the nexthop (gateway), originally one single IP address
 and an interface, has evolved to a complex attribute with several sub-attributes;
 not only considering multipath routing but also MPLS stacks and other per-route
 attributes. This has led to a too complex data structure holding the nexthop set.
 We decided finally to squash *rta* and *ea* to one type of data structure,
 allowing for completely dynamic route attribute lists. This is also supported
 by adding other *net* types (BGP FlowSpec or ROA) where lots of the fields make
 no sense at all, yet we still want to use the same data structures and implementation
 as we don't like duplicating code. *Multithreading doesn't depend on this change,
 anyway this change is going to happen soon anyway.*
 ## Route storage
 The process of route import from protocol into a table can be divided into several phases:
 1. (In protocol code.) Create the route itself (typically from
   protocol-internal data) and choose the right channel to use.
 2. (In protocol code.) Create the *rta* and *ea* and obtain an appropriate
   hashed pointer. Allocate the *rte* structure and fill it in. 
 3. (Optionally.) Store the route to the *import table*.
 4. Run filters. If reject, free everything.
 5. Check whether this is a real change (it may be idempotent). If not, free everything and do nothing more.
 6. Run the best route selection algorithm.
 7. Execute exports if needed.
 We found out that the *rte* structure allocation is done too early. BIRD uses
 global optimized allocators for fixed-size blocks (which *rte* is) to reduce
 its memory footprint, therefore the allocation of *rte* structure would be a
 synchronization point in multithreaded environment.
 The common code is also much more complicated when we have to track whether the
 current *rte* has to be freed or not. This is more a problem in export than in
 import as the export filter can also change the route (and therefore allocate
 another *rte*). The changed route must be therefore freed after use. All the
 route changing code must also track whether this route is writable or
 read-only.
 We therefore introduce a variant of *rte* called *rte_storage*. Both of these
 hold the same, the layer-1 route information (destination, author, cached
 attribute pointer, flags etc.), anyway *rte* is always local and *rte_storage*
 is intended to be put in global data structures.
 This change allows us to remove lots of the code which only tracks whether any
 *rte* is to be freed as *rte*'s are almost always allocated on-stack, naturally
 limiting their lifetime. If not on-stack, it's the responsibility of the owner
 to free the *rte* after import is done.
 This change also removes the need for *rte* allocation in protocol code and
 also *rta* can be safely allocated on-stack. As a result, protocols can simply
 allocate all the data on stack, call the update routine and the common code in
 BIRD's *nest* does all the storage for them.
 Allocating *rta* on-stack is however not required. BGP and OSPF use this to
 import several routes with the same attribute list. In BGP, this is due to the
 format of BGP update messages containing first the attributes and then the
 destinations (BGP NLRI's). In OSPF, in addition to *rta* deduplication, it is
 also presumed that no import filter (or at most some trivial changes) is applied
 as OSPF would typically not work well when filtered.
 *This change is already done.*
 ## Route cleanup and table maintenance
 In some cases, the route update is not originated by a protocol/channel code.
 When the channel shuts down, all routes originated by that channel are simply
 cleaned up. Also routes with recursive routes may get changed without import,
 simply by changing the IGP route. 
 This is currently done by a `rt_event` (see `nest/rt-table.c` for source code)
 which is to be converted to a parallel thread, running when nobody imports any
 route. *This change is freshly done in branch `guernsey`.*
 ## Parallel protocol execution
 The long-term goal of these reworks is to allow for completely independent
 execution of all the protocols. Typically, there is no direct interaction
 between protocols; everything is done thought BIRD's *nest*. Protocols should
 therefore run in parallel in future and wait/lock only when something is needed
 to do externally.
 We also aim for a clean and documented protocol API.
 *It's still a long road to the version 2.1. This series of texts should document
 what is needed to be changed, why we do it and how. In the next chapter, we're
 going to describe how the route is exported from table to protocols and how this
 process is changing. Stay tuned!*
--- a/doc/threads/02_asynchronous_export.md
+++ b/doc/threads/02_asynchronous_export.md
@ -0,0 +1,463 @@
 # BIRD Journey to Threads. Chapter 2: Asynchronous route export
 Route export is a core algorithm of BIRD. This chapter covers how we are making
 this procedure multithreaded. Desired outcomes are mostly lower latency of
 route import, flap dampening and also faster route processing in large
 configurations with lots of export from one table.
 BIRD is a fast, robust and memory-efficient routing daemon designed and
 implemented at the end of 20th century. We're doing a significant amount of
 BIRD's internal structure changes to make it possible to run in multiple
 threads in parallel.
 ## How routes are propagated through BIRD
 In the [previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/), you could learn how the route import works. We should
 now extend that process by the route export.
 1. (In protocol code.) Create the route itself and propagate it through the
   right channel by calling `rte_update`.
 2. The channel runs its import filter.
 3. New best route is selected.
 4. For each channel:
    1. The channel runs its preexport hook and export filter.
    2. (Optionally.) The channel merges the nexthops to create an ECMP route.
    3. The channel calls the protocol's `rt_notify` hook.
 5. After all exports are finished, the `rte_update` call finally returns and
   the source protocol may do anything else.
 Let's imagine that all the protocols are running in parallel. There are two
 protocols with a route prepared to import. One of those wins the table lock,
 does the import and then the export touches the other protocol which must
 either:
 * store the route export until it finishes its own imports, or
 * have independent import and export parts.
 Both of these conditions are infeasible for common use. Implementing them would
 make protocols much more complicated with lots of new code to test and release
 at once and also quite a lot of corner cases. Risk of deadlocks is also worth
 mentioning.
 ## Asynchronous route export
 We decided to make it easier for protocols and decouple the import and export
 this way:
 1. The import is done.
 2. Best route is selected.
 3. Resulting changes are stored.
 Then, after the importing protocol returns, the exports are processed for each
 exporting channel in parallel: Some protocols
 may process the export directly after it is stored, other protocols wait
 until they finish another job.
 This eliminates the risk of deadlocks and all protocols' `rt_notify` hooks can
 rely on their independence. There is only one question. How to store the changes?
 ## Route export modes
 To find a good data structure for route export storage, we shall first know the
 readers. The exporters may request different modes of route export.
 ### Export everything
 This is the most simple route export mode. The exporter wants to know about all
 the routes as they're changing. We therefore simply store the old route until
 the change is fully exported and then we free the old stored route.
 To manage this, we can simply queue the changes one after another and postpone 
 old route cleanup after all channels have exported the change. The queue member
 would look like this:
 ```
 struct {
  struct rte_storage *new;
  struct rte_storage *old;
 };
 ```
 ### Export best
 This is another simple route export mode. We check whether the best route has
 changed; if not, no export happens. Otherwise, the export is propagated as the
 old best route changing to the new best route. 
 To manage this, we could use the queue from the previous point by adding new
 best and old best pointers. It is guaranteed that both the old best and new
 best pointers are always valid in time of export as all the changes in them
 must be stored in future changes which have not been exported yet by this
 channel and therefore not freed yet.
 ```
 struct {
  struct rte_storage *new;
  struct rte_storage *new_best;
  struct rte_storage *old;
  struct rte_storage *old_best;
 };
 ```
 Anyway, we're getting to the complicated export modes where this simple
 structure is simply not enough.
 ### Export merged
 Here we're getting to some kind of problems. The exporting channel requests not
 only the best route but also all routes that are good enough to be considered
 ECMP-eligible (we call these routes *mergable*). The export is then just one
 route with just the nexthops merged.  Export filters are executed before
 merging and if the best route is rejected, nothing is exported at all.
 To achieve this, we have to re-evaluate export filters any time the best route
 or any mergable route changes. Until now, the export could just do what it wanted
 as there was only one thread working. To change this, we need to access the
 whole route list and process it.
 ### Export first accepted
 In this mode, the channel runs export filters on a sorted list of routes, best first.
 If the best route gets rejected, it asks for the next one until it finds an
 acceptable route or exhausts the list. This export mode requires a sorted table.
 BIRD users may know this export mode as `secondary` in BGP.
 For now, BIRD stores two bits per route for each channel. The *export bit* is set
 if the route has been really exported to that channel. The *reject bit* is set
 if the route was rejected by the export filter.
 When processing a route change for accepted, the algorithm first checks the
 export bit for the old route. If this bit is set, the old route is that one
 exported so we have to find the right one to export. Therefore the sorted route
 list is walked best to worst to find a new route to export, using the reject
 bit to evaluate only routes which weren't rejected in previous runs of this
 algorithm.
 If the old route bit is not set, the algorithm walks the sorted route list best
 to worst, checking the position of new route with respect to the exported route.
 If the new route is worse, nothing happens, otherwise the new route is sent to
 filters and finally exported if passes.
 ### Export by feed
 To resolve problems arising from previous two export modes (merged and first accepted),
 we introduce a way to process a whole route list without locking the table
 while export filters are running. To achieve this, we follow this algorithm:
 1. The exporting channel sees a pending export.
 2. *The table is locked.*
 3. All routes (pointers) for the given destination are dumped to a local array.
 4. Also first and last pending exports for the given destination are stored.
 5. *The table is unlocked.*
 6. The channel processes the local array of route pointers.
 7. All pending exports between the first and last stored (incl.) are marked as processed to allow for cleanup.
 After unlocking the table, the pointed-to routes are implicitly guarded by the
 sole fact that no pending export has not yet been processed by all channels
 and the cleanup routine frees only resources after being processed.
 The pending export range must be stored together with the feed. While
 processing export filters for the feed, another export may come in. We
 must process the export once again as the feed is now outdated, therefore we
 must mark only these exports that were pending for this destination when the
 feed was being stored. We also can't mark them before actually processing them
 as they would get freed inbetween.
 ## Pending export data structure
 As the two complicated export modes use the export-by-feed algorithm, the
 pending export data structure may be quite minimalistic.
 ```
 struct rt_pending_export {
  struct rt_pending_export * _Atomic next;	/* Next export for the same destination */
  struct rte_storage *new;			/* New route */
  struct rte_storage *new_best;			/* New best route in unsorted table */
  struct rte_storage *old;			/* Old route */
  struct rte_storage *old_best;			/* Old best route in unsorted table */
  _Atomic u64 seq;				/* Sequential ID (table-local) of the pending export */
 };
 ```
 To allow for squashing outdated pending exports (e.g. for flap dampening
 purposes), there is a `next` pointer to the next export for the same
 destination. This is also needed for the export-by-feed algorithm to traverse
 the list of pending exports.
 We should also add several items into `struct channel`.
 ```
  struct coroutine *export_coro;			/* Exporter and feeder coroutine */
  struct bsem *export_sem;				/* Exporter and feeder semaphore */
  struct rt_pending_export * _Atomic last_export;	/* Last export processed */
  struct bmap export_seen_map;				/* Keeps track which exports were already processed */
  u64 flush_seq;					/* Table export seq when the channel announced flushing */
 ```
 To run the exports in parallel, `export_coro` is run and `export_sem` is
 used for signalling new exports to it. The exporter coroutine also marks all
 seen sequential IDs in its `export_seen_map` to make it possible to skip over
 them if seen again. The exporter coroutine is started when export is requested
 and stopped when export is stopped.
 There is also a table cleaner routine
 (see [previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/))
 which must cleanup also the pending exports after all the channels are finished with them.
 To signal that, there is `last_export` working as a release point: the channel
 guarantees that it doesn't touch the pointed-to pending export (or any older), nor any data
 from it.
 The last tricky point here is channel flushing. When any channel stops, all its
 routes are automatically freed and withdrawals are exported if appropriate.
 Until now, the routes could be flushed synchronously, anyway now flush has
 several phases, stored in `flush_active` channel variable:
 1. Flush started.
 2. Withdrawals for all the channel's routes are issued. 
   Here the channel stores the `seq` of last current pending export to `flush_seq`)
 3. When the table's cleanup routine cleans up the withdrawal with `flush_seq`,
   the channel may safely stop and free its structures as all `sender` pointers in routes are now gone.
 Finally, some additional information has to be stored in tables:
 ```
  _Atomic byte export_used;				/* Export journal cleanup scheduled */ \
  struct rt_pending_export * _Atomic first_export;	/* First export to announce */ \
  byte export_scheduled;				/* Export is scheduled */
  list pending_exports;					/* List of packed struct rt_pending_export */
  struct fib export_fib;				/* Auxiliary fib for storing pending exports */
  u64 next_export_seq;					/* The next export will have this ID */
 ```
 The exports are:
 1. Assigned the `next_export_seq` sequential ID, incrementing this item by one.
 2. Put into `pending_exports` and `export_fib` for both sequential and by-destination access.
 3. Signalled by setting `export_scheduled` and `first_export`.
 After processing several exports, `export_used` is set and route table maintenance
 coroutine is woken up to possibly do cleanup.
 The `struct rt_pending_export` seems to be best allocated by requesting a whole
 memory page, containing a common list node, a simple header and packed all the
 structures in the rest of the page. This may save a significant amount of memory.
 In case of congestion, there will be lots of exports and every spare kilobyte
 counts. If BIRD is almost idle, the optimization does nothing on the overall performance.
 ## Export algorithm
 As we have explained at the beginning, the current export algorithm is
 synchronous and table-driven. The table walks the channel list and propagates the update.
 The new export algorithm is channel-driven. The table just indicates that it
 has something new in export queue and the channel decides what to do with that and when.
 ### Pushing an export
 When a table has something to export, it enqueues an instance of
 `struct rt_pending_export` together with updating the `last` pointer (and
 possibly also `first`) for this destination's pending exports.
 Then it pings its maintenance coroutine (`rt_event`) to notify the exporting
 channels about a new route. Before the maintenance coroutine acquires the table
 lock, the importing protocol may e.g. prepare the next route inbetween.
 The maintenance coroutine, when it wakes up, walks the list of channels and
 wakes their export coroutines.
 These two levels of asynchronicity are here for an efficiency reason.
 1. In case of low table load, the export is announced just after the import happens.
 2. In case of table congestion, the export notification locks the table as well
   as all route importers, effectively reducing the number of channel list traversals.
 ### Processing an export
 After these two pings, the channel finally knows that there is an export pending.
 1. The channel waits for a semaphore. This semaphore is posted by the table
   maintenance coroutine.
 2. The channel checks whether there is a `last_export` stored.
   1. If yes, it proceeds with the next one.
   2. Otherwise it takes `first_export` from the table. This special
      pointer is atomic and can be accessed without locking and also without clashing
      with the export cleanup routine.
 3. The channel checks its `export_seen_map` whether this export has been
   already processed. If so, it goes back to 1. to get the next export. No
   action is needed with this one.
 4. As now the export is clearly new, the export chain (single-linked list) is
   scanned for the current first and last export. This is done by following the
   `next` pointer in the exports.
 5. If all-routes mode is used, the exports are processed one-by-one. In future
   versions, we may employ some simple flap-dampening by checking the pending
   export list for the same route src. *No table locking happens.*
 6. If best-only mode is employed, just the first and last exports are
   considered to find the old and new best routes. The inbetween exports do nothing. *No table locking happens.*
 7. If export-by-feed is used, the current state of routes in table are fetched and processed
   as described above in the "Export by feed" section.
 8. All processed exports are marked as seen.
 9. The channel stores the first processed export to `last_export` and returns
   to beginning.to wait for next exports. The latter exports are then skipped by
   step 3 when the export coroutine gets to them.
 ## The full life-cycle of routes 
 Until now, we're always assuming that the channels *just exist*. In real life,
 any channel may go up or down and we must handle it, flushing the routes
 appropriately and freeing all the memory just in time to avoid both
 use-after-free and memory leaks. BIRD is written in C which has no garbage
 collector or other modern features alike so memory management is a thing.
 ### Protocols and channels as viewed from a route
 BIRD consists effectively of protocols and tables. **Protocols** are active parts,
 kind-of subprocesses manipulating routes and other data. **Tables** are passive,
 serving as a database of routes. To connect a protocol to a table, a
 **channel** is created.
 Every route has its `sender` storing the channel which has put the route into
 the current table. Therefore we know which routes to flush when a channel goes down.
 Every route also has its `src`, a route source allocated by the protocol which
 originated it first. This is kept when a route is passed through a *pipe*. The
 route source is always bound to protocol; it is possible that a protocol
 announces routes via several channels using the same src.
 Both `src` and `sender` must point to active protocols and channels as inactive
 protocols and channels may be deleted any time.
 ### Protocol and channel lifecycle
 In the beginning, all channels and protocols are down. Until they fully start,
 no route from them is allowed to any table. When the protocol and channel is up,
 they may originate and receive routes freely. However, the transitions are worth mentioning.
 ### Channel startup and feed
 When protocols and channels start, they need to get the current state of the
 appropriate table. Therefore, after a protocol and channel start, also the
 export-feed coroutine is initiated.
 Tables can contain millions of routes. It may lead to long import latency if a channel
 was feeding itself in one step. The table structure is (at least for now) too
 complicated to be implemented as lockless, thus even read access needs locking.
 To mitigate this, the feeds are split to allow for regular route propagation
 with a reasonable latency.
 When the exports were synchronous, we simply didn't care and just announced the
 exports to the channels from the time they started feeding. When making exports
 asynchronous, it is crucial to avoid (hopefully) all the possible race conditions
 which could arise from simultaneous feed and export. As the feeder routines had
 to be rewritten, it is a good opportunity to make this precise.
 Therefore, when a channel goes up, it also starts exports:
 1. Start the feed-export coroutine.
 2. *Lock the table.*
 3. Store the last export in queue.
 4. Read a limited number of routes to local memory together with their pending exports.
 5. If there are some routes to process:
   1. *Unlock the table.*
   2. Process the loaded routes.
   3. Set the appropriate pending exports as seen.
   4. *Lock the table*
   5. Go to 4. to continue feeding.
 6. If there was a last export stored, load the next one to be processed. Otherwise take the table's `first_export`.
 7. *Unlock the table.*
 8. Run the exporter loop.
 *Note: There are some nuances not mentioned here how to do things in right
 order to avoid missing some events while changing state. For specifics, look
 into the code in `nest/rt-table.c` in branch `alderney`.*
 When the feeder loop finishes, it continues smoothly to process all the exports
 that have been queued while the feed was running. Step 5.3 ensures that already
 seen exports are skipped, steps 3 and 6 ensure that no export is missed.
 ### Channel flush
 Protocols and channels need to stop for a handful of reasons, All of these
 cases follow the same routine.
 1. (Maybe.) The protocol requests to go down or restart.
 2. The channel requests to go down or restart.
 3. The channel requests to stop export.
 4. In the feed-export coroutine:
   1. At a designated cancellation point, check cancellation.
   2. Clean up local data.
   3. *Lock main BIRD context*
   4. If shutdown requested, switch the channel to *flushing* state and request table maintenance.
   5. *Stop the coroutine and unlock main BIRD context.*
 5. In the table maintenance coroutine:
   1. Walk across all channels and check them for *flushing* state, setting `flush_active` to 1.
   2. Walk across the table (split to allow for low latency updates) and
      generate a withdrawal for each route sent by the flushing channels.
   3. When all the table is traversed, the flushing channels' `flush_active` is set to 2 and
      `flush_seq` is set to the current last export seq.
   3. Wait until all the withdrawals are processed by checking the `flush_seq`.
   4. Mark the flushing channels as *down* and eventually proceed to the protocol shutdown or restart.
 There is also a separate routine that handles bulk cleanup of `src`'s which
 contain a pointer to the originating protocol. This routine may get reworked in
 future; for now it is good enough.
 ### Route export cleanup
 Last but not least is the export cleanup routine. Until now, the withdrawn
 routes were exported synchronously and freed directly after the import was
 done. This is not possible anymore. The export is stored and the import returns
 to let the importing protocol continue its work. We therefore need a routine to
 cleanup the withdrawn routes and also the processed exports.
 First of all, this routine refuses to cleanup when any export is feeding or
 shutting down. In future, cleanup while feeding should be possible, anyway for
 now we aren't sure about possible race conditions.
 Anyway, when all the exports are in a steady state, the routine works as follows:
 1. Walk the active exports and find a minimum (oldest export) between their `last_export` values.
 2. If there is nothing to clear between the actual oldest export and channels' oldest export, do nothing.
 3. Find the table's new `first_export` and set it. Now there is nobody pointing to the old exports.
 4. Free the withdrawn routes.
 5. Free the old exports, removing them also from the first-last list of exports for the same destination.
 ## Results of these changes
 This step is a first major step to move forward. Using just this version may be
 still as slow as the single-threaded version, at least if your export filters are trivial.
 Anyway, the main purpose of this step is not an immediate speedup. It is more
 of a base for the next steps:
 * Unlocking of pipes should enable parallel execution of all the filters on
  pipes, limited solely by the principle *one thread for every direction of
  pipe*.
 * Conversion of CLI's `show route` to the new feed-export coroutines should
  enable faster table queries. Moreover, this approach will allow for
  better splitting of model and view in CLI with a good opportunity to
  implement more output formats, e.g. JSON.
 * Unlocking of kernel route synchronization should fix latency issues induced
  by long-lasting kernel queries.
 * Partial unlocking of BGP packet processing should allow for parallel
  execution in almost all phases of BGP route propagation.
 * Partial unlocking of OSPF route recalculation should raise the useful
  maximums of topology size.
 The development is now being done mostly in the branch `alderney`. If you asked
 why such strange branch names like `jersey`, `guernsey` and `alderney`, here is
 a kind-of reason. Yes, these branches could be named `mq-async-export`,
 `mq-async-export-new`, `mq-async-export-new-new`, `mq-another-async-export` and
 so on. That's so ugly, isn't it? Let's be creative. *Jersey* is an island where a
 same-named knit was first produced – and knits are made of *threads*. Then, you
 just look into a map and find nearby islands.
 Also why so many branches? The development process is quite messy. BIRD's code
 heavily depends on single-threaded approach. This is (in this case)
 exceptionally good for performance, as long as you have one thread only. On the
 other hand, lots of these assumptions are not documented so in many cases one
 desired change yields a chain of other unforeseen changes which must precede.
 This brings lots of backtracking, branch rebasing and other Git magic. There is
 always a can of worms somewhere in the code.
 *It's still a long road to the version 2.1. This series of texts should document
 what is needed to be changed, why we do it and how. The
 [previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/)
 showed the necessary changes in route storage. In the next chapter, we're going
 to describe how the coroutines are implemented and what kind of locking system
 are we employing to prevent deadlocks. Stay tuned!*
--- a/doc/threads/03_coroutines.md
+++ b/doc/threads/03_coroutines.md
@ -0,0 +1,235 @@
 # BIRD Journey to Threads. Chapter 3: Parallel execution and message passing.
 Parallel execution in BIRD uses an underlying mechanism of dedicated IO loops
 and hierarchical locks. The original event scheduling module has been converted
 to do message passing in multithreaded environment. These mechanisms are
 crucial for understanding what happens inside BIRD and how its internal API changes.
 BIRD is a fast, robust and memory-efficient routing daemon designed and
 implemented at the end of 20th century. We're doing a significant amount of
 BIRD's internal structure changes to make it run in multiple threads in parallel.
 ## Locking and deadlock prevention
 Most of BIRD data structures and algorithms are thread-unsafe and not even
 reentrant. Checking and possibly updating all of these would take an
 unreasonable amount of time, thus the multithreaded version uses standard mutexes
 to lock all the parts which have not been checked and updated yet.
 The authors of original BIRD concepts wisely chose a highly modular structure
 which allows to create a hierarchy for locks. The main chokepoint was between
 protocols and tables and it has been removed by implementing asynchronous exports
 as described in the [previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/).
 Locks in BIRD (called domains, as they always lock some defined part of BIRD)
 are partially ordered. Every *domain* has its *type* and all threads are
 strictly required to lock the domains in the order of their respective types.
 The full order is defined in `lib/locking.h`. It's forbidden to lock more than
 one domain of a type (these domains are uncomparable) and recursive locking is
 forbidden as well.
 The locking hiearchy is (roughly; as of February 2022) like this:
 1. The BIRD Lock (for everything not yet checked and/or updated)
 2. Protocols (as of February 2022, it is BFD, RPKI, Pipe and BGP)
 3. Routing tables
 4. Global route attribute cache
 5. Message passing
 6. Internals and memory management
 There are heavy checks to ensure proper locking and to help debugging any
 problem when any code violates the hierarchy rules. This impedes performance
 depending on how much that domain is contended and in some cases I have already
 implemented lockless (or partially lockless) data structures to overcome this.
 You may ask, why are these heavy checks then employed in production builds?
 Risks arising from dropping some locking checks include:
 * deadlocks; these are deadly in BIRD anyway so it should just fail with a meaningful message, or
 * data corruption; it either kills BIRD anyway, or it results into a slow and vicious death,
  leaving undebuggable corefiles behind.
 To be honest, I believe in principles like *"every nontrivial software has at least one bug"*
 and I also don't trust my future self or anybody else to always write bugless code when
 it comes to proper locking. I also believe that if a lock becomes a bottle-neck,
 then we should think about what is locked inside and how to optimize that,
 possibly implementing a lockless or waitless data structure instead of dropping
 thorough consistency checks, especially in a multithreaded environment.
 ### Choosing the right locking order
 When considering the locking order of protocols and route tables, the answer
 was quite easy. We had to make either import or export asynchronous (or both).
 Major reasons for asynchronous export have been stated in the previous chapter,
 therefore it makes little sense to allow entering protocol context from table code.
 As I write further in this text, even accessing table context from protocol
 code leads to contention on table locks, yet for now, it is good enough and the
 lock order features routing tables after protocols to make the multithreading
 goal easier to achieve.
 The major lock level is still The BIRD Lock, containing not only the
 not-yet-converted protocols (like Babel, OSPF or RIP) but also processing CLI
 commands and reconfiguration. This involves an awful lot of direct access into
 other contexts which would be unnecessarily complicated to implement by message
 passing. Therefore, this lock is simply *"the director"*, sitting on the top
 with its own category. 
 The lower lock levels under routing tables are mostly for shared global data
 structures accessed from everywhere. We'll address some of these later.
 ## IO Loop
 There has been a protocol, BFD, running in its own thread since 2013. This
 separation has a good reason; it needs low latency and the main BIRD loop just
 walks round-robin around all the available sockets and one round-trip may take
 a long time (even more than a minute with large configurations). BFD had its
 own IO loop implementation and simple message passing routines. This code could
 be easily updated for general use so I did it.
 To understand the internal principles, we should say that in the `master`
 branch, there is a big loop centered around a `poll()` call, dispatching and
 executing everything as needed. In the `sark` branch, there are multiple loops
 of this kind. BIRD has several means how to get something dispatched from a
 loop.
 1. Requesting to read from a **socket** makes the main loop call your hook when there is some data received.
   The same happens when a socket refuses to write data. Then the data is buffered and you are called when
   the buffer is free to continue writing. There is also a third callback, an error hook, for obvious reasons.
 2. Requesting to be called back after a given amount of time. This is called **timer**.
   As is common with all timers, they aren't precise and the callback may be
   delayed significantly. This was also the reason to have BFD loop separate
   since the very beginning, yet now the abundance of threads may lead to
   problems with BFD latency in large-scale configurations. We haven't tested
   this yet.
 3. Requesting to be called back from a clean context when possible. This is
   useful to run anything not reentrant which might mess with the caller's
   data, e.g. when a protocol decides to shutdown due to some inconsistency
   in received data. This is called **event**.
 4. Requesting to do some work when possible. These are also events, there is only
   a difference where this event is enqueued; in the main loop, there is a
   special *work queue* with an execution limit, allowing sockets and timers to be
   handled with a reasonable latency while still doing all the work needed.
   Other loops don't have designated work queues (we may add them later).
 All these, sockets, timers and events, are tightly bound to some domain.
 Sockets typically belong to a protocol, timers and events to a protocol or table.
 With the modular structure of BIRD, the easy and convenient approach to multithreading
 is to get more IO loops, each bound to a specific domain, running their events, timers and
 socket hooks in their threads.
 ## Message passing and loop entering
 To request some work in another module, the standard way is to pass a message.
 For this purpose, events have been modified to be sent to a given loop without
 locking that loop's domain. In fact, every event queue has its own lock with a
 low priority, allowing to pass messages from almost any part of BIRD, and also
 an assigned loop which executes the events enqueued. When a message is passed
 to a queue executed by another loop, that target loop must be woken up so we
 must know what loop to wake up to avoid unnecessary delays. Then the target
 loop opens its mailbox and processes the task in its context.
 The other way is a direct access of another domain. This approach blocks the
 appropriate loop from doing anything and we call it *entering a birdloop* to
 remember that the task must be fast and *leave the birdloop* as soon as possible.
 Route import is done via direct access from protocols to tables; in large
 setups with fast filters, this is a major point of contention (after filters
 have been parallelized) and will be addressed in future optimization efforts.
 Reconfiguration and interface updates also use direct access; more on that later.
 In general, this approach should be avoided unless there are good reasons to use it.
 Even though direct access is bad, sending lots of messages may be even worse.
 Imagine one thousand post(wo)men, coming one by one every minute, ringing your
 doorbell and delivering one letter each to you. Horrible! Asynchronous message
 passing works exactly this way. After queuing the message, the source sends a
 byte to a pipe to wakeup the target loop to process the task. We could also
 periodically poll for messages instead of waking up the targets, yet it would
 add quite a lot of latency which we also don't like.
 Messages in BIRD don't typically suffer from the problem of amount and the
 overhead is negligible compared to the overall CPU consumption. With one notable
 exception: route import/export.
 ### Route export message passing
 If we had to send a ping for every route we import to every exporting channel,
 we'd spend more time pinging than doing anything else. Been there, seen
 those unbelievable 80%-like figures in Perf output. Never more.
 Route update is quite a complicated process. BIRD must handle large-scale
 configurations with lots of importers and exporters. Therefore, a
 triple-indirect delayed route announcement is employed:
 1. First, when a channel imports a route by entering a loop, it sends an event
   to its own loop (no ping needed in such case). This operation is idempotent,
   thus for several routes in a row, only one event is enqueued. This reduces
   several route import announcements (even hundreds in case of massive BGP
   withdrawals) to one single event.
 2. When the channel is done importing (or at least takes a coffee break and
   checks its mailbox), the scheduled event in its own loop is run, sending
   another event to the table's loop, saying basically *"Hey, table, I've just
   imported something."*. This event is also idempotent and further reduces
   route import announcements from multiple sources to one single event.
 3. The table's announcement event is then executed from its loop, enqueuing export
   events for all connected channels, finally initiating route exports. As we
   already know, imports are done by direct access, therefore if protocols keep
   importing, export announcements are slowed down.
 4. The actual data on what has been updated is stored in a table journal. This
   peculiar technique is used only for informing the exporting channels that
   *"there is something to do"*.
 This may seem overly complicated, yet it should work and it seems to work. In
 case of low load, all these notifications just come through smoothly. In case
 of high load, it's common that multiple updates come for the same destination.
 Delaying the exports allows for the updates to settle down and export just the
 final result, reducing CPU load and export traffic.
 ## Cork
 Route propagation is involved in yet another problem which has to be addressed.
 In the old versions with synchronous route propagation, all the buffering
 happened after exporting routes to BGP. When a packet arrived, all the work was
 done in BGP receive hook – parsing, importing into a table, running all the
 filters and possibly sending to the peers. No more routes until the previous
 was done. This self-regulating mechanism doesn't work any more.
 Route table import now returns immediately after inserting the route into a
 table, creating a buffer there. These buffers have to be processed by other protocols'
 export events. In large-scale configurations, one route import has to be
 processed by hundreds, even thousands of exports. Unlimited imports are a major
 cause of buffer bloating. This is even worse in configurations with pipes,
 as these multiply the exports by propagating them all the way down to other
 tables, eventually eating about twice the amount of memory than the single-threaded version.
 There is therefore a cork to make this stop. Every table is checking how many
 exports it has pending, and when adding a new export to the queue, it may request
 a cork, saying simply "please stop the flow for a while". When the export buffer
 size is reduced low enough, the table uncorks.
 On the other side, there are events and sockets with a cork assigned. When
 trying to enqueue an event and the cork is applied, the event is instead put
 into the cork's queue and released only when the cork is released. In case of
 sockets, when read is indicated or when `poll` arguments are recalculated,
 the corked socket is simply not checked for received packets, effectively
 keeping them in the TCP queue and slowing down the flow until cork is released.
 The cork implementation is quite crude and rough and fragile. It may get some
 rework while stabilizing the multi-threaded version of BIRD or we may even
 completely drop it for some better mechanism. One of these candidates is this
 kind of API:
 * (table to protocol) please do not import
 * (table to protocol) you may resume imports
 * (protocol to table) not processing any exports
 * (protocol to table) resuming export processing
 Anyway, cork works as intended in most cases at least for now.
 *It's a long road to the version 2.1. This series of texts should document what
 is changing, why we do it and how. The
 [previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/)
 shows how the route export had to change to allow parallel execution. In the next chapter, some memory management
 details are to be explained together with the reasons why memory management matters. Stay tuned!*
--- a/doc/threads/03b_performance.md
+++ b/doc/threads/03b_performance.md
@ -0,0 +1,153 @@
 # BIRD Journey to Threads. Chapter 3½: Route server performance
 All the work on multithreading shall be justified by performance improvements.
 This chapter tries to compare times reached by version 3.0-alpha0 and 2.0.8,
 showing some data and thinking about them.
 BIRD is a fast, robust and memory-efficient routing daemon designed and
 implemented at the end of 20th century. We're doing a significant amount of
 BIRD's internal structure changes to make it run in multiple threads in parallel.
 ## Testing setup
 There are two machines in one rack. One of these simulates the peers of
 a route server, the other runs BIRD in a route server configuration. First, the
 peers are launched, then the route server is started and one of the peers
 measures the convergence time until routes are fully propagated. Other peers
 drop all incoming routes.
 There are four configurations. *Single* where all BGPs are directly
 connected to the main table, *Multi* where every BGP has its own table and
 filters are done on pipes between them, and finally *Imex* and *Mulimex* which are
 effectively *Single* and *Multi* where all BGPs have also their auxiliary
 import and export tables enabled.
 All of these use the same short dummy filter for route import to provide a
 consistent load. This filter includes no meaningful logic, it's just some dummy
 data to run the CPU with no memory contention. Real filters also do not suffer from
 memory contention, with an exception of ROA checks. Optimization of ROA is a
 task for another day.
 There is also other stuff in BIRD waiting for performance assessment. As the
 (by far) most demanding setup of BIRD is route server in IXP, we chose to
 optimize and measure BGP and filters first.
 Hardware used for testing is Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 8
 physical cores, two hyperthreads on each. Memory is 32 GB RAM.
 ## Test parameters and statistics
 BIRD setup may scale on two major axes. Number of peers and number of routes /
 destinations. *(There are more axes, e.g.: complexity of filters, routes /
 destinations ratio, topology size in IGP)*
 Scaling the test on route count is easy, just by adding more routes to the
 testing peers. Currently, the largest test data I feed BIRD with is about 2M
 routes for around 800K destinations, due to memory limitations. The routes /
 destinations ratio is around 2.5 in this testing setup, trying to get close to
 real-world routing servers.[^1]
 [^1]: BIRD can handle much more in real life, the actual software limit is currently
      a 32-bit unsigned route counter in the table structure. Hardware capabilities
      are already there and checking how BIRD handles more than 4G routes is
      certainly going to be a real thing soon.
 Scaling the test on peer count is easy, until you get to higher numbers. When I
 was setting up the test, I configured one Linux network namespace for each peer,
 connecting them by virtual links to a bridge and by a GRE tunnel to the other
 machine. This works well for 10 peers but setting up and removing 1000 network
 namespaces takes more than 15 minutes in total. (Note to myself: try this with
 a newer Linux kernel than 4.9.)
 Another problem of test scaling is bandwidth. With 10 peers, everything is OK.
 With 1000 peers, version 3.0-alpha0 does more than 600 Mbps traffic in peak
 which is just about the bandwidth of the whole setup. I'm planning to design a
 better test setup with less chokepoints in future.
 ## Hypothesis
 There are two versions subjected to the test. One of these is `2.0.8` as an
 initial testpoint. The other is version 3.0-alpha0, named `bgp` as parallel BGP
 is implemented there.
 The major problem of large-scale BIRD setups is convergence time on startup. We
 assume that a multithreaded version should reduce the overall convergence time,
 at most by a factor equal to number of cores involved. Here we have 16
 hyperthreads, in theory we should reduce the times up to 16-fold, yet this is
 almost impossible as a non-negligible amount of time is spent in bottleneck
 code like best route selection or some cleanup routines. This has become a
 bottleneck by making other parts parallel.
 ## Data
 Four charts are included here, one for each setup. All axes have a
 logarithmic scale. The route count on X scale is the total route count in
 tested BIRD, different color shades belong to different versions and peer
 counts. Time is plotted on Y scale.
 Raw data is available in Git, as well as the chart generator. Strange results
 caused by testbed bugs are already omitted.
 There is also a line drawn on a 2-second mark. Convergence is checked by
 periodically requesting `birdc show route count` on one of the peers and BGP
 peers have also a 1-second connect delay time (default is 5 seconds). All
 measured times shorter than 2 seconds are highly unreliable.
 ![Plotted data for Single](03b_stats_2d_single.png)
 [Plotted data for Single in PDF](03b_stats_2d_single.pdf)
 Single-table setup has times reduced to about 1/8 when comparing 3.0-alpha0 to
 2.0.8. Speedup for 10-peer setup is slightly worse than expected and there is
 still some room for improvement, yet 8-fold speedup on 8 physical cores and 16
 hyperthreads is good for me now.
 The most demanding case with 2M routes and 1k peers failed. On 2.0.8, my
 configuration converges after almost two hours on 2.0.8, with the speed of
 route processing steadily decreasing until only several routes per second are
 done. Version 3.0-alpha0 is memory-bloating for some non-obvious reason and
 couldn't fit into 32G RAM. There is definitely some work ahead to stabilize
 BIRD behavior with extreme setups.
 ![Plotted data for Multi](03b_stats_2d_multi.png)
 [Plotted data for Multi in PDF](03b_stats_2d_multi.pdf)
 Multi-table setup got the same speedup as single-table setup, no big
 surprise. Largest cases were not tested at all as they don't fit well into 32G
 RAM even with 2.0.8.
 ![Plotted data for Imex](03b_stats_2d_imex.png)
 [Plotted data for Imex in PDF](03b_stats_2d_imex.pdf)
 ![Plotted data for Mulimex](03b_stats_2d_mulimex.png)
 [Plotted data for Mulimex in PDF](03b_stats_2d_mulimex.pdf)
 Setups with import / export tables are also sped up by a factor
 about 6-8. Data on largest setups (2M routes) are showing some strangely
 ineffective behaviour. Considering that both single-table and multi-table
 setups yield similar performance data, there is probably some unwanted
 inefficiency in the auxiliary table code.
 ## Conclusion
 BIRD 3.0-alpha0 is a good version for preliminary testing in IXPs. There is
 some speedup in every testcase and code stability is enough to handle typical
 use cases. Some test scenarios went out of available memory and there is
 definitely a lot of work to stabilize this, yet for now it makes no sense to
 postpone this alpha version any more.
 We don't recommend upgrading a production machine to this version
 yet, anyway if you have a test setup, getting version 3.0-alpha0 there and
 reporting bugs is much welcome.
 Notice: Multithreaded BIRD, at least in version 3.0-alpha0, doesn't limit its number of
 threads. It will spawn at least one thread per every BGP, RPKI and Pipe
 protocol, one thread per every routing table (including auxiliary tables) and
 possibly several more. It's up to the machine administrator to setup a limit on
 CPU core usage by BIRD. When running with many threads and protocols, you may
 need also to raise the filedescriptor limit: BIRD uses 2 filedescriptors per
 every thread for internal messaging.
 *It's a long road to the version 3. By releasing this alpha version, we'd like
 to encourage every user to try this preview. If you want to know more about
 what is being done and why, you may also check the full
 [blogpost series about multithreaded BIRD](https://en.blog.nic.cz/2021/03/15/bird-journey-to-threads-chapter-0-the-reason-why/). Thank you for your ongoing support!*
--- a/doc/threads/03b_stats_2d_imex.pdf
+++ b/doc/threads/03b_stats_2d_imex.pdf
--- a/doc/threads/03b_stats_2d_imex.png
+++ b/doc/threads/03b_stats_2d_imex.png
--- a/doc/threads/03b_stats_2d_mulimex.pdf
+++ b/doc/threads/03b_stats_2d_mulimex.pdf
--- a/doc/threads/03b_stats_2d_mulimex.png
+++ b/doc/threads/03b_stats_2d_mulimex.png
--- a/doc/threads/03b_stats_2d_multi.pdf
+++ b/doc/threads/03b_stats_2d_multi.pdf
--- a/doc/threads/03b_stats_2d_multi.png
+++ b/doc/threads/03b_stats_2d_multi.png
--- a/doc/threads/03b_stats_2d_single.pdf
+++ b/doc/threads/03b_stats_2d_single.pdf
--- a/doc/threads/03b_stats_2d_single.png
+++ b/doc/threads/03b_stats_2d_single.png
--- a/doc/threads/04_memory_management.md
+++ b/doc/threads/04_memory_management.md
@ -0,0 +1,223 @@
 # BIRD Journey to Threads. Chapter 4: Memory and other resource management.
 BIRD is mostly a large specialized database engine, storing mega/gigabytes of
 Internet routing data in memory. To keep accounts of every byte of allocated data,
 BIRD has its own resource management system which must be adapted to the
 multithreaded environment. The resource system has not changed much, yet it
 deserves a short chapter.
 BIRD is a fast, robust and memory-efficient routing daemon designed and
 implemented at the end of 20th century. We're doing a significant amount of
 BIRD's internal structure changes to make it run in multiple threads in parallel.
 ## Resources
 Inside BIRD, (almost) every piece of allocated memory is a resource. To achieve this,
 every such memory block includes a generic `struct resource` header. The node
 is enlisted inside a linked list of a *resource pool* (see below), the class
 pointer defines basic operations done on resources.
 ```
 typedef struct resource {
  node n;				/* Inside resource pool */
  struct resclass *class;		/* Resource class */
 } resource;
 struct resclass {
  char *name;				/* Resource class name */
  unsigned size;			/* Standard size of single resource */
  void (*free)(resource *);		/* Freeing function */
  void (*dump)(resource *);		/* Dump to debug output */
  resource *(*lookup)(resource *, unsigned long);	/* Look up address (only for debugging) */
  struct resmem (*memsize)(resource *);	/* Return size of memory used by the resource, may be NULL */
 };
 void *ralloc(pool *, struct resclass *);
 ```
 Resource cycle begins with an allocation of a resource. To do that, you should call `ralloc()`,
 passing the parent pool and the appropriate resource class as arguments. BIRD
 allocates a memory block of size given by the given class member `size`.
 Beginning of the block is reserved for `struct resource` itself and initialized
 by the given arguments. Therefore, you may sometimes see an idiom where a structure
 has a first member `struct resource r;`, indicating that this item should be
 allocated as a resource.
 The counterpart is resource freeing. This may be implicit (by resource pool
 freeing) or explicit (by `rfree()`). In both cases, the `free()` function of
 the appropriate class is called to cleanup the resource before final freeing.
 To account for `dump` and `memsize` calls, there are CLI commands `dump
 resources` and `show memory`, using these to dump resources or show memory
 usage as perceived by BIRD.
 The last, `lookup`, is quite an obsolete way to identify a specific pointer
 from a debug interface. You may call `rlookup(pointer)` and BIRD should dump
 that resource to the debug output. This mechanism is probably incomplete as no
 developer uses it actively for debugging.
 Resources can be also moved between pools by `rmove` when needed.
 ## Resource pools
 The first internal resource class is a recursive resource – a resource pool. In
 the singlethreaded version, this is just a simple structure:
 ```
 struct pool {
  resource r;
  list inside;
  struct birdloop *loop;  /* In multithreaded version only */
  const char *name;
 };
 ```
 Resource pools are used for grouping resources together. There are pools everywhere
 and it is a common idiom inside BIRD to just `rfree` the appropriate pool when
 e.g. a protocol or table is going down. Everything left there is cleaned up.
 There are anyway several classes which must be freed with care. In the
 singlethreaded version, the *slab* allocator (see below) must be empty before
 it may be freed and this is kept to the multithreaded version while other
 restrictions have been added.
 There is also a global pool, `root_pool`, containing every single resource BIRD
 knows about, either directly or via another resource pool.
 ### Thread safety in resource pools
 In the multithreaded version, every resource pool is bound to a specific IO
 loop and therefore includes an IO loop pointer. This is important for allocations
 as the resource list inside the pool is thread-unsafe. All pool operations
 therefore require the IO loop to be entered to do anything with them, if possible.
 (In case of `rfree`, the pool data structure is not accessed at all so no
 assert is possible. We're currently relying on the caller to ensure proper locking.
 In future, this may change.)
 Each IO loop also has its base resource pool for its allocations. All pools
 inside the IO loop pool must belong to the same loop or to a loop with a
 subordinate lock (see the previous chapter for lock ordering). If there is a
 need for multiple IO loops to access one shared data structure, it must be
 locked by another lock and allocated in such a way that is independent on these
 accessor loops.
 The pool structure should follow the locking order. Any pool should belong to
 either the same loop as its parent or its loop lock should be after its parent
 loop lock in the locking order. This is not enforced explicitly, yet it is
 virtually impossible to write some working code violating this recommendation.
 ### Resource pools in the wilderness
 Root pool contains (among others):
 * route attributes and sources
 * routing tables
 * protocols
 * interfaces
 * configuration data
 Each table has its IO loop and uses the loop base pool for allocations.
 The same holds for protocols. Each protocol has its pool; it is either its IO
 loop base pool or an ordinary pool bound to main loop.
 ## Memory allocators
 BIRD stores data in memory blocks allocated by several allocators. There are 3
 of them: simple memory blocks, linear pools and slabs.
 ### Simple memory block
 When just a chunk of memory is needed, `mb_alloc()` or `mb_allocz()` is used
 to get it. The first with `malloc()` semantics, the other is also zeroed.
 There is also `mb_realloc()` available, `mb_free()` to explicitly free such a
 memory and `mb_move()` to move that memory to another pool.
 Simple memory blocks consume a fixed amount of overhead memory (32 bytes on
 systems with 64-bit pointers) so they are suitable mostly for big chunks,
 taking advantage of the default *stdlib* allocator which is used by this
 allocation strategy. There are anyway some parts of BIRD (in all versions)
 where this allocator is used for little blocks. This will be fixed some day.
 ### Linear pools
 Sometimes, memory is allocated temporarily. When the data may just sit on
 stack, we put it there. Anyway, many tasks need more structured execution where
 stack allocation is incovenient or even impossible (e.g. when callbacks from
 parsers are involved). For such a case, a *linpool* is the best choice.
 This data structure allocates memory blocks of requested size with negligible
 overhead in functions `lp_alloc()` (uninitialized) or `lp_allocz()` (zeroed).
 There is anyway no `realloc` and no `free` call; to have a larger chunk, you
 need to allocate another block. All this memory is freed at once by `lp_flush()`
 when it is no longer needed.
 You may see linpools in parsers (BGP, Linux netlink, config) or in filters.
 In the multithreaded version, linpools have received an update, allocating
 memory pages directly by `mmap()` instead of calling `malloc()`. More on memory
 pages below.
 ### Slabs
 To allocate lots of same-sized objects, a [slab allocator](https://en.wikipedia.org/wiki/Slab_allocation)
 is an ideal choice. In versions until 2.0.8, our slab allocator used blocks
 allocated by `malloc()`, every object included a *slab head* pointer and free objects
 were linked into a single-linked list. This led to memory inefficiency and to
 contra-intuitive behavior where a use-after-free bug could do lots of damage
 before finally crashing.
 Versions from 2.0.9, and also all the multithreaded versions, are coming with
 slabs using directly allocated memory pages and usage bitmaps instead of
 single-linking the free objects. This approach however relies on the fact that
 pointers returned by `mmap()` are always divisible by page size. Freeing of a
 slab object involves zeroing (mostly) 13 least significant bits of its pointer
 to get the page pointer where the slab head resides.
 This update helps with memory consumption by about 5% compared to previous
 versions; exact numbers depend on the usage pattern.
 ## Raw memory pages
 Until 2.0.8 (incl.), BIRD allocated all memory by `malloc()`. This method is
 suitable for lots of use cases, yet when gigabytes of memory should be
 allocated by little pieces, BIRD uses its internal allocators to keep track
 about everything. This brings some ineffectivity as stdlib allocator has its
 own overhead and doesn't allocate aligned memory unless asked for.
 Slabs and linear pools are backed by blocks of memory of kilobyte sizes. As a
 typical memory page size is 4 kB, it is a logical step to drop stdlib
 allocation from these allocators and to use `mmap()` directly. This however has
 some drawbacks, most notably the need of a syscall for every memory mapping and
 unmapping. For allocations, this is not much a case and the syscall time is typically 
 negligible compared to computation time. When freeing memory, this is much
 worse as BIRD sometimes frees gigabytes of data in a blink of eye.
 To minimize the needed number of syscalls, there is a per-thread page cache,
 keeping pages for future use:
 * When a new page is requested, first the page cache is tried.
 * When a page is freed, the per-thread page cache keeps it without telling the kernel.
 * When the number of pages in any per-thread page cache leaves a pre-defined range,
  a cleanup routine is scheduled to free excessive pages or request more in advance.
 This method gives the multithreaded BIRD not only faster memory management than
 ever before but also almost immediate shutdown times as the cleanup routine is
 not scheduled on shutdown at all.
 ## Other resources
 Some objects are not only a piece of memory; notable items are sockets, owning
 the underlying mechanism of I/O, and *object locks*, owning *the right to use a
 specific I/O*. This ensures that collisions on e.g. TCP port numbers and
 addresses are resolved in a predictable way.
 All these resources should be used with the same locking principles as the
 memory blocks. There aren't many checks inside BIRD code to ensure that yet,
 nevertheless violating this recommendation may lead to multiple-access issues.
 *It's still a long road to the version 2.1. This series of texts should document
 what is needed to be changed, why we do it and how. The
 [previous chapter](TODO)
 showed the locking system and how the parallel execution is done.
 The next chapter will cover a bit more detailed explanation about route sources
 and route attributes and how lockless data structures are employed there. Stay tuned!*
--- a/doc/threads/Makefile
+++ b/doc/threads/Makefile
@ -0,0 +1,29 @@
 SUFFICES := .pdf -wordpress.html
 CHAPTERS := 00_the_name_of_the_game 01_the_route_and_its_attributes 02_asynchronous_export 03_coroutines 03b_performance
 all: $(foreach ch,$(CHAPTERS),$(addprefix $(ch),$(SUFFICES)))
 00_the_name_of_the_game.pdf: 00_filter_structure.png
 %.pdf: %.md
 	pandoc -f markdown -t latex -o $@ $<
 %.html: %.md
 	pandoc -f markdown -t html5 -o $@ $<
 %-wordpress.html: %.html Makefile
 	sed -r 's#</p>#\n#g; s#<p>##g; s#<(/?)code>#<\1tt>#g; s#<pre><tt>#<code>#g; s#</tt></pre>#</code>#g; s#</?figure>##g; s#<figcaption>#<p style="text-align: center">#; s#</figcaption>#</p>#; ' $< > $@
 stats-%.csv: stats.csv stats-filter.pl
 	perl stats-filter.pl $< $* > $@
 STATS_VARIANTS := multi imex mulimex single
 stats-all: $(patsubst %,stats-%.csv,$(STATS_VARIANTS))
 stats-2d-%.pdf: stats.csv stats-filter-2d.pl
 	perl stats-filter-2d.pl $< $* $@
 stats-2d-%.png: stats-2d-%.pdf
 	gs -dBATCH -dNOPAUSE -sDEVICE=pngalpha -sOutputFile=$@ -r300 $<
 stats-all-2d: $(foreach suf,pdf png,$(patsubst %,stats-2d-%.$(suf),$(STATS_VARIANTS)))
--- a/doc/threads/stats-draw.gnuplot
+++ b/doc/threads/stats-draw.gnuplot
@ -0,0 +1,41 @@
 set datafile columnheaders
 set datafile separator ";"
 #set isosample 15
 set dgrid3d 8,8
 set logscale
 set view 80,15,1,1
 set autoscale xy
 #set pm3d
 set term pdfcairo size 20cm,15cm
 set xlabel "TOTAL ROUTES" offset 0,-1.5
 set xrange [10000:320000]
 set xtics offset 0,-0.5
 set xtics (10000,15000,30000,50000,100000,150000,300000)
 set ylabel "PEERS"
 #set yrange [10:320]
 #set ytics (10,15,30,50,100,150,300)
 set yrange [10:320]
 set ytics (10,15,30,50,100,150,300)
 set zrange [1:2000]
 set xyplane at 1
 set border 895
 #set grid ztics lt 20
 set output ARG1 . "-" . ARG4 . ".pdf"
 splot \
  ARG1 . ".csv" \
    using "TOTAL_ROUTES":"PEERS":ARG2."/".ARG4 \
    with lines \
    title ARG2."/".ARG4, \
  "" \
    using "TOTAL_ROUTES":"PEERS":ARG3."/".ARG4 \
    with lines \
    title ARG3."/".ARG4
--- a/doc/threads/stats-filter-2d.pl
+++ b/doc/threads/stats-filter-2d.pl
@ -0,0 +1,156 @@
 #!/usr/bin/perl
 use common::sense;
 use Data::Dump;
 use List::Util;
 my @GROUP_BY = qw/VERSION PEERS TOTAL_ROUTES/;
 my @VALUES = qw/TIMEDIF/;
 my ($FILE, $TYPE, $OUTPUT) = @ARGV;
 ### Load data ###
 my %data;
 open F, "<", $FILE or die $!;
 my @header = split /;/, <F>;
 chomp @header;
 my $line = undef;
 while ($line = <F>)
 {
  chomp $line;
  $line =~ s/;;(.*);;/;;\1;/;
  $line =~ s/v2\.0\.8-1[89][^;]+/bgp/;
  $line =~ s/v2\.0\.8-[^;]+/sark/ and next;
  $line =~ s/master;/v2.0.8;/;
  my %row;
  @row{@header} = split /;/, $line;
  push @{$data{join ";", @row{@GROUP_BY}}}, { %row } if $row{TYPE} eq $TYPE;
 }
 ### Do statistics ###
 sub avg {
  return List::Util::sum(@_) / @_;
 }
 sub getinbetween {
  my $index = shift;
  my @list = @_;
  return $list[int $index] if $index == int $index;
  my $lower = $list[int $index];
  my $upper = $list[1 + int $index];
  my $frac = $index - int $index;
  return ($lower * (1 - $frac) + $upper * $frac);
 }
 sub stats {
  my $avg = shift;
  return [0, 0, 0, 0, 0] if @_ <= 1;
  #  my $stdev = sqrt(List::Util::sum(map { ($avg - $_)**2 } @_) / (@_-1));
  my @sorted = sort { $a <=> $b } @_;
  my $count = scalar @sorted;
  return [
    getinbetween(($count-1) * 0.25, @sorted),
    $sorted[0],
    $sorted[$count-1],
    getinbetween(($count-1) * 0.75, @sorted),
  ];
 }
 my %output;
 my %vers;
 my %peers;
 my %stplot;
 STATS:
 foreach my $k (keys %data)
 {
  my %cols = map { my $vk = $_; $vk => [ map { $_->{$vk} } @{$data{$k}} ]; } @VALUES;
  my %avg = map { $_ => avg(@{$cols{$_}})} @VALUES;
  my %stloc = map { $_ => stats($avg{$_}, @{$cols{$_}})} @VALUES;
  $vers{$data{$k}[0]{VERSION}}++;
  $peers{$data{$k}[0]{PEERS}}++;
  $output{$data{$k}[0]{VERSION}}{$data{$k}[0]{PEERS}}{$data{$k}[0]{TOTAL_ROUTES}} = { %avg };
  $stplot{$data{$k}[0]{VERSION}}{$data{$k}[0]{PEERS}}{$data{$k}[0]{TOTAL_ROUTES}} = { %stloc };
 }
 #(3 == scalar %vers) and $vers{sark} and $vers{bgp} and $vers{"v2.0.8"} or die "vers size is " . (scalar %vers) . ", items ", join ", ", keys %vers;
 (2 == scalar %vers) and $vers{bgp} and $vers{"v2.0.8"} or die "vers size is " . (scalar %vers) . ", items ", join ", ", keys %vers;
 ### Export the data ###
 open PLOT, "|-", "gnuplot" or die $!;
 say PLOT <<EOF;
 set logscale
 set term pdfcairo size 20cm,15cm
 set xlabel "Total number of routes" offset 0,-1.5
 set xrange [10000:3000000]
 set xtics offset 0,-0.5
 #set xtics (10000,15000,30000,50000,100000,150000,300000,500000,1000000)
 set ylabel "Time to converge (s)"
 set yrange [0.5:10800]
 set grid
 set key left top
 set output "$OUTPUT"
 EOF
 my @colors = (
  [ 1, 0.9, 0.3 ],
  [ 0.7, 0, 0 ],
  #  [ 0.6, 1, 0.3 ],
  #  [ 0, 0.7, 0 ],
  [ 0, 0.7, 1 ],
  [ 0.3, 0.3, 1 ],
 );
 my $steps = (scalar %peers) - 1;
 my @plot_data;
 foreach my $v (sort keys %vers) {
  my $color = shift @colors;
  my $endcolor = shift @colors;
  my $stepcolor = [ map +( ($endcolor->[$_] - $color->[$_]) / $steps ), (0, 1, 2) ];
  foreach my $p (sort { int $a <=> int $b } keys %peers) {
    my $vnodot = $v; $vnodot =~ s/\.//g;
    say PLOT "\$data_${vnodot}_${p} << EOD";
    foreach my $tr (sort { int $a <=> int $b } keys %{$output{$v}{$p}}) {
      say PLOT "$tr $output{$v}{$p}{$tr}{TIMEDIF}";
    }
    say PLOT "EOD";
    say PLOT "\$data_${vnodot}_${p}_stats << EOD";
    foreach my $tr (sort { int $a <=> int $b } keys %{$output{$v}{$p}}) {
      say PLOT join " ", ( $tr, @{$stplot{$v}{$p}{$tr}{TIMEDIF}} );
    }
    say PLOT "EOD";
    my $colorstr = sprintf "linecolor rgbcolor \"#%02x%02x%02x\"", map +( int($color->[$_] * 255 + 0.5)), (0, 1, 2);
    push @plot_data, "\$data_${vnodot}_${p} using 1:2 with lines $colorstr linewidth 2 title \"$v, $p peers\"";
    push @plot_data, "\$data_${vnodot}_${p}_stats with candlesticks $colorstr linewidth 2 notitle \"\"";
    $color = [ map +( $color->[$_] + $stepcolor->[$_] ), (0, 1, 2) ];
  }
 }
 push @plot_data, "2 with lines lt 1 dashtype 2 title \"Measurement instability\"";
 say PLOT "plot ", join ", ", @plot_data;
 close PLOT;
--- a/doc/threads/stats-filter.pl
+++ b/doc/threads/stats-filter.pl
@ -0,0 +1,84 @@
 #!/usr/bin/perl
 use common::sense;
 use Data::Dump;
 use List::Util;
 my @GROUP_BY = qw/VERSION PEERS TOTAL_ROUTES/;
 my @VALUES = qw/RSS SZ VSZ TIMEDIF/;
 my ($FILE, $TYPE) = @ARGV;
 ### Load data ###
 my %data;
 open F, "<", $FILE or die $!;
 my @header = split /;/, <F>;
 chomp @header;
 my $line = undef;
 while ($line = <F>)
 {
  chomp $line;
  my %row;
  @row{@header} = split /;/, $line;
  push @{$data{join ";", @row{@GROUP_BY}}}, { %row } if $row{TYPE} eq $TYPE;
 }
 ### Do statistics ###
 sub avg {
  return List::Util::sum(@_) / @_;
 }
 sub stdev {
  my $avg = shift;
  return 0 if @_ <= 1;
  return sqrt(List::Util::sum(map { ($avg - $_)**2 } @_) / (@_-1));
 }
 my %output;
 my %vers;
 STATS:
 foreach my $k (keys %data)
 {
  my %cols = map { my $vk = $_; $vk => [ map { $_->{$vk} } @{$data{$k}} ]; } @VALUES;
  my %avg = map { $_ => avg(@{$cols{$_}})} @VALUES;
  my %stdev = map { $_ => stdev($avg{$_}, @{$cols{$_}})} @VALUES;
  foreach my $v (@VALUES) {
    next if $stdev{$v} / $avg{$v} < 0.035;
    for (my $i=0; $i<@{$cols{$v}}; $i++)
    {
      my $dif = $cols{$v}[$i] - $avg{$v};
      next if $dif < $stdev{$v} * 2 and $dif > $stdev{$v} * (-2);
 =cut
      printf "Removing an outlier for %s/%s: avg=%f, stdev=%f, variance=%.1f%%, val=%f, valratio=%.1f%%\n",
 	$k, $v, $avg{$v}, $stdev{$v}, (100 * $stdev{$v} / $avg{$v}), $cols{$v}[$i], (100 * $dif / $stdev{$v});
 =cut
      splice @{$data{$k}}, $i, 1, ();
      redo STATS;
    }
  }
  $vers{$data{$k}[0]{VERSION}}++;
  $output{"$data{$k}[0]{PEERS};$data{$k}[0]{TOTAL_ROUTES}"}{$data{$k}[0]{VERSION}} = { %avg };
 }
 ### Export the data ###
 say "PEERS;TOTAL_ROUTES;" . join ";", ( map { my $vk = $_; map { "$_/$vk" } keys %vers; } @VALUES );
 sub keysort {
  my ($pa, $ta) = split /;/, $_[0];
  my ($pb, $tb) = split /;/, $_[1];
  return (int $ta) <=> (int $tb) if $pa eq $pb;
  return (int $pa) <=> (int $pb);
 }
 foreach my $k (sort { keysort($a, $b); } keys %output)
 {
  say "$k;" . join ";", ( map { my $vk = $_; map { $output{$k}{$_}{$vk}; } keys %vers; } @VALUES );
 }
--- a/doc/threads/stats-longfilters.csv
+++ b/doc/threads/stats-longfilters.csv
--- a/doc/threads/stats.csv
+++ b/doc/threads/stats.csv
--- a/misc/bird.spec
+++ b/misc/bird.spec
@ -1,6 +1,6 @@
 Summary: BIRD Internet Routing Daemon
 Name: bird
-Version: 2.0.12
+Version: 3.0-alpha0
 Release: 1
 Copyright: GPL
 Group: Networking/Daemons