From 765c940094e8e070212ef66b7a77e86829b0cf60 Mon Sep 17 00:00:00 2001
From: Maria Matejka <mq@ucw.cz>
Date: Thu, 3 Feb 2022 22:42:26 +0100
Subject: [PATCH] Thread documentation: Rewritten chapter 3 on loops and locks
 once again

---
 doc/threads/03_coroutines.md | 214 ++++++++++++++++-------------------
 1 file changed, 98 insertions(+), 116 deletions(-)

diff --git a/doc/threads/03_coroutines.md b/doc/threads/03_coroutines.md
index fccc3bbf..d49314eb 100644
--- a/doc/threads/03_coroutines.md
+++ b/doc/threads/03_coroutines.md
@@ -25,16 +25,17 @@ Locks in BIRD (called domains, as they always lock some defined part of BIRD)
 are partially ordered. Every *domain* has its *type* and all threads are
 strictly required to lock the domains in the order of their respective types.
 The full order is defined in `lib/locking.h`. It's forbidden to lock more than
-one domain of a type (these domains are uncomparable) and recursive locking as well.
+one domain of a type (these domains are uncomparable) and recursive locking is
+forbidden as well.
 
-The locking hiearchy is (as of December 2021) like this:
+The locking hiearchy is (roughly; as of February 2022) like this:
 
 1. The BIRD Lock (for everything not yet checked and/or updated)
-2. Protocols (as of December 2021, it is BFD, RPKI and Pipe)
+2. Protocols (as of February 2022, it is BFD, RPKI, Pipe and BGP)
 3. Routing tables
 4. Global route attribute cache
 5. Message passing
-6. Internals
+6. Internals and memory management
 
 There are heavy checks to ensure proper locking and to help debugging any
 problem when any code violates the hierarchy rules. This impedes performance
@@ -48,11 +49,33 @@ Risks arising from dropping some locking checks include:
 * data corruption; it either kills BIRD anyway, or it results into a slow and vicious death,
   leaving undebuggable corefiles behind.
 
-To be honest, I believe in principles like "there is also one more bug somewhere"
-and I just don't trust my future self or anybody else to write bugless code when
-it comes to proper locking. I believe that if a lock becomes a bottle-neck,
-then we should think about what is locked inside and how to optimize that, instead
-of dropping thorough consistency checks.
+To be honest, I believe in principles like *"every nontrivial software has at least one bug"*
+and I also don't trust my future self or anybody else to always write bugless code when
+it comes to proper locking. I also believe that if a lock becomes a bottle-neck,
+then we should think about what is locked inside and how to optimize that,
+possibly implementing a lockless or waitless data structure instead of dropping
+thorough consistency checks, especially in a multithreaded environment.
+
+### Choosing the right locking order
+
+When considering the locking order of protocols and route tables, the answer
+was quite easy. We had to make either import or export asynchronous (or both).
+Major reasons for asynchronous export have been stated in the previous chapter,
+therefore it makes little sense to allow entering protocol context from table code.
+
+As I write further in this text, even accessing table context from protocol
+code leads to contention on table locks, yet for now, it is good enough and the
+lock order features routing tables after protocols to make the multithreading
+goal easier to achieve.
+
+The major lock level is still The BIRD Lock, containing not only the
+not-yet-converted protocols (like Babel, OSPF or RIP) but also processing CLI
+commands and reconfiguration. This involves an awful lot of direct access into
+other contexts which would be unnecessarily complicated to implement by message
+passing. Therefore, this lock is simply *"the director"*, sitting on the top.
+
+The lower lock levels are mostly for shared global data structures accessed
+from everywhere. We'll address some of these later.
 
 ## IO Loop
 
@@ -64,15 +87,13 @@ routines. This code could be easily updated for general use so I did it.
 
 To understand the internal principles, we should say that in the `master`
 branch, there is a big loop centered around a `poll()` call, dispatching and
-executing everything as needed. There are several means how to get something dispatched from the main loop.
+executing everything as needed. There are several means how to get something dispatched from a loop.
 
 1. Requesting to read from a socket makes the main loop call your hook when there is some data received.
    The same happens when a socket refuses to write data. Then the data is buffered and you are called when
    the buffer is free. There is also a third callback, an error hook, for obvious reasons.
 
-2. Requesting to be called back after a given amount of time. The callback may
-   be delayed by any amount of time, anyway when it exceeds 5 seconds (default,
-   configurable) at least the user gets a warning. This is called *timer*.
+2. Requesting to be called back after a given amount of time. This is called *timer*.
 
 3. Requesting to be called back when possible. This is useful to run anything
    not reentrant which might mess with the caller's data, e.g. when a protocol
@@ -97,118 +118,82 @@ locking that loop's domain. In fact, every event queue has its own lock with a
 low priority, allowing to pass messages from almost any part of BIRD, and also
 an assigned loop which executes the events enqueued. When a message is passed
 to a queue executed by another loop, that target loop must be woken up so we
-must know what loop to wake up to avoid unnecessary delays.
+must know what loop to wake up to avoid unnecessary delays. Then the target
+loop opens its mailbox and processes the task in its context.
 
-The other way is faster but not always possible. When the target loop domain
-may be locked from the original loop domain, we may simply *enter the target loop*,
-do the work and then *leave the loop*. Route import uses this approach to
-directly update the best route in the target table. In the other direction,
-loop entering is not possible and events must be used to pass messages.
+The other way is a direct access of another domain. This approach blocks the
+appropriate loop from doing anything and we call it *entering a birdloop* to
+remember that the task must be fast and *leave the birdloop* as soon as possible.
+Route import is done via direct access from protocols to tables; in large
+setups with fast filters, this is a major point of contention (after filters
+have been parallelized) and will be addressed in future optimization efforts.
+Reconfiguration and interface updates also use direct access; more on that later.
+In general, this approach should be avoided unless there are good reasons to use it.
 
-Asynchronous message passing is expensive. It involves sending a byte to a pipe
-to wakeup a loop from `poll` to execute the message. If we had to send a ping
-for every route we import to every channel to export it, we'd spend more time
-pinging than computing the best route. The route update routines therefore
-employ a double-indirect delayed route announcement:
+Even though direct access is bad, sending lots of messages may be even worse.
+Imagine one thousand post(wo)men, coming one by one every minute, ringing your
+doorbell and delivering one letter each to you. Horrible! Asynchronous message
+passing works exactly this way. After queuing the message, the source sends a
+byte to a pipe to wakeup the target loop to process the task. We could also
+periodically poll for messages instead of waking up the targets, yet it would
+add quite a lot of latency which we also don't like.
 
-1. When a channel imports a route by entering a loop, it sends an event to its
-   own loop (no ping needed in such case). This operation is idempotent, thus
-   for several routes, only one event is enqueued.
-2. After all packet parsing is done, the channel import announcement event is
-   executed, sending another event to the table's loop. There may have been
-   multiple imports in the same time but the exports have to get a ping just once.
-3. The table's announcement event is executed from its loop, enqueuing export
-   events for all connected channels, finally initiating route exports.
+Messages in BIRD don't typically suffer from the problem of amount and the
+overhead is negligible compared to the overall CPU consumption. With one notable
+exception: route import/export.
 
-This may seem overly complicated, yet it also allows the incoming changes to
-settle down before exports are finished, reducing also cache invalidation
-between importing and exporting threads.
+### Route export message passing
 
-## Choosing the right locking order
+If we had to send a ping for every route we import to every exporting channel,
+we'd spend more time pinging than doing anything else. Been there, seen
+those unbelievable 80%-like figures in Perf output. Never more.
 
-When considering the locking order of protocols and route tables, the answer was quite easy.
-If route tables could enter protocol loops, they would have to either directly
-execute protocol code, one export after another, or send whole routes by messages.
-Setting this other way around (protocol entering route tables), protocols do
-everything on their time, minimizing table time. Tables are contention points.
+Route update is quite a complicated process. BIRD must handle large-scale
+configurations with lots of importers and exporters. Therefore, a
+triple-indirect delayed route announcement is employed:
 
-The third major lock level is The BIRD Lock, containing virtually everything
-else. It is also established that BFD is after The BIRD Lock, as BFD is
-low-latency and can't wait until The BIRD gets unlocked. Thus it would be
-convenient to have all protocols on the same level, getting The BIRD Lock on top.
+1. First, when a channel imports a route by entering a loop, it sends an event
+   to its own loop (no ping needed in such case). This operation is idempotent,
+   thus for several routes in a row, only one event is enqueued. This reduces
+   several route imports (even hundreds in case of massive BGP withdrawals) to
+   one single event.
+2. When the channel is done importing (or at least takes a coffee break and
+   checks its mailbox), the scheduled event in its own loop is run, sending
+   another event to the table's loop, saying basically *"Hey, table, I've just
+   imported something."*. This event is also idempotent and further reduces
+   route imports from multiple sources to one single event.
+3. The table's announcement event is then executed from its loop, enqueuing export
+   events for all connected channels, finally initiating route exports. As we
+   already know, imports are done by direct access, therefore if protocols keep
+   importing, export announcements must wait.
 
-The BIRD Lock also runs CLI, reconfiguration and other high-level tasks,
-requiring access to everything. Having The BIRD Lock anywhere else, these
-high-level tasks, scattered all around BIRD source code, would have to be split
-out to some super-loop.
-
-## Route tables
-
-BFD could be split out thanks to its special nature. There are no BFD routes,
-therefore no route tables are accessed. To split out any other protocol, we
-need the protocol to be able to directly access routing tables. Therefore
-route tables have to be split out first, to make space for protocols to go
-between tables and The BIRD main loop.
-
-Route tables are primarily data structures, yet they have their maintenance
-routines. Their purpose is (among others) to cleanup export buffers, update
-recursive routes and delete obsolete routes. This all may take lots of time
-occasionally so it makes sense to have a dedicated thread for these.
-
-In previous versions, I had a special type of event loop based on semaphores,
-contrary to the loop originating in BFD, based on `poll`. This was
-unnecessarily complicated, thus I rewrote that finally to use the universal IO
-loop, just with no sockets at all.
-
-There are some drawbacks of this, notably the number of filedescriptors BIRD
-now uses. The user should also check the maximum limit on threads per process.
-
-This change also means that imports and exports are started and stopped
-asynchronously. Stopping an import needs to wait until all its routes are gone.
-This induced some changes in the protocol state machine.
-
-## Protocols
-
-After tables were running in their own loops, the simplest protocol to split
-out was Pipe. There are still no sockets, just events. This also means that
-every single filter assigned to a pipe is run in its own thread, not blocking
-others. (To be precise, both directions of a pipe share the same thread.)
-
-When RPKI is in use, we want it to load the ROAs as soon as possible. Its table
-is independent and the protocol itself is so simple that it could be put into
-its own thread easily.
-
-Other protocols are pending (Kernel) or in progress (BGP).
-
-I tried to make the conversion also as easy as possible, implementing most of
-the code in the generic functions in `nest/proto.c`. There are some
-synchronization points in the protocol state machine; we can't simply delete
-all protocol data when there is another thread running. Together with the
-asynchronous import/export stopping, it is quite messy and it might need some
-future cleanup. Anyway, moving a protocol to its own thread should be now as simple
-as setting its locking level in its `config.Y` file and stopping all timers
-before shutting down.
-(See commits `4f3fa1623f66acd24c227cf0cc5a4af2f5133b6c`
-and `3fd1f46184aa74d8ab7ed65c9ab6954f7e49d309`.)
+This may seem overly complicated, yet it should work and it seems to work. In
+case of low load, all these notifications just come through smoothly. In case
+of high load, it's common that multiple updates come for the same destination.
+Delaying the exports allows for the updates to settle down and export just the
+final result, reducing CPU load and export traffic.
 
 ## Cork
 
+Route propagation is involved in yet another problem which has to be addressed.
 In the old versions with synchronous route propagation, all the buffering
 happened after exporting routes to BGP. When a packet arrived, all the work was
 done in BGP receive hook – parsing, importing into a table, running all the
 filters and possibly sending to the peers. No more routes until the previous
-was done. This doesn't work any more.
+was done. This self-regulating mechanism doesn't work any more.
 
 Route table import now returns immediately after inserting the route into a
 table, creating a buffer there. These buffers have to be processed by other protocols'
-export events, typically queued in the *global work queue* to be limited for lower latency.
-There is therefore no inherent limit for table export buffers which may lead
-(and occasionally leads) to memory bloating. This is even worse in configurations with pipes,
-as these multiply the exports by propagating them all the way down to other tables.
+export events. In large-scale configurations, one route import has to be
+processed by hundreds, even thousands of exports. Unlimited imports are a major
+cause of buffer bloating. This is even worse in configurations with pipes,
+as these multiply the exports by propagating them all the way down to other
+tables, eventually eating about twice the amount of memory than the single-threaded version.
 
 There is therefore a cork to make this stop. Every table is checking how many
-exports it has pending, and when adding a new route, it may apply a cork,
-saying simply "please stop the flow for a while". When the exports are then processed, it uncorks.
+exports it has pending, and when adding a new export to the queue, it may apply
+a cork, saying simply "please stop the flow for a while". When the exports are
+then processed, it uncorks.
 
 On the other side, there may be events and sockets with a cork assigned. When
 trying to enqueue an event and the cork is applied, the event is instead put
@@ -217,23 +202,20 @@ sockets, when `poll` arguments are recalculated, the corked socket is not
 checked for received packets, effectively keeping them in the TCP queue and
 slowing down the flow.
 
-Both events and sockets have some delay before they get to the cork. This is
-intentional; the purpose of cork is to slow down and allow for exports.
-
-The cork implementation is probably due to some future changes after BGP gets
-split out of the main loop, depending on how it is going to perform. I suppose
-that the best way should be to implement a proper table API to allow for
-explicit backpressure on both sides:
+The cork implementation is quite crude and rough and fragile. It may get some
+rework while stabilizing the multi-threaded version of BIRD or we may even
+completely drop it for some better mechanism. One of these candidates is this
+kind of API:
 
 * (table to protocol) please do not import
 * (table to protocol) you may resume imports
 * (protocol to table) not processing any exports
 * (protocol to table) resuming export processing
 
-Anyway, for now it is good enough as it is.
+Anyway, cork works as intended in most cases at least for now.
 
-*It's still a long road to the version 2.1. This series of texts should document
-what is needed to be changed, why we do it and how. The
+*It's a long road to the version 2.1. This series of texts should document what
+is changing, why we do it and how. The
 [previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/)
-showed how the route export had to change to allow parallel execution. In the next chapter, some memory management
+shows how the route export had to change to allow parallel execution. In the next chapter, some memory management
 details are to be explained together with the reasons why memory management matters. Stay tuned!*