2021-12-08 20:31:12 +01:00
|
|
|
|
# BIRD Journey to Threads. Chapter 3: Parallel execution and message passing.
|
|
|
|
|
|
|
|
|
|
Parallel execution in BIRD uses an underlying mechanism of dedicated IO loops
|
|
|
|
|
and hierarchical locks. The original event scheduling module has been converted
|
|
|
|
|
to do message passing in multithreaded environment. These mechanisms are
|
2022-02-04 15:18:06 +01:00
|
|
|
|
crucial for understanding what happens inside BIRD and how its internal API changes.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
|
|
|
|
BIRD is a fast, robust and memory-efficient routing daemon designed and
|
|
|
|
|
implemented at the end of 20th century. We're doing a significant amount of
|
|
|
|
|
BIRD's internal structure changes to make it run in multiple threads in parallel.
|
|
|
|
|
|
|
|
|
|
## Locking and deadlock prevention
|
|
|
|
|
|
|
|
|
|
Most of BIRD data structures and algorithms are thread-unsafe and not even
|
|
|
|
|
reentrant. Checking and possibly updating all of these would take an
|
|
|
|
|
unreasonable amount of time, thus the multithreaded version uses standard mutexes
|
|
|
|
|
to lock all the parts which have not been checked and updated yet.
|
|
|
|
|
|
|
|
|
|
The authors of original BIRD concepts wisely chose a highly modular structure
|
|
|
|
|
which allows to create a hierarchy for locks. The main chokepoint was between
|
2022-02-04 15:18:06 +01:00
|
|
|
|
protocols and tables and it has been removed by implementing asynchronous exports
|
2021-12-08 20:31:12 +01:00
|
|
|
|
as described in the [previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/).
|
|
|
|
|
|
|
|
|
|
Locks in BIRD (called domains, as they always lock some defined part of BIRD)
|
|
|
|
|
are partially ordered. Every *domain* has its *type* and all threads are
|
|
|
|
|
strictly required to lock the domains in the order of their respective types.
|
|
|
|
|
The full order is defined in `lib/locking.h`. It's forbidden to lock more than
|
2022-02-03 22:42:26 +01:00
|
|
|
|
one domain of a type (these domains are uncomparable) and recursive locking is
|
|
|
|
|
forbidden as well.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
2022-02-03 22:42:26 +01:00
|
|
|
|
The locking hiearchy is (roughly; as of February 2022) like this:
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
|
|
|
|
1. The BIRD Lock (for everything not yet checked and/or updated)
|
2022-02-03 22:42:26 +01:00
|
|
|
|
2. Protocols (as of February 2022, it is BFD, RPKI, Pipe and BGP)
|
2021-12-08 20:31:12 +01:00
|
|
|
|
3. Routing tables
|
|
|
|
|
4. Global route attribute cache
|
|
|
|
|
5. Message passing
|
2022-02-03 22:42:26 +01:00
|
|
|
|
6. Internals and memory management
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
|
|
|
|
There are heavy checks to ensure proper locking and to help debugging any
|
|
|
|
|
problem when any code violates the hierarchy rules. This impedes performance
|
|
|
|
|
depending on how much that domain is contended and in some cases I have already
|
|
|
|
|
implemented lockless (or partially lockless) data structures to overcome this.
|
|
|
|
|
|
|
|
|
|
You may ask, why are these heavy checks then employed in production builds?
|
|
|
|
|
Risks arising from dropping some locking checks include:
|
|
|
|
|
|
|
|
|
|
* deadlocks; these are deadly in BIRD anyway so it should just fail with a meaningful message, or
|
|
|
|
|
* data corruption; it either kills BIRD anyway, or it results into a slow and vicious death,
|
|
|
|
|
leaving undebuggable corefiles behind.
|
|
|
|
|
|
2022-02-03 22:42:26 +01:00
|
|
|
|
To be honest, I believe in principles like *"every nontrivial software has at least one bug"*
|
|
|
|
|
and I also don't trust my future self or anybody else to always write bugless code when
|
|
|
|
|
it comes to proper locking. I also believe that if a lock becomes a bottle-neck,
|
|
|
|
|
then we should think about what is locked inside and how to optimize that,
|
|
|
|
|
possibly implementing a lockless or waitless data structure instead of dropping
|
|
|
|
|
thorough consistency checks, especially in a multithreaded environment.
|
|
|
|
|
|
|
|
|
|
### Choosing the right locking order
|
|
|
|
|
|
|
|
|
|
When considering the locking order of protocols and route tables, the answer
|
|
|
|
|
was quite easy. We had to make either import or export asynchronous (or both).
|
|
|
|
|
Major reasons for asynchronous export have been stated in the previous chapter,
|
|
|
|
|
therefore it makes little sense to allow entering protocol context from table code.
|
|
|
|
|
|
|
|
|
|
As I write further in this text, even accessing table context from protocol
|
|
|
|
|
code leads to contention on table locks, yet for now, it is good enough and the
|
|
|
|
|
lock order features routing tables after protocols to make the multithreading
|
|
|
|
|
goal easier to achieve.
|
|
|
|
|
|
|
|
|
|
The major lock level is still The BIRD Lock, containing not only the
|
|
|
|
|
not-yet-converted protocols (like Babel, OSPF or RIP) but also processing CLI
|
|
|
|
|
commands and reconfiguration. This involves an awful lot of direct access into
|
|
|
|
|
other contexts which would be unnecessarily complicated to implement by message
|
2022-02-04 15:18:06 +01:00
|
|
|
|
passing. Therefore, this lock is simply *"the director"*, sitting on the top
|
|
|
|
|
with its own category.
|
2022-02-03 22:42:26 +01:00
|
|
|
|
|
2022-02-04 15:18:06 +01:00
|
|
|
|
The lower lock levels under routing tables are mostly for shared global data
|
|
|
|
|
structures accessed from everywhere. We'll address some of these later.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
|
|
|
|
## IO Loop
|
|
|
|
|
|
|
|
|
|
There has been a protocol, BFD, running in its own thread since 2013. This
|
|
|
|
|
separation has a good reason; it needs low latency and the main BIRD loop just
|
2022-02-04 15:18:06 +01:00
|
|
|
|
walks round-robin around all the available sockets and one round-trip may take
|
|
|
|
|
a long time (even more than a minute with large configurations). BFD had its
|
|
|
|
|
own IO loop implementation and simple message passing routines. This code could
|
|
|
|
|
be easily updated for general use so I did it.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
|
|
|
|
To understand the internal principles, we should say that in the `master`
|
|
|
|
|
branch, there is a big loop centered around a `poll()` call, dispatching and
|
2022-02-04 15:18:06 +01:00
|
|
|
|
executing everything as needed. In the `sark` branch, there are multiple loops
|
|
|
|
|
of this kind. BIRD has several means how to get something dispatched from a
|
|
|
|
|
loop.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
2022-02-04 15:18:06 +01:00
|
|
|
|
1. Requesting to read from a **socket** makes the main loop call your hook when there is some data received.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
The same happens when a socket refuses to write data. Then the data is buffered and you are called when
|
2022-02-04 15:18:06 +01:00
|
|
|
|
the buffer is free to continue writing. There is also a third callback, an error hook, for obvious reasons.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
2022-02-04 15:18:06 +01:00
|
|
|
|
2. Requesting to be called back after a given amount of time. This is called **timer**.
|
|
|
|
|
As is common with all timers, they aren't precise and the callback may be
|
|
|
|
|
delayed significantly. This was also the reason to have BFD loop separate
|
|
|
|
|
since the very beginning, yet now the abundance of threads may lead to
|
|
|
|
|
problems with BFD latency in large-scale configurations. We haven't tested
|
|
|
|
|
this yet.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
2022-02-04 15:18:06 +01:00
|
|
|
|
3. Requesting to be called back from a clean context when possible. This is
|
|
|
|
|
useful to run anything not reentrant which might mess with the caller's
|
|
|
|
|
data, e.g. when a protocol decides to shutdown due to some inconsistency
|
|
|
|
|
in received data. This is called **event**.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
|
|
|
|
4. Requesting to do some work when possible. These are also events, there is only
|
|
|
|
|
a difference where this event is enqueued; in the main loop, there is a
|
|
|
|
|
special *work queue* with an execution limit, allowing sockets and timers to be
|
|
|
|
|
handled with a reasonable latency while still doing all the work needed.
|
2022-02-04 15:18:06 +01:00
|
|
|
|
Other loops don't have designated work queues (we may add them later).
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
|
|
|
|
All these, sockets, timers and events, are tightly bound to some domain.
|
|
|
|
|
Sockets typically belong to a protocol, timers and events to a protocol or table.
|
|
|
|
|
With the modular structure of BIRD, the easy and convenient approach to multithreading
|
2022-02-04 15:18:06 +01:00
|
|
|
|
is to get more IO loops, each bound to a specific domain, running their events, timers and
|
2021-12-08 20:31:12 +01:00
|
|
|
|
socket hooks in their threads.
|
|
|
|
|
|
|
|
|
|
## Message passing and loop entering
|
|
|
|
|
|
|
|
|
|
To request some work in another module, the standard way is to pass a message.
|
|
|
|
|
For this purpose, events have been modified to be sent to a given loop without
|
|
|
|
|
locking that loop's domain. In fact, every event queue has its own lock with a
|
|
|
|
|
low priority, allowing to pass messages from almost any part of BIRD, and also
|
|
|
|
|
an assigned loop which executes the events enqueued. When a message is passed
|
|
|
|
|
to a queue executed by another loop, that target loop must be woken up so we
|
2022-02-03 22:42:26 +01:00
|
|
|
|
must know what loop to wake up to avoid unnecessary delays. Then the target
|
|
|
|
|
loop opens its mailbox and processes the task in its context.
|
|
|
|
|
|
|
|
|
|
The other way is a direct access of another domain. This approach blocks the
|
|
|
|
|
appropriate loop from doing anything and we call it *entering a birdloop* to
|
|
|
|
|
remember that the task must be fast and *leave the birdloop* as soon as possible.
|
|
|
|
|
Route import is done via direct access from protocols to tables; in large
|
|
|
|
|
setups with fast filters, this is a major point of contention (after filters
|
|
|
|
|
have been parallelized) and will be addressed in future optimization efforts.
|
|
|
|
|
Reconfiguration and interface updates also use direct access; more on that later.
|
|
|
|
|
In general, this approach should be avoided unless there are good reasons to use it.
|
|
|
|
|
|
|
|
|
|
Even though direct access is bad, sending lots of messages may be even worse.
|
|
|
|
|
Imagine one thousand post(wo)men, coming one by one every minute, ringing your
|
|
|
|
|
doorbell and delivering one letter each to you. Horrible! Asynchronous message
|
|
|
|
|
passing works exactly this way. After queuing the message, the source sends a
|
|
|
|
|
byte to a pipe to wakeup the target loop to process the task. We could also
|
|
|
|
|
periodically poll for messages instead of waking up the targets, yet it would
|
|
|
|
|
add quite a lot of latency which we also don't like.
|
|
|
|
|
|
|
|
|
|
Messages in BIRD don't typically suffer from the problem of amount and the
|
|
|
|
|
overhead is negligible compared to the overall CPU consumption. With one notable
|
|
|
|
|
exception: route import/export.
|
|
|
|
|
|
|
|
|
|
### Route export message passing
|
|
|
|
|
|
|
|
|
|
If we had to send a ping for every route we import to every exporting channel,
|
|
|
|
|
we'd spend more time pinging than doing anything else. Been there, seen
|
|
|
|
|
those unbelievable 80%-like figures in Perf output. Never more.
|
|
|
|
|
|
|
|
|
|
Route update is quite a complicated process. BIRD must handle large-scale
|
|
|
|
|
configurations with lots of importers and exporters. Therefore, a
|
|
|
|
|
triple-indirect delayed route announcement is employed:
|
|
|
|
|
|
|
|
|
|
1. First, when a channel imports a route by entering a loop, it sends an event
|
|
|
|
|
to its own loop (no ping needed in such case). This operation is idempotent,
|
|
|
|
|
thus for several routes in a row, only one event is enqueued. This reduces
|
2022-02-04 15:18:06 +01:00
|
|
|
|
several route import announcements (even hundreds in case of massive BGP
|
|
|
|
|
withdrawals) to one single event.
|
2022-02-03 22:42:26 +01:00
|
|
|
|
2. When the channel is done importing (or at least takes a coffee break and
|
|
|
|
|
checks its mailbox), the scheduled event in its own loop is run, sending
|
|
|
|
|
another event to the table's loop, saying basically *"Hey, table, I've just
|
|
|
|
|
imported something."*. This event is also idempotent and further reduces
|
2022-02-04 15:18:06 +01:00
|
|
|
|
route import announcements from multiple sources to one single event.
|
2022-02-03 22:42:26 +01:00
|
|
|
|
3. The table's announcement event is then executed from its loop, enqueuing export
|
|
|
|
|
events for all connected channels, finally initiating route exports. As we
|
|
|
|
|
already know, imports are done by direct access, therefore if protocols keep
|
2022-02-04 15:18:06 +01:00
|
|
|
|
importing, export announcements are slowed down.
|
|
|
|
|
4. The actual data on what has been updated is stored in a table journal. This
|
|
|
|
|
peculiar technique is used only for informing the exporting channels that
|
|
|
|
|
*"there is something to do"*.
|
2022-02-03 22:42:26 +01:00
|
|
|
|
|
|
|
|
|
This may seem overly complicated, yet it should work and it seems to work. In
|
|
|
|
|
case of low load, all these notifications just come through smoothly. In case
|
|
|
|
|
of high load, it's common that multiple updates come for the same destination.
|
|
|
|
|
Delaying the exports allows for the updates to settle down and export just the
|
|
|
|
|
final result, reducing CPU load and export traffic.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
|
|
|
|
## Cork
|
|
|
|
|
|
2022-02-03 22:42:26 +01:00
|
|
|
|
Route propagation is involved in yet another problem which has to be addressed.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
In the old versions with synchronous route propagation, all the buffering
|
|
|
|
|
happened after exporting routes to BGP. When a packet arrived, all the work was
|
|
|
|
|
done in BGP receive hook – parsing, importing into a table, running all the
|
|
|
|
|
filters and possibly sending to the peers. No more routes until the previous
|
2022-02-03 22:42:26 +01:00
|
|
|
|
was done. This self-regulating mechanism doesn't work any more.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
|
|
|
|
Route table import now returns immediately after inserting the route into a
|
|
|
|
|
table, creating a buffer there. These buffers have to be processed by other protocols'
|
2022-02-03 22:42:26 +01:00
|
|
|
|
export events. In large-scale configurations, one route import has to be
|
|
|
|
|
processed by hundreds, even thousands of exports. Unlimited imports are a major
|
|
|
|
|
cause of buffer bloating. This is even worse in configurations with pipes,
|
|
|
|
|
as these multiply the exports by propagating them all the way down to other
|
|
|
|
|
tables, eventually eating about twice the amount of memory than the single-threaded version.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
|
|
|
|
There is therefore a cork to make this stop. Every table is checking how many
|
2022-02-04 15:18:06 +01:00
|
|
|
|
exports it has pending, and when adding a new export to the queue, it may request
|
|
|
|
|
a cork, saying simply "please stop the flow for a while". When the export buffer
|
|
|
|
|
size is reduced low enough, the table uncorks.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
2022-02-04 15:18:06 +01:00
|
|
|
|
On the other side, there are events and sockets with a cork assigned. When
|
2021-12-08 20:31:12 +01:00
|
|
|
|
trying to enqueue an event and the cork is applied, the event is instead put
|
|
|
|
|
into the cork's queue and released only when the cork is released. In case of
|
2022-02-04 15:18:06 +01:00
|
|
|
|
sockets, when read is indicated or when `poll` arguments are recalculated,
|
|
|
|
|
the corked socket is simply not checked for received packets, effectively
|
|
|
|
|
keeping them in the TCP queue and slowing down the flow until cork is released.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
2022-02-03 22:42:26 +01:00
|
|
|
|
The cork implementation is quite crude and rough and fragile. It may get some
|
|
|
|
|
rework while stabilizing the multi-threaded version of BIRD or we may even
|
|
|
|
|
completely drop it for some better mechanism. One of these candidates is this
|
|
|
|
|
kind of API:
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
|
|
|
|
* (table to protocol) please do not import
|
|
|
|
|
* (table to protocol) you may resume imports
|
|
|
|
|
* (protocol to table) not processing any exports
|
|
|
|
|
* (protocol to table) resuming export processing
|
|
|
|
|
|
2022-02-03 22:42:26 +01:00
|
|
|
|
Anyway, cork works as intended in most cases at least for now.
|
2021-12-08 20:31:12 +01:00
|
|
|
|
|
2022-02-03 22:42:26 +01:00
|
|
|
|
*It's a long road to the version 2.1. This series of texts should document what
|
|
|
|
|
is changing, why we do it and how. The
|
2021-12-08 20:31:12 +01:00
|
|
|
|
[previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/)
|
2022-02-03 22:42:26 +01:00
|
|
|
|
shows how the route export had to change to allow parallel execution. In the next chapter, some memory management
|
2021-12-22 15:35:29 +01:00
|
|
|
|
details are to be explained together with the reasons why memory management matters. Stay tuned!*
|