mirror of
https://gitlab.nic.cz/labs/bird.git
synced 2025-01-24 09:51:54 +00:00
Thread documentation: Rewritten chapter 3 on loops and locks once again
This commit is contained in:
parent
216bcb7c68
commit
765c940094
@ -25,16 +25,17 @@ Locks in BIRD (called domains, as they always lock some defined part of BIRD)
|
|||||||
are partially ordered. Every *domain* has its *type* and all threads are
|
are partially ordered. Every *domain* has its *type* and all threads are
|
||||||
strictly required to lock the domains in the order of their respective types.
|
strictly required to lock the domains in the order of their respective types.
|
||||||
The full order is defined in `lib/locking.h`. It's forbidden to lock more than
|
The full order is defined in `lib/locking.h`. It's forbidden to lock more than
|
||||||
one domain of a type (these domains are uncomparable) and recursive locking as well.
|
one domain of a type (these domains are uncomparable) and recursive locking is
|
||||||
|
forbidden as well.
|
||||||
|
|
||||||
The locking hiearchy is (as of December 2021) like this:
|
The locking hiearchy is (roughly; as of February 2022) like this:
|
||||||
|
|
||||||
1. The BIRD Lock (for everything not yet checked and/or updated)
|
1. The BIRD Lock (for everything not yet checked and/or updated)
|
||||||
2. Protocols (as of December 2021, it is BFD, RPKI and Pipe)
|
2. Protocols (as of February 2022, it is BFD, RPKI, Pipe and BGP)
|
||||||
3. Routing tables
|
3. Routing tables
|
||||||
4. Global route attribute cache
|
4. Global route attribute cache
|
||||||
5. Message passing
|
5. Message passing
|
||||||
6. Internals
|
6. Internals and memory management
|
||||||
|
|
||||||
There are heavy checks to ensure proper locking and to help debugging any
|
There are heavy checks to ensure proper locking and to help debugging any
|
||||||
problem when any code violates the hierarchy rules. This impedes performance
|
problem when any code violates the hierarchy rules. This impedes performance
|
||||||
@ -48,11 +49,33 @@ Risks arising from dropping some locking checks include:
|
|||||||
* data corruption; it either kills BIRD anyway, or it results into a slow and vicious death,
|
* data corruption; it either kills BIRD anyway, or it results into a slow and vicious death,
|
||||||
leaving undebuggable corefiles behind.
|
leaving undebuggable corefiles behind.
|
||||||
|
|
||||||
To be honest, I believe in principles like "there is also one more bug somewhere"
|
To be honest, I believe in principles like *"every nontrivial software has at least one bug"*
|
||||||
and I just don't trust my future self or anybody else to write bugless code when
|
and I also don't trust my future self or anybody else to always write bugless code when
|
||||||
it comes to proper locking. I believe that if a lock becomes a bottle-neck,
|
it comes to proper locking. I also believe that if a lock becomes a bottle-neck,
|
||||||
then we should think about what is locked inside and how to optimize that, instead
|
then we should think about what is locked inside and how to optimize that,
|
||||||
of dropping thorough consistency checks.
|
possibly implementing a lockless or waitless data structure instead of dropping
|
||||||
|
thorough consistency checks, especially in a multithreaded environment.
|
||||||
|
|
||||||
|
### Choosing the right locking order
|
||||||
|
|
||||||
|
When considering the locking order of protocols and route tables, the answer
|
||||||
|
was quite easy. We had to make either import or export asynchronous (or both).
|
||||||
|
Major reasons for asynchronous export have been stated in the previous chapter,
|
||||||
|
therefore it makes little sense to allow entering protocol context from table code.
|
||||||
|
|
||||||
|
As I write further in this text, even accessing table context from protocol
|
||||||
|
code leads to contention on table locks, yet for now, it is good enough and the
|
||||||
|
lock order features routing tables after protocols to make the multithreading
|
||||||
|
goal easier to achieve.
|
||||||
|
|
||||||
|
The major lock level is still The BIRD Lock, containing not only the
|
||||||
|
not-yet-converted protocols (like Babel, OSPF or RIP) but also processing CLI
|
||||||
|
commands and reconfiguration. This involves an awful lot of direct access into
|
||||||
|
other contexts which would be unnecessarily complicated to implement by message
|
||||||
|
passing. Therefore, this lock is simply *"the director"*, sitting on the top.
|
||||||
|
|
||||||
|
The lower lock levels are mostly for shared global data structures accessed
|
||||||
|
from everywhere. We'll address some of these later.
|
||||||
|
|
||||||
## IO Loop
|
## IO Loop
|
||||||
|
|
||||||
@ -64,15 +87,13 @@ routines. This code could be easily updated for general use so I did it.
|
|||||||
|
|
||||||
To understand the internal principles, we should say that in the `master`
|
To understand the internal principles, we should say that in the `master`
|
||||||
branch, there is a big loop centered around a `poll()` call, dispatching and
|
branch, there is a big loop centered around a `poll()` call, dispatching and
|
||||||
executing everything as needed. There are several means how to get something dispatched from the main loop.
|
executing everything as needed. There are several means how to get something dispatched from a loop.
|
||||||
|
|
||||||
1. Requesting to read from a socket makes the main loop call your hook when there is some data received.
|
1. Requesting to read from a socket makes the main loop call your hook when there is some data received.
|
||||||
The same happens when a socket refuses to write data. Then the data is buffered and you are called when
|
The same happens when a socket refuses to write data. Then the data is buffered and you are called when
|
||||||
the buffer is free. There is also a third callback, an error hook, for obvious reasons.
|
the buffer is free. There is also a third callback, an error hook, for obvious reasons.
|
||||||
|
|
||||||
2. Requesting to be called back after a given amount of time. The callback may
|
2. Requesting to be called back after a given amount of time. This is called *timer*.
|
||||||
be delayed by any amount of time, anyway when it exceeds 5 seconds (default,
|
|
||||||
configurable) at least the user gets a warning. This is called *timer*.
|
|
||||||
|
|
||||||
3. Requesting to be called back when possible. This is useful to run anything
|
3. Requesting to be called back when possible. This is useful to run anything
|
||||||
not reentrant which might mess with the caller's data, e.g. when a protocol
|
not reentrant which might mess with the caller's data, e.g. when a protocol
|
||||||
@ -97,118 +118,82 @@ locking that loop's domain. In fact, every event queue has its own lock with a
|
|||||||
low priority, allowing to pass messages from almost any part of BIRD, and also
|
low priority, allowing to pass messages from almost any part of BIRD, and also
|
||||||
an assigned loop which executes the events enqueued. When a message is passed
|
an assigned loop which executes the events enqueued. When a message is passed
|
||||||
to a queue executed by another loop, that target loop must be woken up so we
|
to a queue executed by another loop, that target loop must be woken up so we
|
||||||
must know what loop to wake up to avoid unnecessary delays.
|
must know what loop to wake up to avoid unnecessary delays. Then the target
|
||||||
|
loop opens its mailbox and processes the task in its context.
|
||||||
|
|
||||||
The other way is faster but not always possible. When the target loop domain
|
The other way is a direct access of another domain. This approach blocks the
|
||||||
may be locked from the original loop domain, we may simply *enter the target loop*,
|
appropriate loop from doing anything and we call it *entering a birdloop* to
|
||||||
do the work and then *leave the loop*. Route import uses this approach to
|
remember that the task must be fast and *leave the birdloop* as soon as possible.
|
||||||
directly update the best route in the target table. In the other direction,
|
Route import is done via direct access from protocols to tables; in large
|
||||||
loop entering is not possible and events must be used to pass messages.
|
setups with fast filters, this is a major point of contention (after filters
|
||||||
|
have been parallelized) and will be addressed in future optimization efforts.
|
||||||
|
Reconfiguration and interface updates also use direct access; more on that later.
|
||||||
|
In general, this approach should be avoided unless there are good reasons to use it.
|
||||||
|
|
||||||
Asynchronous message passing is expensive. It involves sending a byte to a pipe
|
Even though direct access is bad, sending lots of messages may be even worse.
|
||||||
to wakeup a loop from `poll` to execute the message. If we had to send a ping
|
Imagine one thousand post(wo)men, coming one by one every minute, ringing your
|
||||||
for every route we import to every channel to export it, we'd spend more time
|
doorbell and delivering one letter each to you. Horrible! Asynchronous message
|
||||||
pinging than computing the best route. The route update routines therefore
|
passing works exactly this way. After queuing the message, the source sends a
|
||||||
employ a double-indirect delayed route announcement:
|
byte to a pipe to wakeup the target loop to process the task. We could also
|
||||||
|
periodically poll for messages instead of waking up the targets, yet it would
|
||||||
|
add quite a lot of latency which we also don't like.
|
||||||
|
|
||||||
1. When a channel imports a route by entering a loop, it sends an event to its
|
Messages in BIRD don't typically suffer from the problem of amount and the
|
||||||
own loop (no ping needed in such case). This operation is idempotent, thus
|
overhead is negligible compared to the overall CPU consumption. With one notable
|
||||||
for several routes, only one event is enqueued.
|
exception: route import/export.
|
||||||
2. After all packet parsing is done, the channel import announcement event is
|
|
||||||
executed, sending another event to the table's loop. There may have been
|
|
||||||
multiple imports in the same time but the exports have to get a ping just once.
|
|
||||||
3. The table's announcement event is executed from its loop, enqueuing export
|
|
||||||
events for all connected channels, finally initiating route exports.
|
|
||||||
|
|
||||||
This may seem overly complicated, yet it also allows the incoming changes to
|
### Route export message passing
|
||||||
settle down before exports are finished, reducing also cache invalidation
|
|
||||||
between importing and exporting threads.
|
|
||||||
|
|
||||||
## Choosing the right locking order
|
If we had to send a ping for every route we import to every exporting channel,
|
||||||
|
we'd spend more time pinging than doing anything else. Been there, seen
|
||||||
|
those unbelievable 80%-like figures in Perf output. Never more.
|
||||||
|
|
||||||
When considering the locking order of protocols and route tables, the answer was quite easy.
|
Route update is quite a complicated process. BIRD must handle large-scale
|
||||||
If route tables could enter protocol loops, they would have to either directly
|
configurations with lots of importers and exporters. Therefore, a
|
||||||
execute protocol code, one export after another, or send whole routes by messages.
|
triple-indirect delayed route announcement is employed:
|
||||||
Setting this other way around (protocol entering route tables), protocols do
|
|
||||||
everything on their time, minimizing table time. Tables are contention points.
|
|
||||||
|
|
||||||
The third major lock level is The BIRD Lock, containing virtually everything
|
1. First, when a channel imports a route by entering a loop, it sends an event
|
||||||
else. It is also established that BFD is after The BIRD Lock, as BFD is
|
to its own loop (no ping needed in such case). This operation is idempotent,
|
||||||
low-latency and can't wait until The BIRD gets unlocked. Thus it would be
|
thus for several routes in a row, only one event is enqueued. This reduces
|
||||||
convenient to have all protocols on the same level, getting The BIRD Lock on top.
|
several route imports (even hundreds in case of massive BGP withdrawals) to
|
||||||
|
one single event.
|
||||||
|
2. When the channel is done importing (or at least takes a coffee break and
|
||||||
|
checks its mailbox), the scheduled event in its own loop is run, sending
|
||||||
|
another event to the table's loop, saying basically *"Hey, table, I've just
|
||||||
|
imported something."*. This event is also idempotent and further reduces
|
||||||
|
route imports from multiple sources to one single event.
|
||||||
|
3. The table's announcement event is then executed from its loop, enqueuing export
|
||||||
|
events for all connected channels, finally initiating route exports. As we
|
||||||
|
already know, imports are done by direct access, therefore if protocols keep
|
||||||
|
importing, export announcements must wait.
|
||||||
|
|
||||||
The BIRD Lock also runs CLI, reconfiguration and other high-level tasks,
|
This may seem overly complicated, yet it should work and it seems to work. In
|
||||||
requiring access to everything. Having The BIRD Lock anywhere else, these
|
case of low load, all these notifications just come through smoothly. In case
|
||||||
high-level tasks, scattered all around BIRD source code, would have to be split
|
of high load, it's common that multiple updates come for the same destination.
|
||||||
out to some super-loop.
|
Delaying the exports allows for the updates to settle down and export just the
|
||||||
|
final result, reducing CPU load and export traffic.
|
||||||
## Route tables
|
|
||||||
|
|
||||||
BFD could be split out thanks to its special nature. There are no BFD routes,
|
|
||||||
therefore no route tables are accessed. To split out any other protocol, we
|
|
||||||
need the protocol to be able to directly access routing tables. Therefore
|
|
||||||
route tables have to be split out first, to make space for protocols to go
|
|
||||||
between tables and The BIRD main loop.
|
|
||||||
|
|
||||||
Route tables are primarily data structures, yet they have their maintenance
|
|
||||||
routines. Their purpose is (among others) to cleanup export buffers, update
|
|
||||||
recursive routes and delete obsolete routes. This all may take lots of time
|
|
||||||
occasionally so it makes sense to have a dedicated thread for these.
|
|
||||||
|
|
||||||
In previous versions, I had a special type of event loop based on semaphores,
|
|
||||||
contrary to the loop originating in BFD, based on `poll`. This was
|
|
||||||
unnecessarily complicated, thus I rewrote that finally to use the universal IO
|
|
||||||
loop, just with no sockets at all.
|
|
||||||
|
|
||||||
There are some drawbacks of this, notably the number of filedescriptors BIRD
|
|
||||||
now uses. The user should also check the maximum limit on threads per process.
|
|
||||||
|
|
||||||
This change also means that imports and exports are started and stopped
|
|
||||||
asynchronously. Stopping an import needs to wait until all its routes are gone.
|
|
||||||
This induced some changes in the protocol state machine.
|
|
||||||
|
|
||||||
## Protocols
|
|
||||||
|
|
||||||
After tables were running in their own loops, the simplest protocol to split
|
|
||||||
out was Pipe. There are still no sockets, just events. This also means that
|
|
||||||
every single filter assigned to a pipe is run in its own thread, not blocking
|
|
||||||
others. (To be precise, both directions of a pipe share the same thread.)
|
|
||||||
|
|
||||||
When RPKI is in use, we want it to load the ROAs as soon as possible. Its table
|
|
||||||
is independent and the protocol itself is so simple that it could be put into
|
|
||||||
its own thread easily.
|
|
||||||
|
|
||||||
Other protocols are pending (Kernel) or in progress (BGP).
|
|
||||||
|
|
||||||
I tried to make the conversion also as easy as possible, implementing most of
|
|
||||||
the code in the generic functions in `nest/proto.c`. There are some
|
|
||||||
synchronization points in the protocol state machine; we can't simply delete
|
|
||||||
all protocol data when there is another thread running. Together with the
|
|
||||||
asynchronous import/export stopping, it is quite messy and it might need some
|
|
||||||
future cleanup. Anyway, moving a protocol to its own thread should be now as simple
|
|
||||||
as setting its locking level in its `config.Y` file and stopping all timers
|
|
||||||
before shutting down.
|
|
||||||
(See commits `4f3fa1623f66acd24c227cf0cc5a4af2f5133b6c`
|
|
||||||
and `3fd1f46184aa74d8ab7ed65c9ab6954f7e49d309`.)
|
|
||||||
|
|
||||||
## Cork
|
## Cork
|
||||||
|
|
||||||
|
Route propagation is involved in yet another problem which has to be addressed.
|
||||||
In the old versions with synchronous route propagation, all the buffering
|
In the old versions with synchronous route propagation, all the buffering
|
||||||
happened after exporting routes to BGP. When a packet arrived, all the work was
|
happened after exporting routes to BGP. When a packet arrived, all the work was
|
||||||
done in BGP receive hook – parsing, importing into a table, running all the
|
done in BGP receive hook – parsing, importing into a table, running all the
|
||||||
filters and possibly sending to the peers. No more routes until the previous
|
filters and possibly sending to the peers. No more routes until the previous
|
||||||
was done. This doesn't work any more.
|
was done. This self-regulating mechanism doesn't work any more.
|
||||||
|
|
||||||
Route table import now returns immediately after inserting the route into a
|
Route table import now returns immediately after inserting the route into a
|
||||||
table, creating a buffer there. These buffers have to be processed by other protocols'
|
table, creating a buffer there. These buffers have to be processed by other protocols'
|
||||||
export events, typically queued in the *global work queue* to be limited for lower latency.
|
export events. In large-scale configurations, one route import has to be
|
||||||
There is therefore no inherent limit for table export buffers which may lead
|
processed by hundreds, even thousands of exports. Unlimited imports are a major
|
||||||
(and occasionally leads) to memory bloating. This is even worse in configurations with pipes,
|
cause of buffer bloating. This is even worse in configurations with pipes,
|
||||||
as these multiply the exports by propagating them all the way down to other tables.
|
as these multiply the exports by propagating them all the way down to other
|
||||||
|
tables, eventually eating about twice the amount of memory than the single-threaded version.
|
||||||
|
|
||||||
There is therefore a cork to make this stop. Every table is checking how many
|
There is therefore a cork to make this stop. Every table is checking how many
|
||||||
exports it has pending, and when adding a new route, it may apply a cork,
|
exports it has pending, and when adding a new export to the queue, it may apply
|
||||||
saying simply "please stop the flow for a while". When the exports are then processed, it uncorks.
|
a cork, saying simply "please stop the flow for a while". When the exports are
|
||||||
|
then processed, it uncorks.
|
||||||
|
|
||||||
On the other side, there may be events and sockets with a cork assigned. When
|
On the other side, there may be events and sockets with a cork assigned. When
|
||||||
trying to enqueue an event and the cork is applied, the event is instead put
|
trying to enqueue an event and the cork is applied, the event is instead put
|
||||||
@ -217,23 +202,20 @@ sockets, when `poll` arguments are recalculated, the corked socket is not
|
|||||||
checked for received packets, effectively keeping them in the TCP queue and
|
checked for received packets, effectively keeping them in the TCP queue and
|
||||||
slowing down the flow.
|
slowing down the flow.
|
||||||
|
|
||||||
Both events and sockets have some delay before they get to the cork. This is
|
The cork implementation is quite crude and rough and fragile. It may get some
|
||||||
intentional; the purpose of cork is to slow down and allow for exports.
|
rework while stabilizing the multi-threaded version of BIRD or we may even
|
||||||
|
completely drop it for some better mechanism. One of these candidates is this
|
||||||
The cork implementation is probably due to some future changes after BGP gets
|
kind of API:
|
||||||
split out of the main loop, depending on how it is going to perform. I suppose
|
|
||||||
that the best way should be to implement a proper table API to allow for
|
|
||||||
explicit backpressure on both sides:
|
|
||||||
|
|
||||||
* (table to protocol) please do not import
|
* (table to protocol) please do not import
|
||||||
* (table to protocol) you may resume imports
|
* (table to protocol) you may resume imports
|
||||||
* (protocol to table) not processing any exports
|
* (protocol to table) not processing any exports
|
||||||
* (protocol to table) resuming export processing
|
* (protocol to table) resuming export processing
|
||||||
|
|
||||||
Anyway, for now it is good enough as it is.
|
Anyway, cork works as intended in most cases at least for now.
|
||||||
|
|
||||||
*It's still a long road to the version 2.1. This series of texts should document
|
*It's a long road to the version 2.1. This series of texts should document what
|
||||||
what is needed to be changed, why we do it and how. The
|
is changing, why we do it and how. The
|
||||||
[previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/)
|
[previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/)
|
||||||
showed how the route export had to change to allow parallel execution. In the next chapter, some memory management
|
shows how the route export had to change to allow parallel execution. In the next chapter, some memory management
|
||||||
details are to be explained together with the reasons why memory management matters. Stay tuned!*
|
details are to be explained together with the reasons why memory management matters. Stay tuned!*
|
||||||
|
Loading…
Reference in New Issue
Block a user