mirror of
https://gitlab.nic.cz/labs/bird.git
synced 2024-11-18 09:08:42 +00:00
Thread documentation: chapters 0, 1 and 2
This commit is contained in:
parent
493d45d950
commit
f459deee9f
BIN
doc/threads/00_filter_structure.png
Normal file
BIN
doc/threads/00_filter_structure.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 300 KiB |
114
doc/threads/00_the_name_of_the_game.md
Normal file
114
doc/threads/00_the_name_of_the_game.md
Normal file
@ -0,0 +1,114 @@
|
||||
# BIRD Journey to Threads. Chapter 0: The Reason Why.
|
||||
|
||||
BIRD is a fast, robust and memory-efficient routing daemon designed and
|
||||
implemented at the end of 20th century. Its concept of multiple routing
|
||||
tables with pipes between them, as well as a procedural filtering language,
|
||||
has been unique for a long time and is still one of main reasons why people use
|
||||
BIRD for big loads of routing data.
|
||||
|
||||
## IPv4 / IPv6 duality: Solved
|
||||
|
||||
The original design of BIRD has also some drawbacks. One of these was an idea
|
||||
of two separate daemons – one BIRD for IPv4 and another BIRD for IPv6, built from the same
|
||||
codebase, cleverly using `#ifdef IPV6` constructions to implement the
|
||||
common parts of BIRD algorithms and data structures only once.
|
||||
If IPv6 adoption went forward as people thought in that time,
|
||||
it would work; after finishing the worldwide transition to IPv6, people could
|
||||
just stop building BIRD for IPv4 and drop the `#ifdef`-ed code.
|
||||
|
||||
The history went other way, however. BIRD developers therefore decided to *integrate*
|
||||
these two versions into one daemon capable of any address family, allowing for
|
||||
not only IPv6 but for virtually anything. This rework brought quite a lot of
|
||||
backward-incompatible changes, therefore we decided to release it as a version 2.0.
|
||||
This work was mostly finished in 2018 and as for March 2021, we have already
|
||||
switched the 1.6.x branch to a bugfix-only mode.
|
||||
|
||||
## BIRD is single-threaded now
|
||||
|
||||
The second drawback is a single-threaded design. Looking back to 1998, this was
|
||||
a good idea. A common PC had one single core and BIRD was targeting exactly
|
||||
this segment. As the years went by, the manufacturers launched multicore x86 chips
|
||||
(AMD Opteron in 2004, Intel Pentium D in 2005). This ultimately led to a world
|
||||
where as of March 2021, there is virtually no new PC sold with a single-core CPU.
|
||||
|
||||
Together with these changes, the speed of one single core has not been growing as fast
|
||||
as the Internet is growing. BIRD is still capable to handle the full BGP table
|
||||
(868k IPv4 routes in March 2021) with one core, anyway when BIRD starts, it may take
|
||||
long minutes to converge.
|
||||
|
||||
## Intermezzo: Filters
|
||||
|
||||
In 2018, we took some data we had from large internet exchanges and simulated
|
||||
a cold start of BIRD as a route server. We used `linux-perf` to find most time-critical
|
||||
parts of BIRD and it pointed very clearly to the filtering code. It also showed that the
|
||||
IPv4 version of BIRD v1.6.x is substantially faster than the *integrated* version, while
|
||||
the IPv6 version was quite as fast as the *integrated* one.
|
||||
|
||||
Here we should show a little bit more about how the filters really work. Let's use
|
||||
an example of a simple filter:
|
||||
|
||||
```
|
||||
filter foo {
|
||||
if net ~ [10.0.0.0/8+] then reject;
|
||||
preference = 2 * preference - 41;
|
||||
accept;
|
||||
}
|
||||
```
|
||||
|
||||
This filter gets translated to an infix internal structure.
|
||||
|
||||
![Example of filter internal representation](00_filter_structure.png)
|
||||
|
||||
When executing, the filter interpreter just walks the filter internal structure recursively in the
|
||||
right order, executes the instructions, collects their results and finishes by
|
||||
either rejection or acceptation of the route
|
||||
|
||||
## Filter rework
|
||||
|
||||
Further analysis of the filter code revealed an absurdly-looking result. The
|
||||
most executed parts of the interpreter function were the `push` CPU
|
||||
instructions on its very beginning and the `pop` CPU instructions on its very
|
||||
end. This came from the fact that the interpreter function was quite long, yet
|
||||
most of the filter instructions used an extremely short path, doing all the
|
||||
stack manipulation at the beginning, branching by the filter instruction type,
|
||||
then it executed just several CPU instructions, popped everything from the
|
||||
stack back and returned.
|
||||
|
||||
After some thoughts how to minimize stack manipulation when everything you need
|
||||
is to take two numbers and multiply them, we decided to preprocess the filter
|
||||
internal structure to another structure which is much easier to execute. The
|
||||
interpreter is now using a data stack and behaves generally as a
|
||||
postfix-ordered language. We also thought about Lua which showed up to be quite
|
||||
a lot of work implementing all the glue achieving about the same performance.
|
||||
|
||||
After these changes, we managed to reduce the filter execution time by 10–40%,
|
||||
depending on how complex the filter is.
|
||||
Anyway, even this reduction is quite too little when there is one CPU core
|
||||
running for several minutes while others are sleeping.
|
||||
|
||||
## We need more threads
|
||||
|
||||
As a side effect of the rework, the new filter interpreter is also completely
|
||||
thread-safe. It seemed to be the way to go – running the filters in parallel
|
||||
while keeping everything else single-threaded. The main problem of this
|
||||
solution is a too fine granularity of parallel jobs. We would spend lots of
|
||||
time on synchronization overhead.
|
||||
|
||||
The only filter parallel execution was also too one-sided, useful only for
|
||||
configurations with complex filters. In other cases, the major problem is best
|
||||
route recalculation, OSPF recalculation or also kernel synchronization.
|
||||
It also turned out to be dirty a lot from the code cleanliness' point of view.
|
||||
|
||||
Therefore we chose to make BIRD multithreaded completely. We designed a way how
|
||||
to gradually enable parallel computation and best usage of all available CPU
|
||||
cores. Our goals are three:
|
||||
|
||||
* We want to keep current functionality. Parallel computation should never drop
|
||||
a useful feature.
|
||||
* We want to do little steps. No big reworks, even though even the smallest
|
||||
possible step will need quite a lot of refactoring before.
|
||||
* We want to be backwards compatible as much as possible.
|
||||
|
||||
*It's still a long road to the version 2.1. This series of texts should document
|
||||
what is needed to be changed, why we do it and how. In the next chapter, we're
|
||||
going to describe the structures for routes and their attributes. Stay tuned!*
|
159
doc/threads/01_the_route_and_its_attributes.md
Normal file
159
doc/threads/01_the_route_and_its_attributes.md
Normal file
@ -0,0 +1,159 @@
|
||||
# BIRD Journey to Threads. Chapter 1: The Route and its Attributes
|
||||
|
||||
BIRD is a fast, robust and memory-efficient routing daemon designed and
|
||||
implemented at the end of 20th century. We're doing a significant amount of
|
||||
BIRD's internal structure changes to make it possible to run in multiple
|
||||
threads in parallel. This chapter covers necessary changes of data structures
|
||||
which store every single routing data.
|
||||
|
||||
*If you want to see the changes in code, look (basically) into the
|
||||
`route-storage-updates` branch. Not all of them are already implemented, anyway
|
||||
most of them are pretty finished as of end of March, 2021.*
|
||||
|
||||
## How routes are stored
|
||||
|
||||
BIRD routing table is just a hierarchical noSQL database. On top level, the
|
||||
routes are keyed by their destination, called *net*. Due to historic reasons,
|
||||
the *net* is not only *IPv4 prefix*, *IPv6 prefix*, *IPv4 VPN prefix* etc.,
|
||||
but also *MPLS label*, *ROA information* or *BGP Flowspec record*. As there may
|
||||
be several routes for each *net*, an obligatory part of the key is *src* aka.
|
||||
*route source*. The route source is a tuple of the originating protocol
|
||||
instance and a 32-bit unsigned integer. If a protocol wants to withdraw a route,
|
||||
it is enough and necessary to have the *net* and *src* to identify what route
|
||||
is to be withdrawn.
|
||||
|
||||
The route itself consists of (basically) a list of key-value records, with
|
||||
value types ranging from a 16-bit unsigned integer for preference to a complex
|
||||
BGP path structure. The keys are pre-defined by protocols (e.g. BGP path or
|
||||
OSPF metrics), or by BIRD core itself (preference, route gateway).
|
||||
Finally, the user can declare their own attribute keys using the keyword
|
||||
`attribute` in config.
|
||||
|
||||
## Attribute list implementation
|
||||
|
||||
Currently, there are three layers of route attributes. We call them *route*
|
||||
(*rte*), *attributes* (*rta*) and *extended attributes* (*ea*, *eattr*).
|
||||
|
||||
The first layer, *rte*, contains the *net* pointer, several fixed-size route
|
||||
attributes (mostly preference and protocol-specific metrics), flags, lastmod
|
||||
time and a pointer to *rta*.
|
||||
|
||||
The second layer, *rta*, contains the *src* (a pointer to a singleton instance),
|
||||
a route gateway, several other fixed-size route attributes and a pointer to
|
||||
*ea* list.
|
||||
|
||||
The third layer, *ea* list, is a variable-length list of key-value attributes,
|
||||
containing all the remaining route attributes.
|
||||
|
||||
Distribution of the route attributes between the attribute layers is somehow
|
||||
arbitrary. Mostly, in the first and second layer, there are attributes that
|
||||
were thought to be accessed frequently (e.g. in best route selection) and
|
||||
filled in in most routes, while the third layer is for infrequently used
|
||||
and/or infrequently accessed route attributes.
|
||||
|
||||
## Attribute list deduplication
|
||||
|
||||
When protocols originate routes, there are commonly more routes with the
|
||||
same attribute list. BIRD could ignore this fact, anyway if you have several
|
||||
tables connected with pipes, it is more memory-efficient to store the same
|
||||
attribute lists only once.
|
||||
|
||||
Therefore, the two lower layers (*rta* and *ea*) are hashed and stored in a
|
||||
BIRD-global database. Routes (*rte*) contain a pointer to *rta* in this
|
||||
database, maintaining a use-count of each *rta*. Attributes (*rta*) contain
|
||||
a pointer to normalized (sorted by numerical key ID) *ea*.
|
||||
|
||||
## Attribute list rework
|
||||
|
||||
The first thing to change is the distribution of route attributes between
|
||||
attribute list layers. We decided to make the first layer (*rte*) only the key
|
||||
and other per-record internal technical information. Therefore we move *src* to
|
||||
*rte* and preference to *rta* (beside other things). *This is already done.*
|
||||
|
||||
We also found out that the nexthop (gateway), originally one single IP address
|
||||
and an interface, has evolved to a complex attribute with several sub-attributes;
|
||||
not only considering multipath routing but also MPLS stacks and other per-route
|
||||
attributes. This has led to a too complex data structure holding the nexthop set.
|
||||
|
||||
We decided finally to squash *rta* and *ea* to one type of data structure,
|
||||
allowing for completely dynamic route attribute lists. This is also supported
|
||||
by adding other *net* types (BGP FlowSpec or ROA) where lots of the fields make
|
||||
no sense at all, yet we still want to use the same data structures and implementation
|
||||
as we don't like duplicating code. *Multithreading doesn't depend on this change,
|
||||
anyway this change is going to happen soon anyway.*
|
||||
|
||||
## Route storage
|
||||
|
||||
The process of route import from protocol into a table can be divided into several phases:
|
||||
|
||||
1. (In protocol code.) Create the route itself (typically from
|
||||
protocol-internal data) and choose the right channel to use.
|
||||
2. (In protocol code.) Create the *rta* and *ea* and obtain an appropriate
|
||||
hashed pointer. Allocate the *rte* structure and fill it in.
|
||||
3. (Optionally.) Store the route to the *import table*.
|
||||
4. Run filters. If reject, free everything.
|
||||
5. Check whether this is a real change (it may be idempotent). If not, free everything and do nothing more.
|
||||
6. Run the best route selection algorithm.
|
||||
7. Execute exports if needed.
|
||||
|
||||
We found out that the *rte* structure allocation is done too early. BIRD uses
|
||||
global optimized allocators for fixed-size blocks (which *rte* is) to reduce
|
||||
its memory footprint, therefore the allocation of *rte* structure would be a
|
||||
synchronization point in multithreaded environment.
|
||||
|
||||
The common code is also much more complicated when we have to track whether the
|
||||
current *rte* has to be freed or not. This is more a problem in export than in
|
||||
import as the export filter can also change the route (and therefore allocate
|
||||
another *rte*). The changed route must be therefore freed after use. All the
|
||||
route changing code must also track whether this route is writable or
|
||||
read-only.
|
||||
|
||||
We therefore introduce a variant of *rte* called *rte_storage*. Both of these
|
||||
hold the same, the layer-1 route information (destination, author, cached
|
||||
attribute pointer, flags etc.), anyway *rte* is always local and *rte_storage*
|
||||
is intended to be put in global data structures.
|
||||
|
||||
This change allows us to remove lots of the code which only tracks whether any
|
||||
*rte* is to be freed as *rte*'s are almost always allocated on-stack, naturally
|
||||
limiting their lifetime. If not on-stack, it's the responsibility of the owner
|
||||
to free the *rte* after import is done.
|
||||
|
||||
This change also removes the need for *rte* allocation in protocol code and
|
||||
also *rta* can be safely allocated on-stack. As a result, protocols can simply
|
||||
allocate all the data on stack, call the update routine and the common code in
|
||||
BIRD's *nest* does all the storage for them.
|
||||
|
||||
Allocating *rta* on-stack is however not required. BGP and OSPF use this to
|
||||
import several routes with the same attribute list. In BGP, this is due to the
|
||||
format of BGP update messages containing first the attributes and then the
|
||||
destinations (BGP NLRI's). In OSPF, in addition to *rta* deduplication, it is
|
||||
also presumed that no import filter (or at most some trivial changes) is applied
|
||||
as OSPF would typically not work well when filtered.
|
||||
|
||||
*This change is already done.*
|
||||
|
||||
## Route cleanup and table maintenance
|
||||
|
||||
In some cases, the route update is not originated by a protocol/channel code.
|
||||
When the channel shuts down, all routes originated by that channel are simply
|
||||
cleaned up. Also routes with recursive routes may get changed without import,
|
||||
simply by changing the IGP route.
|
||||
|
||||
This is currently done by a `rt_event` (see `nest/rt-table.c` for source code)
|
||||
which is to be converted to a parallel thread, running when nobody imports any
|
||||
route. *This change is freshly done in branch `guernsey`.*
|
||||
|
||||
## Parallel protocol execution
|
||||
|
||||
The long-term goal of these reworks is to allow for completely independent
|
||||
execution of all the protocols. Typically, there is no direct interaction
|
||||
between protocols; everything is done thought BIRD's *nest*. Protocols should
|
||||
therefore run in parallel in future and wait/lock only when something is needed
|
||||
to do externally.
|
||||
|
||||
We also aim for a clean and documented protocol API.
|
||||
|
||||
*It's still a long road to the version 2.1. This series of texts should document
|
||||
what is needed to be changed, why we do it and how. In the next chapter, we're
|
||||
going to describe how the route is exported from table to protocols and how this
|
||||
process is changing. Stay tuned!*
|
437
doc/threads/02_asynchronous_export.md
Normal file
437
doc/threads/02_asynchronous_export.md
Normal file
@ -0,0 +1,437 @@
|
||||
# BIRD Journey to Threads. Chapter 2: Asynchronous route export
|
||||
|
||||
Route export is a core algorithm of BIRD. This chapter covers how we are making
|
||||
this procedure multithreaded. Desired outcomes are mostly lower latency of
|
||||
route import, flap dampening and also faster route processing in large
|
||||
configurations with lots of export from one table.
|
||||
|
||||
BIRD is a fast, robust and memory-efficient routing daemon designed and
|
||||
implemented at the end of 20th century. We're doing a significant amount of
|
||||
BIRD's internal structure changes to make it possible to run in multiple
|
||||
threads in parallel.
|
||||
|
||||
## How routes are propagated through BIRD
|
||||
|
||||
In the [previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/), you could learn how the route import works. We should
|
||||
now extend that process by the route export.
|
||||
|
||||
1. (In protocol code.) Create the route itself and propagate it through the
|
||||
right channel by calling `rte_update`.
|
||||
2. The channel runs its import filter.
|
||||
3. New best route is selected.
|
||||
4. For each channel:
|
||||
1. The channel runs its preexport hook and export filter.
|
||||
2. (Optionally.) The channel merges the nexthops to create an ECMP route.
|
||||
3. The channel calls the protocol's `rt_notify` hook.
|
||||
5. After all exports are finished, the `rte_update` call finally returns and
|
||||
the source protocol may do anything else.
|
||||
|
||||
Let's imagine that all the protocols are running in parallel. There are two
|
||||
protocols with a route prepared to import. One of those wins the table lock,
|
||||
does the import and then the export touches the other protocol which must
|
||||
either:
|
||||
|
||||
* store the route export until it finishes its own imports, or
|
||||
* have independent import and export parts.
|
||||
|
||||
Both of these conditions are infeasible for common use. Implementing them would
|
||||
make protocols much more complicated with lots of new code to test and release
|
||||
at once and also quite a lot of corner cases. Risk of deadlocks is also worth
|
||||
mentioning.
|
||||
|
||||
## Asynchronous route export
|
||||
|
||||
We decided to make it easier for protocols and decouple the import and export
|
||||
this way:
|
||||
|
||||
1. The import is done.
|
||||
2. Best route is selected.
|
||||
3. Resulting changes are stored.
|
||||
|
||||
Then, after the importing protocol returns, the exports are processed for each
|
||||
exporting channel. In future, this will be possible in parallel: Some protocols
|
||||
may process the export directly after it is stored, other protocols will wait
|
||||
until they finish another job.
|
||||
|
||||
This eliminates the risk of deadlocks and all protocols' `rt_notify` hooks can
|
||||
rely on their independence. There is only one question. How to store the changes?
|
||||
|
||||
## Route export modes
|
||||
|
||||
To find a good data structure for route export storage, we shall first know the
|
||||
readers. The exporters may request different modes of route export.
|
||||
|
||||
### Export everything
|
||||
|
||||
This is the most simple route export mode. The exporter wants to know about all
|
||||
the routes as they're changing. We therefore simply store the old route until
|
||||
the change is fully exported and then we free the old stored route.
|
||||
|
||||
To manage this, we can simply queue the changes one after another and postpone
|
||||
old route cleanup after all channels have exported the change. The queue member
|
||||
would look like this:
|
||||
|
||||
```
|
||||
struct {
|
||||
struct rte_storage *new;
|
||||
struct rte_storage *old;
|
||||
};
|
||||
```
|
||||
|
||||
### Export best
|
||||
|
||||
This is another simple route export mode. We check whether the best route has
|
||||
changed; if not, no export happens. Otherwise, the export is propagated as the
|
||||
old best route changing to the new best route.
|
||||
|
||||
To manage this, we could use the queue from the previous point by adding new
|
||||
best and old best pointers. It is guaranteed that both the old best and new
|
||||
best pointers are always valid in time of export as all the changes in them
|
||||
must be stored in future changes which have not been exported yet by this
|
||||
channel and therefore not freed yet.
|
||||
|
||||
```
|
||||
struct {
|
||||
struct rte_storage *new;
|
||||
struct rte_storage *new_best;
|
||||
struct rte_storage *old;
|
||||
struct rte_storage *old_best;
|
||||
};
|
||||
```
|
||||
|
||||
Anyway, we're getting to the complicated export modes where this simple
|
||||
structure is simply not enough.
|
||||
|
||||
### Export merged
|
||||
|
||||
Here we're getting to some kind of problems. The exporting channel requests not
|
||||
only the best route but also all routes that are good enough to be considered
|
||||
ECMP-eligible (we call these routes *mergable*). The export is then just one
|
||||
route with just the nexthops merged. Export filters are executed before
|
||||
merging and if the best route is rejected, nothing is exported at all.
|
||||
|
||||
To achieve this, we have to re-evaluate export filters any time the best route
|
||||
or any mergable route changes. Until now, the export could just do what it wanted
|
||||
as there was only one thread working. To change this, we need to access the
|
||||
whole route list and process it.
|
||||
|
||||
### Export first accepted
|
||||
|
||||
In this mode, the channel runs export filters on a sorted list of routes, best first.
|
||||
If the best route gets rejected, it asks for the next one until it finds an
|
||||
acceptable route or exhausts the list. This export mode requires a sorted table.
|
||||
BIRD users will know this export mode as `secondary` in BGP.
|
||||
|
||||
For now, BIRD stores two bits per route for each channel. The *export bit* is set
|
||||
if the route has been really exported to that channel. The *reject bit* is set
|
||||
if the route was rejected by the export filter.
|
||||
|
||||
When processing a route change for accepted, the algorithm first checks the
|
||||
export bit for the old route. If this bit is set, the old route is that one
|
||||
exported so we have to find the right one to export. Therefore the sorted route
|
||||
list is walked best to worst to find a new route to export, using the reject
|
||||
bit to evaluate only routes which weren't rejected in previous runs of this
|
||||
algorithm.
|
||||
|
||||
If the old route bit is not set, the algorithm walks the sorted route list best
|
||||
to worst, checking the position of new route with respect to the exported route.
|
||||
If the new route is worse, nothing happens, otherwise the new route is sent to
|
||||
filters and finally exported if passes.
|
||||
|
||||
### Export by feed
|
||||
|
||||
To resolve problems arising from previous two export modes (merged and first accepted),
|
||||
we introduce a way to process a whole route list without locking the table
|
||||
while export filters are running. To achieve this, we follow this algorithm:
|
||||
|
||||
1. The exporting channel sees a pending export.
|
||||
2. *The table is locked.*
|
||||
3. The channel counts all the routes for the given destination.
|
||||
4. The channel stores pointers to that routes to a local array.
|
||||
5. *The table is unlocked.*
|
||||
6. The channel processes the local array of route pointers.
|
||||
7. The pending export is marked as processed by this channel.
|
||||
|
||||
After unlocking the table, the pointed-to routes are implicitly guarded by the
|
||||
sole fact that the pending export has not yet been processed by all channels
|
||||
and the cleanup routine never frees any resource related to a pending export.
|
||||
|
||||
## Pending export data structure
|
||||
|
||||
As the two complicated export modes use the export-by-feed algorithm, the
|
||||
pending export data structure may be quite minimalistic.
|
||||
|
||||
```
|
||||
struct rt_pending_export {
|
||||
struct rt_pending_export * _Atomic next; /* Next export for the same destination */
|
||||
struct rte_storage *new; /* New route */
|
||||
struct rte_storage *new_best; /* New best route in unsorted table */
|
||||
struct rte_storage *old; /* Old route */
|
||||
struct rte_storage *old_best; /* Old best route in unsorted table */
|
||||
_Atomic u64 seq; /* Sequential ID (table-local) of the pending export */
|
||||
};
|
||||
```
|
||||
|
||||
To allow for squashing outdated pending exports (e.g. for flap dampening
|
||||
purposes), there is a `next` pointer to the next export for the same
|
||||
destination. This is also needed for the export-by-feed algorithm: if there are
|
||||
several exports for one net at once, all of them are processed by one
|
||||
export-by-feed automatically marked as done.
|
||||
|
||||
We should also add several items into `struct channel`.
|
||||
|
||||
```
|
||||
struct coroutine *export_coro; /* Exporter and feeder coroutine */
|
||||
struct bsem *export_sem; /* Exporter and feeder semaphore */
|
||||
struct rt_pending_export * _Atomic last_export; /* Last export processed */
|
||||
struct bmap export_seen_map; /* Keeps track which exports were already processed */
|
||||
u64 flush_seq; /* Table export seq when the channel announced flushing */
|
||||
```
|
||||
|
||||
To run the exports in parallel, `export_coro` is run and `export_sem` is
|
||||
used for signalling new exports to it. The exporter coroutine also marks all
|
||||
seen sequential IDs in its `export_seen_map` to make it possible to skip over
|
||||
them if seen again.
|
||||
|
||||
There is also a table cleaner routine
|
||||
(see [previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/))
|
||||
which must cleanup also the pending exports after all the channels are finished with them.
|
||||
To signal that, there is `last_export` working as a release point: the channel
|
||||
guarantees that it doesn't touch the pointed-to pending export, nor any data
|
||||
from it.
|
||||
|
||||
The last tricky point is channel flushing. When any channel stops, all its
|
||||
routes are automatically freed and withdrawals are exported if appropriate.
|
||||
Until now, the routes could be flushed synchronously, anyway now flush has
|
||||
several phases:
|
||||
|
||||
1. Flush started (here the channel stores the `seq` of last current pending export).
|
||||
2. All routes unlinked yet not freed, withdrawals pending.
|
||||
3. Pending withdrawals processed and cleaned up. Channel may safely stop and free its structures.
|
||||
|
||||
Finally, some additional information has to be stored in tables:
|
||||
|
||||
```
|
||||
_Atomic byte export_used; /* Export journal cleanup scheduled */ \
|
||||
struct rt_pending_export * _Atomic first_export; /* First export to announce */ \
|
||||
byte export_scheduled; /* Export is scheduled */
|
||||
list pending_exports; /* List of packed struct rt_pending_export */
|
||||
struct fib export_fib; /* Auxiliary fib for storing pending exports */
|
||||
u64 next_export_seq; /* The next export will have this ID */
|
||||
```
|
||||
|
||||
The exports are:
|
||||
1. Assigned the `next_export_seq` sequential ID, incrementing this item by one.
|
||||
2. Put into `pending_exports` and `export_fib` for both sequential and by-destination access.
|
||||
3. Signalled by setting `export_scheduled` and also `first_export` if not `NULL`.
|
||||
|
||||
After processing several exports, `export_used` is set and route table maintenance
|
||||
coroutine is woken up to possibly do cleanup.
|
||||
|
||||
The `struct rt_pending_export` seems to be best allocated by requesting a whole
|
||||
memory page, containing a common list node, a simple header and packed all the
|
||||
structures in the rest of the page. This may save a significant amount of memory.
|
||||
In case of congestion, there will be lots of exports and every spare kilobyte
|
||||
counts. If BIRD is almost idle, the optimization does nothing on the overall performance.
|
||||
|
||||
## Export algorithm
|
||||
|
||||
As we have explained at the beginning, the current export algorithm is
|
||||
table-driven. The table walks the channel list and propagates the update.
|
||||
The now export algorithm is channel-driven. The table just indicates that it
|
||||
has something new in export queue and the channel decides what to do with that and when.
|
||||
|
||||
### Pushing an export
|
||||
|
||||
When a table has something to export, it enqueues an appropriate instance of
|
||||
`struct rt_pending_export`. Then it pings its maintenance coroutine
|
||||
(`rt_event`) to notify the exporting channels about a new route. This coroutine
|
||||
runs with only the table locked so the protocol may e.g. prepare the next route inbetween.
|
||||
The maintenance coroutine, when it wakes up, walks the list of channels and
|
||||
wakes their export coroutines.
|
||||
|
||||
These two levels of asynchronicity are here for two reasons.
|
||||
|
||||
1. There may be lots of channels (hundreds of them) and we don't want to
|
||||
inefficiently iterate the list for each export.
|
||||
2. The notification is going to wait until the route author finishes, hopefully
|
||||
importing more routes and therefore allowing more exports to be processed at once.
|
||||
|
||||
The table also stores the first and last exports for every destination. These
|
||||
come handy mostly for cleanup purposes and for channel feeding.
|
||||
|
||||
### Processing an export
|
||||
|
||||
After these two pings, the channel finally knows that there is an export pending.
|
||||
The channel checks whether there is a `last_export` stored. If yes, it proceeds
|
||||
with the next one, otherwise it takes `first_export` from the table (which is
|
||||
atomic for this reason).
|
||||
|
||||
1. The channel checks its `export_seen_map` whether this export has been
|
||||
already processed. If so, it skips to the next export. No action is needed.
|
||||
2. The export chain is scanned for the current first and last export. This is
|
||||
done by following the `next` pointer in the exports as accessing the
|
||||
auxiliary export fib needs locking.
|
||||
3. In the best-only and all-routes export mode, the changes are processed
|
||||
without further table locking.
|
||||
4. If export-by-feed is used, the current state of routes in table are fetched.
|
||||
Then the table is unlocked and the changes are processed without further locking.
|
||||
5. All processed exports are marked as seen.
|
||||
6. The channel stores the `last_export` and returns to beginning.to wait for next export.
|
||||
|
||||
## The full route life-cycle
|
||||
|
||||
Until now, we're always assuming that the channels *just exist*. In real life,
|
||||
any channel may go up or down and we must handle it, flushing the routes
|
||||
appropriately and freeing all the memory just in time to avoid both
|
||||
use-after-free and memory leaks. It should be noted here explicitly that BIRD
|
||||
is written in C which has no garbage collector or other modern features alike.
|
||||
|
||||
### Protocols and channels as viewed from a route
|
||||
|
||||
BIRD consists effectively of protocols and tables. **Protocols** are active parts,
|
||||
kind-of subprocesses manipulating routes and other data. **Tables** are passive,
|
||||
serving as a database of routes. To connect a protocol to a table, a
|
||||
**channel** is created.
|
||||
|
||||
Every route has its `sender` storing the channel which has put the route into
|
||||
the current table. This comes handy when the channel goes down to know which
|
||||
routes to flush.
|
||||
|
||||
Every route also has its `src`, a route source allocated by the protocol which
|
||||
originated it first. This is kept when a route is passed through a *pipe*.
|
||||
|
||||
Both `src` and `sender` must point to active protocols and channels as inactive
|
||||
protocols and channels may be deleted in any time.
|
||||
|
||||
### Protocol and channel lifecycle
|
||||
|
||||
In the beginning, all channels and protocols are down. Until they fully start,
|
||||
no route from them is allowed to any table. When the protocol and channel is up,
|
||||
they may originate routes freely. However, the transitions are worth mentioning.
|
||||
|
||||
### Channel startup and feed
|
||||
|
||||
When protocols and channels start, they need to get the current state of the
|
||||
appropriate table. Therefore, after a protocol and channel start, also the
|
||||
export-feed coroutine is initiated.
|
||||
|
||||
Tables can contain millions of routes. It may lead to long import latency if a channel
|
||||
was feeding itself in one step. The table structure is (at least for now) too
|
||||
complicated to be implemented as lockless, thus even read access needs locking.
|
||||
To mitigate this, the feeds are split to allow for regular route propagation
|
||||
with a reasonable latency.
|
||||
|
||||
When the exports were synchronous, we simply didn't care and just announced the
|
||||
exports to the channels from the time they started feeding. When making exports
|
||||
asynchronous, it was crucial to avoid most of the possible race conditions
|
||||
which could arise from simultaneous feed and export. As the feeder routines had
|
||||
to be rewritten, it was a good opportunity to make this precise.
|
||||
|
||||
Therefore, when a channel goes up, it also starts exports:
|
||||
|
||||
1. Start the feed-export coroutine.
|
||||
2. *Lock the table.*
|
||||
3. Store the last export in queue.
|
||||
4. Read a limited number of routes to local memory together with their pending exports.
|
||||
5. If there are some routes to process:
|
||||
1. *Unlock the table.*
|
||||
2. Process the loaded routes.
|
||||
3. Set the appropriate pending exports as seen.
|
||||
4. *Lock the table*
|
||||
5. Go to 4. to continue feeding.
|
||||
6. If there was a last export stored, load the next one to be processed. Otherwise take the table's `first_export`.
|
||||
7. *Unlock the table.*
|
||||
8. Run the exporter loop.
|
||||
|
||||
*Note: There are some nuances not mentioned here how to do things in right
|
||||
order to avoid missing some events while changing state. Precise description
|
||||
would be too detailed for this text.*
|
||||
|
||||
When the feeder loop finishes, it continues smoothly to process all the exports
|
||||
that have been queued while the feed was running. Step 5.3 ensures that already
|
||||
seen exports are skipped, steps 3 and 6 ensure that no export is missed.
|
||||
|
||||
### Channel flush
|
||||
|
||||
Protocols and channels need to stop for a handful of reasons, All of these
|
||||
cases follow the same routine.
|
||||
|
||||
1. (Maybe.) The protocol requests to go down or restart.
|
||||
2. The channel requests to go down or restart.
|
||||
3. The channel requests to stop export.
|
||||
4. In the feed-export coroutine:
|
||||
1. At a designated cancellation point, check cancellation.
|
||||
2. Clean up local data.
|
||||
3. *Lock main BIRD context locked*
|
||||
4. If shutdown requested, switch the channel to *flushing* state and request table maintenance.
|
||||
5. In the table maintenance coroutine:
|
||||
1. Walk across all channels and check them for *flushing* state.
|
||||
2. Walk across the table (split to allow for low latency updates) and
|
||||
generate a withdrawal for each route sent by the flushing channels.
|
||||
3. Wait until all the withdrawals are processed.
|
||||
4. Mark the flushing channels as *down* and eventually proceed to the protocol shutdown or restart.
|
||||
|
||||
There is also a separate routine that handles bulk cleanup of `src`'s which
|
||||
contain a pointer to the originating protocol. This routine may get reworked in
|
||||
future, yet for now it is good enough.
|
||||
|
||||
### Route export cleanup
|
||||
|
||||
Last but not least is the export cleanup routine. Until now, the withdrawn
|
||||
routes were exported synchronously and freed directly after the import was
|
||||
done. This is not possible anymore. The export is stored and the import returns
|
||||
to let the importing protocol continue its work. We therefore need a routine to
|
||||
cleanup the withdrawn routes and also the processed exports.
|
||||
|
||||
First of all, this routine refuses to cleanup when any export is feeding or
|
||||
shutting down. This may change in future, anyway for now we aren't sure about
|
||||
possible race conditions.
|
||||
|
||||
Anyway, when all the exports are in a steady state, the routine works as follows:
|
||||
|
||||
1. Walk the active exports and find a minimum (oldest export) between their `last_export` values.
|
||||
2. If there is nothing to clear between the actual oldest export and channels' oldest export, do nothing.
|
||||
3. Find the table's new `first_export` and set it. Now there is nobody pointing to the old exports.
|
||||
4. Free the withdrawn routes.
|
||||
5. Free the old exports, removing them also from the first-last list of exports for the same destination.
|
||||
|
||||
## Results of these changes
|
||||
|
||||
This step is a first major step to move forward. Using just this version may be
|
||||
still as slow as the single-threaded version, at least if your export filters are trivial.
|
||||
Anyway, the main purpose of this step is not an immediate speedup. It is more
|
||||
of a base for the next steps:
|
||||
|
||||
* Unlocking of pipes should enable parallel execution of all the filters on
|
||||
pipes, limited solely by the principle *one thread for every direction of
|
||||
pipe*.
|
||||
* Conversion of CLI's `show route` to the new feed-export coroutines should
|
||||
enable faster table queries. Moreover, this approach will allow for
|
||||
better splitting of model and view in CLI with a good opportunity to
|
||||
implement more output formats, e.g. JSON.
|
||||
* Unlocking of kernel route synchronization should fix latency issues induced
|
||||
by long-lasting kernel queries.
|
||||
* Partial unlocking of BGP packet processing should finally allow for parallel
|
||||
execution in almost all phases of BGP route propagation.
|
||||
|
||||
The development is now being done mostly in the branch `alderney`. If you asked
|
||||
why such strange branch names like `jersey`, `guernsey` and `alderney`, here is
|
||||
a kind-of reason. Yes, these names could be named `mq-async-export`,
|
||||
`mq-async-export-new`, `mq-async-export-new-new`, `mq-another-async-export` and
|
||||
so on. That's so ugly, isn't it?
|
||||
|
||||
Also why so many branches? The development process is quite messy. BIRD's code
|
||||
heavily depends on single-threaded approach. This is exceptionally good for
|
||||
performance, as long as you have one thread only. On the other hand, lots of
|
||||
these assumptions are not documented so in many cases one desired
|
||||
change yields a chain of other unforeseen changes which must precede. This brings
|
||||
lots of backtracking, branch rebasing and other Git magic. There is always a
|
||||
can of worms somewhere in the code.
|
||||
|
||||
*It's still a long road to the version 2.1. This series of texts should document
|
||||
what is needed to be changed, why we do it and how. The
|
||||
[previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/)
|
||||
showed the necessary changes in route storage. In the next chapter, we're going
|
||||
to describe how the coroutines are implemented and what kind of locking system
|
||||
are we employing to prevent deadlocks. Stay tuned!*
|
15
doc/threads/Makefile
Normal file
15
doc/threads/Makefile
Normal file
@ -0,0 +1,15 @@
|
||||
SUFFICES := .pdf -wordpress.html
|
||||
CHAPTERS := 00_the_name_of_the_game 01_the_route_and_its_attributes 02_asynchronous_export
|
||||
|
||||
all: $(foreach ch,$(CHAPTERS),$(addprefix $(ch),$(SUFFICES)))
|
||||
|
||||
00_the_name_of_the_game.pdf: 00_filter_structure.png
|
||||
|
||||
%.pdf: %.md
|
||||
pandoc -f markdown -t latex -o $@ $<
|
||||
|
||||
%.html: %.md
|
||||
pandoc -f markdown -t html5 -o $@ $<
|
||||
|
||||
%-wordpress.html: %.html Makefile
|
||||
sed -r 's#</p>#\n#g; s#<p>##g; s#<(/?)code>#<\1tt>#g; s#<pre><tt>#<code>#g; s#</tt></pre>#</code>#g; s#</?figure>##g; s#<figcaption>#<p style="text-align: center">#; s#</figcaption>#</p>#; ' $< > $@
|
Loading…
Reference in New Issue
Block a user