mirror of
https://gitlab.nic.cz/labs/bird.git
synced 2024-12-31 22:21:54 +00:00
Merge tag '3.0-alpha0' into HEAD
3.0-alpha0
This commit is contained in:
commit
d975827f5f
16
NEWS
16
NEWS
@ -84,6 +84,22 @@ Version 2.0.9 (2022-02-09)
|
|||||||
filtering.
|
filtering.
|
||||||
|
|
||||||
|
|
||||||
|
Version 3.0-alpha0 (2022-02-07)
|
||||||
|
o Removal of fixed protocol-specific route attributes
|
||||||
|
o Asynchronous route export
|
||||||
|
o Explicit table import / export hooks
|
||||||
|
o Partially lockless route attribute cache
|
||||||
|
o Thread-safe resource management
|
||||||
|
o Thread-safe interface notifications
|
||||||
|
o Thread-safe protocol API
|
||||||
|
o Adoption of BFD IO loop for general use
|
||||||
|
o Parallel Pipe protocol
|
||||||
|
o Parallel RPKI protocol
|
||||||
|
o Parallel BGP protocol
|
||||||
|
o Lots of refactoring
|
||||||
|
o Bugfixes and improvements as they came along
|
||||||
|
|
||||||
|
|
||||||
Version 2.0.8 (2021-03-18)
|
Version 2.0.8 (2021-03-18)
|
||||||
o Automatic channel reloads based on RPKI changes
|
o Automatic channel reloads based on RPKI changes
|
||||||
o Multiple static routes with the same network
|
o Multiple static routes with the same network
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
<!doctype birddoc system>
|
<!doctype birddoc system>
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
BIRD 2.0 documentation
|
BIRD 3.0 documentation
|
||||||
|
|
||||||
This documentation can have 4 forms: sgml (this is master copy), html, ASCII
|
This documentation can have 4 forms: sgml (this is master copy), html, ASCII
|
||||||
text and dvi/postscript (generated from sgml using sgmltools). You should always
|
text and dvi/postscript (generated from sgml using sgmltools). You should always
|
||||||
@ -20,7 +20,7 @@ configuration - something in config which is not keyword.
|
|||||||
|
|
||||||
<book>
|
<book>
|
||||||
|
|
||||||
<title>BIRD 2.0 User's Guide
|
<title>BIRD 3.0 User's Guide
|
||||||
<author>
|
<author>
|
||||||
Ondrej Filip <it/<feela@network.cz>/,
|
Ondrej Filip <it/<feela@network.cz>/,
|
||||||
Martin Mares <it/<mj@ucw.cz>/,
|
Martin Mares <it/<mj@ucw.cz>/,
|
||||||
|
2
doc/threads/.gitignore
vendored
Normal file
2
doc/threads/.gitignore
vendored
Normal file
@ -0,0 +1,2 @@
|
|||||||
|
*.html
|
||||||
|
*.pdf
|
BIN
doc/threads/00_filter_structure.png
Normal file
BIN
doc/threads/00_filter_structure.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 300 KiB |
114
doc/threads/00_the_name_of_the_game.md
Normal file
114
doc/threads/00_the_name_of_the_game.md
Normal file
@ -0,0 +1,114 @@
|
|||||||
|
# BIRD Journey to Threads. Chapter 0: The Reason Why.
|
||||||
|
|
||||||
|
BIRD is a fast, robust and memory-efficient routing daemon designed and
|
||||||
|
implemented at the end of 20th century. Its concept of multiple routing
|
||||||
|
tables with pipes between them, as well as a procedural filtering language,
|
||||||
|
has been unique for a long time and is still one of main reasons why people use
|
||||||
|
BIRD for big loads of routing data.
|
||||||
|
|
||||||
|
## IPv4 / IPv6 duality: Solved
|
||||||
|
|
||||||
|
The original design of BIRD has also some drawbacks. One of these was an idea
|
||||||
|
of two separate daemons – one BIRD for IPv4 and another BIRD for IPv6, built from the same
|
||||||
|
codebase, cleverly using `#ifdef IPV6` constructions to implement the
|
||||||
|
common parts of BIRD algorithms and data structures only once.
|
||||||
|
If IPv6 adoption went forward as people thought in that time,
|
||||||
|
it would work; after finishing the worldwide transition to IPv6, people could
|
||||||
|
just stop building BIRD for IPv4 and drop the `#ifdef`-ed code.
|
||||||
|
|
||||||
|
The history went other way, however. BIRD developers therefore decided to *integrate*
|
||||||
|
these two versions into one daemon capable of any address family, allowing for
|
||||||
|
not only IPv6 but for virtually anything. This rework brought quite a lot of
|
||||||
|
backward-incompatible changes, therefore we decided to release it as a version 2.0.
|
||||||
|
This work was mostly finished in 2018 and as for March 2021, we have already
|
||||||
|
switched the 1.6.x branch to a bugfix-only mode.
|
||||||
|
|
||||||
|
## BIRD is single-threaded now
|
||||||
|
|
||||||
|
The second drawback is a single-threaded design. Looking back to 1998, this was
|
||||||
|
a good idea. A common PC had one single core and BIRD was targeting exactly
|
||||||
|
this segment. As the years went by, the manufacturers launched multicore x86 chips
|
||||||
|
(AMD Opteron in 2004, Intel Pentium D in 2005). This ultimately led to a world
|
||||||
|
where as of March 2021, there is virtually no new PC sold with a single-core CPU.
|
||||||
|
|
||||||
|
Together with these changes, the speed of one single core has not been growing as fast
|
||||||
|
as the Internet is growing. BIRD is still capable to handle the full BGP table
|
||||||
|
(868k IPv4 routes in March 2021) with one core, anyway when BIRD starts, it may take
|
||||||
|
long minutes to converge.
|
||||||
|
|
||||||
|
## Intermezzo: Filters
|
||||||
|
|
||||||
|
In 2018, we took some data we had from large internet exchanges and simulated
|
||||||
|
a cold start of BIRD as a route server. We used `linux-perf` to find most time-critical
|
||||||
|
parts of BIRD and it pointed very clearly to the filtering code. It also showed that the
|
||||||
|
IPv4 version of BIRD v1.6.x is substantially faster than the *integrated* version, while
|
||||||
|
the IPv6 version was quite as fast as the *integrated* one.
|
||||||
|
|
||||||
|
Here we should show a little bit more about how the filters really work. Let's use
|
||||||
|
an example of a simple filter:
|
||||||
|
|
||||||
|
```
|
||||||
|
filter foo {
|
||||||
|
if net ~ [10.0.0.0/8+] then reject;
|
||||||
|
preference = 2 * preference - 41;
|
||||||
|
accept;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This filter gets translated to an infix internal structure.
|
||||||
|
|
||||||
|
![Example of filter internal representation](00_filter_structure.png)
|
||||||
|
|
||||||
|
When executing, the filter interpreter just walks the filter internal structure recursively in the
|
||||||
|
right order, executes the instructions, collects their results and finishes by
|
||||||
|
either rejection or acceptation of the route
|
||||||
|
|
||||||
|
## Filter rework
|
||||||
|
|
||||||
|
Further analysis of the filter code revealed an absurdly-looking result. The
|
||||||
|
most executed parts of the interpreter function were the `push` CPU
|
||||||
|
instructions on its very beginning and the `pop` CPU instructions on its very
|
||||||
|
end. This came from the fact that the interpreter function was quite long, yet
|
||||||
|
most of the filter instructions used an extremely short path, doing all the
|
||||||
|
stack manipulation at the beginning, branching by the filter instruction type,
|
||||||
|
then it executed just several CPU instructions, popped everything from the
|
||||||
|
stack back and returned.
|
||||||
|
|
||||||
|
After some thoughts how to minimize stack manipulation when everything you need
|
||||||
|
is to take two numbers and multiply them, we decided to preprocess the filter
|
||||||
|
internal structure to another structure which is much easier to execute. The
|
||||||
|
interpreter is now using a data stack and behaves generally as a
|
||||||
|
postfix-ordered language. We also thought about Lua which showed up to be quite
|
||||||
|
a lot of work implementing all the glue achieving about the same performance.
|
||||||
|
|
||||||
|
After these changes, we managed to reduce the filter execution time by 10–40%,
|
||||||
|
depending on how complex the filter is.
|
||||||
|
Anyway, even this reduction is quite too little when there is one CPU core
|
||||||
|
running for several minutes while others are sleeping.
|
||||||
|
|
||||||
|
## We need more threads
|
||||||
|
|
||||||
|
As a side effect of the rework, the new filter interpreter is also completely
|
||||||
|
thread-safe. It seemed to be the way to go – running the filters in parallel
|
||||||
|
while keeping everything else single-threaded. The main problem of this
|
||||||
|
solution is a too fine granularity of parallel jobs. We would spend lots of
|
||||||
|
time on synchronization overhead.
|
||||||
|
|
||||||
|
The only filter parallel execution was also too one-sided, useful only for
|
||||||
|
configurations with complex filters. In other cases, the major problem is best
|
||||||
|
route recalculation, OSPF recalculation or also kernel synchronization.
|
||||||
|
It also turned out to be dirty a lot from the code cleanliness' point of view.
|
||||||
|
|
||||||
|
Therefore we chose to make BIRD multithreaded completely. We designed a way how
|
||||||
|
to gradually enable parallel computation and best usage of all available CPU
|
||||||
|
cores. Our goals are three:
|
||||||
|
|
||||||
|
* We want to keep current functionality. Parallel computation should never drop
|
||||||
|
a useful feature.
|
||||||
|
* We want to do little steps. No big reworks, even though even the smallest
|
||||||
|
possible step will need quite a lot of refactoring before.
|
||||||
|
* We want to be backwards compatible as much as possible.
|
||||||
|
|
||||||
|
*It's still a long road to the version 2.1. This series of texts should document
|
||||||
|
what is needed to be changed, why we do it and how. In the next chapter, we're
|
||||||
|
going to describe the structures for routes and their attributes. Stay tuned!*
|
159
doc/threads/01_the_route_and_its_attributes.md
Normal file
159
doc/threads/01_the_route_and_its_attributes.md
Normal file
@ -0,0 +1,159 @@
|
|||||||
|
# BIRD Journey to Threads. Chapter 1: The Route and its Attributes
|
||||||
|
|
||||||
|
BIRD is a fast, robust and memory-efficient routing daemon designed and
|
||||||
|
implemented at the end of 20th century. We're doing a significant amount of
|
||||||
|
BIRD's internal structure changes to make it possible to run in multiple
|
||||||
|
threads in parallel. This chapter covers necessary changes of data structures
|
||||||
|
which store every single routing data.
|
||||||
|
|
||||||
|
*If you want to see the changes in code, look (basically) into the
|
||||||
|
`route-storage-updates` branch. Not all of them are already implemented, anyway
|
||||||
|
most of them are pretty finished as of end of March, 2021.*
|
||||||
|
|
||||||
|
## How routes are stored
|
||||||
|
|
||||||
|
BIRD routing table is just a hierarchical noSQL database. On top level, the
|
||||||
|
routes are keyed by their destination, called *net*. Due to historic reasons,
|
||||||
|
the *net* is not only *IPv4 prefix*, *IPv6 prefix*, *IPv4 VPN prefix* etc.,
|
||||||
|
but also *MPLS label*, *ROA information* or *BGP Flowspec record*. As there may
|
||||||
|
be several routes for each *net*, an obligatory part of the key is *src* aka.
|
||||||
|
*route source*. The route source is a tuple of the originating protocol
|
||||||
|
instance and a 32-bit unsigned integer. If a protocol wants to withdraw a route,
|
||||||
|
it is enough and necessary to have the *net* and *src* to identify what route
|
||||||
|
is to be withdrawn.
|
||||||
|
|
||||||
|
The route itself consists of (basically) a list of key-value records, with
|
||||||
|
value types ranging from a 16-bit unsigned integer for preference to a complex
|
||||||
|
BGP path structure. The keys are pre-defined by protocols (e.g. BGP path or
|
||||||
|
OSPF metrics), or by BIRD core itself (preference, route gateway).
|
||||||
|
Finally, the user can declare their own attribute keys using the keyword
|
||||||
|
`attribute` in config.
|
||||||
|
|
||||||
|
## Attribute list implementation
|
||||||
|
|
||||||
|
Currently, there are three layers of route attributes. We call them *route*
|
||||||
|
(*rte*), *attributes* (*rta*) and *extended attributes* (*ea*, *eattr*).
|
||||||
|
|
||||||
|
The first layer, *rte*, contains the *net* pointer, several fixed-size route
|
||||||
|
attributes (mostly preference and protocol-specific metrics), flags, lastmod
|
||||||
|
time and a pointer to *rta*.
|
||||||
|
|
||||||
|
The second layer, *rta*, contains the *src* (a pointer to a singleton instance),
|
||||||
|
a route gateway, several other fixed-size route attributes and a pointer to
|
||||||
|
*ea* list.
|
||||||
|
|
||||||
|
The third layer, *ea* list, is a variable-length list of key-value attributes,
|
||||||
|
containing all the remaining route attributes.
|
||||||
|
|
||||||
|
Distribution of the route attributes between the attribute layers is somehow
|
||||||
|
arbitrary. Mostly, in the first and second layer, there are attributes that
|
||||||
|
were thought to be accessed frequently (e.g. in best route selection) and
|
||||||
|
filled in in most routes, while the third layer is for infrequently used
|
||||||
|
and/or infrequently accessed route attributes.
|
||||||
|
|
||||||
|
## Attribute list deduplication
|
||||||
|
|
||||||
|
When protocols originate routes, there are commonly more routes with the
|
||||||
|
same attribute list. BIRD could ignore this fact, anyway if you have several
|
||||||
|
tables connected with pipes, it is more memory-efficient to store the same
|
||||||
|
attribute lists only once.
|
||||||
|
|
||||||
|
Therefore, the two lower layers (*rta* and *ea*) are hashed and stored in a
|
||||||
|
BIRD-global database. Routes (*rte*) contain a pointer to *rta* in this
|
||||||
|
database, maintaining a use-count of each *rta*. Attributes (*rta*) contain
|
||||||
|
a pointer to normalized (sorted by numerical key ID) *ea*.
|
||||||
|
|
||||||
|
## Attribute list rework
|
||||||
|
|
||||||
|
The first thing to change is the distribution of route attributes between
|
||||||
|
attribute list layers. We decided to make the first layer (*rte*) only the key
|
||||||
|
and other per-record internal technical information. Therefore we move *src* to
|
||||||
|
*rte* and preference to *rta* (beside other things). *This is already done.*
|
||||||
|
|
||||||
|
We also found out that the nexthop (gateway), originally one single IP address
|
||||||
|
and an interface, has evolved to a complex attribute with several sub-attributes;
|
||||||
|
not only considering multipath routing but also MPLS stacks and other per-route
|
||||||
|
attributes. This has led to a too complex data structure holding the nexthop set.
|
||||||
|
|
||||||
|
We decided finally to squash *rta* and *ea* to one type of data structure,
|
||||||
|
allowing for completely dynamic route attribute lists. This is also supported
|
||||||
|
by adding other *net* types (BGP FlowSpec or ROA) where lots of the fields make
|
||||||
|
no sense at all, yet we still want to use the same data structures and implementation
|
||||||
|
as we don't like duplicating code. *Multithreading doesn't depend on this change,
|
||||||
|
anyway this change is going to happen soon anyway.*
|
||||||
|
|
||||||
|
## Route storage
|
||||||
|
|
||||||
|
The process of route import from protocol into a table can be divided into several phases:
|
||||||
|
|
||||||
|
1. (In protocol code.) Create the route itself (typically from
|
||||||
|
protocol-internal data) and choose the right channel to use.
|
||||||
|
2. (In protocol code.) Create the *rta* and *ea* and obtain an appropriate
|
||||||
|
hashed pointer. Allocate the *rte* structure and fill it in.
|
||||||
|
3. (Optionally.) Store the route to the *import table*.
|
||||||
|
4. Run filters. If reject, free everything.
|
||||||
|
5. Check whether this is a real change (it may be idempotent). If not, free everything and do nothing more.
|
||||||
|
6. Run the best route selection algorithm.
|
||||||
|
7. Execute exports if needed.
|
||||||
|
|
||||||
|
We found out that the *rte* structure allocation is done too early. BIRD uses
|
||||||
|
global optimized allocators for fixed-size blocks (which *rte* is) to reduce
|
||||||
|
its memory footprint, therefore the allocation of *rte* structure would be a
|
||||||
|
synchronization point in multithreaded environment.
|
||||||
|
|
||||||
|
The common code is also much more complicated when we have to track whether the
|
||||||
|
current *rte* has to be freed or not. This is more a problem in export than in
|
||||||
|
import as the export filter can also change the route (and therefore allocate
|
||||||
|
another *rte*). The changed route must be therefore freed after use. All the
|
||||||
|
route changing code must also track whether this route is writable or
|
||||||
|
read-only.
|
||||||
|
|
||||||
|
We therefore introduce a variant of *rte* called *rte_storage*. Both of these
|
||||||
|
hold the same, the layer-1 route information (destination, author, cached
|
||||||
|
attribute pointer, flags etc.), anyway *rte* is always local and *rte_storage*
|
||||||
|
is intended to be put in global data structures.
|
||||||
|
|
||||||
|
This change allows us to remove lots of the code which only tracks whether any
|
||||||
|
*rte* is to be freed as *rte*'s are almost always allocated on-stack, naturally
|
||||||
|
limiting their lifetime. If not on-stack, it's the responsibility of the owner
|
||||||
|
to free the *rte* after import is done.
|
||||||
|
|
||||||
|
This change also removes the need for *rte* allocation in protocol code and
|
||||||
|
also *rta* can be safely allocated on-stack. As a result, protocols can simply
|
||||||
|
allocate all the data on stack, call the update routine and the common code in
|
||||||
|
BIRD's *nest* does all the storage for them.
|
||||||
|
|
||||||
|
Allocating *rta* on-stack is however not required. BGP and OSPF use this to
|
||||||
|
import several routes with the same attribute list. In BGP, this is due to the
|
||||||
|
format of BGP update messages containing first the attributes and then the
|
||||||
|
destinations (BGP NLRI's). In OSPF, in addition to *rta* deduplication, it is
|
||||||
|
also presumed that no import filter (or at most some trivial changes) is applied
|
||||||
|
as OSPF would typically not work well when filtered.
|
||||||
|
|
||||||
|
*This change is already done.*
|
||||||
|
|
||||||
|
## Route cleanup and table maintenance
|
||||||
|
|
||||||
|
In some cases, the route update is not originated by a protocol/channel code.
|
||||||
|
When the channel shuts down, all routes originated by that channel are simply
|
||||||
|
cleaned up. Also routes with recursive routes may get changed without import,
|
||||||
|
simply by changing the IGP route.
|
||||||
|
|
||||||
|
This is currently done by a `rt_event` (see `nest/rt-table.c` for source code)
|
||||||
|
which is to be converted to a parallel thread, running when nobody imports any
|
||||||
|
route. *This change is freshly done in branch `guernsey`.*
|
||||||
|
|
||||||
|
## Parallel protocol execution
|
||||||
|
|
||||||
|
The long-term goal of these reworks is to allow for completely independent
|
||||||
|
execution of all the protocols. Typically, there is no direct interaction
|
||||||
|
between protocols; everything is done thought BIRD's *nest*. Protocols should
|
||||||
|
therefore run in parallel in future and wait/lock only when something is needed
|
||||||
|
to do externally.
|
||||||
|
|
||||||
|
We also aim for a clean and documented protocol API.
|
||||||
|
|
||||||
|
*It's still a long road to the version 2.1. This series of texts should document
|
||||||
|
what is needed to be changed, why we do it and how. In the next chapter, we're
|
||||||
|
going to describe how the route is exported from table to protocols and how this
|
||||||
|
process is changing. Stay tuned!*
|
463
doc/threads/02_asynchronous_export.md
Normal file
463
doc/threads/02_asynchronous_export.md
Normal file
@ -0,0 +1,463 @@
|
|||||||
|
# BIRD Journey to Threads. Chapter 2: Asynchronous route export
|
||||||
|
|
||||||
|
Route export is a core algorithm of BIRD. This chapter covers how we are making
|
||||||
|
this procedure multithreaded. Desired outcomes are mostly lower latency of
|
||||||
|
route import, flap dampening and also faster route processing in large
|
||||||
|
configurations with lots of export from one table.
|
||||||
|
|
||||||
|
BIRD is a fast, robust and memory-efficient routing daemon designed and
|
||||||
|
implemented at the end of 20th century. We're doing a significant amount of
|
||||||
|
BIRD's internal structure changes to make it possible to run in multiple
|
||||||
|
threads in parallel.
|
||||||
|
|
||||||
|
## How routes are propagated through BIRD
|
||||||
|
|
||||||
|
In the [previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/), you could learn how the route import works. We should
|
||||||
|
now extend that process by the route export.
|
||||||
|
|
||||||
|
1. (In protocol code.) Create the route itself and propagate it through the
|
||||||
|
right channel by calling `rte_update`.
|
||||||
|
2. The channel runs its import filter.
|
||||||
|
3. New best route is selected.
|
||||||
|
4. For each channel:
|
||||||
|
1. The channel runs its preexport hook and export filter.
|
||||||
|
2. (Optionally.) The channel merges the nexthops to create an ECMP route.
|
||||||
|
3. The channel calls the protocol's `rt_notify` hook.
|
||||||
|
5. After all exports are finished, the `rte_update` call finally returns and
|
||||||
|
the source protocol may do anything else.
|
||||||
|
|
||||||
|
Let's imagine that all the protocols are running in parallel. There are two
|
||||||
|
protocols with a route prepared to import. One of those wins the table lock,
|
||||||
|
does the import and then the export touches the other protocol which must
|
||||||
|
either:
|
||||||
|
|
||||||
|
* store the route export until it finishes its own imports, or
|
||||||
|
* have independent import and export parts.
|
||||||
|
|
||||||
|
Both of these conditions are infeasible for common use. Implementing them would
|
||||||
|
make protocols much more complicated with lots of new code to test and release
|
||||||
|
at once and also quite a lot of corner cases. Risk of deadlocks is also worth
|
||||||
|
mentioning.
|
||||||
|
|
||||||
|
## Asynchronous route export
|
||||||
|
|
||||||
|
We decided to make it easier for protocols and decouple the import and export
|
||||||
|
this way:
|
||||||
|
|
||||||
|
1. The import is done.
|
||||||
|
2. Best route is selected.
|
||||||
|
3. Resulting changes are stored.
|
||||||
|
|
||||||
|
Then, after the importing protocol returns, the exports are processed for each
|
||||||
|
exporting channel in parallel: Some protocols
|
||||||
|
may process the export directly after it is stored, other protocols wait
|
||||||
|
until they finish another job.
|
||||||
|
|
||||||
|
This eliminates the risk of deadlocks and all protocols' `rt_notify` hooks can
|
||||||
|
rely on their independence. There is only one question. How to store the changes?
|
||||||
|
|
||||||
|
## Route export modes
|
||||||
|
|
||||||
|
To find a good data structure for route export storage, we shall first know the
|
||||||
|
readers. The exporters may request different modes of route export.
|
||||||
|
|
||||||
|
### Export everything
|
||||||
|
|
||||||
|
This is the most simple route export mode. The exporter wants to know about all
|
||||||
|
the routes as they're changing. We therefore simply store the old route until
|
||||||
|
the change is fully exported and then we free the old stored route.
|
||||||
|
|
||||||
|
To manage this, we can simply queue the changes one after another and postpone
|
||||||
|
old route cleanup after all channels have exported the change. The queue member
|
||||||
|
would look like this:
|
||||||
|
|
||||||
|
```
|
||||||
|
struct {
|
||||||
|
struct rte_storage *new;
|
||||||
|
struct rte_storage *old;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
### Export best
|
||||||
|
|
||||||
|
This is another simple route export mode. We check whether the best route has
|
||||||
|
changed; if not, no export happens. Otherwise, the export is propagated as the
|
||||||
|
old best route changing to the new best route.
|
||||||
|
|
||||||
|
To manage this, we could use the queue from the previous point by adding new
|
||||||
|
best and old best pointers. It is guaranteed that both the old best and new
|
||||||
|
best pointers are always valid in time of export as all the changes in them
|
||||||
|
must be stored in future changes which have not been exported yet by this
|
||||||
|
channel and therefore not freed yet.
|
||||||
|
|
||||||
|
```
|
||||||
|
struct {
|
||||||
|
struct rte_storage *new;
|
||||||
|
struct rte_storage *new_best;
|
||||||
|
struct rte_storage *old;
|
||||||
|
struct rte_storage *old_best;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
Anyway, we're getting to the complicated export modes where this simple
|
||||||
|
structure is simply not enough.
|
||||||
|
|
||||||
|
### Export merged
|
||||||
|
|
||||||
|
Here we're getting to some kind of problems. The exporting channel requests not
|
||||||
|
only the best route but also all routes that are good enough to be considered
|
||||||
|
ECMP-eligible (we call these routes *mergable*). The export is then just one
|
||||||
|
route with just the nexthops merged. Export filters are executed before
|
||||||
|
merging and if the best route is rejected, nothing is exported at all.
|
||||||
|
|
||||||
|
To achieve this, we have to re-evaluate export filters any time the best route
|
||||||
|
or any mergable route changes. Until now, the export could just do what it wanted
|
||||||
|
as there was only one thread working. To change this, we need to access the
|
||||||
|
whole route list and process it.
|
||||||
|
|
||||||
|
### Export first accepted
|
||||||
|
|
||||||
|
In this mode, the channel runs export filters on a sorted list of routes, best first.
|
||||||
|
If the best route gets rejected, it asks for the next one until it finds an
|
||||||
|
acceptable route or exhausts the list. This export mode requires a sorted table.
|
||||||
|
BIRD users may know this export mode as `secondary` in BGP.
|
||||||
|
|
||||||
|
For now, BIRD stores two bits per route for each channel. The *export bit* is set
|
||||||
|
if the route has been really exported to that channel. The *reject bit* is set
|
||||||
|
if the route was rejected by the export filter.
|
||||||
|
|
||||||
|
When processing a route change for accepted, the algorithm first checks the
|
||||||
|
export bit for the old route. If this bit is set, the old route is that one
|
||||||
|
exported so we have to find the right one to export. Therefore the sorted route
|
||||||
|
list is walked best to worst to find a new route to export, using the reject
|
||||||
|
bit to evaluate only routes which weren't rejected in previous runs of this
|
||||||
|
algorithm.
|
||||||
|
|
||||||
|
If the old route bit is not set, the algorithm walks the sorted route list best
|
||||||
|
to worst, checking the position of new route with respect to the exported route.
|
||||||
|
If the new route is worse, nothing happens, otherwise the new route is sent to
|
||||||
|
filters and finally exported if passes.
|
||||||
|
|
||||||
|
### Export by feed
|
||||||
|
|
||||||
|
To resolve problems arising from previous two export modes (merged and first accepted),
|
||||||
|
we introduce a way to process a whole route list without locking the table
|
||||||
|
while export filters are running. To achieve this, we follow this algorithm:
|
||||||
|
|
||||||
|
1. The exporting channel sees a pending export.
|
||||||
|
2. *The table is locked.*
|
||||||
|
3. All routes (pointers) for the given destination are dumped to a local array.
|
||||||
|
4. Also first and last pending exports for the given destination are stored.
|
||||||
|
5. *The table is unlocked.*
|
||||||
|
6. The channel processes the local array of route pointers.
|
||||||
|
7. All pending exports between the first and last stored (incl.) are marked as processed to allow for cleanup.
|
||||||
|
|
||||||
|
After unlocking the table, the pointed-to routes are implicitly guarded by the
|
||||||
|
sole fact that no pending export has not yet been processed by all channels
|
||||||
|
and the cleanup routine frees only resources after being processed.
|
||||||
|
|
||||||
|
The pending export range must be stored together with the feed. While
|
||||||
|
processing export filters for the feed, another export may come in. We
|
||||||
|
must process the export once again as the feed is now outdated, therefore we
|
||||||
|
must mark only these exports that were pending for this destination when the
|
||||||
|
feed was being stored. We also can't mark them before actually processing them
|
||||||
|
as they would get freed inbetween.
|
||||||
|
|
||||||
|
## Pending export data structure
|
||||||
|
|
||||||
|
As the two complicated export modes use the export-by-feed algorithm, the
|
||||||
|
pending export data structure may be quite minimalistic.
|
||||||
|
|
||||||
|
```
|
||||||
|
struct rt_pending_export {
|
||||||
|
struct rt_pending_export * _Atomic next; /* Next export for the same destination */
|
||||||
|
struct rte_storage *new; /* New route */
|
||||||
|
struct rte_storage *new_best; /* New best route in unsorted table */
|
||||||
|
struct rte_storage *old; /* Old route */
|
||||||
|
struct rte_storage *old_best; /* Old best route in unsorted table */
|
||||||
|
_Atomic u64 seq; /* Sequential ID (table-local) of the pending export */
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
To allow for squashing outdated pending exports (e.g. for flap dampening
|
||||||
|
purposes), there is a `next` pointer to the next export for the same
|
||||||
|
destination. This is also needed for the export-by-feed algorithm to traverse
|
||||||
|
the list of pending exports.
|
||||||
|
|
||||||
|
We should also add several items into `struct channel`.
|
||||||
|
|
||||||
|
```
|
||||||
|
struct coroutine *export_coro; /* Exporter and feeder coroutine */
|
||||||
|
struct bsem *export_sem; /* Exporter and feeder semaphore */
|
||||||
|
struct rt_pending_export * _Atomic last_export; /* Last export processed */
|
||||||
|
struct bmap export_seen_map; /* Keeps track which exports were already processed */
|
||||||
|
u64 flush_seq; /* Table export seq when the channel announced flushing */
|
||||||
|
```
|
||||||
|
|
||||||
|
To run the exports in parallel, `export_coro` is run and `export_sem` is
|
||||||
|
used for signalling new exports to it. The exporter coroutine also marks all
|
||||||
|
seen sequential IDs in its `export_seen_map` to make it possible to skip over
|
||||||
|
them if seen again. The exporter coroutine is started when export is requested
|
||||||
|
and stopped when export is stopped.
|
||||||
|
|
||||||
|
There is also a table cleaner routine
|
||||||
|
(see [previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/))
|
||||||
|
which must cleanup also the pending exports after all the channels are finished with them.
|
||||||
|
To signal that, there is `last_export` working as a release point: the channel
|
||||||
|
guarantees that it doesn't touch the pointed-to pending export (or any older), nor any data
|
||||||
|
from it.
|
||||||
|
|
||||||
|
The last tricky point here is channel flushing. When any channel stops, all its
|
||||||
|
routes are automatically freed and withdrawals are exported if appropriate.
|
||||||
|
Until now, the routes could be flushed synchronously, anyway now flush has
|
||||||
|
several phases, stored in `flush_active` channel variable:
|
||||||
|
|
||||||
|
1. Flush started.
|
||||||
|
2. Withdrawals for all the channel's routes are issued.
|
||||||
|
Here the channel stores the `seq` of last current pending export to `flush_seq`)
|
||||||
|
3. When the table's cleanup routine cleans up the withdrawal with `flush_seq`,
|
||||||
|
the channel may safely stop and free its structures as all `sender` pointers in routes are now gone.
|
||||||
|
|
||||||
|
Finally, some additional information has to be stored in tables:
|
||||||
|
|
||||||
|
```
|
||||||
|
_Atomic byte export_used; /* Export journal cleanup scheduled */ \
|
||||||
|
struct rt_pending_export * _Atomic first_export; /* First export to announce */ \
|
||||||
|
byte export_scheduled; /* Export is scheduled */
|
||||||
|
list pending_exports; /* List of packed struct rt_pending_export */
|
||||||
|
struct fib export_fib; /* Auxiliary fib for storing pending exports */
|
||||||
|
u64 next_export_seq; /* The next export will have this ID */
|
||||||
|
```
|
||||||
|
|
||||||
|
The exports are:
|
||||||
|
1. Assigned the `next_export_seq` sequential ID, incrementing this item by one.
|
||||||
|
2. Put into `pending_exports` and `export_fib` for both sequential and by-destination access.
|
||||||
|
3. Signalled by setting `export_scheduled` and `first_export`.
|
||||||
|
|
||||||
|
After processing several exports, `export_used` is set and route table maintenance
|
||||||
|
coroutine is woken up to possibly do cleanup.
|
||||||
|
|
||||||
|
The `struct rt_pending_export` seems to be best allocated by requesting a whole
|
||||||
|
memory page, containing a common list node, a simple header and packed all the
|
||||||
|
structures in the rest of the page. This may save a significant amount of memory.
|
||||||
|
In case of congestion, there will be lots of exports and every spare kilobyte
|
||||||
|
counts. If BIRD is almost idle, the optimization does nothing on the overall performance.
|
||||||
|
|
||||||
|
## Export algorithm
|
||||||
|
|
||||||
|
As we have explained at the beginning, the current export algorithm is
|
||||||
|
synchronous and table-driven. The table walks the channel list and propagates the update.
|
||||||
|
The new export algorithm is channel-driven. The table just indicates that it
|
||||||
|
has something new in export queue and the channel decides what to do with that and when.
|
||||||
|
|
||||||
|
### Pushing an export
|
||||||
|
|
||||||
|
When a table has something to export, it enqueues an instance of
|
||||||
|
`struct rt_pending_export` together with updating the `last` pointer (and
|
||||||
|
possibly also `first`) for this destination's pending exports.
|
||||||
|
|
||||||
|
Then it pings its maintenance coroutine (`rt_event`) to notify the exporting
|
||||||
|
channels about a new route. Before the maintenance coroutine acquires the table
|
||||||
|
lock, the importing protocol may e.g. prepare the next route inbetween.
|
||||||
|
The maintenance coroutine, when it wakes up, walks the list of channels and
|
||||||
|
wakes their export coroutines.
|
||||||
|
|
||||||
|
These two levels of asynchronicity are here for an efficiency reason.
|
||||||
|
|
||||||
|
1. In case of low table load, the export is announced just after the import happens.
|
||||||
|
2. In case of table congestion, the export notification locks the table as well
|
||||||
|
as all route importers, effectively reducing the number of channel list traversals.
|
||||||
|
|
||||||
|
### Processing an export
|
||||||
|
|
||||||
|
After these two pings, the channel finally knows that there is an export pending.
|
||||||
|
|
||||||
|
1. The channel waits for a semaphore. This semaphore is posted by the table
|
||||||
|
maintenance coroutine.
|
||||||
|
2. The channel checks whether there is a `last_export` stored.
|
||||||
|
1. If yes, it proceeds with the next one.
|
||||||
|
2. Otherwise it takes `first_export` from the table. This special
|
||||||
|
pointer is atomic and can be accessed without locking and also without clashing
|
||||||
|
with the export cleanup routine.
|
||||||
|
3. The channel checks its `export_seen_map` whether this export has been
|
||||||
|
already processed. If so, it goes back to 1. to get the next export. No
|
||||||
|
action is needed with this one.
|
||||||
|
4. As now the export is clearly new, the export chain (single-linked list) is
|
||||||
|
scanned for the current first and last export. This is done by following the
|
||||||
|
`next` pointer in the exports.
|
||||||
|
5. If all-routes mode is used, the exports are processed one-by-one. In future
|
||||||
|
versions, we may employ some simple flap-dampening by checking the pending
|
||||||
|
export list for the same route src. *No table locking happens.*
|
||||||
|
6. If best-only mode is employed, just the first and last exports are
|
||||||
|
considered to find the old and new best routes. The inbetween exports do nothing. *No table locking happens.*
|
||||||
|
7. If export-by-feed is used, the current state of routes in table are fetched and processed
|
||||||
|
as described above in the "Export by feed" section.
|
||||||
|
8. All processed exports are marked as seen.
|
||||||
|
9. The channel stores the first processed export to `last_export` and returns
|
||||||
|
to beginning.to wait for next exports. The latter exports are then skipped by
|
||||||
|
step 3 when the export coroutine gets to them.
|
||||||
|
|
||||||
|
## The full life-cycle of routes
|
||||||
|
|
||||||
|
Until now, we're always assuming that the channels *just exist*. In real life,
|
||||||
|
any channel may go up or down and we must handle it, flushing the routes
|
||||||
|
appropriately and freeing all the memory just in time to avoid both
|
||||||
|
use-after-free and memory leaks. BIRD is written in C which has no garbage
|
||||||
|
collector or other modern features alike so memory management is a thing.
|
||||||
|
|
||||||
|
### Protocols and channels as viewed from a route
|
||||||
|
|
||||||
|
BIRD consists effectively of protocols and tables. **Protocols** are active parts,
|
||||||
|
kind-of subprocesses manipulating routes and other data. **Tables** are passive,
|
||||||
|
serving as a database of routes. To connect a protocol to a table, a
|
||||||
|
**channel** is created.
|
||||||
|
|
||||||
|
Every route has its `sender` storing the channel which has put the route into
|
||||||
|
the current table. Therefore we know which routes to flush when a channel goes down.
|
||||||
|
|
||||||
|
Every route also has its `src`, a route source allocated by the protocol which
|
||||||
|
originated it first. This is kept when a route is passed through a *pipe*. The
|
||||||
|
route source is always bound to protocol; it is possible that a protocol
|
||||||
|
announces routes via several channels using the same src.
|
||||||
|
|
||||||
|
Both `src` and `sender` must point to active protocols and channels as inactive
|
||||||
|
protocols and channels may be deleted any time.
|
||||||
|
|
||||||
|
### Protocol and channel lifecycle
|
||||||
|
|
||||||
|
In the beginning, all channels and protocols are down. Until they fully start,
|
||||||
|
no route from them is allowed to any table. When the protocol and channel is up,
|
||||||
|
they may originate and receive routes freely. However, the transitions are worth mentioning.
|
||||||
|
|
||||||
|
### Channel startup and feed
|
||||||
|
|
||||||
|
When protocols and channels start, they need to get the current state of the
|
||||||
|
appropriate table. Therefore, after a protocol and channel start, also the
|
||||||
|
export-feed coroutine is initiated.
|
||||||
|
|
||||||
|
Tables can contain millions of routes. It may lead to long import latency if a channel
|
||||||
|
was feeding itself in one step. The table structure is (at least for now) too
|
||||||
|
complicated to be implemented as lockless, thus even read access needs locking.
|
||||||
|
To mitigate this, the feeds are split to allow for regular route propagation
|
||||||
|
with a reasonable latency.
|
||||||
|
|
||||||
|
When the exports were synchronous, we simply didn't care and just announced the
|
||||||
|
exports to the channels from the time they started feeding. When making exports
|
||||||
|
asynchronous, it is crucial to avoid (hopefully) all the possible race conditions
|
||||||
|
which could arise from simultaneous feed and export. As the feeder routines had
|
||||||
|
to be rewritten, it is a good opportunity to make this precise.
|
||||||
|
|
||||||
|
Therefore, when a channel goes up, it also starts exports:
|
||||||
|
|
||||||
|
1. Start the feed-export coroutine.
|
||||||
|
2. *Lock the table.*
|
||||||
|
3. Store the last export in queue.
|
||||||
|
4. Read a limited number of routes to local memory together with their pending exports.
|
||||||
|
5. If there are some routes to process:
|
||||||
|
1. *Unlock the table.*
|
||||||
|
2. Process the loaded routes.
|
||||||
|
3. Set the appropriate pending exports as seen.
|
||||||
|
4. *Lock the table*
|
||||||
|
5. Go to 4. to continue feeding.
|
||||||
|
6. If there was a last export stored, load the next one to be processed. Otherwise take the table's `first_export`.
|
||||||
|
7. *Unlock the table.*
|
||||||
|
8. Run the exporter loop.
|
||||||
|
|
||||||
|
*Note: There are some nuances not mentioned here how to do things in right
|
||||||
|
order to avoid missing some events while changing state. For specifics, look
|
||||||
|
into the code in `nest/rt-table.c` in branch `alderney`.*
|
||||||
|
|
||||||
|
When the feeder loop finishes, it continues smoothly to process all the exports
|
||||||
|
that have been queued while the feed was running. Step 5.3 ensures that already
|
||||||
|
seen exports are skipped, steps 3 and 6 ensure that no export is missed.
|
||||||
|
|
||||||
|
### Channel flush
|
||||||
|
|
||||||
|
Protocols and channels need to stop for a handful of reasons, All of these
|
||||||
|
cases follow the same routine.
|
||||||
|
|
||||||
|
1. (Maybe.) The protocol requests to go down or restart.
|
||||||
|
2. The channel requests to go down or restart.
|
||||||
|
3. The channel requests to stop export.
|
||||||
|
4. In the feed-export coroutine:
|
||||||
|
1. At a designated cancellation point, check cancellation.
|
||||||
|
2. Clean up local data.
|
||||||
|
3. *Lock main BIRD context*
|
||||||
|
4. If shutdown requested, switch the channel to *flushing* state and request table maintenance.
|
||||||
|
5. *Stop the coroutine and unlock main BIRD context.*
|
||||||
|
5. In the table maintenance coroutine:
|
||||||
|
1. Walk across all channels and check them for *flushing* state, setting `flush_active` to 1.
|
||||||
|
2. Walk across the table (split to allow for low latency updates) and
|
||||||
|
generate a withdrawal for each route sent by the flushing channels.
|
||||||
|
3. When all the table is traversed, the flushing channels' `flush_active` is set to 2 and
|
||||||
|
`flush_seq` is set to the current last export seq.
|
||||||
|
3. Wait until all the withdrawals are processed by checking the `flush_seq`.
|
||||||
|
4. Mark the flushing channels as *down* and eventually proceed to the protocol shutdown or restart.
|
||||||
|
|
||||||
|
There is also a separate routine that handles bulk cleanup of `src`'s which
|
||||||
|
contain a pointer to the originating protocol. This routine may get reworked in
|
||||||
|
future; for now it is good enough.
|
||||||
|
|
||||||
|
### Route export cleanup
|
||||||
|
|
||||||
|
Last but not least is the export cleanup routine. Until now, the withdrawn
|
||||||
|
routes were exported synchronously and freed directly after the import was
|
||||||
|
done. This is not possible anymore. The export is stored and the import returns
|
||||||
|
to let the importing protocol continue its work. We therefore need a routine to
|
||||||
|
cleanup the withdrawn routes and also the processed exports.
|
||||||
|
|
||||||
|
First of all, this routine refuses to cleanup when any export is feeding or
|
||||||
|
shutting down. In future, cleanup while feeding should be possible, anyway for
|
||||||
|
now we aren't sure about possible race conditions.
|
||||||
|
|
||||||
|
Anyway, when all the exports are in a steady state, the routine works as follows:
|
||||||
|
|
||||||
|
1. Walk the active exports and find a minimum (oldest export) between their `last_export` values.
|
||||||
|
2. If there is nothing to clear between the actual oldest export and channels' oldest export, do nothing.
|
||||||
|
3. Find the table's new `first_export` and set it. Now there is nobody pointing to the old exports.
|
||||||
|
4. Free the withdrawn routes.
|
||||||
|
5. Free the old exports, removing them also from the first-last list of exports for the same destination.
|
||||||
|
|
||||||
|
## Results of these changes
|
||||||
|
|
||||||
|
This step is a first major step to move forward. Using just this version may be
|
||||||
|
still as slow as the single-threaded version, at least if your export filters are trivial.
|
||||||
|
Anyway, the main purpose of this step is not an immediate speedup. It is more
|
||||||
|
of a base for the next steps:
|
||||||
|
|
||||||
|
* Unlocking of pipes should enable parallel execution of all the filters on
|
||||||
|
pipes, limited solely by the principle *one thread for every direction of
|
||||||
|
pipe*.
|
||||||
|
* Conversion of CLI's `show route` to the new feed-export coroutines should
|
||||||
|
enable faster table queries. Moreover, this approach will allow for
|
||||||
|
better splitting of model and view in CLI with a good opportunity to
|
||||||
|
implement more output formats, e.g. JSON.
|
||||||
|
* Unlocking of kernel route synchronization should fix latency issues induced
|
||||||
|
by long-lasting kernel queries.
|
||||||
|
* Partial unlocking of BGP packet processing should allow for parallel
|
||||||
|
execution in almost all phases of BGP route propagation.
|
||||||
|
* Partial unlocking of OSPF route recalculation should raise the useful
|
||||||
|
maximums of topology size.
|
||||||
|
|
||||||
|
The development is now being done mostly in the branch `alderney`. If you asked
|
||||||
|
why such strange branch names like `jersey`, `guernsey` and `alderney`, here is
|
||||||
|
a kind-of reason. Yes, these branches could be named `mq-async-export`,
|
||||||
|
`mq-async-export-new`, `mq-async-export-new-new`, `mq-another-async-export` and
|
||||||
|
so on. That's so ugly, isn't it? Let's be creative. *Jersey* is an island where a
|
||||||
|
same-named knit was first produced – and knits are made of *threads*. Then, you
|
||||||
|
just look into a map and find nearby islands.
|
||||||
|
|
||||||
|
Also why so many branches? The development process is quite messy. BIRD's code
|
||||||
|
heavily depends on single-threaded approach. This is (in this case)
|
||||||
|
exceptionally good for performance, as long as you have one thread only. On the
|
||||||
|
other hand, lots of these assumptions are not documented so in many cases one
|
||||||
|
desired change yields a chain of other unforeseen changes which must precede.
|
||||||
|
This brings lots of backtracking, branch rebasing and other Git magic. There is
|
||||||
|
always a can of worms somewhere in the code.
|
||||||
|
|
||||||
|
*It's still a long road to the version 2.1. This series of texts should document
|
||||||
|
what is needed to be changed, why we do it and how. The
|
||||||
|
[previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/)
|
||||||
|
showed the necessary changes in route storage. In the next chapter, we're going
|
||||||
|
to describe how the coroutines are implemented and what kind of locking system
|
||||||
|
are we employing to prevent deadlocks. Stay tuned!*
|
235
doc/threads/03_coroutines.md
Normal file
235
doc/threads/03_coroutines.md
Normal file
@ -0,0 +1,235 @@
|
|||||||
|
# BIRD Journey to Threads. Chapter 3: Parallel execution and message passing.
|
||||||
|
|
||||||
|
Parallel execution in BIRD uses an underlying mechanism of dedicated IO loops
|
||||||
|
and hierarchical locks. The original event scheduling module has been converted
|
||||||
|
to do message passing in multithreaded environment. These mechanisms are
|
||||||
|
crucial for understanding what happens inside BIRD and how its internal API changes.
|
||||||
|
|
||||||
|
BIRD is a fast, robust and memory-efficient routing daemon designed and
|
||||||
|
implemented at the end of 20th century. We're doing a significant amount of
|
||||||
|
BIRD's internal structure changes to make it run in multiple threads in parallel.
|
||||||
|
|
||||||
|
## Locking and deadlock prevention
|
||||||
|
|
||||||
|
Most of BIRD data structures and algorithms are thread-unsafe and not even
|
||||||
|
reentrant. Checking and possibly updating all of these would take an
|
||||||
|
unreasonable amount of time, thus the multithreaded version uses standard mutexes
|
||||||
|
to lock all the parts which have not been checked and updated yet.
|
||||||
|
|
||||||
|
The authors of original BIRD concepts wisely chose a highly modular structure
|
||||||
|
which allows to create a hierarchy for locks. The main chokepoint was between
|
||||||
|
protocols and tables and it has been removed by implementing asynchronous exports
|
||||||
|
as described in the [previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/).
|
||||||
|
|
||||||
|
Locks in BIRD (called domains, as they always lock some defined part of BIRD)
|
||||||
|
are partially ordered. Every *domain* has its *type* and all threads are
|
||||||
|
strictly required to lock the domains in the order of their respective types.
|
||||||
|
The full order is defined in `lib/locking.h`. It's forbidden to lock more than
|
||||||
|
one domain of a type (these domains are uncomparable) and recursive locking is
|
||||||
|
forbidden as well.
|
||||||
|
|
||||||
|
The locking hiearchy is (roughly; as of February 2022) like this:
|
||||||
|
|
||||||
|
1. The BIRD Lock (for everything not yet checked and/or updated)
|
||||||
|
2. Protocols (as of February 2022, it is BFD, RPKI, Pipe and BGP)
|
||||||
|
3. Routing tables
|
||||||
|
4. Global route attribute cache
|
||||||
|
5. Message passing
|
||||||
|
6. Internals and memory management
|
||||||
|
|
||||||
|
There are heavy checks to ensure proper locking and to help debugging any
|
||||||
|
problem when any code violates the hierarchy rules. This impedes performance
|
||||||
|
depending on how much that domain is contended and in some cases I have already
|
||||||
|
implemented lockless (or partially lockless) data structures to overcome this.
|
||||||
|
|
||||||
|
You may ask, why are these heavy checks then employed in production builds?
|
||||||
|
Risks arising from dropping some locking checks include:
|
||||||
|
|
||||||
|
* deadlocks; these are deadly in BIRD anyway so it should just fail with a meaningful message, or
|
||||||
|
* data corruption; it either kills BIRD anyway, or it results into a slow and vicious death,
|
||||||
|
leaving undebuggable corefiles behind.
|
||||||
|
|
||||||
|
To be honest, I believe in principles like *"every nontrivial software has at least one bug"*
|
||||||
|
and I also don't trust my future self or anybody else to always write bugless code when
|
||||||
|
it comes to proper locking. I also believe that if a lock becomes a bottle-neck,
|
||||||
|
then we should think about what is locked inside and how to optimize that,
|
||||||
|
possibly implementing a lockless or waitless data structure instead of dropping
|
||||||
|
thorough consistency checks, especially in a multithreaded environment.
|
||||||
|
|
||||||
|
### Choosing the right locking order
|
||||||
|
|
||||||
|
When considering the locking order of protocols and route tables, the answer
|
||||||
|
was quite easy. We had to make either import or export asynchronous (or both).
|
||||||
|
Major reasons for asynchronous export have been stated in the previous chapter,
|
||||||
|
therefore it makes little sense to allow entering protocol context from table code.
|
||||||
|
|
||||||
|
As I write further in this text, even accessing table context from protocol
|
||||||
|
code leads to contention on table locks, yet for now, it is good enough and the
|
||||||
|
lock order features routing tables after protocols to make the multithreading
|
||||||
|
goal easier to achieve.
|
||||||
|
|
||||||
|
The major lock level is still The BIRD Lock, containing not only the
|
||||||
|
not-yet-converted protocols (like Babel, OSPF or RIP) but also processing CLI
|
||||||
|
commands and reconfiguration. This involves an awful lot of direct access into
|
||||||
|
other contexts which would be unnecessarily complicated to implement by message
|
||||||
|
passing. Therefore, this lock is simply *"the director"*, sitting on the top
|
||||||
|
with its own category.
|
||||||
|
|
||||||
|
The lower lock levels under routing tables are mostly for shared global data
|
||||||
|
structures accessed from everywhere. We'll address some of these later.
|
||||||
|
|
||||||
|
## IO Loop
|
||||||
|
|
||||||
|
There has been a protocol, BFD, running in its own thread since 2013. This
|
||||||
|
separation has a good reason; it needs low latency and the main BIRD loop just
|
||||||
|
walks round-robin around all the available sockets and one round-trip may take
|
||||||
|
a long time (even more than a minute with large configurations). BFD had its
|
||||||
|
own IO loop implementation and simple message passing routines. This code could
|
||||||
|
be easily updated for general use so I did it.
|
||||||
|
|
||||||
|
To understand the internal principles, we should say that in the `master`
|
||||||
|
branch, there is a big loop centered around a `poll()` call, dispatching and
|
||||||
|
executing everything as needed. In the `sark` branch, there are multiple loops
|
||||||
|
of this kind. BIRD has several means how to get something dispatched from a
|
||||||
|
loop.
|
||||||
|
|
||||||
|
1. Requesting to read from a **socket** makes the main loop call your hook when there is some data received.
|
||||||
|
The same happens when a socket refuses to write data. Then the data is buffered and you are called when
|
||||||
|
the buffer is free to continue writing. There is also a third callback, an error hook, for obvious reasons.
|
||||||
|
|
||||||
|
2. Requesting to be called back after a given amount of time. This is called **timer**.
|
||||||
|
As is common with all timers, they aren't precise and the callback may be
|
||||||
|
delayed significantly. This was also the reason to have BFD loop separate
|
||||||
|
since the very beginning, yet now the abundance of threads may lead to
|
||||||
|
problems with BFD latency in large-scale configurations. We haven't tested
|
||||||
|
this yet.
|
||||||
|
|
||||||
|
3. Requesting to be called back from a clean context when possible. This is
|
||||||
|
useful to run anything not reentrant which might mess with the caller's
|
||||||
|
data, e.g. when a protocol decides to shutdown due to some inconsistency
|
||||||
|
in received data. This is called **event**.
|
||||||
|
|
||||||
|
4. Requesting to do some work when possible. These are also events, there is only
|
||||||
|
a difference where this event is enqueued; in the main loop, there is a
|
||||||
|
special *work queue* with an execution limit, allowing sockets and timers to be
|
||||||
|
handled with a reasonable latency while still doing all the work needed.
|
||||||
|
Other loops don't have designated work queues (we may add them later).
|
||||||
|
|
||||||
|
All these, sockets, timers and events, are tightly bound to some domain.
|
||||||
|
Sockets typically belong to a protocol, timers and events to a protocol or table.
|
||||||
|
With the modular structure of BIRD, the easy and convenient approach to multithreading
|
||||||
|
is to get more IO loops, each bound to a specific domain, running their events, timers and
|
||||||
|
socket hooks in their threads.
|
||||||
|
|
||||||
|
## Message passing and loop entering
|
||||||
|
|
||||||
|
To request some work in another module, the standard way is to pass a message.
|
||||||
|
For this purpose, events have been modified to be sent to a given loop without
|
||||||
|
locking that loop's domain. In fact, every event queue has its own lock with a
|
||||||
|
low priority, allowing to pass messages from almost any part of BIRD, and also
|
||||||
|
an assigned loop which executes the events enqueued. When a message is passed
|
||||||
|
to a queue executed by another loop, that target loop must be woken up so we
|
||||||
|
must know what loop to wake up to avoid unnecessary delays. Then the target
|
||||||
|
loop opens its mailbox and processes the task in its context.
|
||||||
|
|
||||||
|
The other way is a direct access of another domain. This approach blocks the
|
||||||
|
appropriate loop from doing anything and we call it *entering a birdloop* to
|
||||||
|
remember that the task must be fast and *leave the birdloop* as soon as possible.
|
||||||
|
Route import is done via direct access from protocols to tables; in large
|
||||||
|
setups with fast filters, this is a major point of contention (after filters
|
||||||
|
have been parallelized) and will be addressed in future optimization efforts.
|
||||||
|
Reconfiguration and interface updates also use direct access; more on that later.
|
||||||
|
In general, this approach should be avoided unless there are good reasons to use it.
|
||||||
|
|
||||||
|
Even though direct access is bad, sending lots of messages may be even worse.
|
||||||
|
Imagine one thousand post(wo)men, coming one by one every minute, ringing your
|
||||||
|
doorbell and delivering one letter each to you. Horrible! Asynchronous message
|
||||||
|
passing works exactly this way. After queuing the message, the source sends a
|
||||||
|
byte to a pipe to wakeup the target loop to process the task. We could also
|
||||||
|
periodically poll for messages instead of waking up the targets, yet it would
|
||||||
|
add quite a lot of latency which we also don't like.
|
||||||
|
|
||||||
|
Messages in BIRD don't typically suffer from the problem of amount and the
|
||||||
|
overhead is negligible compared to the overall CPU consumption. With one notable
|
||||||
|
exception: route import/export.
|
||||||
|
|
||||||
|
### Route export message passing
|
||||||
|
|
||||||
|
If we had to send a ping for every route we import to every exporting channel,
|
||||||
|
we'd spend more time pinging than doing anything else. Been there, seen
|
||||||
|
those unbelievable 80%-like figures in Perf output. Never more.
|
||||||
|
|
||||||
|
Route update is quite a complicated process. BIRD must handle large-scale
|
||||||
|
configurations with lots of importers and exporters. Therefore, a
|
||||||
|
triple-indirect delayed route announcement is employed:
|
||||||
|
|
||||||
|
1. First, when a channel imports a route by entering a loop, it sends an event
|
||||||
|
to its own loop (no ping needed in such case). This operation is idempotent,
|
||||||
|
thus for several routes in a row, only one event is enqueued. This reduces
|
||||||
|
several route import announcements (even hundreds in case of massive BGP
|
||||||
|
withdrawals) to one single event.
|
||||||
|
2. When the channel is done importing (or at least takes a coffee break and
|
||||||
|
checks its mailbox), the scheduled event in its own loop is run, sending
|
||||||
|
another event to the table's loop, saying basically *"Hey, table, I've just
|
||||||
|
imported something."*. This event is also idempotent and further reduces
|
||||||
|
route import announcements from multiple sources to one single event.
|
||||||
|
3. The table's announcement event is then executed from its loop, enqueuing export
|
||||||
|
events for all connected channels, finally initiating route exports. As we
|
||||||
|
already know, imports are done by direct access, therefore if protocols keep
|
||||||
|
importing, export announcements are slowed down.
|
||||||
|
4. The actual data on what has been updated is stored in a table journal. This
|
||||||
|
peculiar technique is used only for informing the exporting channels that
|
||||||
|
*"there is something to do"*.
|
||||||
|
|
||||||
|
This may seem overly complicated, yet it should work and it seems to work. In
|
||||||
|
case of low load, all these notifications just come through smoothly. In case
|
||||||
|
of high load, it's common that multiple updates come for the same destination.
|
||||||
|
Delaying the exports allows for the updates to settle down and export just the
|
||||||
|
final result, reducing CPU load and export traffic.
|
||||||
|
|
||||||
|
## Cork
|
||||||
|
|
||||||
|
Route propagation is involved in yet another problem which has to be addressed.
|
||||||
|
In the old versions with synchronous route propagation, all the buffering
|
||||||
|
happened after exporting routes to BGP. When a packet arrived, all the work was
|
||||||
|
done in BGP receive hook – parsing, importing into a table, running all the
|
||||||
|
filters and possibly sending to the peers. No more routes until the previous
|
||||||
|
was done. This self-regulating mechanism doesn't work any more.
|
||||||
|
|
||||||
|
Route table import now returns immediately after inserting the route into a
|
||||||
|
table, creating a buffer there. These buffers have to be processed by other protocols'
|
||||||
|
export events. In large-scale configurations, one route import has to be
|
||||||
|
processed by hundreds, even thousands of exports. Unlimited imports are a major
|
||||||
|
cause of buffer bloating. This is even worse in configurations with pipes,
|
||||||
|
as these multiply the exports by propagating them all the way down to other
|
||||||
|
tables, eventually eating about twice the amount of memory than the single-threaded version.
|
||||||
|
|
||||||
|
There is therefore a cork to make this stop. Every table is checking how many
|
||||||
|
exports it has pending, and when adding a new export to the queue, it may request
|
||||||
|
a cork, saying simply "please stop the flow for a while". When the export buffer
|
||||||
|
size is reduced low enough, the table uncorks.
|
||||||
|
|
||||||
|
On the other side, there are events and sockets with a cork assigned. When
|
||||||
|
trying to enqueue an event and the cork is applied, the event is instead put
|
||||||
|
into the cork's queue and released only when the cork is released. In case of
|
||||||
|
sockets, when read is indicated or when `poll` arguments are recalculated,
|
||||||
|
the corked socket is simply not checked for received packets, effectively
|
||||||
|
keeping them in the TCP queue and slowing down the flow until cork is released.
|
||||||
|
|
||||||
|
The cork implementation is quite crude and rough and fragile. It may get some
|
||||||
|
rework while stabilizing the multi-threaded version of BIRD or we may even
|
||||||
|
completely drop it for some better mechanism. One of these candidates is this
|
||||||
|
kind of API:
|
||||||
|
|
||||||
|
* (table to protocol) please do not import
|
||||||
|
* (table to protocol) you may resume imports
|
||||||
|
* (protocol to table) not processing any exports
|
||||||
|
* (protocol to table) resuming export processing
|
||||||
|
|
||||||
|
Anyway, cork works as intended in most cases at least for now.
|
||||||
|
|
||||||
|
*It's a long road to the version 2.1. This series of texts should document what
|
||||||
|
is changing, why we do it and how. The
|
||||||
|
[previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/)
|
||||||
|
shows how the route export had to change to allow parallel execution. In the next chapter, some memory management
|
||||||
|
details are to be explained together with the reasons why memory management matters. Stay tuned!*
|
153
doc/threads/03b_performance.md
Normal file
153
doc/threads/03b_performance.md
Normal file
@ -0,0 +1,153 @@
|
|||||||
|
# BIRD Journey to Threads. Chapter 3½: Route server performance
|
||||||
|
|
||||||
|
All the work on multithreading shall be justified by performance improvements.
|
||||||
|
This chapter tries to compare times reached by version 3.0-alpha0 and 2.0.8,
|
||||||
|
showing some data and thinking about them.
|
||||||
|
|
||||||
|
BIRD is a fast, robust and memory-efficient routing daemon designed and
|
||||||
|
implemented at the end of 20th century. We're doing a significant amount of
|
||||||
|
BIRD's internal structure changes to make it run in multiple threads in parallel.
|
||||||
|
|
||||||
|
## Testing setup
|
||||||
|
|
||||||
|
There are two machines in one rack. One of these simulates the peers of
|
||||||
|
a route server, the other runs BIRD in a route server configuration. First, the
|
||||||
|
peers are launched, then the route server is started and one of the peers
|
||||||
|
measures the convergence time until routes are fully propagated. Other peers
|
||||||
|
drop all incoming routes.
|
||||||
|
|
||||||
|
There are four configurations. *Single* where all BGPs are directly
|
||||||
|
connected to the main table, *Multi* where every BGP has its own table and
|
||||||
|
filters are done on pipes between them, and finally *Imex* and *Mulimex* which are
|
||||||
|
effectively *Single* and *Multi* where all BGPs have also their auxiliary
|
||||||
|
import and export tables enabled.
|
||||||
|
|
||||||
|
All of these use the same short dummy filter for route import to provide a
|
||||||
|
consistent load. This filter includes no meaningful logic, it's just some dummy
|
||||||
|
data to run the CPU with no memory contention. Real filters also do not suffer from
|
||||||
|
memory contention, with an exception of ROA checks. Optimization of ROA is a
|
||||||
|
task for another day.
|
||||||
|
|
||||||
|
There is also other stuff in BIRD waiting for performance assessment. As the
|
||||||
|
(by far) most demanding setup of BIRD is route server in IXP, we chose to
|
||||||
|
optimize and measure BGP and filters first.
|
||||||
|
|
||||||
|
Hardware used for testing is Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 8
|
||||||
|
physical cores, two hyperthreads on each. Memory is 32 GB RAM.
|
||||||
|
|
||||||
|
## Test parameters and statistics
|
||||||
|
|
||||||
|
BIRD setup may scale on two major axes. Number of peers and number of routes /
|
||||||
|
destinations. *(There are more axes, e.g.: complexity of filters, routes /
|
||||||
|
destinations ratio, topology size in IGP)*
|
||||||
|
|
||||||
|
Scaling the test on route count is easy, just by adding more routes to the
|
||||||
|
testing peers. Currently, the largest test data I feed BIRD with is about 2M
|
||||||
|
routes for around 800K destinations, due to memory limitations. The routes /
|
||||||
|
destinations ratio is around 2.5 in this testing setup, trying to get close to
|
||||||
|
real-world routing servers.[^1]
|
||||||
|
|
||||||
|
[^1]: BIRD can handle much more in real life, the actual software limit is currently
|
||||||
|
a 32-bit unsigned route counter in the table structure. Hardware capabilities
|
||||||
|
are already there and checking how BIRD handles more than 4G routes is
|
||||||
|
certainly going to be a real thing soon.
|
||||||
|
|
||||||
|
Scaling the test on peer count is easy, until you get to higher numbers. When I
|
||||||
|
was setting up the test, I configured one Linux network namespace for each peer,
|
||||||
|
connecting them by virtual links to a bridge and by a GRE tunnel to the other
|
||||||
|
machine. This works well for 10 peers but setting up and removing 1000 network
|
||||||
|
namespaces takes more than 15 minutes in total. (Note to myself: try this with
|
||||||
|
a newer Linux kernel than 4.9.)
|
||||||
|
|
||||||
|
Another problem of test scaling is bandwidth. With 10 peers, everything is OK.
|
||||||
|
With 1000 peers, version 3.0-alpha0 does more than 600 Mbps traffic in peak
|
||||||
|
which is just about the bandwidth of the whole setup. I'm planning to design a
|
||||||
|
better test setup with less chokepoints in future.
|
||||||
|
|
||||||
|
## Hypothesis
|
||||||
|
|
||||||
|
There are two versions subjected to the test. One of these is `2.0.8` as an
|
||||||
|
initial testpoint. The other is version 3.0-alpha0, named `bgp` as parallel BGP
|
||||||
|
is implemented there.
|
||||||
|
|
||||||
|
The major problem of large-scale BIRD setups is convergence time on startup. We
|
||||||
|
assume that a multithreaded version should reduce the overall convergence time,
|
||||||
|
at most by a factor equal to number of cores involved. Here we have 16
|
||||||
|
hyperthreads, in theory we should reduce the times up to 16-fold, yet this is
|
||||||
|
almost impossible as a non-negligible amount of time is spent in bottleneck
|
||||||
|
code like best route selection or some cleanup routines. This has become a
|
||||||
|
bottleneck by making other parts parallel.
|
||||||
|
|
||||||
|
## Data
|
||||||
|
|
||||||
|
Four charts are included here, one for each setup. All axes have a
|
||||||
|
logarithmic scale. The route count on X scale is the total route count in
|
||||||
|
tested BIRD, different color shades belong to different versions and peer
|
||||||
|
counts. Time is plotted on Y scale.
|
||||||
|
|
||||||
|
Raw data is available in Git, as well as the chart generator. Strange results
|
||||||
|
caused by testbed bugs are already omitted.
|
||||||
|
|
||||||
|
There is also a line drawn on a 2-second mark. Convergence is checked by
|
||||||
|
periodically requesting `birdc show route count` on one of the peers and BGP
|
||||||
|
peers have also a 1-second connect delay time (default is 5 seconds). All
|
||||||
|
measured times shorter than 2 seconds are highly unreliable.
|
||||||
|
|
||||||
|
![Plotted data for Single](03b_stats_2d_single.png)
|
||||||
|
[Plotted data for Single in PDF](03b_stats_2d_single.pdf)
|
||||||
|
|
||||||
|
Single-table setup has times reduced to about 1/8 when comparing 3.0-alpha0 to
|
||||||
|
2.0.8. Speedup for 10-peer setup is slightly worse than expected and there is
|
||||||
|
still some room for improvement, yet 8-fold speedup on 8 physical cores and 16
|
||||||
|
hyperthreads is good for me now.
|
||||||
|
|
||||||
|
The most demanding case with 2M routes and 1k peers failed. On 2.0.8, my
|
||||||
|
configuration converges after almost two hours on 2.0.8, with the speed of
|
||||||
|
route processing steadily decreasing until only several routes per second are
|
||||||
|
done. Version 3.0-alpha0 is memory-bloating for some non-obvious reason and
|
||||||
|
couldn't fit into 32G RAM. There is definitely some work ahead to stabilize
|
||||||
|
BIRD behavior with extreme setups.
|
||||||
|
|
||||||
|
![Plotted data for Multi](03b_stats_2d_multi.png)
|
||||||
|
[Plotted data for Multi in PDF](03b_stats_2d_multi.pdf)
|
||||||
|
|
||||||
|
Multi-table setup got the same speedup as single-table setup, no big
|
||||||
|
surprise. Largest cases were not tested at all as they don't fit well into 32G
|
||||||
|
RAM even with 2.0.8.
|
||||||
|
|
||||||
|
![Plotted data for Imex](03b_stats_2d_imex.png)
|
||||||
|
[Plotted data for Imex in PDF](03b_stats_2d_imex.pdf)
|
||||||
|
|
||||||
|
![Plotted data for Mulimex](03b_stats_2d_mulimex.png)
|
||||||
|
[Plotted data for Mulimex in PDF](03b_stats_2d_mulimex.pdf)
|
||||||
|
|
||||||
|
Setups with import / export tables are also sped up by a factor
|
||||||
|
about 6-8. Data on largest setups (2M routes) are showing some strangely
|
||||||
|
ineffective behaviour. Considering that both single-table and multi-table
|
||||||
|
setups yield similar performance data, there is probably some unwanted
|
||||||
|
inefficiency in the auxiliary table code.
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
BIRD 3.0-alpha0 is a good version for preliminary testing in IXPs. There is
|
||||||
|
some speedup in every testcase and code stability is enough to handle typical
|
||||||
|
use cases. Some test scenarios went out of available memory and there is
|
||||||
|
definitely a lot of work to stabilize this, yet for now it makes no sense to
|
||||||
|
postpone this alpha version any more.
|
||||||
|
|
||||||
|
We don't recommend upgrading a production machine to this version
|
||||||
|
yet, anyway if you have a test setup, getting version 3.0-alpha0 there and
|
||||||
|
reporting bugs is much welcome.
|
||||||
|
|
||||||
|
Notice: Multithreaded BIRD, at least in version 3.0-alpha0, doesn't limit its number of
|
||||||
|
threads. It will spawn at least one thread per every BGP, RPKI and Pipe
|
||||||
|
protocol, one thread per every routing table (including auxiliary tables) and
|
||||||
|
possibly several more. It's up to the machine administrator to setup a limit on
|
||||||
|
CPU core usage by BIRD. When running with many threads and protocols, you may
|
||||||
|
need also to raise the filedescriptor limit: BIRD uses 2 filedescriptors per
|
||||||
|
every thread for internal messaging.
|
||||||
|
|
||||||
|
*It's a long road to the version 3. By releasing this alpha version, we'd like
|
||||||
|
to encourage every user to try this preview. If you want to know more about
|
||||||
|
what is being done and why, you may also check the full
|
||||||
|
[blogpost series about multithreaded BIRD](https://en.blog.nic.cz/2021/03/15/bird-journey-to-threads-chapter-0-the-reason-why/). Thank you for your ongoing support!*
|
BIN
doc/threads/03b_stats_2d_imex.pdf
Normal file
BIN
doc/threads/03b_stats_2d_imex.pdf
Normal file
Binary file not shown.
BIN
doc/threads/03b_stats_2d_imex.png
Normal file
BIN
doc/threads/03b_stats_2d_imex.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 160 KiB |
BIN
doc/threads/03b_stats_2d_mulimex.pdf
Normal file
BIN
doc/threads/03b_stats_2d_mulimex.pdf
Normal file
Binary file not shown.
BIN
doc/threads/03b_stats_2d_mulimex.png
Normal file
BIN
doc/threads/03b_stats_2d_mulimex.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 149 KiB |
BIN
doc/threads/03b_stats_2d_multi.pdf
Normal file
BIN
doc/threads/03b_stats_2d_multi.pdf
Normal file
Binary file not shown.
BIN
doc/threads/03b_stats_2d_multi.png
Normal file
BIN
doc/threads/03b_stats_2d_multi.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 147 KiB |
BIN
doc/threads/03b_stats_2d_single.pdf
Normal file
BIN
doc/threads/03b_stats_2d_single.pdf
Normal file
Binary file not shown.
BIN
doc/threads/03b_stats_2d_single.png
Normal file
BIN
doc/threads/03b_stats_2d_single.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 161 KiB |
223
doc/threads/04_memory_management.md
Normal file
223
doc/threads/04_memory_management.md
Normal file
@ -0,0 +1,223 @@
|
|||||||
|
# BIRD Journey to Threads. Chapter 4: Memory and other resource management.
|
||||||
|
|
||||||
|
BIRD is mostly a large specialized database engine, storing mega/gigabytes of
|
||||||
|
Internet routing data in memory. To keep accounts of every byte of allocated data,
|
||||||
|
BIRD has its own resource management system which must be adapted to the
|
||||||
|
multithreaded environment. The resource system has not changed much, yet it
|
||||||
|
deserves a short chapter.
|
||||||
|
|
||||||
|
BIRD is a fast, robust and memory-efficient routing daemon designed and
|
||||||
|
implemented at the end of 20th century. We're doing a significant amount of
|
||||||
|
BIRD's internal structure changes to make it run in multiple threads in parallel.
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
Inside BIRD, (almost) every piece of allocated memory is a resource. To achieve this,
|
||||||
|
every such memory block includes a generic `struct resource` header. The node
|
||||||
|
is enlisted inside a linked list of a *resource pool* (see below), the class
|
||||||
|
pointer defines basic operations done on resources.
|
||||||
|
|
||||||
|
```
|
||||||
|
typedef struct resource {
|
||||||
|
node n; /* Inside resource pool */
|
||||||
|
struct resclass *class; /* Resource class */
|
||||||
|
} resource;
|
||||||
|
|
||||||
|
struct resclass {
|
||||||
|
char *name; /* Resource class name */
|
||||||
|
unsigned size; /* Standard size of single resource */
|
||||||
|
void (*free)(resource *); /* Freeing function */
|
||||||
|
void (*dump)(resource *); /* Dump to debug output */
|
||||||
|
resource *(*lookup)(resource *, unsigned long); /* Look up address (only for debugging) */
|
||||||
|
struct resmem (*memsize)(resource *); /* Return size of memory used by the resource, may be NULL */
|
||||||
|
};
|
||||||
|
|
||||||
|
void *ralloc(pool *, struct resclass *);
|
||||||
|
```
|
||||||
|
|
||||||
|
Resource cycle begins with an allocation of a resource. To do that, you should call `ralloc()`,
|
||||||
|
passing the parent pool and the appropriate resource class as arguments. BIRD
|
||||||
|
allocates a memory block of size given by the given class member `size`.
|
||||||
|
Beginning of the block is reserved for `struct resource` itself and initialized
|
||||||
|
by the given arguments. Therefore, you may sometimes see an idiom where a structure
|
||||||
|
has a first member `struct resource r;`, indicating that this item should be
|
||||||
|
allocated as a resource.
|
||||||
|
|
||||||
|
The counterpart is resource freeing. This may be implicit (by resource pool
|
||||||
|
freeing) or explicit (by `rfree()`). In both cases, the `free()` function of
|
||||||
|
the appropriate class is called to cleanup the resource before final freeing.
|
||||||
|
|
||||||
|
To account for `dump` and `memsize` calls, there are CLI commands `dump
|
||||||
|
resources` and `show memory`, using these to dump resources or show memory
|
||||||
|
usage as perceived by BIRD.
|
||||||
|
|
||||||
|
The last, `lookup`, is quite an obsolete way to identify a specific pointer
|
||||||
|
from a debug interface. You may call `rlookup(pointer)` and BIRD should dump
|
||||||
|
that resource to the debug output. This mechanism is probably incomplete as no
|
||||||
|
developer uses it actively for debugging.
|
||||||
|
|
||||||
|
Resources can be also moved between pools by `rmove` when needed.
|
||||||
|
|
||||||
|
## Resource pools
|
||||||
|
|
||||||
|
The first internal resource class is a recursive resource – a resource pool. In
|
||||||
|
the singlethreaded version, this is just a simple structure:
|
||||||
|
|
||||||
|
```
|
||||||
|
struct pool {
|
||||||
|
resource r;
|
||||||
|
list inside;
|
||||||
|
struct birdloop *loop; /* In multithreaded version only */
|
||||||
|
const char *name;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
Resource pools are used for grouping resources together. There are pools everywhere
|
||||||
|
and it is a common idiom inside BIRD to just `rfree` the appropriate pool when
|
||||||
|
e.g. a protocol or table is going down. Everything left there is cleaned up.
|
||||||
|
|
||||||
|
There are anyway several classes which must be freed with care. In the
|
||||||
|
singlethreaded version, the *slab* allocator (see below) must be empty before
|
||||||
|
it may be freed and this is kept to the multithreaded version while other
|
||||||
|
restrictions have been added.
|
||||||
|
|
||||||
|
There is also a global pool, `root_pool`, containing every single resource BIRD
|
||||||
|
knows about, either directly or via another resource pool.
|
||||||
|
|
||||||
|
### Thread safety in resource pools
|
||||||
|
|
||||||
|
In the multithreaded version, every resource pool is bound to a specific IO
|
||||||
|
loop and therefore includes an IO loop pointer. This is important for allocations
|
||||||
|
as the resource list inside the pool is thread-unsafe. All pool operations
|
||||||
|
therefore require the IO loop to be entered to do anything with them, if possible.
|
||||||
|
(In case of `rfree`, the pool data structure is not accessed at all so no
|
||||||
|
assert is possible. We're currently relying on the caller to ensure proper locking.
|
||||||
|
In future, this may change.)
|
||||||
|
|
||||||
|
Each IO loop also has its base resource pool for its allocations. All pools
|
||||||
|
inside the IO loop pool must belong to the same loop or to a loop with a
|
||||||
|
subordinate lock (see the previous chapter for lock ordering). If there is a
|
||||||
|
need for multiple IO loops to access one shared data structure, it must be
|
||||||
|
locked by another lock and allocated in such a way that is independent on these
|
||||||
|
accessor loops.
|
||||||
|
|
||||||
|
The pool structure should follow the locking order. Any pool should belong to
|
||||||
|
either the same loop as its parent or its loop lock should be after its parent
|
||||||
|
loop lock in the locking order. This is not enforced explicitly, yet it is
|
||||||
|
virtually impossible to write some working code violating this recommendation.
|
||||||
|
|
||||||
|
### Resource pools in the wilderness
|
||||||
|
|
||||||
|
Root pool contains (among others):
|
||||||
|
|
||||||
|
* route attributes and sources
|
||||||
|
* routing tables
|
||||||
|
* protocols
|
||||||
|
* interfaces
|
||||||
|
* configuration data
|
||||||
|
|
||||||
|
Each table has its IO loop and uses the loop base pool for allocations.
|
||||||
|
The same holds for protocols. Each protocol has its pool; it is either its IO
|
||||||
|
loop base pool or an ordinary pool bound to main loop.
|
||||||
|
|
||||||
|
## Memory allocators
|
||||||
|
|
||||||
|
BIRD stores data in memory blocks allocated by several allocators. There are 3
|
||||||
|
of them: simple memory blocks, linear pools and slabs.
|
||||||
|
|
||||||
|
### Simple memory block
|
||||||
|
|
||||||
|
When just a chunk of memory is needed, `mb_alloc()` or `mb_allocz()` is used
|
||||||
|
to get it. The first with `malloc()` semantics, the other is also zeroed.
|
||||||
|
There is also `mb_realloc()` available, `mb_free()` to explicitly free such a
|
||||||
|
memory and `mb_move()` to move that memory to another pool.
|
||||||
|
|
||||||
|
Simple memory blocks consume a fixed amount of overhead memory (32 bytes on
|
||||||
|
systems with 64-bit pointers) so they are suitable mostly for big chunks,
|
||||||
|
taking advantage of the default *stdlib* allocator which is used by this
|
||||||
|
allocation strategy. There are anyway some parts of BIRD (in all versions)
|
||||||
|
where this allocator is used for little blocks. This will be fixed some day.
|
||||||
|
|
||||||
|
### Linear pools
|
||||||
|
|
||||||
|
Sometimes, memory is allocated temporarily. When the data may just sit on
|
||||||
|
stack, we put it there. Anyway, many tasks need more structured execution where
|
||||||
|
stack allocation is incovenient or even impossible (e.g. when callbacks from
|
||||||
|
parsers are involved). For such a case, a *linpool* is the best choice.
|
||||||
|
|
||||||
|
This data structure allocates memory blocks of requested size with negligible
|
||||||
|
overhead in functions `lp_alloc()` (uninitialized) or `lp_allocz()` (zeroed).
|
||||||
|
There is anyway no `realloc` and no `free` call; to have a larger chunk, you
|
||||||
|
need to allocate another block. All this memory is freed at once by `lp_flush()`
|
||||||
|
when it is no longer needed.
|
||||||
|
|
||||||
|
You may see linpools in parsers (BGP, Linux netlink, config) or in filters.
|
||||||
|
|
||||||
|
In the multithreaded version, linpools have received an update, allocating
|
||||||
|
memory pages directly by `mmap()` instead of calling `malloc()`. More on memory
|
||||||
|
pages below.
|
||||||
|
|
||||||
|
### Slabs
|
||||||
|
|
||||||
|
To allocate lots of same-sized objects, a [slab allocator](https://en.wikipedia.org/wiki/Slab_allocation)
|
||||||
|
is an ideal choice. In versions until 2.0.8, our slab allocator used blocks
|
||||||
|
allocated by `malloc()`, every object included a *slab head* pointer and free objects
|
||||||
|
were linked into a single-linked list. This led to memory inefficiency and to
|
||||||
|
contra-intuitive behavior where a use-after-free bug could do lots of damage
|
||||||
|
before finally crashing.
|
||||||
|
|
||||||
|
Versions from 2.0.9, and also all the multithreaded versions, are coming with
|
||||||
|
slabs using directly allocated memory pages and usage bitmaps instead of
|
||||||
|
single-linking the free objects. This approach however relies on the fact that
|
||||||
|
pointers returned by `mmap()` are always divisible by page size. Freeing of a
|
||||||
|
slab object involves zeroing (mostly) 13 least significant bits of its pointer
|
||||||
|
to get the page pointer where the slab head resides.
|
||||||
|
|
||||||
|
This update helps with memory consumption by about 5% compared to previous
|
||||||
|
versions; exact numbers depend on the usage pattern.
|
||||||
|
|
||||||
|
## Raw memory pages
|
||||||
|
|
||||||
|
Until 2.0.8 (incl.), BIRD allocated all memory by `malloc()`. This method is
|
||||||
|
suitable for lots of use cases, yet when gigabytes of memory should be
|
||||||
|
allocated by little pieces, BIRD uses its internal allocators to keep track
|
||||||
|
about everything. This brings some ineffectivity as stdlib allocator has its
|
||||||
|
own overhead and doesn't allocate aligned memory unless asked for.
|
||||||
|
|
||||||
|
Slabs and linear pools are backed by blocks of memory of kilobyte sizes. As a
|
||||||
|
typical memory page size is 4 kB, it is a logical step to drop stdlib
|
||||||
|
allocation from these allocators and to use `mmap()` directly. This however has
|
||||||
|
some drawbacks, most notably the need of a syscall for every memory mapping and
|
||||||
|
unmapping. For allocations, this is not much a case and the syscall time is typically
|
||||||
|
negligible compared to computation time. When freeing memory, this is much
|
||||||
|
worse as BIRD sometimes frees gigabytes of data in a blink of eye.
|
||||||
|
|
||||||
|
To minimize the needed number of syscalls, there is a per-thread page cache,
|
||||||
|
keeping pages for future use:
|
||||||
|
|
||||||
|
* When a new page is requested, first the page cache is tried.
|
||||||
|
* When a page is freed, the per-thread page cache keeps it without telling the kernel.
|
||||||
|
* When the number of pages in any per-thread page cache leaves a pre-defined range,
|
||||||
|
a cleanup routine is scheduled to free excessive pages or request more in advance.
|
||||||
|
|
||||||
|
This method gives the multithreaded BIRD not only faster memory management than
|
||||||
|
ever before but also almost immediate shutdown times as the cleanup routine is
|
||||||
|
not scheduled on shutdown at all.
|
||||||
|
|
||||||
|
## Other resources
|
||||||
|
|
||||||
|
Some objects are not only a piece of memory; notable items are sockets, owning
|
||||||
|
the underlying mechanism of I/O, and *object locks*, owning *the right to use a
|
||||||
|
specific I/O*. This ensures that collisions on e.g. TCP port numbers and
|
||||||
|
addresses are resolved in a predictable way.
|
||||||
|
|
||||||
|
All these resources should be used with the same locking principles as the
|
||||||
|
memory blocks. There aren't many checks inside BIRD code to ensure that yet,
|
||||||
|
nevertheless violating this recommendation may lead to multiple-access issues.
|
||||||
|
|
||||||
|
*It's still a long road to the version 2.1. This series of texts should document
|
||||||
|
what is needed to be changed, why we do it and how. The
|
||||||
|
[previous chapter](TODO)
|
||||||
|
showed the locking system and how the parallel execution is done.
|
||||||
|
The next chapter will cover a bit more detailed explanation about route sources
|
||||||
|
and route attributes and how lockless data structures are employed there. Stay tuned!*
|
29
doc/threads/Makefile
Normal file
29
doc/threads/Makefile
Normal file
@ -0,0 +1,29 @@
|
|||||||
|
SUFFICES := .pdf -wordpress.html
|
||||||
|
CHAPTERS := 00_the_name_of_the_game 01_the_route_and_its_attributes 02_asynchronous_export 03_coroutines 03b_performance
|
||||||
|
|
||||||
|
all: $(foreach ch,$(CHAPTERS),$(addprefix $(ch),$(SUFFICES)))
|
||||||
|
|
||||||
|
00_the_name_of_the_game.pdf: 00_filter_structure.png
|
||||||
|
|
||||||
|
%.pdf: %.md
|
||||||
|
pandoc -f markdown -t latex -o $@ $<
|
||||||
|
|
||||||
|
%.html: %.md
|
||||||
|
pandoc -f markdown -t html5 -o $@ $<
|
||||||
|
|
||||||
|
%-wordpress.html: %.html Makefile
|
||||||
|
sed -r 's#</p>#\n#g; s#<p>##g; s#<(/?)code>#<\1tt>#g; s#<pre><tt>#<code>#g; s#</tt></pre>#</code>#g; s#</?figure>##g; s#<figcaption>#<p style="text-align: center">#; s#</figcaption>#</p>#; ' $< > $@
|
||||||
|
|
||||||
|
stats-%.csv: stats.csv stats-filter.pl
|
||||||
|
perl stats-filter.pl $< $* > $@
|
||||||
|
|
||||||
|
STATS_VARIANTS := multi imex mulimex single
|
||||||
|
stats-all: $(patsubst %,stats-%.csv,$(STATS_VARIANTS))
|
||||||
|
|
||||||
|
stats-2d-%.pdf: stats.csv stats-filter-2d.pl
|
||||||
|
perl stats-filter-2d.pl $< $* $@
|
||||||
|
|
||||||
|
stats-2d-%.png: stats-2d-%.pdf
|
||||||
|
gs -dBATCH -dNOPAUSE -sDEVICE=pngalpha -sOutputFile=$@ -r300 $<
|
||||||
|
|
||||||
|
stats-all-2d: $(foreach suf,pdf png,$(patsubst %,stats-2d-%.$(suf),$(STATS_VARIANTS)))
|
41
doc/threads/stats-draw.gnuplot
Normal file
41
doc/threads/stats-draw.gnuplot
Normal file
@ -0,0 +1,41 @@
|
|||||||
|
set datafile columnheaders
|
||||||
|
set datafile separator ";"
|
||||||
|
#set isosample 15
|
||||||
|
set dgrid3d 8,8
|
||||||
|
set logscale
|
||||||
|
set view 80,15,1,1
|
||||||
|
set autoscale xy
|
||||||
|
#set pm3d
|
||||||
|
|
||||||
|
set term pdfcairo size 20cm,15cm
|
||||||
|
|
||||||
|
set xlabel "TOTAL ROUTES" offset 0,-1.5
|
||||||
|
set xrange [10000:320000]
|
||||||
|
set xtics offset 0,-0.5
|
||||||
|
set xtics (10000,15000,30000,50000,100000,150000,300000)
|
||||||
|
|
||||||
|
set ylabel "PEERS"
|
||||||
|
#set yrange [10:320]
|
||||||
|
#set ytics (10,15,30,50,100,150,300)
|
||||||
|
set yrange [10:320]
|
||||||
|
set ytics (10,15,30,50,100,150,300)
|
||||||
|
|
||||||
|
set zrange [1:2000]
|
||||||
|
set xyplane at 1
|
||||||
|
|
||||||
|
set border 895
|
||||||
|
|
||||||
|
#set grid ztics lt 20
|
||||||
|
|
||||||
|
set output ARG1 . "-" . ARG4 . ".pdf"
|
||||||
|
|
||||||
|
splot \
|
||||||
|
ARG1 . ".csv" \
|
||||||
|
using "TOTAL_ROUTES":"PEERS":ARG2."/".ARG4 \
|
||||||
|
with lines \
|
||||||
|
title ARG2."/".ARG4, \
|
||||||
|
"" \
|
||||||
|
using "TOTAL_ROUTES":"PEERS":ARG3."/".ARG4 \
|
||||||
|
with lines \
|
||||||
|
title ARG3."/".ARG4
|
||||||
|
|
156
doc/threads/stats-filter-2d.pl
Normal file
156
doc/threads/stats-filter-2d.pl
Normal file
@ -0,0 +1,156 @@
|
|||||||
|
#!/usr/bin/perl
|
||||||
|
|
||||||
|
use common::sense;
|
||||||
|
use Data::Dump;
|
||||||
|
use List::Util;
|
||||||
|
|
||||||
|
my @GROUP_BY = qw/VERSION PEERS TOTAL_ROUTES/;
|
||||||
|
my @VALUES = qw/TIMEDIF/;
|
||||||
|
|
||||||
|
my ($FILE, $TYPE, $OUTPUT) = @ARGV;
|
||||||
|
|
||||||
|
### Load data ###
|
||||||
|
my %data;
|
||||||
|
open F, "<", $FILE or die $!;
|
||||||
|
my @header = split /;/, <F>;
|
||||||
|
chomp @header;
|
||||||
|
|
||||||
|
my $line = undef;
|
||||||
|
while ($line = <F>)
|
||||||
|
{
|
||||||
|
chomp $line;
|
||||||
|
$line =~ s/;;(.*);;/;;\1;/;
|
||||||
|
$line =~ s/v2\.0\.8-1[89][^;]+/bgp/;
|
||||||
|
$line =~ s/v2\.0\.8-[^;]+/sark/ and next;
|
||||||
|
$line =~ s/master;/v2.0.8;/;
|
||||||
|
my %row;
|
||||||
|
@row{@header} = split /;/, $line;
|
||||||
|
push @{$data{join ";", @row{@GROUP_BY}}}, { %row } if $row{TYPE} eq $TYPE;
|
||||||
|
}
|
||||||
|
|
||||||
|
### Do statistics ###
|
||||||
|
sub avg {
|
||||||
|
return List::Util::sum(@_) / @_;
|
||||||
|
}
|
||||||
|
|
||||||
|
sub getinbetween {
|
||||||
|
my $index = shift;
|
||||||
|
my @list = @_;
|
||||||
|
|
||||||
|
return $list[int $index] if $index == int $index;
|
||||||
|
|
||||||
|
my $lower = $list[int $index];
|
||||||
|
my $upper = $list[1 + int $index];
|
||||||
|
|
||||||
|
my $frac = $index - int $index;
|
||||||
|
|
||||||
|
return ($lower * (1 - $frac) + $upper * $frac);
|
||||||
|
}
|
||||||
|
|
||||||
|
sub stats {
|
||||||
|
my $avg = shift;
|
||||||
|
return [0, 0, 0, 0, 0] if @_ <= 1;
|
||||||
|
|
||||||
|
# my $stdev = sqrt(List::Util::sum(map { ($avg - $_)**2 } @_) / (@_-1));
|
||||||
|
|
||||||
|
my @sorted = sort { $a <=> $b } @_;
|
||||||
|
my $count = scalar @sorted;
|
||||||
|
|
||||||
|
return [
|
||||||
|
getinbetween(($count-1) * 0.25, @sorted),
|
||||||
|
$sorted[0],
|
||||||
|
$sorted[$count-1],
|
||||||
|
getinbetween(($count-1) * 0.75, @sorted),
|
||||||
|
];
|
||||||
|
}
|
||||||
|
|
||||||
|
my %output;
|
||||||
|
my %vers;
|
||||||
|
my %peers;
|
||||||
|
my %stplot;
|
||||||
|
|
||||||
|
STATS:
|
||||||
|
foreach my $k (keys %data)
|
||||||
|
{
|
||||||
|
my %cols = map { my $vk = $_; $vk => [ map { $_->{$vk} } @{$data{$k}} ]; } @VALUES;
|
||||||
|
|
||||||
|
my %avg = map { $_ => avg(@{$cols{$_}})} @VALUES;
|
||||||
|
my %stloc = map { $_ => stats($avg{$_}, @{$cols{$_}})} @VALUES;
|
||||||
|
|
||||||
|
$vers{$data{$k}[0]{VERSION}}++;
|
||||||
|
$peers{$data{$k}[0]{PEERS}}++;
|
||||||
|
$output{$data{$k}[0]{VERSION}}{$data{$k}[0]{PEERS}}{$data{$k}[0]{TOTAL_ROUTES}} = { %avg };
|
||||||
|
$stplot{$data{$k}[0]{VERSION}}{$data{$k}[0]{PEERS}}{$data{$k}[0]{TOTAL_ROUTES}} = { %stloc };
|
||||||
|
}
|
||||||
|
|
||||||
|
#(3 == scalar %vers) and $vers{sark} and $vers{bgp} and $vers{"v2.0.8"} or die "vers size is " . (scalar %vers) . ", items ", join ", ", keys %vers;
|
||||||
|
(2 == scalar %vers) and $vers{bgp} and $vers{"v2.0.8"} or die "vers size is " . (scalar %vers) . ", items ", join ", ", keys %vers;
|
||||||
|
|
||||||
|
### Export the data ###
|
||||||
|
|
||||||
|
open PLOT, "|-", "gnuplot" or die $!;
|
||||||
|
|
||||||
|
say PLOT <<EOF;
|
||||||
|
set logscale
|
||||||
|
|
||||||
|
set term pdfcairo size 20cm,15cm
|
||||||
|
|
||||||
|
set xlabel "Total number of routes" offset 0,-1.5
|
||||||
|
set xrange [10000:3000000]
|
||||||
|
set xtics offset 0,-0.5
|
||||||
|
#set xtics (10000,15000,30000,50000,100000,150000,300000,500000,1000000)
|
||||||
|
|
||||||
|
set ylabel "Time to converge (s)"
|
||||||
|
set yrange [0.5:10800]
|
||||||
|
|
||||||
|
set grid
|
||||||
|
|
||||||
|
set key left top
|
||||||
|
|
||||||
|
set output "$OUTPUT"
|
||||||
|
EOF
|
||||||
|
|
||||||
|
my @colors = (
|
||||||
|
[ 1, 0.9, 0.3 ],
|
||||||
|
[ 0.7, 0, 0 ],
|
||||||
|
# [ 0.6, 1, 0.3 ],
|
||||||
|
# [ 0, 0.7, 0 ],
|
||||||
|
[ 0, 0.7, 1 ],
|
||||||
|
[ 0.3, 0.3, 1 ],
|
||||||
|
);
|
||||||
|
|
||||||
|
my $steps = (scalar %peers) - 1;
|
||||||
|
|
||||||
|
my @plot_data;
|
||||||
|
foreach my $v (sort keys %vers) {
|
||||||
|
my $color = shift @colors;
|
||||||
|
my $endcolor = shift @colors;
|
||||||
|
my $stepcolor = [ map +( ($endcolor->[$_] - $color->[$_]) / $steps ), (0, 1, 2) ];
|
||||||
|
|
||||||
|
foreach my $p (sort { int $a <=> int $b } keys %peers) {
|
||||||
|
my $vnodot = $v; $vnodot =~ s/\.//g;
|
||||||
|
say PLOT "\$data_${vnodot}_${p} << EOD";
|
||||||
|
foreach my $tr (sort { int $a <=> int $b } keys %{$output{$v}{$p}}) {
|
||||||
|
say PLOT "$tr $output{$v}{$p}{$tr}{TIMEDIF}";
|
||||||
|
}
|
||||||
|
say PLOT "EOD";
|
||||||
|
|
||||||
|
say PLOT "\$data_${vnodot}_${p}_stats << EOD";
|
||||||
|
foreach my $tr (sort { int $a <=> int $b } keys %{$output{$v}{$p}}) {
|
||||||
|
say PLOT join " ", ( $tr, @{$stplot{$v}{$p}{$tr}{TIMEDIF}} );
|
||||||
|
}
|
||||||
|
say PLOT "EOD";
|
||||||
|
|
||||||
|
my $colorstr = sprintf "linecolor rgbcolor \"#%02x%02x%02x\"", map +( int($color->[$_] * 255 + 0.5)), (0, 1, 2);
|
||||||
|
push @plot_data, "\$data_${vnodot}_${p} using 1:2 with lines $colorstr linewidth 2 title \"$v, $p peers\"";
|
||||||
|
push @plot_data, "\$data_${vnodot}_${p}_stats with candlesticks $colorstr linewidth 2 notitle \"\"";
|
||||||
|
$color = [ map +( $color->[$_] + $stepcolor->[$_] ), (0, 1, 2) ];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
push @plot_data, "2 with lines lt 1 dashtype 2 title \"Measurement instability\"";
|
||||||
|
|
||||||
|
say PLOT "plot ", join ", ", @plot_data;
|
||||||
|
close PLOT;
|
||||||
|
|
||||||
|
|
84
doc/threads/stats-filter.pl
Normal file
84
doc/threads/stats-filter.pl
Normal file
@ -0,0 +1,84 @@
|
|||||||
|
#!/usr/bin/perl
|
||||||
|
|
||||||
|
use common::sense;
|
||||||
|
use Data::Dump;
|
||||||
|
use List::Util;
|
||||||
|
|
||||||
|
my @GROUP_BY = qw/VERSION PEERS TOTAL_ROUTES/;
|
||||||
|
my @VALUES = qw/RSS SZ VSZ TIMEDIF/;
|
||||||
|
|
||||||
|
my ($FILE, $TYPE) = @ARGV;
|
||||||
|
|
||||||
|
### Load data ###
|
||||||
|
my %data;
|
||||||
|
open F, "<", $FILE or die $!;
|
||||||
|
my @header = split /;/, <F>;
|
||||||
|
chomp @header;
|
||||||
|
|
||||||
|
my $line = undef;
|
||||||
|
while ($line = <F>)
|
||||||
|
{
|
||||||
|
chomp $line;
|
||||||
|
my %row;
|
||||||
|
@row{@header} = split /;/, $line;
|
||||||
|
push @{$data{join ";", @row{@GROUP_BY}}}, { %row } if $row{TYPE} eq $TYPE;
|
||||||
|
}
|
||||||
|
|
||||||
|
### Do statistics ###
|
||||||
|
sub avg {
|
||||||
|
return List::Util::sum(@_) / @_;
|
||||||
|
}
|
||||||
|
|
||||||
|
sub stdev {
|
||||||
|
my $avg = shift;
|
||||||
|
return 0 if @_ <= 1;
|
||||||
|
return sqrt(List::Util::sum(map { ($avg - $_)**2 } @_) / (@_-1));
|
||||||
|
}
|
||||||
|
|
||||||
|
my %output;
|
||||||
|
my %vers;
|
||||||
|
|
||||||
|
STATS:
|
||||||
|
foreach my $k (keys %data)
|
||||||
|
{
|
||||||
|
my %cols = map { my $vk = $_; $vk => [ map { $_->{$vk} } @{$data{$k}} ]; } @VALUES;
|
||||||
|
|
||||||
|
my %avg = map { $_ => avg(@{$cols{$_}})} @VALUES;
|
||||||
|
my %stdev = map { $_ => stdev($avg{$_}, @{$cols{$_}})} @VALUES;
|
||||||
|
|
||||||
|
foreach my $v (@VALUES) {
|
||||||
|
next if $stdev{$v} / $avg{$v} < 0.035;
|
||||||
|
|
||||||
|
for (my $i=0; $i<@{$cols{$v}}; $i++)
|
||||||
|
{
|
||||||
|
my $dif = $cols{$v}[$i] - $avg{$v};
|
||||||
|
next if $dif < $stdev{$v} * 2 and $dif > $stdev{$v} * (-2);
|
||||||
|
=cut
|
||||||
|
printf "Removing an outlier for %s/%s: avg=%f, stdev=%f, variance=%.1f%%, val=%f, valratio=%.1f%%\n",
|
||||||
|
$k, $v, $avg{$v}, $stdev{$v}, (100 * $stdev{$v} / $avg{$v}), $cols{$v}[$i], (100 * $dif / $stdev{$v});
|
||||||
|
=cut
|
||||||
|
splice @{$data{$k}}, $i, 1, ();
|
||||||
|
redo STATS;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
$vers{$data{$k}[0]{VERSION}}++;
|
||||||
|
$output{"$data{$k}[0]{PEERS};$data{$k}[0]{TOTAL_ROUTES}"}{$data{$k}[0]{VERSION}} = { %avg };
|
||||||
|
}
|
||||||
|
|
||||||
|
### Export the data ###
|
||||||
|
|
||||||
|
say "PEERS;TOTAL_ROUTES;" . join ";", ( map { my $vk = $_; map { "$_/$vk" } keys %vers; } @VALUES );
|
||||||
|
|
||||||
|
sub keysort {
|
||||||
|
my ($pa, $ta) = split /;/, $_[0];
|
||||||
|
my ($pb, $tb) = split /;/, $_[1];
|
||||||
|
|
||||||
|
return (int $ta) <=> (int $tb) if $pa eq $pb;
|
||||||
|
return (int $pa) <=> (int $pb);
|
||||||
|
}
|
||||||
|
|
||||||
|
foreach my $k (sort { keysort($a, $b); } keys %output)
|
||||||
|
{
|
||||||
|
say "$k;" . join ";", ( map { my $vk = $_; map { $output{$k}{$_}{$vk}; } keys %vers; } @VALUES );
|
||||||
|
}
|
1964
doc/threads/stats-longfilters.csv
Normal file
1964
doc/threads/stats-longfilters.csv
Normal file
File diff suppressed because it is too large
Load Diff
2086
doc/threads/stats.csv
Normal file
2086
doc/threads/stats.csv
Normal file
File diff suppressed because it is too large
Load Diff
@ -1,6 +1,6 @@
|
|||||||
Summary: BIRD Internet Routing Daemon
|
Summary: BIRD Internet Routing Daemon
|
||||||
Name: bird
|
Name: bird
|
||||||
Version: 2.0.12
|
Version: 3.0-alpha0
|
||||||
Release: 1
|
Release: 1
|
||||||
Copyright: GPL
|
Copyright: GPL
|
||||||
Group: Networking/Daemons
|
Group: Networking/Daemons
|
||||||
|
Loading…
Reference in New Issue
Block a user