mirror of
https://gitlab.nic.cz/labs/bird.git
synced 2025-01-03 15:41:54 +00:00
154 lines
7.7 KiB
Markdown
154 lines
7.7 KiB
Markdown
|
# BIRD Journey to Threads. Chapter 3½: Route server performance
|
||
|
|
||
|
All the work on multithreading shall be justified by performance improvements.
|
||
|
This chapter tries to compare times reached by version 3.0-alpha0 and 2.0.8,
|
||
|
showing some data and thinking about them.
|
||
|
|
||
|
BIRD is a fast, robust and memory-efficient routing daemon designed and
|
||
|
implemented at the end of 20th century. We're doing a significant amount of
|
||
|
BIRD's internal structure changes to make it run in multiple threads in parallel.
|
||
|
|
||
|
## Testing setup
|
||
|
|
||
|
There are two machines in one rack. One of these simulates the peers of
|
||
|
a route server, the other runs BIRD in a route server configuration. First, the
|
||
|
peers are launched, then the route server is started and one of the peers
|
||
|
measures the convergence time until routes are fully propagated. Other peers
|
||
|
drop all incoming routes.
|
||
|
|
||
|
There are four configurations. *Single* where all BGPs are directly
|
||
|
connected to the main table, *Multi* where every BGP has its own table and
|
||
|
filters are done on pipes between them, and finally *Imex* and *Mulimex* which are
|
||
|
effectively *Single* and *Multi* where all BGPs have also their auxiliary
|
||
|
import and export tables enabled.
|
||
|
|
||
|
All of these use the same short dummy filter for route import to provide a
|
||
|
consistent load. This filter includes no meaningful logic, it's just some dummy
|
||
|
data to run the CPU with no memory contention. Real filters also do not suffer from
|
||
|
memory contention, with an exception of ROA checks. Optimization of ROA is a
|
||
|
task for another day.
|
||
|
|
||
|
There is also other stuff in BIRD waiting for performance assessment. As the
|
||
|
(by far) most demanding setup of BIRD is route server in IXP, we chose to
|
||
|
optimize and measure BGP and filters first.
|
||
|
|
||
|
Hardware used for testing is Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 8
|
||
|
physical cores, two hyperthreads on each. Memory is 32 GB RAM.
|
||
|
|
||
|
## Test parameters and statistics
|
||
|
|
||
|
BIRD setup may scale on two major axes. Number of peers and number of routes /
|
||
|
destinations. *(There are more axes, e.g.: complexity of filters, routes /
|
||
|
destinations ratio, topology size in IGP)*
|
||
|
|
||
|
Scaling the test on route count is easy, just by adding more routes to the
|
||
|
testing peers. Currently, the largest test data I feed BIRD with is about 2M
|
||
|
routes for around 800K destinations, due to memory limitations. The routes /
|
||
|
destinations ratio is around 2.5 in this testing setup, trying to get close to
|
||
|
real-world routing servers.[^1]
|
||
|
|
||
|
[^1]: BIRD can handle much more in real life, the actual software limit is currently
|
||
|
a 32-bit unsigned route counter in the table structure. Hardware capabilities
|
||
|
are already there and checking how BIRD handles more than 4G routes is
|
||
|
certainly going to be a real thing soon.
|
||
|
|
||
|
Scaling the test on peer count is easy, until you get to higher numbers. When I
|
||
|
was setting up the test, I configured one Linux network namespace for each peer,
|
||
|
connecting them by virtual links to a bridge and by a GRE tunnel to the other
|
||
|
machine. This works well for 10 peers but setting up and removing 1000 network
|
||
|
namespaces takes more than 15 minutes in total. (Note to myself: try this with
|
||
|
a newer Linux kernel than 4.9.)
|
||
|
|
||
|
Another problem of test scaling is bandwidth. With 10 peers, everything is OK.
|
||
|
With 1000 peers, version 3.0-alpha0 does more than 600 Mbps traffic in peak
|
||
|
which is just about the bandwidth of the whole setup. I'm planning to design a
|
||
|
better test setup with less chokepoints in future.
|
||
|
|
||
|
## Hypothesis
|
||
|
|
||
|
There are two versions subjected to the test. One of these is `2.0.8` as an
|
||
|
initial testpoint. The other is version 3.0-alpha0, named `bgp` as parallel BGP
|
||
|
is implemented there.
|
||
|
|
||
|
The major problem of large-scale BIRD setups is convergence time on startup. We
|
||
|
assume that a multithreaded version should reduce the overall convergence time,
|
||
|
at most by a factor equal to number of cores involved. Here we have 16
|
||
|
hyperthreads, in theory we should reduce the times up to 16-fold, yet this is
|
||
|
almost impossible as a non-negligible amount of time is spent in bottleneck
|
||
|
code like best route selection or some cleanup routines. This has become a
|
||
|
bottleneck by making other parts parallel.
|
||
|
|
||
|
## Data
|
||
|
|
||
|
Four charts are included here, one for each setup. All axes have a
|
||
|
logarithmic scale. The route count on X scale is the total route count in
|
||
|
tested BIRD, different color shades belong to different versions and peer
|
||
|
counts. Time is plotted on Y scale.
|
||
|
|
||
|
Raw data is available in Git, as well as the chart generator. Strange results
|
||
|
caused by testbed bugs are already omitted.
|
||
|
|
||
|
There is also a line drawn on a 2-second mark. Convergence is checked by
|
||
|
periodically requesting `birdc show route count` on one of the peers and BGP
|
||
|
peers have also a 1-second connect delay time (default is 5 seconds). All
|
||
|
measured times shorter than 2 seconds are highly unreliable.
|
||
|
|
||
|
![Plotted data for Single](03b_stats_2d_single.png)
|
||
|
[Plotted data for Single in PDF](03b_stats_2d_single.pdf)
|
||
|
|
||
|
Single-table setup has times reduced to about 1/8 when comparing 3.0-alpha0 to
|
||
|
2.0.8. Speedup for 10-peer setup is slightly worse than expected and there is
|
||
|
still some room for improvement, yet 8-fold speedup on 8 physical cores and 16
|
||
|
hyperthreads is good for me now.
|
||
|
|
||
|
The most demanding case with 2M routes and 1k peers failed. On 2.0.8, my
|
||
|
configuration converges after almost two hours on 2.0.8, with the speed of
|
||
|
route processing steadily decreasing until only several routes per second are
|
||
|
done. Version 3.0-alpha0 is memory-bloating for some non-obvious reason and
|
||
|
couldn't fit into 32G RAM. There is definitely some work ahead to stabilize
|
||
|
BIRD behavior with extreme setups.
|
||
|
|
||
|
![Plotted data for Multi](03b_stats_2d_multi.png)
|
||
|
[Plotted data for Multi in PDF](03b_stats_2d_multi.pdf)
|
||
|
|
||
|
Multi-table setup got the same speedup as single-table setup, no big
|
||
|
surprise. Largest cases were not tested at all as they don't fit well into 32G
|
||
|
RAM even with 2.0.8.
|
||
|
|
||
|
![Plotted data for Imex](03b_stats_2d_imex.png)
|
||
|
[Plotted data for Imex in PDF](03b_stats_2d_imex.pdf)
|
||
|
|
||
|
![Plotted data for Mulimex](03b_stats_2d_mulimex.png)
|
||
|
[Plotted data for Mulimex in PDF](03b_stats_2d_mulimex.pdf)
|
||
|
|
||
|
Setups with import / export tables are also sped up by a factor
|
||
|
about 6-8. Data on largest setups (2M routes) are showing some strangely
|
||
|
ineffective behaviour. Considering that both single-table and multi-table
|
||
|
setups yield similar performance data, there is probably some unwanted
|
||
|
inefficiency in the auxiliary table code.
|
||
|
|
||
|
## Conclusion
|
||
|
|
||
|
BIRD 3.0-alpha0 is a good version for preliminary testing in IXPs. There is
|
||
|
some speedup in every testcase and code stability is enough to handle typical
|
||
|
use cases. Some test scenarios went out of available memory and there is
|
||
|
definitely a lot of work to stabilize this, yet for now it makes no sense to
|
||
|
postpone this alpha version any more.
|
||
|
|
||
|
We don't recommend upgrading a production machine to this version
|
||
|
yet, anyway if you have a test setup, getting version 3.0-alpha0 there and
|
||
|
reporting bugs is much welcome.
|
||
|
|
||
|
Notice: Multithreaded BIRD, at least in version 3.0-alpha0, doesn't limit its number of
|
||
|
threads. It will spawn at least one thread per every BGP, RPKI and Pipe
|
||
|
protocol, one thread per every routing table (including auxiliary tables) and
|
||
|
possibly several more. It's up to the machine administrator to setup a limit on
|
||
|
CPU core usage by BIRD. When running with many threads and protocols, you may
|
||
|
need also to raise the filedescriptor limit: BIRD uses 2 filedescriptors per
|
||
|
every thread for internal messaging.
|
||
|
|
||
|
*It's a long road to the version 3. By releasing this alpha version, we'd like
|
||
|
to encourage every user to try this preview. If you want to know more about
|
||
|
what is being done and why, you may also check the full
|
||
|
[blogpost series about multithreaded BIRD](https://en.blog.nic.cz/2021/03/15/bird-journey-to-threads-chapter-0-the-reason-why/). Thank you for your ongoing support!*
|