0
0
mirror of https://gitlab.nic.cz/labs/bird.git synced 2024-11-18 17:18:42 +00:00

Blogpost about performance + data.

This commit is contained in:
Maria Matejka 2022-02-07 22:27:42 +01:00
parent 3f6462ad35
commit a6fc31f153
12 changed files with 2287 additions and 29 deletions

View File

@ -0,0 +1,153 @@
# BIRD Journey to Threads. Chapter 3½: Route server performance
All the work on multithreading shall be justified by performance improvements.
This chapter tries to compare times reached by version 3.0-alpha0 and 2.0.8,
showing some data and thinking about them.
BIRD is a fast, robust and memory-efficient routing daemon designed and
implemented at the end of 20th century. We're doing a significant amount of
BIRD's internal structure changes to make it run in multiple threads in parallel.
## Testing setup
There are two machines in one rack. One of these simulates the peers of
a route server, the other runs BIRD in a route server configuration. First, the
peers are launched, then the route server is started and one of the peers
measures the convergence time until routes are fully propagated. Other peers
drop all incoming routes.
There are four configurations. *Single* where all BGPs are directly
connected to the main table, *Multi* where every BGP has its own table and
filters are done on pipes between them, and finally *Imex* and *Mulimex* which are
effectively *Single* and *Multi* where all BGPs have also their auxiliary
import and export tables enabled.
All of these use the same short dummy filter for route import to provide a
consistent load. This filter includes no meaningful logic, it's just some dummy
data to run the CPU with no memory contention. Real filters also do not suffer from
memory contention, with an exception of ROA checks. Optimization of ROA is a
task for another day.
There is also other stuff in BIRD waiting for performance assessment. As the
(by far) most demanding setup of BIRD is route server in IXP, we chose to
optimize and measure BGP and filters first.
Hardware used for testing is Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 8
physical cores, two hyperthreads on each. Memory is 32 GB RAM.
## Test parameters and statistics
BIRD setup may scale on two major axes. Number of peers and number of routes /
destinations. *(There are more axes, e.g.: complexity of filters, routes /
destinations ratio, topology size in IGP)*
Scaling the test on route count is easy, just by adding more routes to the
testing peers. Currently, the largest test data I feed BIRD with is about 2M
routes for around 800K destinations, due to memory limitations. The routes /
destinations ratio is around 2.5 in this testing setup, trying to get close to
real-world routing servers.[^1]
[^1]: BIRD can handle much more in real life, the actual software limit is currently
a 32-bit unsigned route counter in the table structure. Hardware capabilities
are already there and checking how BIRD handles more than 4G routes is
certainly going to be a real thing soon.
Scaling the test on peer count is easy, until you get to higher numbers. When I
was setting up the test, I configured one Linux network namespace for each peer,
connecting them by virtual links to a bridge and by a GRE tunnel to the other
machine. This works well for 10 peers but setting up and removing 1000 network
namespaces takes more than 15 minutes in total. (Note to myself: try this with
a newer Linux kernel than 4.9.)
Another problem of test scaling is bandwidth. With 10 peers, everything is OK.
With 1000 peers, version 3.0-alpha0 does more than 600 Mbps traffic in peak
which is just about the bandwidth of the whole setup. I'm planning to design a
better test setup with less chokepoints in future.
## Hypothesis
There are two versions subjected to the test. One of these is `2.0.8` as an
initial testpoint. The other is version 3.0-alpha0, named `bgp` as parallel BGP
is implemented there.
The major problem of large-scale BIRD setups is convergence time on startup. We
assume that a multithreaded version should reduce the overall convergence time,
at most by a factor equal to number of cores involved. Here we have 16
hyperthreads, in theory we should reduce the times up to 16-fold, yet this is
almost impossible as a non-negligible amount of time is spent in bottleneck
code like best route selection or some cleanup routines. This has become a
bottleneck by making other parts parallel.
## Data
Four charts are included here, one for each setup. All axes have a
logarithmic scale. The route count on X scale is the total route count in
tested BIRD, different color shades belong to different versions and peer
counts. Time is plotted on Y scale.
Raw data is available in Git, as well as the chart generator. Strange results
caused by testbed bugs are already omitted.
There is also a line drawn on a 2-second mark. Convergence is checked by
periodically requesting `birdc show route count` on one of the peers and BGP
peers have also a 1-second connect delay time (default is 5 seconds). All
measured times shorter than 2 seconds are highly unreliable.
![Plotted data for Single](03b_stats_2d_single.png)
[Plotted data for Single in PDF](03b_stats_2d_single.pdf)
Single-table setup has times reduced to about 1/8 when comparing 3.0-alpha0 to
2.0.8. Speedup for 10-peer setup is slightly worse than expected and there is
still some room for improvement, yet 8-fold speedup on 8 physical cores and 16
hyperthreads is good for me now.
The most demanding case with 2M routes and 1k peers failed. On 2.0.8, my
configuration converges after almost two hours on 2.0.8, with the speed of
route processing steadily decreasing until only several routes per second are
done. Version 3.0-alpha0 is memory-bloating for some non-obvious reason and
couldn't fit into 32G RAM. There is definitely some work ahead to stabilize
BIRD behavior with extreme setups.
![Plotted data for Multi](03b_stats_2d_multi.png)
[Plotted data for Multi in PDF](03b_stats_2d_multi.pdf)
Multi-table setup got the same speedup as single-table setup, no big
surprise. Largest cases were not tested at all as they don't fit well into 32G
RAM even with 2.0.8.
![Plotted data for Imex](03b_stats_2d_imex.png)
[Plotted data for Imex in PDF](03b_stats_2d_imex.pdf)
![Plotted data for Mulimex](03b_stats_2d_mulimex.png)
[Plotted data for Mulimex in PDF](03b_stats_2d_mulimex.pdf)
Setups with import / export tables are also sped up by a factor
about 6-8. Data on largest setups (2M routes) are showing some strangely
ineffective behaviour. Considering that both single-table and multi-table
setups yield similar performance data, there is probably some unwanted
inefficiency in the auxiliary table code.
## Conclusion
BIRD 3.0-alpha0 is a good version for preliminary testing in IXPs. There is
some speedup in every testcase and code stability is enough to handle typical
use cases. Some test scenarios went out of available memory and there is
definitely a lot of work to stabilize this, yet for now it makes no sense to
postpone this alpha version any more.
We don't recommend upgrading a production machine to this version
yet, anyway if you have a test setup, getting version 3.0-alpha0 there and
reporting bugs is much welcome.
Notice: Multithreaded BIRD, at least in version 3.0-alpha0, doesn't limit its number of
threads. It will spawn at least one thread per every BGP, RPKI and Pipe
protocol, one thread per every routing table (including auxiliary tables) and
possibly several more. It's up to the machine administrator to setup a limit on
CPU core usage by BIRD. When running with many threads and protocols, you may
need also to raise the filedescriptor limit: BIRD uses 2 filedescriptors per
every thread for internal messaging.
*It's a long road to the version 3. By releasing this alpha version, we'd like
to encourage every user to try this preview. If you want to know more about
what is being done and why, you may also check the full
[blogpost series about multithreaded BIRD](https://en.blog.nic.cz/2021/03/15/bird-journey-to-threads-chapter-0-the-reason-why/). Thank you for your ongoing support!*

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 160 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 149 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 147 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 KiB

View File

@ -1,5 +1,5 @@
SUFFICES := .pdf -wordpress.html SUFFICES := .pdf -wordpress.html
CHAPTERS := 00_the_name_of_the_game 01_the_route_and_its_attributes 02_asynchronous_export 03_coroutines CHAPTERS := 00_the_name_of_the_game 01_the_route_and_its_attributes 02_asynchronous_export 03_coroutines 03b_performance
all: $(foreach ch,$(CHAPTERS),$(addprefix $(ch),$(SUFFICES))) all: $(foreach ch,$(CHAPTERS),$(addprefix $(ch),$(SUFFICES)))

View File

@ -21,7 +21,7 @@ while ($line = <F>)
chomp $line; chomp $line;
$line =~ s/;;(.*);;/;;\1;/; $line =~ s/;;(.*);;/;;\1;/;
$line =~ s/v2\.0\.8-1[89][^;]+/bgp/; $line =~ s/v2\.0\.8-1[89][^;]+/bgp/;
$line =~ s/v2\.0\.8-[^;]+/sark/; $line =~ s/v2\.0\.8-[^;]+/sark/ and next;
$line =~ s/master;/v2.0.8;/; $line =~ s/master;/v2.0.8;/;
my %row; my %row;
@row{@header} = split /;/, $line; @row{@header} = split /;/, $line;
@ -33,15 +33,41 @@ sub avg {
return List::Util::sum(@_) / @_; return List::Util::sum(@_) / @_;
} }
sub stdev { sub getinbetween {
my $index = shift;
my @list = @_;
return $list[int $index] if $index == int $index;
my $lower = $list[int $index];
my $upper = $list[1 + int $index];
my $frac = $index - int $index;
return ($lower * (1 - $frac) + $upper * $frac);
}
sub stats {
my $avg = shift; my $avg = shift;
return 0 if @_ <= 1; return [0, 0, 0, 0, 0] if @_ <= 1;
return sqrt(List::Util::sum(map { ($avg - $_)**2 } @_) / (@_-1));
# my $stdev = sqrt(List::Util::sum(map { ($avg - $_)**2 } @_) / (@_-1));
my @sorted = sort { $a <=> $b } @_;
my $count = scalar @sorted;
return [
getinbetween(($count-1) * 0.25, @sorted),
$sorted[0],
$sorted[$count-1],
getinbetween(($count-1) * 0.75, @sorted),
];
} }
my %output; my %output;
my %vers; my %vers;
my %peers; my %peers;
my %stplot;
STATS: STATS:
foreach my $k (keys %data) foreach my $k (keys %data)
@ -49,30 +75,16 @@ foreach my $k (keys %data)
my %cols = map { my $vk = $_; $vk => [ map { $_->{$vk} } @{$data{$k}} ]; } @VALUES; my %cols = map { my $vk = $_; $vk => [ map { $_->{$vk} } @{$data{$k}} ]; } @VALUES;
my %avg = map { $_ => avg(@{$cols{$_}})} @VALUES; my %avg = map { $_ => avg(@{$cols{$_}})} @VALUES;
my %stdev = map { $_ => stdev($avg{$_}, @{$cols{$_}})} @VALUES; my %stloc = map { $_ => stats($avg{$_}, @{$cols{$_}})} @VALUES;
foreach my $v (@VALUES) {
next if $stdev{$v} / $avg{$v} < 0.035;
for (my $i=0; $i<@{$cols{$v}}; $i++)
{
my $dif = $cols{$v}[$i] - $avg{$v};
next if $dif < $stdev{$v} * 2 and $dif > $stdev{$v} * (-2);
=cut
printf "Removing an outlier for %s/%s: avg=%f, stdev=%f, variance=%.1f%%, val=%f, valratio=%.1f%%\n",
$k, $v, $avg{$v}, $stdev{$v}, (100 * $stdev{$v} / $avg{$v}), $cols{$v}[$i], (100 * $dif / $stdev{$v});
=cut
splice @{$data{$k}}, $i, 1, ();
redo STATS;
}
}
$vers{$data{$k}[0]{VERSION}}++; $vers{$data{$k}[0]{VERSION}}++;
$peers{$data{$k}[0]{PEERS}}++; $peers{$data{$k}[0]{PEERS}}++;
$output{$data{$k}[0]{VERSION}}{$data{$k}[0]{PEERS}}{$data{$k}[0]{TOTAL_ROUTES}} = { %avg }; $output{$data{$k}[0]{VERSION}}{$data{$k}[0]{PEERS}}{$data{$k}[0]{TOTAL_ROUTES}} = { %avg };
$stplot{$data{$k}[0]{VERSION}}{$data{$k}[0]{PEERS}}{$data{$k}[0]{TOTAL_ROUTES}} = { %stloc };
} }
(3 == scalar %vers) and $vers{sark} and $vers{bgp} and $vers{"v2.0.8"} or die "vers size is " . (scalar %vers) . ", items ", join ", ", keys %vers; #(3 == scalar %vers) and $vers{sark} and $vers{bgp} and $vers{"v2.0.8"} or die "vers size is " . (scalar %vers) . ", items ", join ", ", keys %vers;
(2 == scalar %vers) and $vers{bgp} and $vers{"v2.0.8"} or die "vers size is " . (scalar %vers) . ", items ", join ", ", keys %vers;
### Export the data ### ### Export the data ###
@ -84,9 +96,9 @@ set logscale
set term pdfcairo size 20cm,15cm set term pdfcairo size 20cm,15cm
set xlabel "Total number of routes" offset 0,-1.5 set xlabel "Total number of routes" offset 0,-1.5
set xrange [10000:1500000] set xrange [10000:3000000]
set xtics offset 0,-0.5 set xtics offset 0,-0.5
set xtics (10000,15000,30000,50000,100000,150000,300000,500000,1000000) #set xtics (10000,15000,30000,50000,100000,150000,300000,500000,1000000)
set ylabel "Time to converge (s)" set ylabel "Time to converge (s)"
set yrange [0.5:10800] set yrange [0.5:10800]
@ -99,10 +111,10 @@ set output "$OUTPUT"
EOF EOF
my @colors = ( my @colors = (
[ 1, 0.3, 0.3 ], [ 1, 0.9, 0.3 ],
[ 1, 0.7, 0 ], [ 0.7, 0, 0 ],
[ 0.3, 1, 0 ], # [ 0.6, 1, 0.3 ],
[ 0, 1, 0.3 ], # [ 0, 0.7, 0 ],
[ 0, 0.7, 1 ], [ 0, 0.7, 1 ],
[ 0.3, 0.3, 1 ], [ 0.3, 0.3, 1 ],
); );
@ -123,8 +135,15 @@ foreach my $v (sort keys %vers) {
} }
say PLOT "EOD"; say PLOT "EOD";
say PLOT "\$data_${vnodot}_${p}_stats << EOD";
foreach my $tr (sort { int $a <=> int $b } keys %{$output{$v}{$p}}) {
say PLOT join " ", ( $tr, @{$stplot{$v}{$p}{$tr}{TIMEDIF}} );
}
say PLOT "EOD";
my $colorstr = sprintf "linecolor rgbcolor \"#%02x%02x%02x\"", map +( int($color->[$_] * 255 + 0.5)), (0, 1, 2); my $colorstr = sprintf "linecolor rgbcolor \"#%02x%02x%02x\"", map +( int($color->[$_] * 255 + 0.5)), (0, 1, 2);
push @plot_data, "\$data_${vnodot}_${p} using 1:2 with lines $colorstr linewidth 2 title \"$v, $p peers\""; push @plot_data, "\$data_${vnodot}_${p} using 1:2 with lines $colorstr linewidth 2 title \"$v, $p peers\"";
push @plot_data, "\$data_${vnodot}_${p}_stats with candlesticks $colorstr linewidth 2 notitle \"\"";
$color = [ map +( $color->[$_] + $stepcolor->[$_] ), (0, 1, 2) ]; $color = [ map +( $color->[$_] + $stepcolor->[$_] ), (0, 1, 2) ];
} }
} }

2086
doc/threads/stats.csv Normal file

File diff suppressed because it is too large Load Diff