Blogpost about performance + data.

2024-11-18 09:08:42 +00:00 · 2022-02-07 22:27:42 +01:00 · 2022-02-07 22:27:42 +01:00 · a6fc31f153
commit a6fc31f153
parent 3f6462ad35
12 changed files with 2287 additions and 29 deletions
--- a/doc/threads/03b_performance.md
+++ b/doc/threads/03b_performance.md
@ -0,0 +1,153 @@
 # BIRD Journey to Threads. Chapter 3½: Route server performance
 All the work on multithreading shall be justified by performance improvements.
 This chapter tries to compare times reached by version 3.0-alpha0 and 2.0.8,
 showing some data and thinking about them.
 BIRD is a fast, robust and memory-efficient routing daemon designed and
 implemented at the end of 20th century. We're doing a significant amount of
 BIRD's internal structure changes to make it run in multiple threads in parallel.
 ## Testing setup
 There are two machines in one rack. One of these simulates the peers of
 a route server, the other runs BIRD in a route server configuration. First, the
 peers are launched, then the route server is started and one of the peers
 measures the convergence time until routes are fully propagated. Other peers
 drop all incoming routes.
 There are four configurations. *Single* where all BGPs are directly
 connected to the main table, *Multi* where every BGP has its own table and
 filters are done on pipes between them, and finally *Imex* and *Mulimex* which are
 effectively *Single* and *Multi* where all BGPs have also their auxiliary
 import and export tables enabled.
 All of these use the same short dummy filter for route import to provide a
 consistent load. This filter includes no meaningful logic, it's just some dummy
 data to run the CPU with no memory contention. Real filters also do not suffer from
 memory contention, with an exception of ROA checks. Optimization of ROA is a
 task for another day.
 There is also other stuff in BIRD waiting for performance assessment. As the
 (by far) most demanding setup of BIRD is route server in IXP, we chose to
 optimize and measure BGP and filters first.
 Hardware used for testing is Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 8
 physical cores, two hyperthreads on each. Memory is 32 GB RAM.
 ## Test parameters and statistics
 BIRD setup may scale on two major axes. Number of peers and number of routes /
 destinations. *(There are more axes, e.g.: complexity of filters, routes /
 destinations ratio, topology size in IGP)*
 Scaling the test on route count is easy, just by adding more routes to the
 testing peers. Currently, the largest test data I feed BIRD with is about 2M
 routes for around 800K destinations, due to memory limitations. The routes /
 destinations ratio is around 2.5 in this testing setup, trying to get close to
 real-world routing servers.[^1]
 [^1]: BIRD can handle much more in real life, the actual software limit is currently
      a 32-bit unsigned route counter in the table structure. Hardware capabilities
      are already there and checking how BIRD handles more than 4G routes is
      certainly going to be a real thing soon.
 Scaling the test on peer count is easy, until you get to higher numbers. When I
 was setting up the test, I configured one Linux network namespace for each peer,
 connecting them by virtual links to a bridge and by a GRE tunnel to the other
 machine. This works well for 10 peers but setting up and removing 1000 network
 namespaces takes more than 15 minutes in total. (Note to myself: try this with
 a newer Linux kernel than 4.9.)
 Another problem of test scaling is bandwidth. With 10 peers, everything is OK.
 With 1000 peers, version 3.0-alpha0 does more than 600 Mbps traffic in peak
 which is just about the bandwidth of the whole setup. I'm planning to design a
 better test setup with less chokepoints in future.
 ## Hypothesis
 There are two versions subjected to the test. One of these is `2.0.8` as an
 initial testpoint. The other is version 3.0-alpha0, named `bgp` as parallel BGP
 is implemented there.
 The major problem of large-scale BIRD setups is convergence time on startup. We
 assume that a multithreaded version should reduce the overall convergence time,
 at most by a factor equal to number of cores involved. Here we have 16
 hyperthreads, in theory we should reduce the times up to 16-fold, yet this is
 almost impossible as a non-negligible amount of time is spent in bottleneck
 code like best route selection or some cleanup routines. This has become a
 bottleneck by making other parts parallel.
 ## Data
 Four charts are included here, one for each setup. All axes have a
 logarithmic scale. The route count on X scale is the total route count in
 tested BIRD, different color shades belong to different versions and peer
 counts. Time is plotted on Y scale.
 Raw data is available in Git, as well as the chart generator. Strange results
 caused by testbed bugs are already omitted.
 There is also a line drawn on a 2-second mark. Convergence is checked by
 periodically requesting `birdc show route count` on one of the peers and BGP
 peers have also a 1-second connect delay time (default is 5 seconds). All
 measured times shorter than 2 seconds are highly unreliable.
 ![Plotted data for Single](03b_stats_2d_single.png)
 [Plotted data for Single in PDF](03b_stats_2d_single.pdf)
 Single-table setup has times reduced to about 1/8 when comparing 3.0-alpha0 to
 2.0.8. Speedup for 10-peer setup is slightly worse than expected and there is
 still some room for improvement, yet 8-fold speedup on 8 physical cores and 16
 hyperthreads is good for me now.
 The most demanding case with 2M routes and 1k peers failed. On 2.0.8, my
 configuration converges after almost two hours on 2.0.8, with the speed of
 route processing steadily decreasing until only several routes per second are
 done. Version 3.0-alpha0 is memory-bloating for some non-obvious reason and
 couldn't fit into 32G RAM. There is definitely some work ahead to stabilize
 BIRD behavior with extreme setups.
 ![Plotted data for Multi](03b_stats_2d_multi.png)
 [Plotted data for Multi in PDF](03b_stats_2d_multi.pdf)
 Multi-table setup got the same speedup as single-table setup, no big
 surprise. Largest cases were not tested at all as they don't fit well into 32G
 RAM even with 2.0.8.
 ![Plotted data for Imex](03b_stats_2d_imex.png)
 [Plotted data for Imex in PDF](03b_stats_2d_imex.pdf)
 ![Plotted data for Mulimex](03b_stats_2d_mulimex.png)
 [Plotted data for Mulimex in PDF](03b_stats_2d_mulimex.pdf)
 Setups with import / export tables are also sped up by a factor
 about 6-8. Data on largest setups (2M routes) are showing some strangely
 ineffective behaviour. Considering that both single-table and multi-table
 setups yield similar performance data, there is probably some unwanted
 inefficiency in the auxiliary table code.
 ## Conclusion
 BIRD 3.0-alpha0 is a good version for preliminary testing in IXPs. There is
 some speedup in every testcase and code stability is enough to handle typical
 use cases. Some test scenarios went out of available memory and there is
 definitely a lot of work to stabilize this, yet for now it makes no sense to
 postpone this alpha version any more.
 We don't recommend upgrading a production machine to this version
 yet, anyway if you have a test setup, getting version 3.0-alpha0 there and
 reporting bugs is much welcome.
 Notice: Multithreaded BIRD, at least in version 3.0-alpha0, doesn't limit its number of
 threads. It will spawn at least one thread per every BGP, RPKI and Pipe
 protocol, one thread per every routing table (including auxiliary tables) and
 possibly several more. It's up to the machine administrator to setup a limit on
 CPU core usage by BIRD. When running with many threads and protocols, you may
 need also to raise the filedescriptor limit: BIRD uses 2 filedescriptors per
 every thread for internal messaging.
 *It's a long road to the version 3. By releasing this alpha version, we'd like
 to encourage every user to try this preview. If you want to know more about
 what is being done and why, you may also check the full
 [blogpost series about multithreaded BIRD](https://en.blog.nic.cz/2021/03/15/bird-journey-to-threads-chapter-0-the-reason-why/). Thank you for your ongoing support!*
--- a/doc/threads/03b_stats_2d_imex.pdf
+++ b/doc/threads/03b_stats_2d_imex.pdf
--- a/doc/threads/03b_stats_2d_imex.png
+++ b/doc/threads/03b_stats_2d_imex.png
--- a/doc/threads/03b_stats_2d_mulimex.pdf
+++ b/doc/threads/03b_stats_2d_mulimex.pdf
--- a/doc/threads/03b_stats_2d_mulimex.png
+++ b/doc/threads/03b_stats_2d_mulimex.png
--- a/doc/threads/03b_stats_2d_multi.pdf
+++ b/doc/threads/03b_stats_2d_multi.pdf
--- a/doc/threads/03b_stats_2d_multi.png
+++ b/doc/threads/03b_stats_2d_multi.png
--- a/doc/threads/03b_stats_2d_single.pdf
+++ b/doc/threads/03b_stats_2d_single.pdf
--- a/doc/threads/03b_stats_2d_single.png
+++ b/doc/threads/03b_stats_2d_single.png
--- a/doc/threads/Makefile
+++ b/doc/threads/Makefile
@ -1,5 +1,5 @@
 SUFFICES := .pdf -wordpress.html
-CHAPTERS := 00_the_name_of_the_game 01_the_route_and_its_attributes 02_asynchronous_export 03_coroutines
+CHAPTERS := 00_the_name_of_the_game 01_the_route_and_its_attributes 02_asynchronous_export 03_coroutines 03b_performance
 all: $(foreach ch,$(CHAPTERS),$(addprefix $(ch),$(SUFFICES)))
--- a/doc/threads/stats-filter-2d.pl
+++ b/doc/threads/stats-filter-2d.pl
@ -21,7 +21,7 @@ while ($line = <F>)
  chomp $line;
  $line =~ s/;;(.*);;/;;\1;/;
  $line =~ s/v2\.0\.8-1[89][^;]+/bgp/;
-  $line =~ s/v2\.0\.8-[^;]+/sark/;
+  $line =~ s/v2\.0\.8-[^;]+/sark/ and next;
  $line =~ s/master;/v2.0.8;/;
  my %row;
  @row{@header} = split /;/, $line;
@ -33,15 +33,41 @@ sub avg {
  return List::Util::sum(@_) / @_;
 }
-sub stdev {
+sub getinbetween {
  my $index = shift;
  my @list = @_;
  return $list[int $index] if $index == int $index;
  my $lower = $list[int $index];
  my $upper = $list[1 + int $index];
  my $frac = $index - int $index;
  return ($lower * (1 - $frac) + $upper * $frac);
 }
 sub stats {
  my $avg = shift;
-  return 0 if @_ <= 1;
+  return [0, 0, 0, 0, 0] if @_ <= 1;
-  return sqrt(List::Util::sum(map { ($avg - $_)**2 } @_) / (@_-1));
+
  #  my $stdev = sqrt(List::Util::sum(map { ($avg - $_)**2 } @_) / (@_-1));
  my @sorted = sort { $a <=> $b } @_;
  my $count = scalar @sorted;
  return [
    getinbetween(($count-1) * 0.25, @sorted),
    $sorted[0],
    $sorted[$count-1],
    getinbetween(($count-1) * 0.75, @sorted),
  ];
 }
 my %output;
 my %vers;
 my %peers;
 my %stplot;
 STATS:
 foreach my $k (keys %data)
@ -49,30 +75,16 @@ foreach my $k (keys %data)
  my %cols = map { my $vk = $_; $vk => [ map { $_->{$vk} } @{$data{$k}} ]; } @VALUES;
  my %avg = map { $_ => avg(@{$cols{$_}})} @VALUES;
-  my %stdev = map { $_ => stdev($avg{$_}, @{$cols{$_}})} @VALUES;
+  my %stloc = map { $_ => stats($avg{$_}, @{$cols{$_}})} @VALUES;
  foreach my $v (@VALUES) {
    next if $stdev{$v} / $avg{$v} < 0.035;
    for (my $i=0; $i<@{$cols{$v}}; $i++)
    {
      my $dif = $cols{$v}[$i] - $avg{$v};
      next if $dif < $stdev{$v} * 2 and $dif > $stdev{$v} * (-2);
 =cut
      printf "Removing an outlier for %s/%s: avg=%f, stdev=%f, variance=%.1f%%, val=%f, valratio=%.1f%%\n",
 	$k, $v, $avg{$v}, $stdev{$v}, (100 * $stdev{$v} / $avg{$v}), $cols{$v}[$i], (100 * $dif / $stdev{$v});
 =cut
      splice @{$data{$k}}, $i, 1, ();
      redo STATS;
    }
  }
  $vers{$data{$k}[0]{VERSION}}++;
  $peers{$data{$k}[0]{PEERS}}++;
  $output{$data{$k}[0]{VERSION}}{$data{$k}[0]{PEERS}}{$data{$k}[0]{TOTAL_ROUTES}} = { %avg };
  $stplot{$data{$k}[0]{VERSION}}{$data{$k}[0]{PEERS}}{$data{$k}[0]{TOTAL_ROUTES}} = { %stloc };
 }
-(3 == scalar %vers) and $vers{sark} and $vers{bgp} and $vers{"v2.0.8"} or die "vers size is " . (scalar %vers) . ", items ", join ", ", keys %vers;
+#(3 == scalar %vers) and $vers{sark} and $vers{bgp} and $vers{"v2.0.8"} or die "vers size is " . (scalar %vers) . ", items ", join ", ", keys %vers;
 (2 == scalar %vers) and $vers{bgp} and $vers{"v2.0.8"} or die "vers size is " . (scalar %vers) . ", items ", join ", ", keys %vers;
 ### Export the data ###
@ -84,9 +96,9 @@ set logscale
 set term pdfcairo size 20cm,15cm
 set xlabel "Total number of routes" offset 0,-1.5
-set xrange [10000:1500000]
+set xrange [10000:3000000]
 set xtics offset 0,-0.5
-set xtics (10000,15000,30000,50000,100000,150000,300000,500000,1000000)
+#set xtics (10000,15000,30000,50000,100000,150000,300000,500000,1000000)
 set ylabel "Time to converge (s)"
 set yrange [0.5:10800]
@ -99,10 +111,10 @@ set output "$OUTPUT"
 EOF
 my @colors = (
-  [ 1, 0.3, 0.3 ],
+  [ 1, 0.9, 0.3 ],
-  [ 1, 0.7, 0 ],
+  [ 0.7, 0, 0 ],
-  [ 0.3, 1, 0 ],
+  #  [ 0.6, 1, 0.3 ],
-  [ 0, 1, 0.3 ],
+  #  [ 0, 0.7, 0 ],
  [ 0, 0.7, 1 ],
  [ 0.3, 0.3, 1 ],
 );
@ -123,8 +135,15 @@ foreach my $v (sort keys %vers) {
    }
    say PLOT "EOD";
    say PLOT "\$data_${vnodot}_${p}_stats << EOD";
    foreach my $tr (sort { int $a <=> int $b } keys %{$output{$v}{$p}}) {
      say PLOT join " ", ( $tr, @{$stplot{$v}{$p}{$tr}{TIMEDIF}} );
    }
    say PLOT "EOD";
    my $colorstr = sprintf "linecolor rgbcolor \"#%02x%02x%02x\"", map +( int($color->[$_] * 255 + 0.5)), (0, 1, 2);
    push @plot_data, "\$data_${vnodot}_${p} using 1:2 with lines $colorstr linewidth 2 title \"$v, $p peers\"";
    push @plot_data, "\$data_${vnodot}_${p}_stats with candlesticks $colorstr linewidth 2 notitle \"\"";
    $color = [ map +( $color->[$_] + $stepcolor->[$_] ), (0, 1, 2) ];
  }
 }
--- a/doc/threads/stats.csv
+++ b/doc/threads/stats.csv