For BGP LLGR purposes, there was an API allowing a protocol to directly
modify their stale routes in table before flushing them. This API was
called by the table prune routine which violates the future locking
requirements.
Instead of this, BGP now requests a special route export and reimports
these routes into the table, allowing for asynchronous execution without
locking the table on export.
Until now, we were marking routes as REF_STALE and REF_DISCARD to
cleanup old routes after route refresh. This needed a synchronous route
table walk at both beginning and the end of route refresh routine,
marking the routes by the flags.
We avoid these walks by using a stale counter. Every route contains:
u8 stale_cycle;
Every import hook contains:
u8 stale_set;
u8 stale_valid;
u8 stale_pruned;
u8 stale_pruning;
In base_state, stale_set == stale_valid == stale_pruned == stale_pruning
and all routes' stale_cycle also have the same value.
The route refresh looks like follows:
+ ----------- + --------- + ----------- + ------------- + ------------ +
| | stale_set | stale_valid | stale_pruning | stale_pruned |
| Base | x | x | x | x |
| Begin | x+1 | x | x | x |
... now routes are being inserted with stale_cycle == (x+1)
| End | x+1 | x+1 | x | x |
... now table pruning routine is scheduled
| Prune begin | x+1 | x+1 | x+1 | x |
... now routes with stale_cycle not between stale_set and stale_valid
are deleted
| Prune end | x+1 | x+1 | x+1 | x+1 |
+ ----------- + --------- + ----------- + ------------- + ------------ +
The pruning routine is asynchronous and may have high latency in
high-load environments. Therefore, multiple route refresh requests may
happen before the pruning routine starts, leading to this situation:
| Prune begin | x+k | x+k | x -> x+k | x |
... or even
| Prune begin | x+k+1 | x+k | x -> x+k | x |
... if the prune event starts while another route refresh is running.
In such a case, the pruning routine still deletes routes not fitting
between stale_set and and stale_valid, effectively pruning the remnants
of all unpruned route refreshes from before:
| Prune end | x+k | x+k | x+k | x+k |
In extremely rare cases, there may happen too many route refreshes
before any route prune routine finishes. If the difference between
stale_valid and stale_pruned becomes more than 128 when requesting for
another route refresh, the routine walks the table synchronously and
resets all the stale values to a base state, while logging a warning.
Until now, if export table was enabled, Nest was storing exactly the
route before rt_notify() was called on it. This was quite sloppy and
spooky and it also wasn't reflecting the changes BGP does before
sending. And as BGP is storing the routes to be sent anyway, we are
simply keeping the already-sent routes in there to better rule out
unneeded reexports.
Some of the route attributes (IGP metric, preference) make no sense in
BGP, therefore these will be probably replaced by something sensible.
Also the nexthop shown in the short output is the BGP nexthop.
Changes in internal API:
* Every route attribute must be defined as struct ea_class somewhere.
* Registration of route attributes known at startup must be done by
ea_register_init() from protocol build functions.
* Every attribute has now its symbol registered in a global symbol table
defined as SYM_ATTRIBUTE
* All attribute ID's are dynamically allocated.
* Attribute value custom formatting hook is defined in the ea_class.
* Attribute names are the same for display and filters, always prefixed
by protocol name.
Also added some unit testing code for filters with route attributes.
Implement flowspec validation procedure as described in RFC 8955 sec. 6
and RFC 9117. The Validation procedure enforces that only routers in the
forwarding path for a network can originate flowspec rules for that
network.
The patch adds new mechanism for tracking inter-table dependencies, which
is necessary as the flowspec validation depends on IP routes, and flowspec
rules must be revalidated when best IP routes change.
The validation procedure is disabled by default and requires that
relevant IP table uses trie, as it uses interval queries for subnets.
This commit prevents use-after-free of routes belonging to protocols
which have been already destroyed, delaying also all the protocols'
shutdown until all of their routes have been finally propagated through
all the pipes down to the appropriate exports.
The use-after-free was somehow hypothetic yet theoretically possible in
rare conditions, when one BGP protocol authors a lot of routes and the
user deletes that protocol by reconfiguring in the same time as next hop
update is requested, causing rte_better() to be called on a
not-yet-pruned network prefix while the owner protocol has been already
freed.
In parallel execution environments, this would happen an inter-thread
use-after-free, causing possible heisenbugs or other nasty problems.
There is a simple universal IO loop, taking care of events, timers and
sockets. Primarily, one instance of a protocol should use exactly one IO
loop to do all its work, as is now done in BFD.
Contrary to previous versions, the loop is now launched and cleaned by
the nest/proto.c code, allowing for a protocol to just request its own
loop by setting the loop's lock order in config higher than the_bird.
It is not supported nor checked if any protocol changed the requested
lock order in reconfigure. No protocol should do it at all.
* internal tables are now more standalone, having their own import and
export hooks
* route refresh/reload uses stale counter instead of stale flag,
allowing to drop walking the table at the beginning
* route modify (by BGP LLGR) is now done by a special refeed hook,
reimporting the modified routes directly without filters
Channels have now included rt_import_req and rt_export_req to hook into
the table instead of just one list node. This will (in future) allow for:
* channel import and export bound to different tables
* more efficient pipe code (dropping most of the channel code)
* conversion of 'show route' to a special kind of export
* temporary static routes from CLI
The import / export states are also updated to the new algorithms.
Routes from downed protocols stay in rtable (until next rtable prune
cycle ends) and may be even exported to another protocol. In BGP case,
source BGP protocol is examined, although dynamic parts (including
neighbor entries) are already freed. That may lead to crash under some
race conditions. Ensure that freed neighbor entry is not accessed to
avoid this issue.
BGP statistics code was preliminary and i wanted to replace it by
separate 'show X stats' command. The patch hides the preliminary
output in 'show protocols all' so it is not part of the released
version.
This is an implementation of draft-walton-bgp-hostname-capability-02.
It is implemented since quite some time for FRR and in datacenter, this
gives a nice output to avoid using IP addresses.
It is disabled by default. The hostname is retrieved from uname(2) and
can be overriden with "hostname" option. The domain name is never set
nor displayed.
Minor changes by committer.
The option is not implemented since transition to 2.0 and no plan to add it.
Also remove some deprecated RTS_* valus from documentation.
Thanks to Sébastien Parisot for notification.
BFD session options are configured per interface in BFD protocol. This
patch allows to specify them also per-request in protocols requesting
sessions (currently limited to BGP).
When dynamic BGP with remote range is configured, MD5SIG needs to use
newer socket option (TCP_MD5SIG_EXT) to specify remote addres range for
listening socket.
Thanks to Adam Kułagowski for the suggestion.
This is optional check described in RFC 4271. Although this can be also
done by filters, it is widely implemented option in BGP implementations.
Thanks to Eugene Bogomazov for the original patch.
Change of some options requires route refresh, but when import table is
active, channel reload is done from it instead of doing full route
refresh. So in this case we request it internally.
The patch implements optional internal export table to a channel and
hooks it to BGP so it can be used as Adj-RIB-Out. When enabled, all
exported (post-filtered) routes are stored there. An export table can be
examined using e.g. 'show route export table bgp1.ipv4'.
Several BGP channel options (including 'next hop self') could be
reconfigured without session reset, with just route refeed/refresh.
The patch improves reconfiguration code to do it that way.
When 'graceful down' command is entered, protocols are shut down
with regard to graceful restart. Namely Kernel protocol does
not remove routes and BGP protocol does not send notification,
just closes the connection.