For BGP LLGR purposes, there was an API allowing a protocol to directly
modify their stale routes in table before flushing them. This API was
called by the table prune routine which violates the future locking
requirements.
Instead of this, BGP now requests a special route export and reimports
these routes into the table, allowing for asynchronous execution without
locking the table on export.
Until now, we were marking routes as REF_STALE and REF_DISCARD to
cleanup old routes after route refresh. This needed a synchronous route
table walk at both beginning and the end of route refresh routine,
marking the routes by the flags.
We avoid these walks by using a stale counter. Every route contains:
u8 stale_cycle;
Every import hook contains:
u8 stale_set;
u8 stale_valid;
u8 stale_pruned;
u8 stale_pruning;
In base_state, stale_set == stale_valid == stale_pruned == stale_pruning
and all routes' stale_cycle also have the same value.
The route refresh looks like follows:
+ ----------- + --------- + ----------- + ------------- + ------------ +
| | stale_set | stale_valid | stale_pruning | stale_pruned |
| Base | x | x | x | x |
| Begin | x+1 | x | x | x |
... now routes are being inserted with stale_cycle == (x+1)
| End | x+1 | x+1 | x | x |
... now table pruning routine is scheduled
| Prune begin | x+1 | x+1 | x+1 | x |
... now routes with stale_cycle not between stale_set and stale_valid
are deleted
| Prune end | x+1 | x+1 | x+1 | x+1 |
+ ----------- + --------- + ----------- + ------------- + ------------ +
The pruning routine is asynchronous and may have high latency in
high-load environments. Therefore, multiple route refresh requests may
happen before the pruning routine starts, leading to this situation:
| Prune begin | x+k | x+k | x -> x+k | x |
... or even
| Prune begin | x+k+1 | x+k | x -> x+k | x |
... if the prune event starts while another route refresh is running.
In such a case, the pruning routine still deletes routes not fitting
between stale_set and and stale_valid, effectively pruning the remnants
of all unpruned route refreshes from before:
| Prune end | x+k | x+k | x+k | x+k |
In extremely rare cases, there may happen too many route refreshes
before any route prune routine finishes. If the difference between
stale_valid and stale_pruned becomes more than 128 when requesting for
another route refresh, the routine walks the table synchronously and
resets all the stale values to a base state, while logging a warning.
Until now, if export table was enabled, Nest was storing exactly the
route before rt_notify() was called on it. This was quite sloppy and
spooky and it also wasn't reflecting the changes BGP does before
sending. And as BGP is storing the routes to be sent anyway, we are
simply keeping the already-sent routes in there to better rule out
unneeded reexports.
Some of the route attributes (IGP metric, preference) make no sense in
BGP, therefore these will be probably replaced by something sensible.
Also the nexthop shown in the short output is the BGP nexthop.
Changes in internal API:
* Every route attribute must be defined as struct ea_class somewhere.
* Registration of route attributes known at startup must be done by
ea_register_init() from protocol build functions.
* Every attribute has now its symbol registered in a global symbol table
defined as SYM_ATTRIBUTE
* All attribute ID's are dynamically allocated.
* Attribute value custom formatting hook is defined in the ea_class.
* Attribute names are the same for display and filters, always prefixed
by protocol name.
Also added some unit testing code for filters with route attributes.
Implement flowspec validation procedure as described in RFC 8955 sec. 6
and RFC 9117. The Validation procedure enforces that only routers in the
forwarding path for a network can originate flowspec rules for that
network.
The patch adds new mechanism for tracking inter-table dependencies, which
is necessary as the flowspec validation depends on IP routes, and flowspec
rules must be revalidated when best IP routes change.
The validation procedure is disabled by default and requires that
relevant IP table uses trie, as it uses interval queries for subnets.
Channels have now included rt_import_req and rt_export_req to hook into
the table instead of just one list node. This will (in future) allow for:
* channel import and export bound to different tables
* more efficient pipe code (dropping most of the channel code)
* conversion of 'show route' to a special kind of export
* temporary static routes from CLI
The import / export states are also updated to the new algorithms.
Routes from downed protocols stay in rtable (until next rtable prune
cycle ends) and may be even exported to another protocol. In BGP case,
source BGP protocol is examined, although dynamic parts (including
neighbor entries) are already freed. That may lead to crash under some
race conditions. Ensure that freed neighbor entry is not accessed to
avoid this issue.
BGP statistics code was preliminary and i wanted to replace it by
separate 'show X stats' command. The patch hides the preliminary
output in 'show protocols all' so it is not part of the released
version.
This is an implementation of draft-walton-bgp-hostname-capability-02.
It is implemented since quite some time for FRR and in datacenter, this
gives a nice output to avoid using IP addresses.
It is disabled by default. The hostname is retrieved from uname(2) and
can be overriden with "hostname" option. The domain name is never set
nor displayed.
Minor changes by committer.
The option is not implemented since transition to 2.0 and no plan to add it.
Also remove some deprecated RTS_* valus from documentation.
Thanks to Sébastien Parisot for notification.
BFD session options are configured per interface in BFD protocol. This
patch allows to specify them also per-request in protocols requesting
sessions (currently limited to BGP).
When dynamic BGP with remote range is configured, MD5SIG needs to use
newer socket option (TCP_MD5SIG_EXT) to specify remote addres range for
listening socket.
Thanks to Adam Kułagowski for the suggestion.
This is optional check described in RFC 4271. Although this can be also
done by filters, it is widely implemented option in BGP implementations.
Thanks to Eugene Bogomazov for the original patch.
Change of some options requires route refresh, but when import table is
active, channel reload is done from it instead of doing full route
refresh. So in this case we request it internally.
The patch implements optional internal export table to a channel and
hooks it to BGP so it can be used as Adj-RIB-Out. When enabled, all
exported (post-filtered) routes are stored there. An export table can be
examined using e.g. 'show route export table bgp1.ipv4'.
Several BGP channel options (including 'next hop self') could be
reconfigured without session reset, with just route refeed/refresh.
The patch improves reconfiguration code to do it that way.
When 'graceful down' command is entered, protocols are shut down
with regard to graceful restart. Namely Kernel protocol does
not remove routes and BGP protocol does not send notification,
just closes the connection.
Useful for implementation of agents implementing the SNMP-BGP MIB, which
requires the local AS of a session to be specified.
Thanks to Jan-Philipp Litza for the patch.
Support for dynamically spawning BGP protocols for incoming connections.
Use 'neighbor range' to specify range of valid neighbor addresses, then
incoming connections from these addresses spawn new BGP instances.
When BGP connection is opened, it may happen that rx hook (with remote
OPEN) is called before tx hook (for local OPEN). Therefore, we need to do
internal changes (like setting local_caps) synchronously with OPENSENT
transition and we need to ensure that OPEN is sent before KEEPALIVE.
Allow to specify just 'internal' or 'external' for remote neighbor
instead of specific ASN. In the second case that means BGP peers with
any non-local ASNs are accepted.