Add BGP channel option 'next hop prefer global' that modifies BGP
recursive next hop resolution to use global next hop IPv6 address instead
of link-local next hop IPv6 address for immediate next hop of received
routes.
In principle, the channel list is a list of parent struct proto and can
contain general structures of type struct channel, That is useful e.g.
for adding MPLS channels to BGP.
- When next hop is reset to local IP, we should remove BGP label stack,
as it is related to original next hop
- BGP next hop or immediate next hop from one VRF should not be passed
to another VRF, as they are different IP namespaces
Had to fix route source locking inside BGP export table as we need to
keep the route sources properly allocated until even last BGP pending
update is sent out, therefore the export table printout is accurate.
There were more conflicts that I'd like to see, most notably in route
export. If a bisect identifies this commit with something related, it
may be simply true that this commit introduces that bug. Let's hope it
doesn't happen.
The invalid routes were filtered out before they could ever get
exported, yet some of the routines need them available, e.g. for
display or import reload.
Now the invalid routes are properly exported and dropped in channel
export routines instead.
For BGP LLGR purposes, there was an API allowing a protocol to directly
modify their stale routes in table before flushing them. This API was
called by the table prune routine which violates the future locking
requirements.
Instead of this, BGP now requests a special route export and reimports
these routes into the table, allowing for asynchronous execution without
locking the table on export.
Until now, we were marking routes as REF_STALE and REF_DISCARD to
cleanup old routes after route refresh. This needed a synchronous route
table walk at both beginning and the end of route refresh routine,
marking the routes by the flags.
We avoid these walks by using a stale counter. Every route contains:
u8 stale_cycle;
Every import hook contains:
u8 stale_set;
u8 stale_valid;
u8 stale_pruned;
u8 stale_pruning;
In base_state, stale_set == stale_valid == stale_pruned == stale_pruning
and all routes' stale_cycle also have the same value.
The route refresh looks like follows:
+ ----------- + --------- + ----------- + ------------- + ------------ +
| | stale_set | stale_valid | stale_pruning | stale_pruned |
| Base | x | x | x | x |
| Begin | x+1 | x | x | x |
... now routes are being inserted with stale_cycle == (x+1)
| End | x+1 | x+1 | x | x |
... now table pruning routine is scheduled
| Prune begin | x+1 | x+1 | x+1 | x |
... now routes with stale_cycle not between stale_set and stale_valid
are deleted
| Prune end | x+1 | x+1 | x+1 | x+1 |
+ ----------- + --------- + ----------- + ------------- + ------------ +
The pruning routine is asynchronous and may have high latency in
high-load environments. Therefore, multiple route refresh requests may
happen before the pruning routine starts, leading to this situation:
| Prune begin | x+k | x+k | x -> x+k | x |
... or even
| Prune begin | x+k+1 | x+k | x -> x+k | x |
... if the prune event starts while another route refresh is running.
In such a case, the pruning routine still deletes routes not fitting
between stale_set and and stale_valid, effectively pruning the remnants
of all unpruned route refreshes from before:
| Prune end | x+k | x+k | x+k | x+k |
In extremely rare cases, there may happen too many route refreshes
before any route prune routine finishes. If the difference between
stale_valid and stale_pruned becomes more than 128 when requesting for
another route refresh, the routine walks the table synchronously and
resets all the stale values to a base state, while logging a warning.
Implement BGP roles as described in RFC 9234. It is a mechanism for
route leak prevention and automatic route filtering based on common BGP
topology relationships. It defines role capability (controlled by 'local
role' option) and OTC route attribute, which is used for automatic route
filtering and leak detection.
Minor changes done by commiter.
Until now, if export table was enabled, Nest was storing exactly the
route before rt_notify() was called on it. This was quite sloppy and
spooky and it also wasn't reflecting the changes BGP does before
sending. And as BGP is storing the routes to be sent anyway, we are
simply keeping the already-sent routes in there to better rule out
unneeded reexports.
Some of the route attributes (IGP metric, preference) make no sense in
BGP, therefore these will be probably replaced by something sensible.
Also the nexthop shown in the short output is the BGP nexthop.
Passing protocol to preexport was in fact a historical relic from the
old times when channels weren't a thing. Refactoring that to match
current extensibility needs.
There were quite a lot of conflicts in flowspec validation code which
ultimately led to some code being a bit rewritten, not only adapted from
this or that branch, yet it is still in a limit of a merge.
Validation is called internally from route table at the same place where
nexthop resolution is done. Also accounting for rte->sender semantics
change (not a channel but the import hook instead).