Instead of propagating interface updates as they are loaded from kernel,
they are enqueued and all the notifications are called from a
protocol-specific event. This change allows to break the locking loop
between protocols and interfaces.
Anyway, this change is based on v2 branch to keep the changes between v2
and v3 smaller.
Instead of calling custom hooks from object locks, we use standard event
sending mechanism to inform protocols about object lock changes. This is
a backport from version 3 where these events are passed across threads.
This implementation of object locks doesn't use mutexes to lock the
whole data structure. In version 3, this data structure may get accessed
from multiple threads and must be protected by mutex.
Some of these new BGP role keywords use generic names that collides with
user-defined symbols. Allow them to be redefined. Also remove duplicit
keyword definition for 'prefer'.
There were some confusion about validity and usage of pflags, which
caused incorrect usage after some flags from (now removed) protocol-
specific area were moved to pflags.
We state that pflags:
- Are secondary data used by protocol-specific hooks
- Can be changed on an existing route (in contrast to copy-on-write
for primary data)
- Are irrelevant for propagation (not propagated when changed)
- Are specific to a routing table (not propagated by pipe)
The patch did these fixes:
- Do not compare pflags in rte_same(), as they may keep cached values
like BGP_REF_STALE, causing spurious propagation.
- Initialize pflags to zero in rte_get_temp(), avoid initialization in
protocol code, fixing at least two forgotten initializations (krt
and one case in babel).
- Improve documentation about pflags
The effective keepalive time now scales relative to the negotiated
hold time, to maintain proportion between the keepalive time and the
hold time. This avoids issues when both keepalive and hold times
were configured, the hold time was negotiated to a smaller value,
but the keepalive time stayed the same.
Add new options 'min hold time' and 'min keepalive time', which reject
session attempts with too small hold time.
Improve validation of config options an their documentation.
Thanks to Alexander Zubkov and Sergei Goriunov for suggestions.
Add BGP channel option 'next hop prefer global' that modifies BGP
recursive next hop resolution to use global next hop IPv6 address instead
of link-local next hop IPv6 address for immediate next hop of received
routes.
- When next hop is reset to local IP, we should remove BGP label stack,
as it is related to original next hop
- BGP next hop or immediate next hop from one VRF should not be passed
to another VRF, as they are different IP namespaces
In principle, the channel list is a list of parent struct proto and can
contain general structures of type struct channel, That is useful e.g.
for adding MPLS channels to BGP.
Implement BGP roles as described in RFC 9234. It is a mechanism for
route leak prevention and automatic route filtering based on common BGP
topology relationships. It defines role capability (controlled by 'local
role' option) and OTC route attribute, which is used for automatic route
filtering and leak detection.
Minor changes done by commiter.
Passing protocol to preexport was in fact a historical relic from the
old times when channels weren't a thing. Refactoring that to match
current extensibility needs.
The prefix hash table in BGP used the same hash function as the rtable.
When a batch of routes are exported during feed/flush to the BGP, they
all have similar hash values, so they are all crowded in a few slots in
the BGP prefix table (which is much smaller - around the size of the
batch - and uses higher bits from hash values), making it much slower due
to excessive collisions. Use a different hash function to avoid this.
Also, increase the batch size to fill 4k BGP packets and increase minimum
BGP bucket and prefix hash sizes to avoid back and forth resizing during
flushes.
This leads to order of magnitude faster flushes (on my test data).
It is too cryptic to flush tmp_linpool in these cases and we don't want
anybody in the future to break this code by adding an allocation
somewhere which should persist over that flush.
Saving and restoring linpool state is safer.
Implement flowspec validation procedure as described in RFC 8955 sec. 6
and RFC 9117. The Validation procedure enforces that only routers in the
forwarding path for a network can originate flowspec rules for that
network.
The patch adds new mechanism for tracking inter-table dependencies, which
is necessary as the flowspec validation depends on IP routes, and flowspec
rules must be revalidated when best IP routes change.
The validation procedure is disabled by default and requires that
relevant IP table uses trie, as it uses interval queries for subnets.
One of previous commits added error logging of invalid routes. This
also inadvertently caused error logging of route loops, which should
be ignored silently. Fix that.
Most error messages in attribute processing are in rx/decode step and
these use L_REMOTE log class. But there are few that are in tx/export
step and these should use L_ERR log class.
Use tx-specific macro (REJECT()) in tx/export code and rename field
err_withdraw to err_reject in struct bgp_export_state to ensure that
appropriate error reporting macros are called in proper contexts.
Routes from downed protocols stay in rtable (until next rtable prune
cycle ends) and may be even exported to another protocol. In BGP case,
source BGP protocol is examined, although dynamic parts (including
neighbor entries) are already freed. That may lead to crash under some
race conditions. Ensure that freed neighbor entry is not accessed to
avoid this issue.
The flag makes sense just in external representation. It is reset during
BGP export, but keeping it internally broke MRT dumps for short attributes
that used it anyways.
Thanks to Simon Marsh for the bugreport and the patch.
BGP statistics code was preliminary and i wanted to replace it by
separate 'show X stats' command. The patch hides the preliminary
output in 'show protocols all' so it is not part of the released
version.
This is an implementation of draft-walton-bgp-hostname-capability-02.
It is implemented since quite some time for FRR and in datacenter, this
gives a nice output to avoid using IP addresses.
It is disabled by default. The hostname is retrieved from uname(2) and
can be overriden with "hostname" option. The domain name is never set
nor displayed.
Minor changes by committer.
Add fake MP_REACH_NLRI attribute with BGP next hop when encoding MRT
table dumps for IPv6 routes. That is necessary to encode next hop as
NEXT_HOP attribute is not used for MP-BGP.
Thanks to Santiago Aggio for the bugreport.
The option is not implemented since transition to 2.0 and no plan to add it.
Also remove some deprecated RTS_* valus from documentation.
Thanks to Sébastien Parisot for notification.