Instead of propagating interface updates as they are loaded from kernel,
they are enqueued and all the notifications are called from a
protocol-specific event. This change allows to break the locking loop
between protocols and interfaces.
Anyway, this change is based on v2 branch to keep the changes between v2
and v3 smaller.
When creating a new babel_source object we initialise the seqno to 0. The
caller will update the source object with the right metric and seqno value,
for both newly created and old source objects. However if we initialise the
source object seqno to 0 that may actually turn out to be a valid (higher)
seqno than the one in the routing table, because of seqno wrapping. In this
case the source metric will not be set properly, which breaks feasibility
tracking for subsequent updates.
To fix this, add a new initial_seqno argument to babel_get_source() which
is used when allocating a new object, and set that to the seqno value of
the update we're sending.
Thanks to Juliusz Chroboczek for the bugreport.
Juliusz noticed there were a couple of places we were doing straight
inequality comparisons of seqnos in Babel. This is wrong because seqnos can
wrap: so we need to use the modulo-64k comparison function for these cases
as well.
Introduce a strict-inequality version of the modulo-comparison for this
purpose.
Instead of calling custom hooks from object locks, we use standard event
sending mechanism to inform protocols about object lock changes. This is
a backport from version 3 where these events are passed across threads.
This implementation of object locks doesn't use mutexes to lock the
whole data structure. In version 3, this data structure may get accessed
from multiple threads and must be protected by mutex.
Instead of calling custom hooks from object locks, we use standard event
sending mechanism to inform protocols about object lock changes. As
event sending is lockless, the unlocking protocol simply enqueues the
appropriate event to the given loop when the locking is done.
For active sessions, ignore received packets with zero local id and
mismatched remote id. That forces a session timeout instead of an
immediate session restart. It makes BFD sessions more resilient to
packet spoofing.
Thanks to André Grüneberg for the suggestion.
Protocols receive if_notify() announcements that are filtered according
to their VRF setting, but during reconfiguration, they access iface_list
directly and forgot to check VRF setting here, which leads to all
interfaces be addedd.
Fix this issue for Babel, OSPF, RAdv and RIP protocols.
Thanks to Marcel Menzel for the bugreport.
Some of these new BGP role keywords use generic names that collides with
user-defined symbols. Allow them to be redefined. Also remove duplicit
keyword definition for 'prefer'.
During backporting attribute changes from 3.0-branch, some internal
attributes (RIP iface and Babel seqno) leaked to 'show route all' output.
Allow protocols to hide specific attributes with GA_HIDDEN value.
Thanks to Nigel Kukard for the bugreport.
There were some confusion about validity and usage of pflags, which
caused incorrect usage after some flags from (now removed) protocol-
specific area were moved to pflags.
We state that pflags:
- Are secondary data used by protocol-specific hooks
- Can be changed on an existing route (in contrast to copy-on-write
for primary data)
- Are irrelevant for propagation (not propagated when changed)
- Are specific to a routing table (not propagated by pipe)
The patch did these fixes:
- Do not compare pflags in rte_same(), as they may keep cached values
like BGP_REF_STALE, causing spurious propagation.
- Initialize pflags to zero in rte_get_temp(), avoid initialization in
protocol code, fixing at least two forgotten initializations (krt
and one case in babel).
- Improve documentation about pflags
The seqno request retransmission handling was tracking the destination
that a forwarded request was being sent to and always retransmitting to
that same destination. This is unnecessary because we only need to
retransmit requests we originate ourselves, not those we forward on
behalf of others; in fact retransmitting on behalf of others can lead to
exponential multiplication of requests, which would be bad.
So rework the seqno request tracking so that instead of storing the
destination of a request, we just track whether it was a request that we
forwarded on behalf of another node, or if it was a request we originated
ourselves. Forwarded requests are not retransmitted, they are only used
for duplicate suppression, and for triggering an update when satisfied.
If we end up originating a request that we previously forwarded, we
"upgrade" the old request and restart the retransmit counter.
One complication with this is that requests sent in response to unfeasible
updates (section 3.8.2.2 of the RFC) have to be sent as unicast to a
particular peer. However, we don't really need to retransmit those as
there's no starvation when sending such a request; so we just change
such requests to be one-off unicast requests that are not subject to
retransmission or duplicate suppression. This is the same behaviour as
babeld has for such requests.
Minor changes from committer.
The effective keepalive time now scales relative to the negotiated
hold time, to maintain proportion between the keepalive time and the
hold time. This avoids issues when both keepalive and hold times
were configured, the hold time was negotiated to a smaller value,
but the keepalive time stayed the same.
Add new options 'min hold time' and 'min keepalive time', which reject
session attempts with too small hold time.
Improve validation of config options an their documentation.
Thanks to Alexander Zubkov and Sergei Goriunov for suggestions.
Add BGP channel option 'next hop prefer global' that modifies BGP
recursive next hop resolution to use global next hop IPv6 address instead
of link-local next hop IPv6 address for immediate next hop of received
routes.
In principle, the channel list is a list of parent struct proto and can
contain general structures of type struct channel, That is useful e.g.
for adding MPLS channels to BGP.
- When next hop is reset to local IP, we should remove BGP label stack,
as it is related to original next hop
- BGP next hop or immediate next hop from one VRF should not be passed
to another VRF, as they are different IP namespaces
Had to fix route source locking inside BGP export table as we need to
keep the route sources properly allocated until even last BGP pending
update is sent out, therefore the export table printout is accurate.
There were more conflicts that I'd like to see, most notably in route
export. If a bisect identifies this commit with something related, it
may be simply true that this commit introduces that bug. Let's hope it
doesn't happen.
The invalid routes were filtered out before they could ever get
exported, yet some of the routines need them available, e.g. for
display or import reload.
Now the invalid routes are properly exported and dropped in channel
export routines instead.
For BGP LLGR purposes, there was an API allowing a protocol to directly
modify their stale routes in table before flushing them. This API was
called by the table prune routine which violates the future locking
requirements.
Instead of this, BGP now requests a special route export and reimports
these routes into the table, allowing for asynchronous execution without
locking the table on export.
Until now, we were marking routes as REF_STALE and REF_DISCARD to
cleanup old routes after route refresh. This needed a synchronous route
table walk at both beginning and the end of route refresh routine,
marking the routes by the flags.
We avoid these walks by using a stale counter. Every route contains:
u8 stale_cycle;
Every import hook contains:
u8 stale_set;
u8 stale_valid;
u8 stale_pruned;
u8 stale_pruning;
In base_state, stale_set == stale_valid == stale_pruned == stale_pruning
and all routes' stale_cycle also have the same value.
The route refresh looks like follows:
+ ----------- + --------- + ----------- + ------------- + ------------ +
| | stale_set | stale_valid | stale_pruning | stale_pruned |
| Base | x | x | x | x |
| Begin | x+1 | x | x | x |
... now routes are being inserted with stale_cycle == (x+1)
| End | x+1 | x+1 | x | x |
... now table pruning routine is scheduled
| Prune begin | x+1 | x+1 | x+1 | x |
... now routes with stale_cycle not between stale_set and stale_valid
are deleted
| Prune end | x+1 | x+1 | x+1 | x+1 |
+ ----------- + --------- + ----------- + ------------- + ------------ +
The pruning routine is asynchronous and may have high latency in
high-load environments. Therefore, multiple route refresh requests may
happen before the pruning routine starts, leading to this situation:
| Prune begin | x+k | x+k | x -> x+k | x |
... or even
| Prune begin | x+k+1 | x+k | x -> x+k | x |
... if the prune event starts while another route refresh is running.
In such a case, the pruning routine still deletes routes not fitting
between stale_set and and stale_valid, effectively pruning the remnants
of all unpruned route refreshes from before:
| Prune end | x+k | x+k | x+k | x+k |
In extremely rare cases, there may happen too many route refreshes
before any route prune routine finishes. If the difference between
stale_valid and stale_pruned becomes more than 128 when requesting for
another route refresh, the routine walks the table synchronously and
resets all the stale values to a base state, while logging a warning.
Implement BGP roles as described in RFC 9234. It is a mechanism for
route leak prevention and automatic route filtering based on common BGP
topology relationships. It defines role capability (controlled by 'local
role' option) and OTC route attribute, which is used for automatic route
filtering and leak detection.
Minor changes done by commiter.
Until now, if export table was enabled, Nest was storing exactly the
route before rt_notify() was called on it. This was quite sloppy and
spooky and it also wasn't reflecting the changes BGP does before
sending. And as BGP is storing the routes to be sent anyway, we are
simply keeping the already-sent routes in there to better rule out
unneeded reexports.
Some of the route attributes (IGP metric, preference) make no sense in
BGP, therefore these will be probably replaced by something sensible.
Also the nexthop shown in the short output is the BGP nexthop.
When f_line is done, we have to pop the stack frame. The old code just
removed nominal number of args/vars. Change it to use stored ventry value
modified by number of returned values. This allows to allocate variables
on a stack frame during execution of f_lines instead of just at start.
But we need to know the number of returned values for a f_line. It is 1
for term, 0 for cmd. Store that to f_line during linearization.
Passing protocol to preexport was in fact a historical relic from the
old times when channels weren't a thing. Refactoring that to match
current extensibility needs.
In the multithreaded environment, it is not supposed that anybody
traverses the routing table as the CLI show-route was doing. Now the
routing table traversal is gone and CLI won't hold the table locked
while computing filters.
Added an option for export filter to allow for prefiltering based on the
prefix. Routes outside the given prefix are completely ignored. Config
is simple:
export in <net> <filter>;
There were quite a lot of conflicts in flowspec validation code which
ultimately led to some code being a bit rewritten, not only adapted from
this or that branch, yet it is still in a limit of a merge.
Validation is called internally from route table at the same place where
nexthop resolution is done. Also accounting for rte->sender semantics
change (not a channel but the import hook instead).
The Babel seqno request code keeps track of which seqno requests are
outstanding for a neighbour by putting them onto a per-neighbour list. When
reusing a seqno request, it will try to remove this node, but if the seqno
request in question was a multicast request with no neighbour attached this
will result in a crash because it tries to remove a list node that wasn't
added to any list.
Fix this by making the list remove conditional. Also fix neighbor removal
which were changing seqno requests to multicast ones instead of removing
them.
Fixes: ebd5751cde ("Babel: Seqno requests are properly decoupled from
neighbors when the underlying interface disappears").
Based on the patch from Toke Høiland-Jørgensen <toke@toke.dk>,
bug reported by Stefan Haller <stefan.haller@stha.de>, thanks.
For now, all route attributes are stored as eattrs in ea_list. This
should make route manipulation easier and it also allows for a layered
approach of route attributes where updates from filters will be stored
as an overlay over the previous version.
As there is either a nexthop or another destination specification
(or othing in case of ROAs and Flowspec), it may be merged together.
This code is somehow quirky and should be replaced in future by better
implementation of nexthop.
Also flowspec validation result has its own attribute now as it doesn't
have anything to do with route nexthop.
This doesn't do anything more than to put the whole structure inside
adata. The overall performance is certainly going downhill; we'll
optimize this later.
Anyway, this is one of the latest items inside rta and in several
commits we may drop rta completely and move to eattrs-only routes.
The prefix hash table in BGP used the same hash function as the rtable.
When a batch of routes are exported during feed/flush to the BGP, they
all have similar hash values, so they are all crowded in a few slots in
the BGP prefix table (which is much smaller - around the size of the
batch - and uses higher bits from hash values), making it much slower due
to excessive collisions. Use a different hash function to avoid this.
Also, increase the batch size to fill 4k BGP packets and increase minimum
BGP bucket and prefix hash sizes to avoid back and forth resizing during
flushes.
This leads to order of magnitude faster flushes (on my test data).
The route scope attribute was used for simple user route marking. As
there is a better tool for this (custom attributes), the old and limited
way can be dropped.
Some tokens are both keywords and symbols. For now, we allow only
specific keywords to be redefined; in future, more of the keywords may
be added to this category.
The redefinable keywords must be specified in any .Y file as follows:
toksym: THE_KEYWORD ;
See proto/bgp/config.Y for an example.
Also dropped a lot of unused terminals.
Changes in internal API:
* Every route attribute must be defined as struct ea_class somewhere.
* Registration of route attributes known at startup must be done by
ea_register_init() from protocol build functions.
* Every attribute has now its symbol registered in a global symbol table
defined as SYM_ATTRIBUTE
* All attribute ID's are dynamically allocated.
* Attribute value custom formatting hook is defined in the ea_class.
* Attribute names are the same for display and filters, always prefixed
by protocol name.
Also added some unit testing code for filters with route attributes.
This commit removes the EAF_TYPE_* namespace completely and also for
route attributes, filter-based types T_* are used. This simplifies
fetching and setting route attributes from filters.
Also, there is now union bval which serves as an universal value holder
instead of private unions held separately by eattr and filter code.
When shutting down a Babel instance we send a wildcard retraction to make
sure all peers can quickly switch to other route origins. Add another small
optimisation borrowed from babeld: sending a Hello message (along with the
retraction) with a very low interval.
This will cause neighbours to modify their expiry timers for the node's
state to quickly time it out, thus conserving resources in the network.
The interface pointer was improperly converted to u32 and back. Fixing
this by explicitly allocating an adata structure for it. It's not so
memory efficient, we'll optimize this later.
Add BFD protocol option 'strict bind' to use separate listening socket
for each BFD interface bound to its address instead of using shared
listening sockets.
It is too cryptic to flush tmp_linpool in these cases and we don't want
anybody in the future to break this code by adding an allocation
somewhere which should persist over that flush.
Saving and restoring linpool state is safer.
A recent change in Babel causes ifaces to disappear after
reconfiguration. The patch fixes that.
Thanks to Johannes Kimmel for an insightful bugreport.
Implement flowspec validation procedure as described in RFC 8955 sec. 6
and RFC 9117. The Validation procedure enforces that only routers in the
forwarding path for a network can originate flowspec rules for that
network.
The patch adds new mechanism for tracking inter-table dependencies, which
is necessary as the flowspec validation depends on IP routes, and flowspec
rules must be revalidated when best IP routes change.
The validation procedure is disabled by default and requires that
relevant IP table uses trie, as it uses interval queries for subnets.
Attach a prefix trie to IP/VPN/ROA tables. Use it for net_route() and
net_roa_check(). This leads to 3-5x speedups for IPv4 and 5-10x
speedup for IPv6 of these calls.
TODO:
- Rebuild the trie during rt_prune_table()
- Better way to avoid trie_add_prefix() in net_get() for existing tables
- Make it configurable (?)
One of previous commits added error logging of invalid routes. This
also inadvertently caused error logging of route loops, which should
be ignored silently. Fix that.
Most error messages in attribute processing are in rx/decode step and
these use L_REMOTE log class. But there are few that are in tx/export
step and these should use L_ERR log class.
Use tx-specific macro (REJECT()) in tx/export code and rename field
err_withdraw to err_reject in struct bgp_export_state to ensure that
appropriate error reporting macros are called in proper contexts.
RFC 6810 and RFC 8210 specify that the "Max Length" value MUST NOT be
less than the Prefix Length element (underflow). On the other side,
overflow of the Max Length element also is possible, it being an 8-bit
unsigned integer allows for values larger than 32 or 128. This also
implicitly ensures there is no overflow of "Length" value.
When a PDU is received where the Max Length field is corrputed, the RTR
client (BIRD) should immediately terminate the session, flush all data
learned from that cache, and log an error for the operator.
Minor changes done by commiter.
This commit prevents use-after-free of routes belonging to protocols
which have been already destroyed, delaying also all the protocols'
shutdown until all of their routes have been finally propagated through
all the pipes down to the appropriate exports.
The use-after-free was somehow hypothetic yet theoretically possible in
rare conditions, when one BGP protocol authors a lot of routes and the
user deletes that protocol by reconfiguring in the same time as next hop
update is requested, causing rte_better() to be called on a
not-yet-pruned network prefix while the owner protocol has been already
freed.
In parallel execution environments, this would happen an inter-thread
use-after-free, causing possible heisenbugs or other nasty problems.
This basically means that:
* there are some more levels of indirection and asynchronicity, mostly
in cleanup procedures, requiring correct lock ordering
* all the internal table operations (prune, next hop update) are done
without blocking the other parts of BIRD
* the protocols may get their own loops very soon
There is a simple universal IO loop, taking care of events, timers and
sockets. Primarily, one instance of a protocol should use exactly one IO
loop to do all its work, as is now done in BFD.
Contrary to previous versions, the loop is now launched and cleaned by
the nest/proto.c code, allowing for a protocol to just request its own
loop by setting the loop's lock order in config higher than the_bird.
It is not supported nor checked if any protocol changed the requested
lock order in reconfigure. No protocol should do it at all.
In previous versions, every thread used its own time structures,
effectively leading to different time in every thread and strange
logging messages.
The time processing code now uses global atomic variables to keep
current time available for fast concurrent reading and safe updates.
* internal tables are now more standalone, having their own import and
export hooks
* route refresh/reload uses stale counter instead of stale flag,
allowing to drop walking the table at the beginning
* route modify (by BGP LLGR) is now done by a special refeed hook,
reimporting the modified routes directly without filters
Channels have now included rt_import_req and rt_export_req to hook into
the table instead of just one list node. This will (in future) allow for:
* channel import and export bound to different tables
* more efficient pipe code (dropping most of the channel code)
* conversion of 'show route' to a special kind of export
* temporary static routes from CLI
The import / export states are also updated to the new algorithms.
Routes are now allocated only when they are just to be inserted to the
table. Updating a route needs a locally allocated route structure.
Ownership of the attributes is also now not transfered from protocols to
tables and vice versa but just borrowed which should be easier to handle
in a multithreaded environment.