Add strict checking for netlink KRT dumps to avoid PMTU cache records
from FNHE table dump along with KRT.
Linux Kernel added FNHE table dump to the netlink API in patch:
8d3b68cd37.1561131177.git.sbrivio@redhat.com/
Therefore, since Linux 5.3 these route cache entries are dumped together
with regular routes during periodic KRT scans, which in some cases may be
huge amount of useless data. This can be avoided by using strict checking
for netlink dumps:
https://lore.kernel.org/netdev/20181008031644.15989-1-dsahern@kernel.org/
The patch mitigates the risk of receiving unknown and potentially large
number of FNHE records that would block BIRD I/O in each sync. There is a
known issue caused by the GRE tunnels on Linux that seems to be creating
one FNHE record for each destination IP address that is routed through
the tunnel, even when the PMTU equals to GRE interface MTU.
Thanks to Tomas Hlavacek for the original patch.
Kernel uses cloned routes to keep route cache entries, but reports them
together with regular routes. They were skipped implicitly as they
do not have rtm_protocol filled. Add explicit check for cloned flag
and skip such routes explicitly.
Also, improve debug logs of skipped routes.
Add option to socket interface for nonlocal binding, i.e. binding to an
IP address that is not present on interfaces. This behaviour is enabled
when SKF_FREEBIND socket flag is set. For Linux systems, it is
implemented by IP_FREEBIND socket flag.
Minor changes done by commiter.
Currently, BIRD ignores dead routes to consider them absent. But it also
ignores its own routes and thus it can not correctly manage such routes
in some cases. This patch makes an exception for routes with proto bird
when ignoring dead routes, so they can be properly updated or removed.
Thanks to Alexander Zubkov for the original patch.
Lexer expression for bytestring was too loose, accepting also
full-length IPv6 addresses. It should be restricted such that
colon is used between every byte or never.
Fix the regex and also add some test cases for it.
Thanks to Alexander Zubkov for the bugreport
Add operators .min and .max to find minumum or maximum element in sets
of types: clist, eclist, lclist. Example usage:
bgp_community.min
bgp_ext_community.max
filter(bgp_large_community, [(as1, as2, *)]).min
Signed-off-by: Alexander Zubkov <green@qrator.net>
The BSD kernel does not support the onlink flag and BIRD does not use
direct routes for next hop validation, instead depends on interface
address ranges. We would like to handle PtMP cases with only host
addresses configured, like:
ifconfig wg0 192.168.0.10/32
route add 192.168.0.4 -iface wg0
route add 192.168.0.8 -iface wg0
To accept BIRD routes with onlink next-hop, like:
route 192.168.42.0/24 via 192.168.0.4%wg0 onlink
BIRD would dismiss the route when receiving from the kernel, as the
next-hop 192.168.0.4 is not part of any interface subnet and onlink
flag is not kept by the BSD kernel.
The commit fixes this by assuming that for routes received from the
kernel, any next-hop is onlink on ifaces with only host addresses.
Thanks to Stefan Haller for the original patch.
RFC 6810 and RFC 8210 specify that the "Max Length" value MUST NOT be
less than the Prefix Length element (underflow). On the other side,
overflow of the Max Length element also is possible, it being an 8-bit
unsigned integer allows for values larger than 32 or 128. This also
implicitly ensures there is no overflow of "Length" value.
When a PDU is received where the Max Length field is corrputed, the RTR
client (BIRD) should immediately terminate the session, flush all data
learned from that cache, and log an error for the operator.
Minor changes done by commiter.
Compare all IA_* flags that are set by sysdep iface code.
The old code ignores IA_SECONDARY flag when comparing whether iface
address updates from kernel changed anything. This is usually not an
issue as kernel removes all secondary addresses due to removal of the
primary one, but it breaks when sysctl 'promote_secondaries' is enabled
and kernel promotes secondary addresses to primary ones.
Thanks to 'Alexander' for the bugreport.
For convenience, Trie functions generally accept as input values not only
NET_IPx types of nets, but also NET_VPNx and NET_ROAx types. But returned
values are always NET_IPx types.
This feature is intended mostly for checking that BIRD's allocation
strategies don't consume much memory space. There are some cases where
withdrawing routes in a specific order lead to memory fragmentation and
this output should give the user at least a notion of how much memory is
actually used for data storage and how much memory is "just allocated"
or used for overhead.
Also raising the "system allocator overhead estimation" from 8 to 16
bytes; it is probably even more. I've found 16 as a local minimum in
best scenarios among reachable machines. I couldn't find any reasonable
method to estimate this value when BIRD starts up.
This commit also fixes the inaccurate computation of memory overhead for
slabs where the "system allocater overhead estimation" was improperly
added to the size of mmap-ed memory.
The prefix trie now supports longest-prefix-match query by function
trie_match_longest_ipX() and it can be extended to iteration over all
covering prefixes for a given prefix (from longest to shortest) using
TRIE_WALK_TO_ROOT_IPx() macro.
This basically means that:
* there are some more levels of indirection and asynchronicity, mostly
in cleanup procedures, requiring correct lock ordering
* all the internal table operations (prune, next hop update) are done
without blocking the other parts of BIRD
* the protocols may get their own loops very soon
To access route attribute cache from multiple threads at once, we have
to lock the cache on writing. The route attributes data structures are
safe to read unless somebody tries to tamper with the cache itself.
This commit prevents use-after-free of routes belonging to protocols
which have been already destroyed, delaying also all the protocols'
shutdown until all of their routes have been finally propagated through
all the pipes down to the appropriate exports.
The use-after-free was somehow hypothetic yet theoretically possible in
rare conditions, when one BGP protocol authors a lot of routes and the
user deletes that protocol by reconfiguring in the same time as next hop
update is requested, causing rte_better() to be called on a
not-yet-pruned network prefix while the owner protocol has been already
freed.
In parallel execution environments, this would happen an inter-thread
use-after-free, causing possible heisenbugs or other nasty problems.
There is a simple universal IO loop, taking care of events, timers and
sockets. Primarily, one instance of a protocol should use exactly one IO
loop to do all its work, as is now done in BFD.
Contrary to previous versions, the loop is now launched and cleaned by
the nest/proto.c code, allowing for a protocol to just request its own
loop by setting the loop's lock order in config higher than the_bird.
It is not supported nor checked if any protocol changed the requested
lock order in reconfigure. No protocol should do it at all.
In previous versions, every thread used its own time structures,
effectively leading to different time in every thread and strange
logging messages.
The time processing code now uses global atomic variables to keep
current time available for fast concurrent reading and safe updates.
In some specific configurations, it was possible to send BIRD into an
infinite loop of recursive next hop resolution. This was caused by route
priority inversion.
To prevent priority inversions affecting other next hops, we simply
refuse to resolve any next hop if the best route for the matching prefix
is recursive or any other route with the same preference is recursive.
Next hop resolution doesn't change route priority, therefore it is
perfectly OK to resolve BGP next hops e.g. by an OSPF route, yet if the
same (or covering) prefix is also announced by iBGP, by retraction of
the OSPF route we would get a possible priority inversion.
* internal tables are now more standalone, having their own import and
export hooks
* route refresh/reload uses stale counter instead of stale flag,
allowing to drop walking the table at the beginning
* route modify (by BGP LLGR) is now done by a special refeed hook,
reimporting the modified routes directly without filters