For now, all route attributes are stored as eattrs in ea_list. This
should make route manipulation easier and it also allows for a layered
approach of route attributes where updates from filters will be stored
as an overlay over the previous version.
As there is either a nexthop or another destination specification
(or othing in case of ROAs and Flowspec), it may be merged together.
This code is somehow quirky and should be replaced in future by better
implementation of nexthop.
Also flowspec validation result has its own attribute now as it doesn't
have anything to do with route nexthop.
This doesn't do anything more than to put the whole structure inside
adata. The overall performance is certainly going downhill; we'll
optimize this later.
Anyway, this is one of the latest items inside rta and in several
commits we may drop rta completely and move to eattrs-only routes.
The route scope attribute was used for simple user route marking. As
there is a better tool for this (custom attributes), the old and limited
way can be dropped.
Changes in internal API:
* Every route attribute must be defined as struct ea_class somewhere.
* Registration of route attributes known at startup must be done by
ea_register_init() from protocol build functions.
* Every attribute has now its symbol registered in a global symbol table
defined as SYM_ATTRIBUTE
* All attribute ID's are dynamically allocated.
* Attribute value custom formatting hook is defined in the ea_class.
* Attribute names are the same for display and filters, always prefixed
by protocol name.
Also added some unit testing code for filters with route attributes.
Implicit paddings have undefined values in C. We want the eattr blocks
to be comparable by memcmp and eattrs settable directly by structrure
literals. This check ensures that all paddings in eattr and bval are
explicit and therefore zeroed in all literals.
This commit removes the EAF_TYPE_* namespace completely and also for
route attributes, filter-based types T_* are used. This simplifies
fetching and setting route attributes from filters.
Also, there is now union bval which serves as an universal value holder
instead of private unions held separately by eattr and filter code.
There were several requests to allow use of 240.0.0.0/4 as a private
range, and Linux kernel already allows such routes, so perhaps we can
allow that too.
Thanks to Vincent Bernat and others for suggestion and patches.
Alignment of slabs should be at least sizeof(ptr) to avoid unaligned
pointers in slab structures. Fixme: Use proper way to choose alignment
for internal allocators.
Attach a prefix trie to IP/VPN/ROA tables. Use it for net_route() and
net_roa_check(). This leads to 3-5x speedups for IPv4 and 5-10x
speedup for IPv6 of these calls.
TODO:
- Rebuild the trie during rt_prune_table()
- Better way to avoid trie_add_prefix() in net_get() for existing tables
- Make it configurable (?)
Add option to socket interface for nonlocal binding, i.e. binding to an
IP address that is not present on interfaces. This behaviour is enabled
when SKF_FREEBIND socket flag is set. For Linux systems, it is
implemented by IP_FREEBIND socket flag.
Minor changes done by commiter.
This feature is intended mostly for checking that BIRD's allocation
strategies don't consume much memory space. There are some cases where
withdrawing routes in a specific order lead to memory fragmentation and
this output should give the user at least a notion of how much memory is
actually used for data storage and how much memory is "just allocated"
or used for overhead.
Also raising the "system allocator overhead estimation" from 8 to 16
bytes; it is probably even more. I've found 16 as a local minimum in
best scenarios among reachable machines. I couldn't find any reasonable
method to estimate this value when BIRD starts up.
This commit also fixes the inaccurate computation of memory overhead for
slabs where the "system allocater overhead estimation" was improperly
added to the size of mmap-ed memory.
This commit prevents use-after-free of routes belonging to protocols
which have been already destroyed, delaying also all the protocols'
shutdown until all of their routes have been finally propagated through
all the pipes down to the appropriate exports.
The use-after-free was somehow hypothetic yet theoretically possible in
rare conditions, when one BGP protocol authors a lot of routes and the
user deletes that protocol by reconfiguring in the same time as next hop
update is requested, causing rte_better() to be called on a
not-yet-pruned network prefix while the owner protocol has been already
freed.
In parallel execution environments, this would happen an inter-thread
use-after-free, causing possible heisenbugs or other nasty problems.
There is a simple universal IO loop, taking care of events, timers and
sockets. Primarily, one instance of a protocol should use exactly one IO
loop to do all its work, as is now done in BFD.
Contrary to previous versions, the loop is now launched and cleaned by
the nest/proto.c code, allowing for a protocol to just request its own
loop by setting the loop's lock order in config higher than the_bird.
It is not supported nor checked if any protocol changed the requested
lock order in reconfigure. No protocol should do it at all.
In previous versions, every thread used its own time structures,
effectively leading to different time in every thread and strange
logging messages.
The time processing code now uses global atomic variables to keep
current time available for fast concurrent reading and safe updates.
* internal tables are now more standalone, having their own import and
export hooks
* route refresh/reload uses stale counter instead of stale flag,
allowing to drop walking the table at the beginning
* route modify (by BGP LLGR) is now done by a special refeed hook,
reimporting the modified routes directly without filters
We can also quite simply allocate bigger blocks. Anyway, we need these
blocks to be aligned to their size which needs one mmap() two times
bigger and then two munmap()s returning the unaligned parts.
The user can specify -B <N> on startup when <N> is the exponent of 2,
setting the block size to 2^N. On most systems, N is 12, anyway if you
know that your configuration is going to eat gigabytes of RAM, you are
almost forced to raise your block size as you may easily get into memory
fragmentation issues or you have to raise your maximum mapping count,
e.g. "sysctl vm.max_map_count=(number)".
Use 16-way (4bit) branching in prefix trie instead of basic binary
branching. The change makes IPv4 prefix sets almost 3x faster, but
with more memory consumption and much more complicated algorithm.
Together with a previous filter change, it makes IPv4 prefix sets
about ~4.3x faster and slightly smaller (on my test data).
Add support for specifying a password in hexadecimal format, The result
is the same whether a password is specified as a quoted string or a
hex-encoded byte string, this just makes it more convenient to input
high-entropy byte strings as MAC keys.
Import the blake2-kat.h header with test vector output from the blake
reference implementation, and add tests to mac_test.c to compare the
output of the Bird MAC algorithm implementations with that reference
output.
Since the reference implementation only has test vectors for the full
output size, there are no tests for the smaller-sized output variants.
The Babel MAC authentication RFC recommends implementing Blake2s as one of
the supported algorithms. In order to achieve do this, add the blake2b and
blake2s hash functions for MAC authentication. The hashing function
implementations are the reference implementations from blake2.net.
The Blake2 algorithms allow specifying an arbitrary output size, and the
Babel MAC spec says to implement Blake2s with 128-bit output. To satisfy
this, we add two different variants of each of the algorithms, one using
the default size (256 bits for Blake2s, 512 bits for Blake2b), and one
using half the default output size.
Update to BIRD coding style done by committer.
Add a wrapper function in sysdep to get random bytes, and required checks
in configure.ac to select how to do it. The configure script tries, in
order, getrandom(), getentropy() and reading from /dev/urandom.
When an interface disappears, all the neighbors are freed as well. Seqno
requests were anyway not decoupled from them, leading to strange
segfaults. This fix adds a proper seqno request list inside neighbors to
make sure that no pointer to neighbor is kept after free.
For numeric operators, comma is used for disjunction in expressions like
"10, 20, 30..40". But for bitmask operators, comma is used for
conjunction in a way that does not really make much sense. Use always
explicit logical operators (&& and ||) to connect bitmask operators.
Thanks to Matt Corallo for the bugreport.
Add support to set or read outgoing MPLS labels using filters. Currently
this supports the addition of one label per route for the first next hop.
Minor changes by committer.
Implement function flow_explicate_part() to convert flowspec numeric
expressions to a simple list of (disjoint, sorted) intervals. That could
be used in filters to build f_tree-based int-sets from them.
The code in tm_format_real_time() mixed up two buffers and their
sizes, which may cause crash in MRT dumping code.
Thanks to Piotr Wydrych for the bugreport.
From now, there are no auxiliary pointers stored in the free slab nodes.
This led to strange debugging problems if use-after-free happened in
slab-allocated structures, especially if the structure's first member is
a next pointer.
This also reduces the memory needed by 1 pointer per allocated object.
OTOH, we now rely on pages being aligned to their size's multiple, which
is quite common anyway.
In general, events are code handling some some condition, which is
scheduled when such condition happened and executed independently from
I/O loop. Work-events are a subgroup of events that are scheduled
repeatedly until some (often significant) work is done (e.g. feeding
routes to protocol). All scheduled events are executed during each
I/O loop iteration.
Separate work-events from regular events to a separate queue and
rate limit their execution to a fixed number per I/O loop iteration.
That should prevent excess latency when many work-events are
scheduled at one time (e.g. simultaneous reload of many BGP sessions).
The babel protocol code was initialising objects returned from the slab
allocator by assigning to each of the struct members individually, but
wasn't touching the NODE member while doing so. This leads to warnings on
debug builds since commit:
baac700906 ("List expensive check.")
To fix this, introduce an sl_allocz() variant of the slab allocator which
will zero out the memory before returning it, and switch all the babel call
sites to use this version. The overhead for doing this should be negligible
for small objects, and in the case of babel, the largest object being
allocated was being zeroed anyway, so we can drop the memset in
babel_read_tlv().