This variant of logging avoids calling write() for every log line,
allowing for waitless logging. This makes heavy logging less heavy
and more useful for race condition debugging.
The original logging routines were locking a common mutex. This led to
massive underperformance and unwanted serialization when heavily logging
due to lock contention. Now the logging is lockless, though still
serializing on write() syscalls to the same filedescriptor.
This change also brings in a persistent logging channel structures and
thus avoids writing into active configuration data structures during
regular run.
Add a current_time_now() function which gets an immediate monotonic
timestamp instead of using the cached value from the event loop. This is
useful for callers that need precise times, such as the Babel RTT
measurement code.
Minor changes by committer.
Now sk_open() requires an explicit IO loop to open the socket in. Also
specific functions for socket RX pause / resume are added to allow for
BGP corking.
And last but not least, socket reloop is now synchronous to resolve
weird cases of the target loop stopping before actually picking up the
relooped socket. Now the caller must ensure that both loops are locked
while relooping, and this way all sockets always have their respective
loop.
If there are lots of loops in a single thread and only some of the loops
are actually active, the other loops are now kept aside and not checked
until they actually get some timers, events or active sockets.
This should help with extreme loads like 100k tables and protocols.
Also ping and loop pickup mechanism was allowing subtle race
conditions. Now properly handling collisions between loop ping and pickup.
On large configurations, too many threads would spawn with one thread
per loop. Therefore, threads may now run multiple loops at once. The
thread count is configurable and may be changed during run. All threads
are spawned on startup.
This change helps with memory bloating. BIRD filters need large
temporary memory blocks to store their stack and also memory management
keeps its hot page storage per-thread.
Known bugs:
* Thread autobalancing is not yet implemented.
* Low latency loops are executed together with standard loops.
Log message before aborting due to watchdog timeout. We have to use
async-safe write to debug log, as it is done in signal handler.
Minor changes from committer.
Add option to socket interface for nonlocal binding, i.e. binding to an
IP address that is not present on interfaces. This behaviour is enabled
when SKF_FREEBIND socket flag is set. For Linux systems, it is
implemented by IP_FREEBIND socket flag.
Minor changes done by commiter.
There is a simple universal IO loop, taking care of events, timers and
sockets. Primarily, one instance of a protocol should use exactly one IO
loop to do all its work, as is now done in BFD.
Contrary to previous versions, the loop is now launched and cleaned by
the nest/proto.c code, allowing for a protocol to just request its own
loop by setting the loop's lock order in config higher than the_bird.
It is not supported nor checked if any protocol changed the requested
lock order in reconfigure. No protocol should do it at all.
In previous versions, every thread used its own time structures,
effectively leading to different time in every thread and strange
logging messages.
The time processing code now uses global atomic variables to keep
current time available for fast concurrent reading and safe updates.
In general, events are code handling some some condition, which is
scheduled when such condition happened and executed independently from
I/O loop. Work-events are a subgroup of events that are scheduled
repeatedly until some (often significant) work is done (e.g. feeding
routes to protocol). All scheduled events are executed during each
I/O loop iteration.
Separate work-events from regular events to a separate queue and
rate limit their execution to a fixed number per I/O loop iteration.
That should prevent excess latency when many work-events are
scheduled at one time (e.g. simultaneous reload of many BGP sessions).
When dynamic BGP with remote range is configured, MD5SIG needs to use
newer socket option (TCP_MD5SIG_EXT) to specify remote addres range for
listening socket.
Thanks to Adam Kułagowski for the suggestion.
The C11 specification allows only sig_atomic_t and _Atomic variable
access. All other accesses to global variables are undefined behavior.
Using int was probably OK on x86 and x86_64; yet there were some reports
from other architectures (especially some MIPS) that in rare cases,
after issuing SIGHUP, BIRD did strange things.
Support for dynamically spawning BGP protocols for incoming connections.
Use 'neighbor range' to specify range of valid neighbor addresses, then
incoming connections from these addresses spawn new BGP instances.
Since v2 we have multiple listening BGP sockets, and each BGP protocol
has associated one of them. Use listening socket that accepted the
incoming connection as a key in the dispatch process so only BGP
protocols assocaited with that listening socket can be selected.
This is necesary for proper dispatch when VRFs are used.
FreeBSD silently changes TTL to 1 when MSG_DONTROUTE is used, even when
it is explicitly set to another value. That breaks TTL security sockets,
including BFD which always uses TTL 255. Bad FreeBSD!
no more warnings
No more warnings over me
And while it is being compiled all the log is black and white
Release BIRD now and then let it flee
(use the melody of well-known Oh Freedom!)
BSD systems cannot use SO_DONTROUTE, because it does not work properly
with multicast packets (perhaps it tries to find iface based on multicast
group address). But we can use MSG_DONTROUTE sendmsg() flag for unicast
packets. Works on FreeBSD, is ignored on OpenBSD and is broken on NetBSD
(i guess due to integrated routing table and ARP table).
Use full time precision to initialize random generator. The old
code was prone to initialize it to the same values in specific
circumstances (boot without RTC, multiple VMs starting at once).
On Linux, setting the ToS will also set the priority and the range of
accepted values is quite limited (masked by 0x1e). Therefore, 0xc0 is
translated to a priority of 0, not something we want, overriding the
"7" priority which was set previously explicitely. To avoid that, just
move setting priority later in the code.
Thanks to Vincent Bernat for the patch.