Inspirel banner

Programming Distributed Systems with YAMI4

5.4 Load Balancing And Fail-Over

The general-purpose libraries provide an additional high-level service for load balancing and fail-over.

If the outgoing message target has the following form:

failover:(sometarget|othertarget|...)

where individual targets follow the usual format, then the semantics of message sending is extended and can be fully described as:

For example, if there are two server agents listening on:

tcp://somehost:12345

and:

tcp://otherhost:12345

then the following group can be used:

failover:(tcp://somehost:12345|tcp://otherhost:12345)

When a new outgoing message is created for such a composite target, one of the servers is chosen by random. If the message succeeds, the process is finished and the other server is not contacted. If the sending failed, the other one is tried.

This way the group of targets provides the randomized load-balancing if both services are working properly and automatic fail-over if one of them is down.

The use of target groups affects the asynchronous nature of messaging - technically, the description of the target group is not part of the outgoing message and therefore the logic described above is implemented in terms of regular messaging and is not its inherent feature. In other words, the user thread that asks for fail-over target has to wait until the success or failure of each attempted message is confirmed. Due to this the fail-over messages are handled synchronously and there are no further background activities related to the given message after the sending operation returned.

An important property of the fail-over service is that the individual targets are processed in sequence. In other words, a single target has to be fully confirmed to be failing before the next target is tried, which means that the messaging delays accumulate as the (randomized) list of targets is processed. The delay for each single target necessarily includes both connection establishing, if it is not already done - this can involve many connection retries with their own delays according to the agent's configuration options - and the transmission of data packets, if it is attempted before actual failure. Taking all these factors into account, there is no particular limit on the total amount of the accumulated delay for the fail-over scenario, so its usage should be limited to those systems, which are not sensitive to or just do not introduce nondeterministic delays. For these reasons, the fail-over functionality is mostly useful in local-area networks, where the delays related to connection management tend to be bounded.

Note:

In the current library version the wait for completion or wait for transmission for every tried target are performed without any timeout.