[MidoNet-dev] Feature Proposal: Tunnel Health checks

Navarro, Abel abel at midokura.com
Tue Feb 19 15:48:30 UTC 2013


what would be the maximum time to detect a tunnel is down?
1s<http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/fasthelo.html>?
50ms <http://en.wikipedia.org/wiki/Resilient_Packet_Ring>?
30s<http://en.wikipedia.org/wiki/Border_Gateway_Protocol>
?


On Tue, Feb 19, 2013 at 4:31 PM, Guillermo Ontañón <guillermo at midokura.jp>wrote:

> On Tue, Feb 19, 2013 at 4:21 PM, Navarro, Galo <galo at midokura.jp> wrote:
>
>> Hi Guillermo, thanks for the quick feedback! Some comments below
>>
>> >> - TunnelPorts become active on each side of the tunnel, the TunnelDoc
>> >>   becomes aware of local ports and starts taking care of them.
>> >> - Regularly, for each cared-for tunnel the TunnelDoc:
>> >>     - Sends a packet to the other peer
>> >>     - Logs variation on RX value of the PortStats on the tunnel's local
>> >> port.
>> >>     - If variation = 0, increment a "no-increment" counter
>> >>     - If "no-increment" counter > threshold, trigger alert message for
>> >>       lack of connectivity on the REVERSE direction of the tunnel
>> (e.g.:
>> >>       if the TunnelDoc at A spots no RX, the alert refers to loss of
>> >>       connectivity from B to A).
>> >>     - Implement whatever corrective measures upon receiving the alert
>> >>       (typically, the DatapathController could recreate the tunnel)
>>
>> > This is not a lot of extra traffic, but the number of tunnels does grow
>> > quadratically with the number of MM agents. I propose a slight
>> variation on
>> > the above to avoid sending traffic on non-idle tunnels, along the lines
>> of
>> > what is done by IPsec's dead peer detection:
>> >
>> > http://www.ietf.org/rfc/rfc3706.txt
>> >
>> > Basically, from the POV of view of one of the nodes, it looks like this:
>> >
>> >    * Monitor idleness (by looking at RX as you outline above) and do
>> nothing
>> > and consider the tunnel healthy while idleness doesn't go above a
>> certain
>> > threshold.
>> >    * When the tunnel becomes idle, send an "are-you-there" packet to the
>> > Peer (we could just use the tunnel-key for this).
>> >    * When an "are-you-there" packet is received, reply to it with an
>> Ack.
>>
>> This is definitely better. I messed up copypastes badly but the idea
>> was basically what you explain, the "send packet to another peer"
>> would be conditioned to several cycles without increment on the
>> "no-data-increment" counter.
>>
>
>
> But I think that for this to work you need the 'ack' reply, would it be
> included? Otherwise a host may be receiving traffic (non-idle) but not
> sending, and would never send any 'are-you-there' packets to the other side
> because its RX is increasing.
>
>
>>
>> >> The counters described above would be stored in a common data structure
>> >> in Cassandra so that API clients could easily retrieve the list of
>> >> failing tunnels as described above.
>> >
>> > I wonder why Cassandra and not Zookeeper?
>>
>> Just because as far as I could see metrics are stored in Cassandra,
>> but if it makes more sense to have them in ZK that's perfectly ok for
>> me.
>>
>> Thanks!
>>
>> /g
>>
>
>
>
> --
> -- Guillermo Ontañón
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.midonet.org/pipermail/midonet-dev/attachments/20130219/8f70e236/attachment-0001.html>


More information about the MidoNet-dev mailing list