[MidoNet-dev] Feature Proposal: Tunnel Health checks

de Palol, Marc marc at midokura.jp
Tue Feb 19 22:12:25 UTC 2013


Hi all,

I agree that the results will need to be stored in Cassandra, for what Galo
said, the metrics are there and the GUI already knows how to get them.

About this 'are you there' problem. I wonder if we could use zookeeper's
ephemeral nodes. This nodes exist in zookeeper as long as the session who
created them still exists. We could create a znode for every tunnel, tied
to the session. If a tunnel disappears or stops working the ephemeral node
disappears (don't know how, we should see the details here). There could be
some watchers set in place to notify the responsible for the tunnel
recreation.


On Tue, Feb 19, 2013 at 5:27 PM, Navarro, Galo <galo at midokura.jp> wrote:

> Just to clarify after talking w. Guillermo:
>
> Even though we don't need to active listen for an ACK received (for
> the reasons explained before), we do need to implement a mechanism to
> receive and reply to "are-you-there" messages received on one side of
> the tunnel.
>
> /g
>
> On 19 February 2013 16:50, Navarro, Galo <galo at midokura.jp> wrote:
> > On 19 February 2013 16:31, Guillermo Ontañón <guillermo at midokura.jp>
> wrote:
> >> On Tue, Feb 19, 2013 at 4:21 PM, Navarro, Galo <galo at midokura.jp>
> wrote:
> >>>
> >>> Hi Guillermo, thanks for the quick feedback! Some comments below
> >>>
> >>> >> - TunnelPorts become active on each side of the tunnel, the
> TunnelDoc
> >>> >>   becomes aware of local ports and starts taking care of them.
> >>> >> - Regularly, for each cared-for tunnel the TunnelDoc:
> >>> >>     - Sends a packet to the other peer
> >>> >>     - Logs variation on RX value of the PortStats on the tunnel's
> local
> >>> >> port.
> >>> >>     - If variation = 0, increment a "no-increment" counter
> >>> >>     - If "no-increment" counter > threshold, trigger alert message
> for
> >>> >>       lack of connectivity on the REVERSE direction of the tunnel
> >>> >> (e.g.:
> >>> >>       if the TunnelDoc at A spots no RX, the alert refers to loss of
> >>> >>       connectivity from B to A).
> >>> >>     - Implement whatever corrective measures upon receiving the
> alert
> >>> >>       (typically, the DatapathController could recreate the tunnel)
> >>>
> >>> > This is not a lot of extra traffic, but the number of tunnels does
> grow
> >>> > quadratically with the number of MM agents. I propose a slight
> variation
> >>> > on
> >>> > the above to avoid sending traffic on non-idle tunnels, along the
> lines
> >>> > of
> >>> > what is done by IPsec's dead peer detection:
> >>> >
> >>> > http://www.ietf.org/rfc/rfc3706.txt
> >>> >
> >>> > Basically, from the POV of view of one of the nodes, it looks like
> this:
> >>> >
> >>> >    * Monitor idleness (by looking at RX as you outline above) and do
> >>> > nothing
> >>> > and consider the tunnel healthy while idleness doesn't go above a
> >>> > certain
> >>> > threshold.
> >>> >    * When the tunnel becomes idle, send an "are-you-there" packet to
> the
> >>> > Peer (we could just use the tunnel-key for this).
> >>> >    * When an "are-you-there" packet is received, reply to it with an
> >>> > Ack.
> >>>
> >>> This is definitely better. I messed up copypastes badly but the idea
> >>> was basically what you explain, the "send packet to another peer"
> >>> would be conditioned to several cycles without increment on the
> >>> "no-data-increment" counter.
> >>
> >>
> >>
> >> But I think that for this to work you need the 'ack' reply, would it be
> >> included? Otherwise a host may be receiving traffic (non-idle) but not
> >> sending, and would never send any 'are-you-there' packets to the other
> side
> >> because its RX is increasing.
> >
> > But note that A is only monitoring *incoming* connectivity (B->A).
> > This is because once the packet leaves A it's agent can tell that
> > something is broken in the line, but not in what direction (is PING
> > lost bc. A->B is cut, or ACK lost because B->A is cut?). We need to
> > report health of each direction.
> >
> > So, A doesn't care about A->B. It only asserts that data is arriving
> > from B. With this in mind, once A's agent sends the "are-you-there"
> > message it doesn't really need to pay attention to the ACK.
> >
> > From the other side, B will do the same in reverse. If the
> > "are-you-there" never arrives because A->B is broken, B will notice
> > the static rx count and start a health check of the A->B direction.
> >
> > Does that make sense?
> > /g
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.midonet.org/pipermail/midonet-dev/attachments/20130219/ed92014a/attachment-0001.html>


More information about the MidoNet-dev mailing list