[MidoNet-dev] MidoNet: L2 gateway (connecting physical L2 into the virtual topology)

Pino de Candia gdecandia at midokura.com
Mon Feb 11 11:11:13 UTC 2013


Hi Folks, 

over the weekend I did some reading on link aggregation (see wikipedia and http://standards.ieee.org/getieee802/download/802.1AX-2008.pdf) and this morning I had a chat with Abel so I wanted to put some thoughts in writing.

Use case: connect MidoNet virtual bridges and routers to physical L2 segments. Let's focus on just connecting to L2 segments in the cloud's data center as opposed to the tenant's network, because today we don't have a VPN solution. Let's also leave aside the discussion of connecting physical L2 segments to MidoNet virtual routers, this presents similar issues, with the exception of bridging loops/STP.

Note that today a cloud administrator can already connect a physical segment to the virtual topology - but it requires a lot of thinking and manual configuration. Before jumping into feature specification in MidoNet, let's talk about what a cloud admin would do as things stand today:
Because of some higher level requirements, the cloud admin decides he wants to connect a MidoNet virtual bridge VB1 to a physical segment that e.g. has some legacy databases - let's call this physical segment L2-DB, and the databases are on vlan 100 (assume switches in L2-DB are all symmetric, all carrying the same set of VLANs).
Find a physical switch in L2-DB that has at least one free port and is close enough to run a line into a physical server with a free port that's already running MidoNet or where we can install MidoNet. Task done: cabling and MidoNet agent ready (running, in a tunnel-zone, tunnel interface ready). The server port is eth5 and the admin manually sets it to UP.
How should the switch port be configured? We decide it's not going to be trunked, it's going to be dedicated to VLAN100 which is the vlan the databases are in. Remember, packets arriving at the server already have their vlan tag stripped.
Go to MidoNet's GUI and add a vport on VB1. Bind this new vport to eth5 on the server.
A few seconds later the MN agent on the server learns about the binding and does the setup to hook eth5 into the virtual topology. Packets start flowing between VMs on VB1 and the databases. Woot!

So far, so good. Now a few things can happen:

The admin wakes up in the middle of the night thinking "Oh, boy, what about some redundancy/resilience of the connection between VB1 and L2-DB?
The admin is asked to connect some other MidoNet virtual bridge, VB2, to the same vlan, vlan100 in L2-DB.
The admin is asked to connect some other MidoNet virtual bridge, VB2, to a different vlan, vlan200, in L2-DB.
The admin ia asked to connect VB1 to another VLAN on the same physical segment.
The admin is asked to connect VB1 to a VLAN on a different physical segment.

Scenarios 4 and 5 are not realistic today because MN virtual bridges don't handle VLAN tags. You can send VLAN-tagged packets into the virtual bridge (we used to have code to drop these packets, but I don't remember if that made it into Caddo), but MAC-learning is not done per VLAN-tag. Alternatively, you can send VLAN-stripped packets from different VLANs into the virtual bridge, and hope that the packets don't interfere (e.g. the two vlans aren't using the same L3 address range). Basically, if you want a VM to be on multiple vlans, today you have to give the VM multiple vnics on different vbridges. I don't know whether this makes a case for VLAN support in our virtual bridge.


Scenario 2 doesn't make sense in MidoNet today. We don't allow connecting two MidoNet virtual bridges, the reasoning is that you can make your vbridges as large as you want (as many ports as you want) so no need to connect vbridges. Scenario 2 would essentially connect two MN virtual bridges via vlan100 (and risk loops). So the admin's reply to scenario 2 is: "any device or VM in VB2 that needs to be in VLAN100 should be given an interface connected to a new vport in VB1. VB1 is already in vlan100".

Scenario 3: that's reasonable. How to do it? Should we repeat the process we followed for vlan100 for vlan200: choosing a server and physical switch to connect, do the cabling, configuring the switch port and server port? That's annoying. When doing the setup for vlan100, I should have put the physical switch port in trunk mode, then I wouldn't need any new cabling today, and I wouldn't need to go down to the data center. Ok, this time I'll do it right:
go to the data-center, put the physical port in trunk mode and go back to the office.
Log into the server, put eth5 in trunk mode (I don't know whether this happens automatically or not) and then make a virtual interface off of eth5 for vlan100 named eth5.100 (eth5.100 strips/adds the vlan tag on packets ingressing/egressing).
Destroy the current vport binding on eth5 and replace it by binding the same vport to eth5.100.
Now create sub-interface eth5.200, create a new vport on VB2 and bind the vport to eth5.200.
A few minutes later, packets are flowing between VB1 and the databases, and between VB2 and whatever's in vlan200. Woot! What's more, the admin is pleased that he can easily bridge any other vlan carried by L2-DB into the virtual topology.

For setting up vlan tagging/stripping in Linux see any of:
https://wiki.archlinux.org/index.php/VLAN
http://linux.die.net/man/8/vconfig
http://unixfoo.blogspot.com.es/2007/12/linux-vlan-configuration.html



What about Scenario 1? We're becoming more and more reliant on that single link. We have 3 SPOFs: the physical switch, the cable, and the physical server.

A. First, let's eliminate the cable SPOF. Today, the admin can manually set up link aggregation between the physical switch and the server. We'll do our best: configure the switch for LACP, verify that the appropriate kmod is loaded on the physical server (running Linux), configure LACP on the server, run another cable between the physical switch and the server. eth5 on the server now has to be a logical interface that is sitting on top of eth6 and eth7 which are the aggregated server ports connected to the switch.

B. What about eliminating the server SPOF by running a cable from the switch to another physical server? We can't put another vport on VB1 connected to vlan100. Why? We would be creating a L2 loop. Our virtual bridge doesn't implement STP.

C. What about eliminating the physical switch SPOF? Assuming a few of the switches in L2-DB are by the same vendor, they might support multi-chassis link-aggregation (MLAG). In that case, run a line from another physical switch to the single physical server. On the switch side, configure MLAG, on the server-side, configure normal LACP (I'm guessing this should work, the server side shouldn't care about the other side, because the switches will advertise themselves as a single system - note that this is really only a guess). Since this eliminates the cable SPOF, don't bother running two lines from the first switch to the server (as we did in A).

-----------
Epilogue: the admin is really worried about the server SPOF AND bridging loops. MidoNet delivers features to deal with them: link aggregation and STP.

Bridging loops: we should support STP. What I don't know yet, is:
- Exactly what flavor of STP?
- Is there a suitable flavor of per-VLAN STP that can work with our VLAN-agnostic bridges? Does per-vlan STP automatically imply the virtual bridges need to be vlan-aware?

- What about Shortest path bridging (802.1aq)? This is very new - I assume the cloud will have devices that don't support this - so I would punt.

For now, I'm going to assume we can keep our virtual bridges VLAN-agnostic. 

What about link aggregation? I'm going to assume that these aggregated links are trunked - they're going to carry multiple VLANs. Also, I'm going to assume that there's no Linux implementation of MLAG (to aggregate links across different Linux servers) - which would allow us to do most of the work in Linux vs. writing MidoNet code. Therefore, we're going to have to implement LACP inside MidoNet. That will be equivalent to a proprietary MLAG because our virtual bridges are distributed across different physical servers, and that will eliminate the MidoNet server SPOF in the admin's original configuration.

What does doing LACP in MidoNet imply? Well, LACP has to run at a layer below VLANs and STP (this is just my intuition, needs verification). Since we're implementing LACP in MidoNet to work across servers, we already knew that we couldn't use the Linux bonding driver to handle LACP. But now I also suspect that we won't be able to strip the vlan tags in Linux - because I think that needs to happen after the LACP negotiation (above it in the protocol stack). The same goes for the BPDUs (Bridge Protocol Data Units) used in STP. This is all very vague, and we have to figure out the exact interaction with LACP, but my intuition is that LACP has to happen lower in the protocol stack, so doing it in MidoNet means we also have to push VLAN handling into MidoNet.

Who should handle VLANs and LACP in the MidoNet virtual device model? We have a few choices:
Make the virtual bridges aware of VLAN and LACP.
Create a new concept - a meta-bridge. A meta-bridge (needs better name) is the equivalent of physical switch. It may contain multiple virtual bridges in the same way that a physical bridge contains (or supports) multiple vlans.

Why do we need another layer of bridge software? Because multiple virtual bridges may be using the same aggregated links - so no single virtual bridge can be responsible for the LACP negotiation on behalf of the others (which would be a weird model). And we really do want to share/re-use those aggregated links between multiple vbridges/vlans.


-----------
Finally, and for completeness, note that we haven't


------------
As always, feedback is appreciated. I tried not to go too deep into implementation - just enough to understand how much work certain features imply. Let's try to keep the focus of this thread on defining the feature, not the implementation.

thanks,
Pino

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.midonet.org/pipermail/midonet-dev/attachments/20130211/cbbe0623/attachment.html>


More information about the MidoNet-dev mailing list