[MidoNet-dev] Feature Proposal: Resource Tagging

Ishimoto, Ryu ryu at midokura.com
Thu Mar 14 08:09:03 UTC 2013

Thanks for the feedback.

On Thu, Mar 14, 2013 at 12:50 AM, Navarro, Galo <galo at midokura.com> wrote:

> Just to add some info on the redundancy point: Solr can provide
> replication to Lucene (I used this a few years ago when and it was really
> based on rsync, but now it seems to have grown up a bit :) There is info on
> https://wiki.apache.org/solr/SolrReplication.
Very cool.  Instead of doing memory indexing, another way is to use Solr
instead, which would indeed solve the replication issue when we have
multiple API servers.  The downside is that we would need to have Solr set
up as a separate service, which I wanted to avoid so I opted for in-memory
Lucene indexing.  I am hopeful to design this so that this Solr deployment
option is also possible though.  SolrCloud sounds very interesting and it
uses Zookeeper too: http://wiki.apache.org/solr/SolrCloud

On 13 March 2013 14:41, Pino de Candia <gdecandia at midokura.com> wrote:

> Something was bothering me about this, but I had to sleep on it to figure
> it out: before we commit to the scan-ZK/index-in-memory approach, I'd like
> to compare to having the relations in a data-store because:
> - I'd like to keep them outside of ZK (minor point though)

In my proposal, most of the relationships data will remain in ZK, and only
the directories used for indexing are removed. For this project, I only see
Lucene as a way to compliment our data store (ZK) with search
capabilities.  Whether we should keep our data in ZK or not probably
requires more (separate) discussions.

- I'd like to be able to examine them without the API server, and to be
> able to run more than one API server.

Good point.  So as far as examining the data without the API server,
because I'm hoping to keep most of the relationship data in ZK, we won't
lose visibility into the data even without the API server.  There are tools
to inspect the indexed data as well.  You can also store the indices in a
file, or DB, so perhaps those options should be enabled for debugging
purpose.  You think that's good enough?  I think this is a very important
point whether we are ok with not storing all the data in a central place.
With Lucene, you can also store the indices in files or DB.

Can Cassandra serve this purpose? Doesn't it have most of the features
> you're looking for? The only comparison points I can come up with are:

One concern I had with Cassandra is with its search capabilities,
especially for tags.  Adam mentioned to me that he knows of people who have
expertise in Cassandra for searching (Datastax), so I will talk to them.
It looks like they are also using Lucene/Solr for search based on Cassandra
data (
This tells me that Cassandra on its own is not exactly good for
implementing rich search features.

- inconsistency window with Cassandra vs. no server redundancy/scalability
> with Lucene.

Cassandra vs Lucene was definitely the biggest question while designing
this.  To me, it made more sense to keep Zookeeper as the master data and
have Lucene just derive its indices from it, as opposed to designing a way
to sync data between Cassandra and Zookeeper.  My thoughts were that it
would be good to have a way to easily get back to the correct state if
something goes wrong.  By re-indexing everything, Lucene should be able to
achieve this easier.

As for redundancy with Lucene, I think it needs to be implemented, similar
to how we replicate data in midolman right now using watcher (but clearly
you cannot watch the entire data so we need to think about something
clever).  And for scalability, my guess is that Lucene should be able to
handle our data size with little problem, but I need to look more into that.

- adding relationships after the fact without triggering ZK watchers.

So on this point, you are suggesting that we use ZK as the master data
store and use Cassandra just to keep the 'relationship' data, right?  No ZK
watcher is nice.  Replicated Solr would be a good candidate for this too.

- ease of implementation - not sure which one wins.

One advantage we have for Lucene is that we know developers here that have
expertise in the technology.

- API server startup speed - I think the Cassandra approach wins
> - query speed - the Lucene approach definitely wins.
> Separately, and it goes for whatever approach we take - assuming we have
> to migrate Netflix to this new model... how will that work?
Because the ZK changes only require removing existing directories and
adding new fields to existing directories, we may be able to do the upgrade
inside the code.  If that isn't possible, we'll have to provide them with a
migration script.  As for Lucene, since it's just a library embedded in
midonet-api, re-installing midonet-api should be enough, and starting the
API should create indices on the host.

Since you bought up a lot of good points, I would like to amend the current
proposal so that it allows you easily swap  Solr, Lucene and even Cassandra
to work as the search service provider.  In the configuration, we would
specify the SearchProvider and it could be configured to be Lucene or Solr
or Cassandra.   What this simply means is that we'll define interfaces for
resource creation and searching that these provider classes must implement
to store and fetch data.  The provider class is responsible for syncing
data between its data store and Zookeeper.  At least this way, we can
always switch if needed, and each one would provide different advantages.
What do you think?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.midonet.org/pipermail/midonet-dev/attachments/20130314/f62ab175/attachment.html>

More information about the MidoNet-dev mailing list