Monday, November 16, 2009

Consistent hashing for Mnesia fragments

If you work with Mnesia and use the mnesia_frag module for partitioning tables, maybe you know that Mnesia's default hashing module deals very well with the situation in which you need to add one more fragment to your table. Actually, the rehashing scheme has two important properties:
  1. No data will need to be moved to any fragment that is not the new one.
  2. Only the keys in one of the old fragments have to be checked (and possibly moved to the new fragment).

But, there are also two characteristics that you might not want for your application:
  1. Data is not very well distributed among the table fragments.
  2. If you are using disc_only_copies and one of your fragments is reaching the 2 GB limit, maybe you will need to add a lot of fragments to make it shrink (this is, in a way, a consequence of advantage 2, above).
As Mnesia allows you to change the hashing scheme without having to patch OTP's code, I believe there's an easy way to overcome the two disadvantages above and keep advantage 1 (though losing part of advantage 2), by implementing a form of consistent hashing for distribution of data between table fragments.

I have implemented that, in a module called mnesia_frag_chash, along with a modified version of Erlang's gb_sets module. The key functionality is the geq_iterator that Richard O'Keefe provided in this post. I named the modified module ok_gb_sets.

I'm creating 100 entries for each fragment, calculating a hash value for each entry and then storing each one in the circular hash table for consistent hashing (which is actually a tree - an ok_gb_sets). To find the fragment for a specific key, all I have to do is calculate a hash value H for the key, find the first element whose hash value is greater or equals H and than pick that entry's fragment (actual implementation is just a little bit more complicated, as we have to avoid collisions):

You could use a smaller number of entries per fragment: that would make the tree smaller and you would need to check a smaller number of fragments when adding a new one. On the other hand, creating more entries per fragment would provide even better distribution of data between fragments.

The hash state is defined as follows, where chash_table is the ok_gb_sets and n_frag_pieces is the number of entries created in the set for each fragment:

My main concern is about the size of chash_table. I need to understand Mnesia's code better to be sure that the use of this hash_state will not cause performance problems.

Another problem introduced is that, when adding a new fragment, the keys in several fragments (100, in the worst case, but that can be tuned, as I said above) will have to be scanned for Mnesia to find out which ones need to be moved to the new fragment. Although the total amount of data to be moved will still be small, the number of fragments locked and the amount of verified keys will be considerably bigger, so beware.

If my first concern turns out to be unimportant and the second one is not a big problem for you, you should consider using this mnesia_frag_chash, as it solves the two problems I discussed in the beginning of this post (I have executed some simple experiments and it seems a good job is done on distributing the data).

Full source code (just two files) is here.


  1. The second disadvantage also leads to an significant augment of network traffic, that can also be a huge problem for your system.

  2. You mean scanning the fragments to find the data that needs to be moved? If this is done in the nodes that actually contain the fragments, I don't believe there'll be any significant impact in network traffic, as the amount of data actually moved is not big (compared to Mnesia's default scheme).

  3. What's the largest pool of mnesia nodes that you've deployed or read about? I'm finding it difficult to find good numbers. For instance... a setup where you have a small number of disc copied nodes servicing a far larger pool of connected nodes through the mnesia api as remote tables? Tens, hundreds, thousands?

  4. Eduardo, as I said on the post, "the total amount of data to be moved will still be small", so the raise in network traffic will not be so big.

  5. bile, I've never tested with anything near 100 nodes and I've never heard of anyone running Mnesia with hundreds of nodes.
    You may ask about that in the Erlang/OTP discussion list (but I don't think you'll get the straight answer you want).
    Have you thought of running some tests on Amazon's EC2?