The article referenced below is about achieving high-writes in HBase. It states that you want to achieve even distributions across regions for your data by hashing it (or something similar). This is a common theme in these articles.
I'm a little confused how scans work after doing these kinds of things to the data though.
If you "pad the key with a random string" for example, wouldn't the records be in quite random order all over the different region servers? How can you scan that efficiently? I understand that HBase is basically indexed on the row key (I guess kind of like a clustered index in a database), but I don't understand how that is usable once you've somewhat randomized the data.
2. HBase row key design
The best method to use to distribute keys evenly across the regions is to hash it. But if you need to recover the key or use key scanning, you can consider hashing part of the prefix or pad the key with a random string.
Here’s an example in Python:
import struct, hashlib, binascii rawkey = struct.pack(!III, key_part1, key_part2, key_part3) readable_key = binascii.hexilify(rawkey)# or “%.8x%.8x%.8x” % (key_part1, key_part2, key_part3) key_prefix = hashlib.md5(readable_key).hexdigest()
After you have the key prefix, you can start appending the key with some suffixes you would like to use for scanning purposes. For time series data, it’s not a bad idea to use a timestamp as the suffix if you want to do a time range query for certain metrics identified by your row key prefix. HBase provides various filters for the rows that will make your query easy.
Best Practices for Managing HBase in a High Write Environment | AppFirst Blog