I have a requirement to store events generated by a user identified by userId. Each user belongs to a company that is identified by the company. I came up with a design for a table in HBase as follows:
rowkey : <companyId> <userId> <timestamp>
column-family : information (encapsulating a set of event attributes as shown below)
columns : <attr1>, <attr2> .... <attrn>
I know that this key design will make it easier to request data later on companyId and / or userId with a partial key scan. Having said that, I have some questions and problems, and I would like to get some ideas.
1- If we have a case of using a read that reads all the data specified by a time range, then with this current construct we cannot use rowKey. Instead, we will need to complete the full scan and filter lines in the timestamp field (supported separately as one of the attr columns). Am I completely disconnected here?
2 How to handle duplicates? I know that HBase will create a new version of the string in this case, but will it allow reading later according to the read-usecase mentioned in 1? I know that you can control versions when requested, but will it be a good design or will it overload your own function incorrectly?
3 This refers to the server access point to the server. We do not have monolithic keys, but we can still face this problem if, say, one particular company or user is very active. Will hashing and balancing based on the number of servers not work in this case? Maybe if we have a hash in the timestamp field and add this to rowKey, and not to the original value? But then the problem would be that scanning on the timestamp component would not be possible. We will need to have a separate column (attr) in the column to fix this. Any suggestions?
Thank you very much for any input (comment, link, book, idea) that can be provided.
source share