HBase table key design for duplicates and centralized server server access

I have a requirement to store events generated by a user identified by userId. Each user belongs to a company that is identified by the company. I came up with a design for a table in HBase as follows:

rowkey : <companyId> ​​<userId> <timestamp>

column-family : information (encapsulating a set of event attributes as shown below)

columns : <attr1>, <attr2> .... <attrn>

I know that this key design will make it easier to request data later on companyId and / or userId with a partial key scan. Having said that, I have some questions and problems, and I would like to get some ideas.

1- If we have a case of using a read that reads all the data specified by a time range, then with this current construct we cannot use rowKey. Instead, we will need to complete the full scan and filter lines in the timestamp field (supported separately as one of the attr columns). Am I completely disconnected here?

2 How to handle duplicates? I know that HBase will create a new version of the string in this case, but will it allow reading later according to the read-usecase mentioned in 1? I know that you can control versions when requested, but will it be a good design or will it overload your own function incorrectly?

3 This refers to the server access point to the server. We do not have monolithic keys, but we can still face this problem if, say, one particular company or user is very active. Will hashing and balancing based on the number of servers not work in this case? Maybe if we have a hash in the timestamp field and add this to rowKey, and not to the original value? But then the problem would be that scanning on the timestamp component would not be possible. We will need to have a separate column (attr) in the column to fix this. Any suggestions?

Thank you very much for any input (comment, link, book, idea) that can be provided.

+4
source share
1 answer

1: read usage example

It depends on your use case:

  • If you want to get all the user data for Org in a given time range, then what you have seems to me correct, and you will need to perform a scan on all org data.

  • If you want to read all the data for a given current key design, you will be fine. Although I would flip the position of org and user id by creating a new key ( rowkey : userId-companyId-timestamp). This will happen because the data of independent users does not intersect, now they do not need to be linked together.

  • If you click the timestamp at the top ( rowkey : timestamp-companyId-userId), you can run a scan of all the information about org / all users, which ends in a location determined by a time range (skipping a full table scan)

2: Duplication

BEWARE: By default, Hbase records up to 3 versions of a cell (also do not confuse these version timestamps with timestamps on your line). You can increase this limit and get results from different versions, but it is not recommended that the number of this version be large.

If you intend to write previously saved values, I would recommend not relying on the search for the previous version (although there are ways to achieve it). You can also use the new column to store the new value if you should be able to save / retrieve all previously recorded data.

3: Hot areas

  • If the company is very active, you can add the companyId-userId hash to your line. This will distribute records to any organization.

  • IF the user is very active, and there is a precedent to extract all his data in the optimal order, then I'm not sure if hashing by key or timestamp is a good solution. You will definitely want to save the data for the user together, and I'm not sure what the best solution here would be.

Based on how I understand your problem, I would probably create a ROWKEY as HASH (companyId-UserId) -companyId-UserId-Timestamp

+2
source

All Articles