HBase Thrift: how to connect to a remote HBase master / cluster?

Thanks to the Cloudera distribution, I have an HBase master / datanode + Thrift server running on the local machine, and it can code and test HBase client programs and use it without problems.

However, now I need to use Thrift in production, and I cannot find documentation on how to get Thrift to work with the HBase production cluster.

From what I understand, I will need to run the hbase-thrift program on the node client , since the Thrift program is just an intermediate client for HBase.

So, I assume that I need to somehow specify the master node hostname / IP on HBase-Thrift? How can I do it?

Also, any suggestions on how to increase this in production? I only need a setting:

Client <-> Thrift client <-> HBase Master <-> Multiple HBase workers 
+4
source share
1 answer

Get it

You do not need to run the Thrift server on your local computer, it can work anywhere, but RegionServers are usually a good place *. In the code, you then connect to this server.

Python example:

 transport = TSocket.TSocket("random-regionserver", 9090) 

If you obviously replace random-regionserver with one of the servers running the Thrift server.

This server gets its configuration from the usual places. If you use CDH, you will find the configuration in /etc/hbase/conf/hbase-site.xml and you will need to add the hbase.zookeeper.quorum property:

 <property> <name>hbase.zookeeper.quorum</name> <value>list of your zookeeper servers</value> </property> 

When you start the Thrift server from a downloaded Apache distribution, it looks like hbase-site.xml is likely to be in a different directory.

Scaling

One easy way to scale right now is to keep a list of all the Regionservers in your Thrift client and select a random connection. Or you create several connections and use random each time. Some language bindings (e.g. PHP) have TSocketPool , where you can go through all your servers. Otherwise, you will need to do manual work.

Using this method, all reads and writes should be more or less distributed between Thrift servers in your cluster. Each read or write operation that arrives at the Thrift server will still be transferred to a Java API call from the Thrift server, which then opens a network connection to the appropriate Regionserver to perform the requested action.

This means that you will not get the same high performance as when using the Java API. This can help if you cache the regions yourself and get to the corresponding Thrift server, but even then an additional Java API call will be called, even if it appears on the local server. HBASE-4460 will help in this scenario, but it is not included in CDH3u4 or CDH4.

* There is an HBASE-4460 issue that actually includes a Thrift server in a Regionserver.

+7
source

Source: https://habr.com/ru/post/1416434/


All Articles