Understanding R-Hive, Elastic MapReduce, RHIPE, and Shortened Text Using R

Learning about MapReduce to solve the problem of computer vision for my recent internship at Google, I felt like an enlightened person. I have already used R for word processing. I wanted to use R for large-scale word processing and for thematic modeling experiments. I started reading textbooks and worked on some of them. Now I will describe my understanding of each of the tools:

1) Dashboard for text input R: for local (client) text processing and uses the XML library

2) Hive: Hadoop interative, provides a structure for calling cards / abbreviations, and also provides a DFS interface for storing files in DFS.

3) RHIPE: Hadoop Integrated Environment

4) Elastic MapReduce with R: MapReduce framework for those who do not have their own clusters

5) Distributing distributed text using R: an attempt to make a seamless transition from local processing on the server side, from R-tm to R-distrib-tm

I have the following questions and misunderstandings regarding the above packages

1) Rival and RHIPE and predictive text distribution utilities need your own clusters. Right?

2) If I have only one computer, how would DFS work in case of HIVE

3) Faced with the problem of duplication of effort with the above packages?

I hope to receive information on the above issues in the next few days.

+4
source share
2 answers

(1) Well Hive and Rhipe do not need clusters, you can run them in the same node cluster. RHipe is basically a structure (in the R-language package) that combines R and Hadoop, and you can use the power of R on Hadoop. To use Rhipe you do not need to have a cluster, you can run it anyway, for example, in cluster mode or in pseudo mode. Even if you have a Hadoop cluster of more than two nodes, you can still use Rhipe in local mode by specifying the property mapered.job.tracker = 'local'.

You can go to my website (search) for “Bangalore R User Groups” and you can see how I tried to solve problems with Rhipe, I hope you can get a fair idea

(2) Well, what does a hive mean that you mean a package of hives in R? since this package is somewhat misleading with Hive (a chaotic data warehouse).

The hive package in R is similar to Rhipe with only some additional functionality (I didn’t go through completely). The beehive package when I saw that I thought they integrated R with Hive, but after looking at the functionality it was not like dat.

Well Hadoop data ware house, which is HIVE, is mainly if you are interested in some subset of results that should go through a subset of the data that you usually use using SQL queries. Queries in HIVE are also very similar to SQL queries. To give you a very simple example: let's say you have 1 TB of stock data for different stocks over the past 10 years. Now the first thing you will do is save to HDFS, and then you will create an HIVE table on top of it. Thats it ... Now shoot any query you want. You may also need to perform a complex calculation, for example, find a simple moving average (SMA), in which case you can write your UDF (user-defined function). Besides this, you can also use UDTF (user-defined table generation function)

(3) If you have one system, it means that you are using Hadoop in pseudo-mode. In addition, you do not need to worry if Hadoop works in pseudo-mode or cluster mode, since Hive needs to be installed only on NameNode and not on data nodes. After the correct configuration is completed, the bush will take care of representing the task in the cluster. Unlike Hive, you need to install R and Rhipe on all data nodes, including NameNode. But then at any given time, if you want to run the task only in NameNode, you can do this, as I mentioned above.

(4) Another thing Rhipe is only for batch jobs, this means that the MR job will run in the entire dataset, and Hive in the subset of the data.

(5) I would like to understand what exactly you are doing in text mining, are you trying to make some kind of NLP king, for example, object name recognition using HMM (hidden Markov models), CRF (random condition fields), vector function or SVM (supporting vector machines). Or are you just trying to do document clustering, indexing, etc. Well there are packages like tm, openNLP, HMM, SVM, etc.

+2
source

I am not familiar with distributing distributed text using an R application, but Hive can work in a local cluster or in a single node cluster. This can be done for experimentation or practice, but it deprives the goal of having a distributed file system for serious work. In terms of duplication of effort, Hive should be a full SQL implementation on top of Hadoop, so there is duplication, because SQL and R can work with text data, but not as much as both are specific tools with different strengths.

0
source

All Articles