Compilation, Big Data and Best Practices

I have the following class

public class BdFileContent { String filecontent; } 

For example, file1.txt has the following content:

 This is test 
  • "This" is a separate instance of a file's content object.
  • "is" represents another file content object
  • "test" represents another file content object

Assume the following folder structure:

 lineage | +-folder1 | | | +-file1.txt | +-file2.txt | +-folder2 | | | +-file3.txt | +-file4.txt +-... | +-...+-fileN.txt 

. ,,.

N

N> 1000 files
The value of N will be a very large value.

The BdFileContent class represents each line in a file in a directory.

I need to manipulate a lot of data and you need to create work on a complex data structure. I have to perform calculations both in memory and on disk.

 ArrayList<ArrayList<ArrayList<BdFileContent>>> filecontentallFolderFileAsSingleStringToken = new ArrayList<>(); 

For example, an Above object represents the entire contents of a directory file. I have to add this object to the node tree in BdTree.

I write my tree and add filecontentallFolderFileAsSingleStringToken as node.

What extends the data structure of the collection structure for huge data.

At this point, I want to get an idea of ​​how a large company uses a data structure to manipulate the huge data set generated every day.

Do they use collection frames?

Do they use their own data structure?

Do they use a node data structure with each node running on a separate JVM?

Until now, the collection object is running on one jvm and cannot dynamically use another jvm when there is a signal overflowing the stream in memory and there is a lack of resource for processing

Typically, what is another developer approach for data structure for big data?

How do other developers handle it?

I want some tips for real-world use cases and experiences.

+5
source share
3 answers

When you are dealing with big data, you have to change the approach. First of all, you should assume that all your data will not fit into the memory of one machine, so you need to share the data between several machines, let them figure out what you need, and then put it all together, So you can use Collection but only for part of all the work.

I can offer you a look at:

  • Hadoop : First Big Data Framework
  • Spark : another big data structure, often faster than Hadoop
  • Akka : a framework for writing distributed applications

While Hadoop and Spark are the de facto standard for the big data world, Akka is just a framework that is used in many contexts, not just big data: this means you will need to write a lot of material that Hadoop and Spark are already there; I put it on the list just for the sake of completeness.

You can read the WordCount example, which is the equivalent of "HelloWorld" in the big data world, to get an idea of ​​how MapReduce works for Hadoop, or you can take a look at the quick start guide to get the equivalent transform using Spark.

+6
source

These are the answers to your requests (these requests are addressed while maintaining Hadoop)

Do they use a collection framework?

No. The HDFS file system is used in the case of Hadoop.

Do they use their own data structure?

You need to understand HDFS, the Hadoop Distributed File System. Refer to this book for Orielly - Hadoop: The Definitive Guide, 3rd Edition for purchase. If you want to know the basics without buying a book, try this link - HDFC Basics or Apache Hadoop . The HDFC file system is a reliable and fault-tolerant system.

Do they use a node data structure with each node running on a separate JVM?

Yes. See Hadoop 2.0 YARN archictecture

Generally, what is the other developer approach for a data structure for big data?

There's a lot. Refer to: Hadoop Alternatives

How does another developer handle this?

Corresponding technologies are provided through the framework. Map Zoom out in case of Hadoop

I want some tips for real-world use cases and experiences.

BigData technologies are useful where RDBMS fails - Data analytics, Data Warehouse (a system used for reporting and data analysis). Some of the use cases - Referral Mechanisms (LinkedIn), ad targeting (youtube) , processing of large amounts of data - find the hottest / coldest day of a place for more than 100 years, weather data, stock price analysis , market trend , etc.

See many real-life use cases for Big Data Use Cases

+3
source

When it comes to Big Data, the available technologies are the Hadoop Distributed File System, such as HDFS (a variant of Google DFS), Hadoop, Spark / MapReduce and Hive (originally developed by Facebook). Now, when you mainly ask about the data structure used in big data processing, you need to understand the role of this system.

Hadoop Distributed File System - HDFS

In very simple words, this is a file storage system that uses a cluster of a cheap machine to store files that are "highly available" and "fault tolerant" in nature. Thus, it becomes a data source in big data processing. Now it can be structured data (for example, comma-delimited data) or unstructured data (the contents of all books in the world).

How to work with structured data

One important technology used for structured data is Hive. This provides a relational database, such as a presentation of the data. Please note that this is not a relational database. The source of this view is again files stored on disk (or HDFS, which are used by large companies). Now, when you are processing a data bush, the logic is applied to files (internally through one / more Reduce Map programs), and the result is returned. Now, if you want to save this result, it will again land on disk (or hdfs) as a structured file.

Thus, the Hive query sequence will help you refine a large data set in the desired data set by stepwise conversion. Consider how to extract data from a traditional database system using joins, and then save the data in a temp table.

How to work with unstructured data

When it comes to unstructured data, the Map-Reduce approach is one of the popular ones along with Apache Pig (which is ideal for semi-structured data). The Map-Reduce paradigm primarily uses disk data (or hdfs) to process it on multiple machines and output the result to disk.

If you are reading a popular book about Hadoop - Orielly - Hadoop: The Definitive Guide; You will find that Map Reduce primarily works with a Key-Value structure type (such as Map); but it never stores all the values ​​in memory at a particular point in time. It looks more like

  • Get key value data
  • Do some processing
  • Select data to disk through context
  • Do this for all key values, thus processing one logical block at a time from the Big Data source.

At the end, the output of one Map-Reduce program is written to disk, and now you have a new data set for the next processing level (again, it may be another Map Reduce program).

Now, to answer, your specific requests:

At this stage, I want to get an idea of ​​how a large company uses a data structure to manipulate a huge set of data created every day.

They use HDFS (or a similar distributed file system) to store large data. If the data is structured, Hive is a popular tool for processing it. Because the Hive query for transforming data is closer to SQL (syntax); learning curve is really low.

Do they use collection frames?

When processing Big data, all content is never stored in memory (not even on cluster nodes). Its more like a piece of data being processed simultaneously. This piece of data can be represented as a collection (in memory) during processing, but at the end the entire set of output data is flushed back to disk in a structured form.

Do they use their own data structure?

Since not all data is stored in memory, therefore, there is no particular point in the structure of user data. However, the movement of data in Map-Reduce or through the network occurs in the form of a data structure, so yes - there is a data structure; but it is not so important from the point of view of the application developer. Again, the logic inside Map-Reduce or other Big-Data processing will be written by the developer, you can always use any API (or user set) to process the data; but the data must be written back to disk in the data structure expected by the infrastructure.

Do they use a node data structure with each node running on a separate JVM?

Big data in files is processed by several machines in blocks. for example, 10 TB of data is processed in a 64 MB block between clusters of several nodes (a separate JVM and sometimes several JVMs on the same machine). But again, this is not general data structured in the JVM; rather, it is distributed data entry (in the form of a file block) through the JVM.

Until now, the collection object is running on one jvm and cannot dynamically use another jvm when there is a signal overflowing the stream in memory and there is a lack of resource for processing

You're right.

Typically, what is another developer approach for data structure for big data?

For the perspective of data I / O, this is always an HDFS file. From data processing (application logic); you can use any regular Java API that can be run in the JVM. Now that the JVMs in the cluster are in a big data environment, they also have resource limitations. Thus, you must connect your application logic to work within this resource (for example, for a regular Java program)

How do other developers handle it?

I would suggest reading the final guide (mentioned in the previous section) to understand the building block of Big-Data processing. This book is amazing and touches on many aspects / problems and their approach to solving in Big-Data.

I want some tips for real-world use cases and experiences.

There are many uses of Big Data Processing, especially with financial institutions. Google Analytics is one of the prominent use cases that allows the user to track user behavior on a website in order to determine the best position on a web page to host a Google ad unit. I work with a leading financial institution that uploads user transaction data to Hive to make fraud detection based on user behavior.

+3
source

All Articles