When it comes to Big Data, the available technologies are the Hadoop Distributed File System, such as HDFS (a variant of Google DFS), Hadoop, Spark / MapReduce and Hive (originally developed by Facebook). Now, when you mainly ask about the data structure used in big data processing, you need to understand the role of this system.
Hadoop Distributed File System - HDFS
In very simple words, this is a file storage system that uses a cluster of a cheap machine to store files that are "highly available" and "fault tolerant" in nature. Thus, it becomes a data source in big data processing. Now it can be structured data (for example, comma-delimited data) or unstructured data (the contents of all books in the world).
How to work with structured data
One important technology used for structured data is Hive. This provides a relational database, such as a presentation of the data. Please note that this is not a relational database. The source of this view is again files stored on disk (or HDFS, which are used by large companies). Now, when you are processing a data bush, the logic is applied to files (internally through one / more Reduce Map programs), and the result is returned. Now, if you want to save this result, it will again land on disk (or hdfs) as a structured file.
Thus, the Hive query sequence will help you refine a large data set in the desired data set by stepwise conversion. Consider how to extract data from a traditional database system using joins, and then save the data in a temp table.
How to work with unstructured data
When it comes to unstructured data, the Map-Reduce approach is one of the popular ones along with Apache Pig (which is ideal for semi-structured data). The Map-Reduce paradigm primarily uses disk data (or hdfs) to process it on multiple machines and output the result to disk.
If you are reading a popular book about Hadoop - Orielly - Hadoop: The Definitive Guide; You will find that Map Reduce primarily works with a Key-Value structure type (such as Map); but it never stores all the values ββin memory at a particular point in time. It looks more like
- Get key value data
- Do some processing
- Select data to disk through context
- Do this for all key values, thus processing one logical block at a time from the Big Data source.
At the end, the output of one Map-Reduce program is written to disk, and now you have a new data set for the next processing level (again, it may be another Map Reduce program).
Now, to answer, your specific requests:
At this stage, I want to get an idea of ββhow a large company uses a data structure to manipulate a huge set of data created every day.
They use HDFS (or a similar distributed file system) to store large data. If the data is structured, Hive is a popular tool for processing it. Because the Hive query for transforming data is closer to SQL (syntax); learning curve is really low.
Do they use collection frames?
When processing Big data, all content is never stored in memory (not even on cluster nodes). Its more like a piece of data being processed simultaneously. This piece of data can be represented as a collection (in memory) during processing, but at the end the entire set of output data is flushed back to disk in a structured form.
Do they use their own data structure?
Since not all data is stored in memory, therefore, there is no particular point in the structure of user data. However, the movement of data in Map-Reduce or through the network occurs in the form of a data structure, so yes - there is a data structure; but it is not so important from the point of view of the application developer. Again, the logic inside Map-Reduce or other Big-Data processing will be written by the developer, you can always use any API (or user set) to process the data; but the data must be written back to disk in the data structure expected by the infrastructure.
Do they use a node data structure with each node running on a separate JVM?
Big data in files is processed by several machines in blocks. for example, 10 TB of data is processed in a 64 MB block between clusters of several nodes (a separate JVM and sometimes several JVMs on the same machine). But again, this is not general data structured in the JVM; rather, it is distributed data entry (in the form of a file block) through the JVM.
Until now, the collection object is running on one jvm and cannot dynamically use another jvm when there is a signal overflowing the stream in memory and there is a lack of resource for processing
You're right.
Typically, what is another developer approach for data structure for big data?
For the perspective of data I / O, this is always an HDFS file. From data processing (application logic); you can use any regular Java API that can be run in the JVM. Now that the JVMs in the cluster are in a big data environment, they also have resource limitations. Thus, you must connect your application logic to work within this resource (for example, for a regular Java program)
How do other developers handle it?
I would suggest reading the final guide (mentioned in the previous section) to understand the building block of Big-Data processing. This book is amazing and touches on many aspects / problems and their approach to solving in Big-Data.
I want some tips for real-world use cases and experiences.
There are many uses of Big Data Processing, especially with financial institutions. Google Analytics is one of the prominent use cases that allows the user to track user behavior on a website in order to determine the best position on a web page to host a Google ad unit. I work with a leading financial institution that uploads user transaction data to Hive to make fraud detection based on user behavior.