Life without EVENTS ... understanding and common practices

Question

Life without EVENTS ... understanding and common practices

Many BAWs (large back sites) use data storage and retrieval methods that rely on huge index tables and use queries that JOINs will not / cannot use in their queries (BigTable, HQL, etc.) for work with scalable and scalable databases. How does it work when you have a lot and a lot of data that is very connected?

I can only assume that most of this joining should be done on the application side, but isn't it getting expensive? What if you need to make several queries on several different tables to get information for compilation? Doesn’t get into the database, which starts to go up many times, than just use the connections in the first place? I think it depends on how much data you have?

And for public ORMs, how do they deal with the inability to use connections? Is there any support for this in the ORMs that are used today in large use? Or are most projects that should approach this level of data collapse their own anyway?

Thus, this does not apply to any current project that I am doing, but it is something that has been in my head for several months now that I can only reflect on what “best practices” are. I never had to refer to this in any of my projects, because they never reached the scale where necessary. Hope this helps other people as well.

As someone below said, ORMs "don't work" without unions. Are there other levels of data access that are already available to developers working with data at this level?

EDIT: For clarification, Vinko Vrsalovic said:

"I think giggling wants to talk about NO-SQL, where transactional data is denormalized and used in Hadoop or BigTable or Cassandra."

This is really what I'm talking about.

Bonus points for those who catch the xkcd link.

+59

join orm nosql bigtable hadoop

snicker Oct 07 '09 at 15:05

source share

7 answers

You start with an erroneous assumption.

A data warehouse does not normalize data in the same way that a transaction application normalizes. There are no "lots" of associations. Relatively few.

In particular, the second and third irregularities are not a “problem” because data warehouses are rarely updated. And when they are updated, it’s usually just a status flag to make the dimension lines “current” or “not current”.

Since you do not need to worry about updates, you will not decompose things to the 2NF level, where the update cannot lead to abnormal relationships. No updates mean no anomalies; and no decomposition and no associations. You can pre-join everything.

Typically, DW data is decomposed according to the pattern of the star. This will help you decompose the data into numerical "factual" tables containing measures - numbers with units - and links to foreign keys to size.

A dimension (or “business entity”) is best understood as the real thing with attributes. Often this includes things like geography, time, product, customer, etc. These things often have complex hierarchies. Hierarchies are usually arbitrary, determined by the different needs of business reporting, and are not modeled as separate tables, but simply columns in the dimension used for aggregation.

To solve some of your questions.

"this attachment must be done on the application side." View. Before downloading, the data is "pre-merged." Measurement data is often a combination of the corresponding source data for that measurement. It connects and loads as a relatively flat structure.

It is not updated. Instead of updates, additional historical records are added.

"but doesn't it get expensive?" View. Downloading data requires some attention. However, there are not many reports / analyzes. Data is pre-merged.

Problems with ORMs are largely controversial as the data is pre-merged. Your ORM matches a fact or dimension as needed. Except in special cases, sizes tend to be small and fit completely into memory. The exception is that you work in the field of finance (banking or insurance) or utilities and have massive customer databases. These customer sizes rarely fit into memory.

+21

S.Lott Oct 07 '09 at 15:14

source share

A JOIN is a pure relational term, and not all databases are relational.

Other database models have other ways of building relationships.

Network databases use endless chains of find a key - fetch the reference - find a key , which must be programmed using a common programming language.

The code can be run on the application side or on the server side, but it is not SQL or even installed based.

With proper design, a network database can be much faster than a relational database.

For example, a network database can store a link to another object as a direct pointer to an offset in a file or even to a block on a disk where information about this object is stored.

This speeds up the movement of networks faster - if you have written effective code for this.

A relational database can only store links as pairs of base values, such as integers (or triples or tuples of a higher order).

To find these values in a relational database, the engine must perform the following actions:

Find out where the tuple containing the first value is located
Find the second value
Find the root address in the B-Tree where the data is stored, the second -
Turn this tree
Find the pointer to the actual table (which can be saved as the B-Tree , in which case the pointer will be the PRIMARY KEY value of the row we are in)
Find the table row with a pointer or go to the table
Finally, we get the result.

And you can control it only to a certain extent. After that, you just issue the SQL query and wait.

A relational model made to simplify the life of a developer, and not to achieve ultra-fast speed, is always and never matter.

This is the same as the assembly language or a higher level, the relational model is a higher level language.

You can read the article on your blog.

What is a relational database?

in which I try to explain the differences between several commonly used database models.

+14

Quassnoi Oct 07 '09 at 15:13

source share

When you denormalize your data in this way, you do this to avoid the cost of combining disparate items; You agree that some data may be duplicated and that some methods of combining it may be difficult because the benefits of using simple queries.

If you need to make some large number of attachments at the application level, this means that you have not denormalized it enough.

Ideally, you can make one request for any set of required data. In practice, you do not need to use more than two or three queries for any aspect of your application, and any attachment at the application level will be an easier trivial extraction of material from separate result sets for insertion into the view.

This view is really needed only for really massive data sets, and there are all kinds of trade-offs. To give just one example: BigTable cannot perform aggregate queries, for example, giving you an account. It can be used to give you an approximate figure - in the sense that if you have, say, 12,149,173 entries, of which 23,721 have been added in the last hour, it really doesn't matter how much better you can find out you have "about 12,100,000 records." If your application depends on knowing the exact numbers at any time, then you should not use BigTable for this, this is a general relation.

+4

NickFitz Oct 07 '09 at 15:24

source share

Applications like facebook have very few data changes; most users post new items. Therefore, the problem of multiple recordings needs to be updated when an item is changed, is a lesser problem.

This allows the data to not be normalized without a problem with updates.

Applications such as Amazon can allow you to load all the data for one user into RAM (how big is the shopping basket?), And then update the data in RAM and write it as a single data item.

Eliminating the need for most normalized data again.

You trade in scaling for ease of application development, so if you don't need to scale to great heights, you can keep the ease of application development provided by RDBMS.

+3

Ian Ringrose Oct 12 '09 at 11:30 a.m.

source share

I think that in these situations you will be largely independent, and you will have to quit. I was not there, but reviewed it for some of our projects. You can get quite large with relational databases (as SO shows), so I will continue to use relational kindness for now.

0

Rob West 07 Oct '09 at 15:15

source share

As a rule, a data warehouse is built using associations and data, divided into tables of sizes and facts (with the so-called "star schemes", etc.)

Compounds will often be pre-computed and stored as de-normalized tables.

I am not aware of any ORM tools that work with database systems that do not allow joins, since they are usually not considered to be traditional relational databases.

0

user185700 Oct 07 '09 at 15:17

source share

gtd · Accepted Answer · 2009-10-16 08:18

As I look at this, a relational database is a general-purpose tool for hedging your bets. Modern computers are fast enough and RDBMS are well optimized so that you can grow to a pretty decent size in one box. By choosing RDBMS you provide very flexible access to your data and the ability to have powerful validity restrictions that greatly simplify data encoding. However, RDBMS will not present a good optimization for any particular problem, it just gives you the ability to easily change problems.

If you start to develop rapidly and understand that you have to scale beyond the boundaries of one database server, then it will suddenly be much more difficult for you to do. You will need to begin to identify bottlenecks and remove them. An RDBMS will be one unpleasant growling node of co-dependence that you have to tear apart. The more interconnected your data, the more work you will have to do, but you may not have to completely unravel it all. If you read hard, you might be able to handle simple replication. If you saturate your market, and growth is leveled, perhaps you can partially denormalize and outline a fixed number of database servers. Perhaps you have only a few problem tables that you can migrate to a more scalable data warehouse. Perhaps your usage profile is very convenient for caching, and you can just transfer the load to the giant memcached cluster.

If scalable keystores, such as BigTable, come in when none of the above functions work, and you have so much data of the same type, even if you denormalize one table, there is too much for one server. At this point, you should be able to split it arbitrarily and still have a clean API to access it. Naturally, when data is distributed across many machines, you cannot have algorithms that require these machines to talk to each other a lot, which will require many standard relational algorithms. Do you think these distributed query algorithms may require more full computational power than the JOIN equivalent in a properly indexed relational database, but since they are parallelized, real-time performance is orders of magnitude better than any single machine (for example, a machine that could hold the entire index, even exists).

Now that you can scale your massive data horizontally (just by connecting more servers), the hard part of scalability is done. Well, I should not say, because ongoing operations and development on this scale is much more complicated than a single-server application, but the fact is that application servers are usually trivial to scale through an architecture without shared access, so long as they can get the data they need in a timely manner.

To answer the question of how commonly used ORMs treat the inability to use JOINs, they do not give a short answer. ORM stands for Object Relational Mapping, and most of the work of ORM is simply a translation of the powerful relational paradigm of predicate logic, a simple object-oriented data structure. Most of what they give you will simply not be accessible from the keystore. In practice, you will probably need to create and maintain your own level of data access that meets your specific needs, as data profiles at these scales will change dramatically, and I believe that there are too many trade-offs for a general-purpose tool and become dominant, like in RDBMS In short, you will always have to do more work on this scale.

However, it will be definitely interesting to know which relational or other aggregate functions can be built on top of key primitives. I actually do not have enough experience to comment, but I have a lot of knowledge in the field of corporate computing about this, which happens many years ago (for example, Oracle), a lot of unused theoretical knowledge in the academic community, a lot of practical knowledge in Google, Amazon, Facebook et al., But the knowledge filtered in the wider community of developers is still quite limited.

However, now that many applications are moving to the Internet, and more and more of the world's population is online, inevitably more applications will have to scale, and best practices will begin to crystallize. The knowledge gap will be narrowed on both sides by cloud services such as AppEngine and EC2, as well as open source databases such as Cassandra. In a sense, it goes hand in hand with parallel and asynchronous computing, which is also in its infancy. Definitely a fascinating time to be a programmer.

Life without EVENTS ... understanding and common practices

More articles: