Is BigTable slow or am I dumb?

I basically have a classic for many models. User, award, and many-to-many table comparisons between users and awards.

Each user has about 400 awards, and each award is awarded to approximately 1/2 users.

I want to sort through all the rewards of users and summarize their points. In SQL, this will be a join of many-to-many tables, and then go through each of the rows. On a decent machine with an instance of MySQL, 400 rows should not be a big problem.

In the application, I see about 10 seconds to make an amount. Most of the time is spent on the Google data warehouse. Here are the first few lines of cProfile

    ncalls tottime percall cumtime percall filename: lineno (function)
       462 6.291 0.014 6.868 0.015 {google3.apphosting.runtime._apphosting_runtime ___ python__apiproxy.Wait}
       913 0.148 0.000 1.437 0.002 datastore.py[2424(_FromPb)
      8212 0.130 0.000 0.502 0.000 datastore_types.py:1345(FromPropertyPb)
       462 0.120 0.000 0.458 0.001 {google3.net.proto._net_proto ___ parse__python.MergeFromString}

Is my data model incorrect? Am I doing a wrong search? Is this a flaw that I have to deal with caching and mass scattering (which would be a royal pain in the ass).

+25
google-app-engine django django-models bigtable
Jun 05 '09 at 8:15
source share
5 answers

It may be both; -)

If you execute 400 queries in the Rewards table, one for each result returned for the query in the mapping table, then I expect it to be painful. There is a limit to 1000 results for queries because BigTable believes that returning 1000 results is limited by its ability to work in a reasonable amount of time. Based on the architecture, I would expect 400 queries to be slower than one query returning 400 results (400 log N versus (log M) + 400).

The good news is that in GAE memcaching, one hash table containing all the rewards and their score values ​​is pretty simple (well, it looked pretty simple when I looked at memcache docs a while ago. Do it yet).

Also, if you do not already know, for result in query.fetch(1000) faster than for result in query , and you are limited to 1000 results anyway. The advantages of the latter: (1) it can be faster if you help out early, and (2) if Google ever increases the limit above 1000, it will benefit without changing the code.

You may also have problems deleting the user (or reward). I found in one test that I can delete 300 objects in a while. These objects were more complex than your mapping objects, having 3 properties and 5 indexes (including implicit), while your mapping table probably only has 2 properties and 2 (implicit) indexes. [Edit: just realized that I did this test before I found out that db.delete () can take a list, which is probably much faster].

BigTable does not necessarily do what relational databases are designed to achieve the best results. Instead, it distributes data well across many nodes. But almost all websites work fine with a bottleneck on the same db server and, therefore, do not require strict attention to what BigTable does.

One more thing: if you execute 400 data warehouse requests on one HTTP request, you will find that you have set a fixed storage quota long before you get to a fixed quota of your request. Of course, if you feel good about quotas, or if you click something else first, then this may not be relevant for your application. But the ratio between the two quotas is something like 8: 1, and I take it as a hint at what Google expects my data model to look like.

+20
Jun 05 '09 at 8:44
source share

Is my data model incorrect? Am I doing a search incorrect?

Yes and yes, I'm afraid.

As for your data model, the best way to handle this is to save the amount against the user's record and update it when the user receives / loses the reward. It makes no sense to count their score every time when the vast majority of the time it does not change. If you make the UserAward object a type of the User child object, you can update the account and insert or delete the UserAward record in one atomic transaction, ensuring that your account is always accurate.

onebyone indicates that you can memcache the reward table. This is a good idea, but given the limited amount of data it is even better to store it in local memory. Global members persist between HTTP requests, and since I suggest that you often do not update the reward table, you don’t have to worry about the cache being invalidated. Just download it at your first request (or even copy it to your source). If you change the list of awards, the deployment of a new minor update will reset all instances, which will lead to their reboot.

To search, remember that the significant cost of performing data warehouse operations is round-trip time. The get () operation, which scans 1 or more records by ID (you can batch!), Takes about 20-40 ms. The request, however, takes about 160-200 ms. Consequently, the power of denormalization.

+19
Jun 05 '09 at 16:24
source share

One of the important idioms of the application is that storage is cheap, but time is never overkill. It seems the best way to make many, many relationships in an application is to simply store information on both sides. IE, the user has a list of awards, and each award has a list of users. To find all the rewards, the user simply queries the reward table for a specific user.

This idea is well demonstrated here: Creating scalable complex applications

+1
Oct 13 2018-10-10
source share

Google BigTable runs on the Google distributed file system.

Data is distributed. Maybe 400 mysql rows are still better, but for big data google BigTable can be faster.

I think that’s why they encourage us to use memcache to make it faster.

0
Jun 05 '09 at 8:44
source share

Even you mention BigTable, I think you are implementing a relational database in cloud SQL.

Your model is fine, this is the right way to do something like this. I see no reason enough to de-normalize aggregates in a user table.

You have created indexes to quickly join tables. It is pretty simple. You may need BTree indexes for all fields that are related to table joins. There is no need to index the aggregation field (which you accept SUM). Basically, both foreign keys of the N: N table should be indexed. If these foreign keys belong to the primary key of the other two tables, this is enough to go.

Above about 100 lines, a simple BTree index for foreign keys can have a decent and noticeable increase in throughput.

I am running a database on CloudSQL where some edge tables have over 2 million records. After only 2.5 million records, I am considering some de-normalization, and these are also some additional indexes and are still aggregated for SUM. Otherwise, I will make unnecessary updates in the SUM field when adding new entries.

Only when the table collected more than 1 million records did we have to consider using a read replica. And just then we can distinguish between processes that read only some tables and do not write.

If you use Django, be careful when implementing LIMIT according to their documentation; because it is very misleading. When you [: 100] (spliced) in a recordset, this is not what you expect from SQL, which is actually sent to the SQL server. It was very difficult for me to understand this. Django is not a good option when you plan to do something that will appear on a very large scale. But about 1,000 records would be good.

0
Mar 01 '17 at 19:46 on
source share



All Articles