Getting large rows from Azure SQL - but where to go? Tables, Blob, or something like MongoDB?

I read a lot of comparisons between an Azure Table / Blob / SQL, and I think I have a good idea about all of these ... but still I'm not sure where to go for my specific needs. Maybe someone with experience in such scenarios can make a recommendation.

What i have

SQL Azure DB, which stores articles in raw HTML inside a varchar (max) column. Each row also contains many metadata columns and many indexes for a simple query. The table contains many links to "Users", "Subscriptions", "Tags", etc. Therefore, my project will always require an SQL database.

What a problem

I already have about 500,000 articles in this table, and I expect it to grow in millions of articles per year. Each HTML article content can be anywhere between several KB and 1 MB or, in very few cases, more than 1 MB.

Two problems arise: as Azure SQL storage is expensive, sooner rather than earlier, I get stuck in my head with the cost of storing it. In addition, I will remove the 150 GB database size limit sooner rather than later. These 500,000 articles already consume 1.6 GB of database space.

What I want

It is clear that the HTML content must exit the SQL database. While the article table itself should remain for joining users, subscriptions, tags, etc. To quickly relationally discover the necessary articles, at least the one that contains the HTML content can be transferred to the external market for cheaper storage.

At first glance, the Azure Table storage looks perfect

Terabytes of data in one large table at very low prices and fast queries - sounds perfect to have one table table containing the contents of the article as an addition to SQL DB.

But reading through comparisons shows that this may not even be an option: 64 KB per column will be enough for 98% of my articles, but there are those 2% remaining where for some individual articles even the entire 1 MB row limit may not be enough.

Blob repository sounds completely wrong, but ...

So, on the left side of Azure there is only one option left: Blobs. Now it may not be as bad as it seems. In most cases, I would need the content of only one article at a time. This should work fine and fast enough with the Blob repository.

But I also have queries where I need 50, 100 or even more lines at a time, INCLUDING even the content. Therefore, I would have to run an SQL query to retrieve the necessary articles, and then retrieve each individual article from the Blob repository. I have no experience with this, but I can’t believe that I can stay in milliseconds of time for queries at the same time. And queries that take a few seconds are absolutely not suitable for my project.

Therefore, this also should not be a suitable solution.

Do I look like a guy with a plan?

At least I have something like a plan. I was only thinking of “exporting” the relevant records to the SQL table store and / or Blob store.

Something like "while the content is <64 KB, export it to the table store, and also store it in the SQL table (or even export this single XL record to the BLOB store)"

This may work quite well. But this complicates the situation and may be undesirably error prone.

Other options

There are several other NoSQL DBs such as MongoDB and CouchDB that seem to be better suited to my needs (at least from my naive point of view, as with those who just read the specifications on paper, I have no experience with them ) But they need self-care, something that I would like to get out of it, if possible. I am on Azure to do as little as possible in terms of self-hosting servers and services.

Have you really read so far?

Then thank you very much for your valuable time and reflect on my problems :)

Any suggestions would be greatly appreciated. As you can see, I have my own ideas and plans, but nothing compares with the experience of those who walked along the road before :)

Thanks Bernhard

+7
source share
8 answers

My thoughts on this: switching to the MongoDB (or CouchDB) route will ultimately cost you additional Compute, since you will need to start several servers (for high availability). And depending on the required performance, you can run dual- or quad-core boxes. Three 4-core boxes will work more than your SQL-DB costs (plus then the cost of storage, but MongoDB, etc. They will return their data to the azure cell for data storage).

Now, how to store html in blobs: this is a very common template to offload large objects in blob storage. GETs must be executed in a single call to store blocks (single transaction), especially with the file size you specify. And you do not need to extract each piece every time; you can use TPL to load a few drops into a role instance in parallel.

One more thing: how do you use the content? If you pass it from your role instances, then what I said about TPL should work well. If, on the other hand, you enter href on your output page, you can simply put the blob URL directly on your html page. And if you are concerned about privacy, make blobs confidential and create a short “share” TTL signature providing access to a small time window (this only applies when pasting the blob URL to another html page, it doesn’t apply if you upload an instance roles, and then do something with him there).

+1
source

I signed up only to help with this issue. I used to find useful answers to my problems from Stackoverflow - thank you community, so I thought it would be fair (perhaps fair to be an understatement) to try to give something back with this question as it falls on my lane,

In short, given all the factors mentioned in the question, storing the table may be the best option - if you can correctly evaluate transactions per month: a good article about this . You can solve two restrictions that you have limited, the restriction of rows and columns, by splitting (a simple text method or serializing it) the document / html / data. Speaking from experience with 40 GB + data stored in a table storage, our application often extracts more than 10 rows per page in milliseconds - there are no arguments! If you need more than 50 lines, you look at the small single digits (s), or you can do them in parallel (and further, breaking the data in different sections) or in some asynchronous way. Or read the proposed layered caching below.

A bit more. I tried using SQL Azure, Blob (both pages and block) and Table Storage. I cannot speak for Mongo DB, because, partly for the reasons already mentioned here, I did not want to go along this route.

  • Table storage fast; in the range from 20 to 50 milliseconds or even faster (depending, for example, in the same data center that I saw it up to 10 milliseconds) when querying with a section and a row key. You can also have several sections, in some way, based on your data and your knowledge of this.
  • It scales better in terms of GB, but not transactions.
  • The restrictions of the rows and columns that you mentioned are a burden, are consistent, but do not show a stopper. I wrote my own solution for splitting entities, you can too easily, or you can see this already written solution (does not solve the whole problem, but this is a good start): https://code.google.com/p/lokad-cloud/wiki / FatEntities
  • It is also necessary to keep in mind that loading data into a table storage takes a lot of time, even when batch objects are associated with other restrictions (i.e. request size less than 4 MB, download speed, etc.).

But using TableStorage exclusively may not be the best solution (thinking about growth and economics). The best solution that we completed with the implementation of multi-level caching / storage, starting with static classes, a cache based on the Azure role, table storage and block blocks. Lets call it for readability purposes of levels 1A, 1B, 2, and 3, respectively. Using this approach, we use the average one instance (2 processor cores and 3.5 GB of RAM - my laptop has better performance) and are able to process / query / evaluate 100 GB + data in seconds (95% of cases in less than 1 second). I find this to be quite impressive given that we check all the “articles” before displaying them (4+ million “articles”). Firstly, it is difficult and may or may not be possible in your case. I do not have enough knowledge about data and its consumption / processing, but if you can find a way to organize the data, this may be ideal. I will make an assumption: it looks like you are trying to find and find relevant articles with some information about the user and some tags (perhaps a variant of the news aggregator, just got a hunch). This assumption is made to illustrate the proposal, therefore, even if it is incorrect, I hope this helps you or initiates new ideas on how to accept it.

Level 1A data. Define and add key entities or their properties in a static class (periodically, depending on how you anticipate updates). Let's say we define user preferences (for example, demographics and interest, etc.) and tags (technology, politics, sports, etc.). This will be used to quickly obtain information about the user, his preferences and any tags. Think of it as a key / value; for example, a key that is a tag, and its value is a list of article identifiers or its range. This solves a small problem, namely: a set of keys is set (user prefix, tags, etc.). What articles are of interest to us! This data should be small in size if it is organized correctly (for example, instead of storing the path to the article, you can only store the number). * Note: the problem with saving data in a static class is that the application pool in Azure by default resets every 20 minutes or inactivity, so your data in the static class is no longer persistent - they are also shared between instances (if you have more than 1 ) can be a burden. Welcome to level 1B for help.

Leval 1B Data The solution we used was to store level 1A data in Azure Cache, with the sole purpose of refilling the static object when and when it is needed. Level 1B data solves this problem. In addition, if you encounter problems with the reset application pool, you can change this programmatically. Thus, levels 1A and 1B have the same data, but one is faster than the other (a fairly close analogy: processor cache and RAM).

Discussion of level 1A and 1B a bit. It can be noted that the excessive use of the static class and cache, as it uses more memory. But the problem that we found in practice is that at first it is faster with static ones. Secondly, there are some limitations in the cache (i.e. 8 MB per object). With big data, this is a small limit. By storing data in a static class, you can have objects larger than 8 MB and store them in the cache, breaking them (i.e., we currently have over 40 partitions). By the way, vote to increase this limit in the next issue of azure, thanks! Here is the link: www.mygreatwindowsazureidea.com/forums/34192-windows-azure-feature-voting/suggestions/3223557-azure-preview-cache-increase-max-item-size

Level 2 data Once we get the values ​​from the key / value object (level 1A), we use the value to retrieve the data in the table store. The value should tell you which section and Row Key you need. The problem is resolved here: you only query for rows that are relevant to the user / search context. As you now see, level 1A data comes down to minimizing row queries from table storage.

Tier 3 data The table storage data may contain a summary of your articles, or the first paragraph, or something like that. When you need to show the entire article, you will receive it from Blob. There should also be a column in the table storage that uniquely identifies the entire article in the blob. In the block, you can organize the data as follows:

  • Separate each article in separate files.
  • Group articles in one file.
  • Group all the articles in one file (not recommended, but not as bad as the first impression you can get).

For option 1, you will save the path to the article in the table storage, and then just take it directly from Blob. Due to the above levels, you only need to read a few complete articles here.

For the second and third options, you must store the path to the file in the table storage, as well as the start and end positions of where to read and where to stop reading using the search.

Here is an example code in C #:

 YourBlobClientWithReferenceToTheFile.Seek(TableStorageData.start, SeekOrigin.Begin); int numBytesToRead = (int)TableStorageData.end - (int)TableStorageData.start; int numBytesRead = 0; while (numBytesToRead > 0) { int n = YourBlobClientWithReferenceToTheFile.Read(bytes,numBytesRead,numBytesToRead); if (n == 0) break; numBytesRead += n; numBytesToRead -= n; } 

I hope this does not turn into a book, and I hope that it will be useful. Feel free to contact me if you have further questions or comments. Thanks!

+7
source

Proper file storage is a blob. But if your request should return several tens of blocks at the same time, it will be too slow, as you specify. So you can use a hybrid approach: use Azure Tables for 98% of your data, and if it's too big, use Blob instead and save the Blob URIs in your table.

Also, do you compress your content at all? I would do that.

+2
source

You can use the MongoDB GridFS function: http://docs.mongodb.org/manual/core/gridfs/

It breaks the data into 256 thousand pieces by default (configurable up to 16 MB) and allows you to use the database file as a file system that you can use to store and retrieve files. If the file is larger than the block size, mongo db drivers handle splitting / collecting data when the file needs to be restored. To add extra disk space, just add extra shards.

You should know, however, that only some mongodb drivers support this, and this is a driver convention, not a server function that allows this behavior.

+1
source

A few comments:

  • What you can do is ALWAYS store HTML content in block storage and save blob url in table storage. I personally do not like the idea of ​​storing data conditionally, i.e. If the content of the HTML file is more than 64 KB, then save it to the blob store, otherwise use the table store. Another advantage you get from this approach is that you can still request data. If you store everything in blob storage, you will lose the ability to query.
  • Regarding the use of other NoSQL stores, the only problem that I see with them is that they are not supported on the basis of Windows Azure, so you are also responsible for their management.
+1
source

Another option is to save your files as a VHD image in blob storage. Your roles can mount VHD in their file system and read data from there.

It seems that the complication is that only one virtual machine can have read / write access to VHD. Others may take a picture and read from it, but they will not see the update. Depending on how often your data is updated, which may work. for example, if you update data at known times, you can disconnect all clients, take a new snapshot, and remount to get new data.

You can also share VHD using SMB sharing, as described in this MSDN blog post . This will provide full read / write access, but may be slightly less reliable and slightly more complex.

0
source

you don’t say, but if you don’t compress your articles, which probably solve your problem, just use the table storage.

Otherwise, just use the table storage and use a unique section key for each article. If an article that is too large puts it in two lines, until you request a section key, you will get both lines, and then use the line key as an index indicating how the articles come together.

0
source

One of the ideas that I want to use is to use a CDN to store the content of your article and link them directly from the client side, rather than any multiphase job of getting data from sql, and then move on to some storage. That would be something like

 http://<cdnurl>/<container>/<articleId>.html 

Infact can also be done with the Blob repository.

The advantage is that it gets insanely fast.

The disadvantage is that the security aspect is lost.

Something like a shared signature can be explored for security, but I'm not sure how useful this is for client links.

-one
source

All Articles