I signed up only to help with this issue. I used to find useful answers to my problems from Stackoverflow - thank you community, so I thought it would be fair (perhaps fair to be an understatement) to try to give something back with this question as it falls on my lane,
In short, given all the factors mentioned in the question, storing the table may be the best option - if you can correctly evaluate transactions per month: a good article about this . You can solve two restrictions that you have limited, the restriction of rows and columns, by splitting (a simple text method or serializing it) the document / html / data. Speaking from experience with 40 GB + data stored in a table storage, our application often extracts more than 10 rows per page in milliseconds - there are no arguments! If you need more than 50 lines, you look at the small single digits (s), or you can do them in parallel (and further, breaking the data in different sections) or in some asynchronous way. Or read the proposed layered caching below.
A bit more. I tried using SQL Azure, Blob (both pages and block) and Table Storage. I cannot speak for Mongo DB, because, partly for the reasons already mentioned here, I did not want to go along this route.
- Table storage fast; in the range from 20 to 50 milliseconds or even faster (depending, for example, in the same data center that I saw it up to 10 milliseconds) when querying with a section and a row key. You can also have several sections, in some way, based on your data and your knowledge of this.
- It scales better in terms of GB, but not transactions.
- The restrictions of the rows and columns that you mentioned are a burden, are consistent, but do not show a stopper. I wrote my own solution for splitting entities, you can too easily, or you can see this already written solution (does not solve the whole problem, but this is a good start): https://code.google.com/p/lokad-cloud/wiki / FatEntities
- It is also necessary to keep in mind that loading data into a table storage takes a lot of time, even when batch objects are associated with other restrictions (i.e. request size less than 4 MB, download speed, etc.).
But using TableStorage exclusively may not be the best solution (thinking about growth and economics). The best solution that we completed with the implementation of multi-level caching / storage, starting with static classes, a cache based on the Azure role, table storage and block blocks. Lets call it for readability purposes of levels 1A, 1B, 2, and 3, respectively. Using this approach, we use the average one instance (2 processor cores and 3.5 GB of RAM - my laptop has better performance) and are able to process / query / evaluate 100 GB + data in seconds (95% of cases in less than 1 second). I find this to be quite impressive given that we check all the “articles” before displaying them (4+ million “articles”). Firstly, it is difficult and may or may not be possible in your case. I do not have enough knowledge about data and its consumption / processing, but if you can find a way to organize the data, this may be ideal. I will make an assumption: it looks like you are trying to find and find relevant articles with some information about the user and some tags (perhaps a variant of the news aggregator, just got a hunch). This assumption is made to illustrate the proposal, therefore, even if it is incorrect, I hope this helps you or initiates new ideas on how to accept it.
Level 1A data. Define and add key entities or their properties in a static class (periodically, depending on how you anticipate updates). Let's say we define user preferences (for example, demographics and interest, etc.) and tags (technology, politics, sports, etc.). This will be used to quickly obtain information about the user, his preferences and any tags. Think of it as a key / value; for example, a key that is a tag, and its value is a list of article identifiers or its range. This solves a small problem, namely: a set of keys is set (user prefix, tags, etc.). What articles are of interest to us! This data should be small in size if it is organized correctly (for example, instead of storing the path to the article, you can only store the number). * Note: the problem with saving data in a static class is that the application pool in Azure by default resets every 20 minutes or inactivity, so your data in the static class is no longer persistent - they are also shared between instances (if you have more than 1 ) can be a burden. Welcome to level 1B for help.
Leval 1B Data The solution we used was to store level 1A data in Azure Cache, with the sole purpose of refilling the static object when and when it is needed. Level 1B data solves this problem. In addition, if you encounter problems with the reset application pool, you can change this programmatically. Thus, levels 1A and 1B have the same data, but one is faster than the other (a fairly close analogy: processor cache and RAM).
Discussion of level 1A and 1B a bit. It can be noted that the excessive use of the static class and cache, as it uses more memory. But the problem that we found in practice is that at first it is faster with static ones. Secondly, there are some limitations in the cache (i.e. 8 MB per object). With big data, this is a small limit. By storing data in a static class, you can have objects larger than 8 MB and store them in the cache, breaking them (i.e., we currently have over 40 partitions). By the way, vote to increase this limit in the next issue of azure, thanks! Here is the link: www.mygreatwindowsazureidea.com/forums/34192-windows-azure-feature-voting/suggestions/3223557-azure-preview-cache-increase-max-item-size
Level 2 data Once we get the values from the key / value object (level 1A), we use the value to retrieve the data in the table store. The value should tell you which section and Row Key you need. The problem is resolved here: you only query for rows that are relevant to the user / search context. As you now see, level 1A data comes down to minimizing row queries from table storage.
Tier 3 data The table storage data may contain a summary of your articles, or the first paragraph, or something like that. When you need to show the entire article, you will receive it from Blob. There should also be a column in the table storage that uniquely identifies the entire article in the blob. In the block, you can organize the data as follows:
- Separate each article in separate files.
- Group articles in one file.
- Group all the articles in one file (not recommended, but not as bad as the first impression you can get).
For option 1, you will save the path to the article in the table storage, and then just take it directly from Blob. Due to the above levels, you only need to read a few complete articles here.
For the second and third options, you must store the path to the file in the table storage, as well as the start and end positions of where to read and where to stop reading using the search.
Here is an example code in C #:
YourBlobClientWithReferenceToTheFile.Seek(TableStorageData.start, SeekOrigin.Begin); int numBytesToRead = (int)TableStorageData.end - (int)TableStorageData.start; int numBytesRead = 0; while (numBytesToRead > 0) { int n = YourBlobClientWithReferenceToTheFile.Read(bytes,numBytesRead,numBytesToRead); if (n == 0) break; numBytesRead += n; numBytesToRead -= n; }
I hope this does not turn into a book, and I hope that it will be useful. Feel free to contact me if you have further questions or comments. Thanks!