How to handle massive storage of records in a database for user authorization purposes?

I am using Ruby on Rails 3.2.2 and MySQL. I would like to know whether it is “appropriate” / “desirable” to store in the database table associated with the class all the records related to two other classes for each “combination” of their instances.

That is, I have User and Article models. To save all the authorization objects for a custom article, I would like to implement the ArticleUserAuthorization model ArticleUserAuthorization that if N users and M have N * M ArticleUserAuthorization .

Having done so, I can specify and use ActiveRecord::Associations as follows:

 class Article < ActiveRecord::Base has_many :user_authorizations, :class_name => 'ArticleUserAuthorization' has_many :users, :through => :user_authorizations end class User < ActiveRecord::Base has_many :article_authorizations, :class_name => 'ArticleUserAuthorization' has_many :articles, :through => :article_authorizations end 

However, the aforementioned approach to preserving all combinations will lead to the creation of a large database table containing billions of billions of billions of rows! In addition, ideally, I plan to create all the authorization entries when creating the User or Article object (i.e., I plan to create all the previously mentioned “combinations” immediately or, better, “delay” the packages ... in any way , this process creates another billions of billions of rows of the database table !!!) and creates the flip side of the destruction (by deleting billions of billions of rows of the database table !!!). In addition, I plan to read and update these rows immediately when the object is updated User or Article .

So my doubts are:

  • Is this approach “appropriate” / “desirable”? For example, might you have performance issues? or - a bad "path" / "recipe" for administering / managing databases with very large database tables?
  • How can I / should / should / should act in my case (perhaps "rethinking" in general, how best to handle user authorizations)?

Note. I would use this approach because in order to retrieve only “allowed objects” when retrieving User or Article objects, I think I need “atomic” user authorization rules (that is, one user authorization account for each user and article object), so as the system is not based on user groups, such as "admin", "registered", etc. So, I thought that having an ArticleUserAuthorization table allows ArticleUserAuthorization to run methods related to user authorization (note: these methods include some MySQL queries that can degrade performance - see this my previous question for an example authorization method implementation) for each extracted object by simply accessing / joining the ArticleUserAuthorization table to get only user-authorized objects.

+7
source share
7 answers

The thing is, if you want article-level permissions per user , you need a way to associate User with Article , which they can access. This requires a minimum of N * A (where A is the number of uniquely permitted articles).

The 3NF approach to this, as you suggested, will have a UsersArticles set ... which will be a very large table (as you noted).

Keep in mind that you can access a lot of this table ... This seems to me one of the situations when a more denormalized approach (or even noSQL) is more appropriate.

Consider the model Twitter uses for its custom tables:

Jeff Atwood on this

And the High Scalability Blog

The sample of these fragments is a lesson learned on Twitter that requesting followers from a normalized table puts a huge load on the Users table. Their solution was to denormalize the followers so that the user follower is stored in their individual user settings.

Denormalize a lot. They were saved alone. For example, they store all the friends ids of user ids together, which prevents a lot of costly associations. - Avoid complex joins. - Avoid scanning large datasets.

I suggest that a similar approach can be used to secure work permissions and avoid the extremely busy UsersArticles separate table.

+6
source

You do not need to reinvent the wheel. ACL (Control Control List) structures have been dealing with the same problems for a long time, and most effectively if you ask me. You have resources (article) or even the best resource groups (article category / tag / Etc). On the other hand, you have user (user) and user groups. Then you will have a relatively small table that maps resource groups to user groups. And you will have another relatively small table that contains exceptions to this general mapping. Alternatively, you can configure a set of rules to access the article. You may even have dynamic groups such as: authors_friends, depending on your user to user relationship.

Just take a look at any decent ACL structure and you will have an idea how to deal with this problem.

+5
source

If there really is the prospect of a “large database table containing billions of billions of billions of rows,” then perhaps you should develop a solution for your specific needs around a (relatively) sparsely populated table.

Large database tables pose a significant performance issue in how quickly the system can find the corresponding rows or rows. It really needs indexes and primary keys; however, they add to storage requirements and also require processor cycles to be maintained as records are added, updated, and deleted. In addition, heavy-duty database systems also have partition functions (see http://en.wikipedia.org/wiki/Partition_(database )) that address such string performance problems.

A too-populated table may probably serve the purpose, assuming that the default (computable or constant) can be used when no rows are returned. Insert lines only where something other than the default value is required. A sparsely populated table will require much less storage space, and the system will be able to find rows faster. (Using custom functions or views can help simplify the query.)

If you really cannot make a sparsely populated table for you, then you are completely stuck. Perhaps you can make this huge table into a collection of smaller tables, although I doubt that any help if your database system supports partitioning. In addition, a collection of smaller tables makes the query more messy.

So, let's say you have millions or billions of Users who or not have certain privileges regarding the millions or billions of articles on your system. What then determines at the business level that the User has the privilege to do with this Article? Should the user be a (paying) subscriber? Or can he be a guest? Does the User use (and pay for) a package of certain articles? Can a user get the privilege of editing certain articles? And so on.

So, let's say, some User wants to do something with a specific Article. In the case of a sparsely populated SELECT table on this great table, UsersArticles will either return 1 row or nothing. If he returns a string, you will immediately find out what happens with ArticleUserAuthorization, and can continue with the rest of the operation.

If there is no line, then perhaps it’s enough to say that the User cannot do anything with this article. Or maybe the User is a member of some UserGroup that has the right to certain privileges on any article that has some ArticleAttribute (which this article has or does not have). Or maybe the article has an ArticleUserAuthorization article by default (stored in some other table) for any User who does not already have such an entry in UsersArticles. Or something...

The fact is that in many situations there is a structure and regularity that can be used to reduce the resources needed for the system. For example, people can add two numbers, each of which can contain up to 6 digits, without consulting a table of more than one and a half trillion records; that use structure. Regarding regularity, most people have heard of the Pareto principle (the "80-20" rule - see http://en.wikipedia.org/wiki/Pareto_principle ). Do you really need to have "billions of billions of billions of lines"? Or it would be true to say that about 80% of Users will only have (special) privileges for perhaps hundreds or thousands of articles - in this case, why spend other billions of billions of dollars (rounded: -P).

+4
source

You should look at hierarchical access control solutions (RBAC) based on hierarchical roles. You should also consider reasonable defaults.

  • Are all users allowed to read the article by default? Then save the deny exceptions.

  • Are all users allowed to read the article by default? Then save allow exceptions.

  • Does it depend on the article, is the default value allow or deny ? Then save this in the article and save both allow and deny exceptions.

  • Are articles affected and questions collected in journals and journals collected in the field of knowledge? Then save authorizations between users and these objects.

  • What if a User allowed to read a Journal but lacks a specific Article ? Then save User-Journal:allow , User-Article:deny , and the most specific command (in this case, the article) takes precedence over the more general (in this case, the default and the journal).

+1
source

Configure the ArticleUserAuthorization table with user_id. The principle is to reduce the effective size of the data set in the access path. Some data will be available more often than others and will be accessed in a certain way. On this path, the size of the result set should be small. Here we do this with a splinter. Also, optimize this path more, possibly having an index, if it is a read workload, cache it, etc.

This shard is useful if you want all articles to be authorized by the user.
If you want to request an article as well, then duplicate the table and shrapnel on the article article. When we have this second fragmentation scheme, we denormalized the data. Now the data is duplicated, and the application will need to do additional work to ensure data consistency. Recordings will also be slower; use the recording queue

The problem with the edging is that the fragments requests are inefficient; you will need a separate reporting database. Select a fragment scheme and consider recalculating the fragments.

For truly massive databases, you would like to partition it on physical machines. eg. one or more cars per user article.

Some nosql suggestions:

  • relationships are graphs. so look at graph databases. special
    https://github.com/twitter/flockdb
  • redis keeping the link in the list.
  • column oriented database like hbase. may consider it as a sparse nested hash

it all depends on the size of your database and the types of queries

EDIT: modified answer. the question previously had a relationship with "has_one". Also added nosql sentences 1 and 2

0
source

First of all, it’s good to think about the default values ​​and their behavior, rather than storing them in a database. For example, if by default the user cannot read the article, if not specified, then it should not be stored as false in the database.

My second thought is that you could have a users_authorizations column in the articles table and articles_authorizations in your users table. These 2 columns will store user identifiers and product identifiers in the form 3,7,65,78,29,78 . For example, for the articles table, this means that users with identifiers 3,7,65,78,29,78 can access articles. Then you will need to modify your queries to find users this way:

 @article = Article.find(34) @users = User.find(@article.user_authorizations.split(',')) 

Each time the article and user are saved or destroyed, you will need to create callbacks to update the authorization columns.

 class User < ActiveRecord after_save :update_articles_authorizations def update_articles_authorizations #... end end 

Do the same for the Article model.

Last: if you have different types of authorizations, feel free to create more columns, like user_edit_authorization .

When using these combined methods, the amount of data and database access is minimal.

0
source

After reading all the comments and the question, I still doubt the correct storage of all combinations. Think of it differently - who will fill this table? The author of the article or the moderator, or someone else? And based on what rule? You hurt yourself how difficult it is. Unable to fill all combinations.

Facebook has a similar feature. When you write a message, you can choose with whom you want to share it. You can select Friends, Friends of Friends, Everyone, or your own list. A custom list allows you to determine who will be included and excluded. In the same way, you only need to save special cases, such as "include" and "exclude", and all other combinations refer to the default case. Thus, N * M can be significantly reduced. Post visibility

0
source

All Articles