Redis sorted sets and best way to store uids

I have data consisting of user_ids and tags of these user IDs. User_id occurs several times and has a predefined number of tags (500), however this may change in the function. What you need to save is user_id, their tags and their score. I want it to be easy to find top-rated tags later .. etc. Each time a tag appears, it grows

My implementation in redis is done using sorted sets

  • each user_id is a sorted set

  • the key is user_id and is a hexadecimal number

works as follows:

zincrby user_id: x 1 "tag0"

zincrby user_id: x 1 "tag499"

zincrby user_id: y 1 "tag3"

etc.

bearing in mind that I want to get tags with the highest result, is there a better way?

The second problem is that now I use "keys *" to extract these keys for client-side manipulations, which I know that they are not aimed at production systems.

Plus, it would be great if memory problems were repeated through a certain number of keys (in the range of 10,000). I know that keys must be stored in memory, however they do not follow a specific pattern for partial retrieval, so I can avoid the "zmalloc" error (64-bit 64-bit debian server). The keys are 20 million. Any thoughts?

+4
source share
1 answer

My first point would be to note that 4 GB are tough for storing 20M sorted sets. A quick try shows that 20M users, each with 20 tags, consumes about 8 GB on a 64-bit field (and it takes into account the optimized ziplist memory optimizations shipped with Redis 2.4, don't even try to use this with earlier versions).

Sorted sets are the ideal data structure to support your use case. I would use them exactly as you described.

As you pointed out, KEYS cannot be used to iterate over keys. Rather, it means a debug command. To support key iteration, you need to add a data structure to provide this access path. The only structures in Redis that can support iteration are a list and a sorted set (via range methods). However, they seek to convert the iteration algorithms O (n) to O (n ^ 2) (for a list) or O (nlogn) (for zset). The list is also a poor choice for storing keys, as it will be difficult to maintain, as keys are added / deleted.

A more effective solution is to add an index consisting of regular sets. You must use the hash function to bind a specific user to the bucket and add the user ID to the set corresponding to that bucket. If the user ID is a numeric value, a simple modular function is enough. If this is not the case, the simplest line hashing function will do the trick.

So, to support iteration on user: 1000, user: 2000 and user: 1001, select the modulo 1000 function. User: 1000 and user: 2000 will be placed in bucket: 0 index, while user: 1001 will be placed in bucket index : 1.

So, at the top of zsets there are now the following keys:

index:0 => set[ 1000, 2000 ] index:1 => set[ 1001 ] 

Keys prefix is ​​not needed in sets, and it allows Redis to optimize memory consumption by serializing sets, provided that they remain small enough (optimization of whole sets proposed by Sripati Krishnan).

The global iteration consists of a simple loop on buckets from 0 to 1000 (excluded). For each bucket, the SMEMBERS command is used to obtain the corresponding set, and the client can then iterate over individual items.

Here is an example in Python:

 #!/usr/bin/env python # -*- coding: utf-8 -*- # ---------------------------------------------------- import redis, random POOL = redis.ConnectionPool(host='localhost', port=6379, db=0) NUSERS = 10000 NTAGS = 500 NBUCKETS = 1000 # ---------------------------------------------------- # Fill redis with some random data def fill(r): p = r.pipeline() # Create only 10000 users for this example for id in range(0,NUSERS): user = "user:%d" % id # Add the user in the index: a simple modulo is used to hash the user id # and put it in the correct bucket p.sadd( "index:%d" % (id%NBUCKETS), id ) # Add random tags to the user for x in range(0,20): tag = "tag:%d" % (random.randint(0,NTAGS)) p.zincrby( user, tag, 1 ) # Flush the pipeline every 1000 users if id % 1000 == 0: p.execute() print id # Flush one last time p.execute() # ---------------------------------------------------- # Iterate on all the users and display their 5 highest ranked tags def iterate(r): # Iterate on the buckets of the key index # The range depends on the function used to hash the user id for x in range(0,NBUCKETS): # Iterate on the users in this bucket for id in r.smembers( "index:%d"%(x) ): user = "user:%d" % int(id) print user,r.zrevrangebyscore(user,"+inf","-inf", 0, 5, True ) # ---------------------------------------------------- # Main function def main(): r = redis.Redis(connection_pool=POOL) r.flushall() m = r.info()["used_memory"] fill(r) info = r.info() print "Keys: ",info["db0"]["keys"] print "Memory: ",info["used_memory"]-m iterate(r) # ---------------------------------------------------- main() 

By setting constants, you can also use this program to estimate the global memory consumption of this data structure.

IMO this strategy is simple and effective because it offers O (1) the complexity of adding / removing users and the true complexity of O (n) for iterating over all elements. The only drawback is that the key order of iterations is random.

+13
source

All Articles