Effective data structure for tags?

Imagine that you wanted to serialize and deserialize stackoverflow messages, including their tags, as space as efficiently as possible (in binary format), but also for performance when searching for tags. Is there a good data structure for this kind of scenario?

There are about 28,532 different tags in Stackoverflow, you can create a table with all the tags and assign them an integer. In addition, you can sort them by frequency so that the most common tags have the smallest numbers. Saving them simply as a string in the "1 32 45" format seems a bit inefficient in terms of search and storage

Another idea is to save the tags as a bitarray variable, which is attractive in terms of search and serialization. Since the most common tags in the first place can potentially put tags in a small amount of memory.

The problem would, of course, be that unusual tags would produce huge bitrates. Is there any standard for “compressing” bitrates for large 0 spaces? Or do you need to use any other structure completely?

EDIT

I'm not looking for a database solution or a solution where I need to store entire tables in memory, but a structure for filtering individual elements

+5
source share
4 answers

, 28 . . , ? "" . , , , , ( ?).

, , , , 200 ?

: -)

, , -, ( , ). .

, . , , ; measure: -)

.

, , . , .

( ). , , , , , .

+3

: tag_id question_id

. tag_id, question_id question_id, tag_id - , .

+1

, ; , , .

, , , . , , , . .

+1

If you want to effectively search for questions in a specific tag, you will need some kind of index. Perhaps all Tag objects can have an array of links (links, pointers, nummeric-id, etc.) to all issues tagged with this particular tag. So you just need to find the tag object, and you have an array pointing to all the questions of that tag.

0
source

All Articles