How to rank a million images using crowdsourcing

I would like to evaluate a collection of landscape images by creating a game in which website visitors can rate them to find out which images people find most attractive.

What would be a good way to do this?

  • Hot-or-Not Style ? That is, show one image, ask the user to rate it from 1-10. As I see it, this allows me to average the scores, and I just need to ensure an even distribution of votes across all the images. Pretty simple to implement.
  • Choose A-or-B ? That is, show two images, ask the user to choose the best. This is attractive because there is no numerical rating, it is just a comparison. But how do I implement it? My first thought was to do it as a quick sort, with the comparison operations provided by people, and as soon as they ended, just repeat the ad-infinitum sort.

How do you do this?

If you need numbers, I'm talking about one million images on a site with 20,000 daily visits. I would suggest that a small fraction can play the game, for the sake of argument, let's say I can generate 2,000 human operations per day! This is a non-profit website, and the terminally curious will find it through my profile :)

+78
sorting algorithm crowdsourcing
Oct 02 '08 at 22:09
source share
12 answers

As others have said, a rating of 1-10 does not work very well because people have different levels.

The problem with the Pick A-or-B method is that it is not guaranteed to be transitive (A can beat B, but B is C, and C is A). The presence of intransitive comparison operators breaks the sorting algorithms . With quick sorting, against this example, letters not selected as a bar will be incorrectly taken against each other.

At any given time you want to get an absolute rating of all images (even if some / all of them are related). You also want your rating not to change if someone does not vote .

I would use the Pick A-or-B (or tie) method, but defined a ranking similar to the Elo rating system, which is used to rank in games with two players (originally in chess):

Player rating Elo system compares the records of matches of players against matches of opponents and determines the probability a player wins a match. This probability factor determines how much it indicates that the player rating goes up or down according to the results of each match. When a player defeats an opponent with a higher rating, the player’s rating grows more than if he or she defeated a player with a lower rating (since players must defeat opponents who have ratings).

Elo system:

  • All new players start with a base rating of 1600
  • WinProbability = 1 / (10 ^ ((Opponents Current rating - Current player rating) / 400) + 1)
  • ScoringPt = 1 point if they win the match, 0 if they lose, and 0.5 for a tie.
  • Players New rating = Players Old rating + (K-Value * (Winning victory in ScoringPt-Player))

Replace the “players” in the photo, and you have an easy way to adjust the rating of both paintings based on the formula. You can then perform the ranking using these numerical ratings. (K-Value here is the "Level" of the tournament: 8-16 for small local tournaments and 24-32 for large invitations / regions. You can simply use the constant as 20).

With this method, you only need to save one number for each image, which is much less saturated with memory than storing the individual ranks of each image on top of each other.

EDIT: added some meat based on comments.

+90
Oct 02 '08 at 23:05
source share
— -

The most naive approaches to the problem have some serious problems. Worst of all, bash.org and qdb.us display quotes — users can vote for a quote (+1) or down (-1), and the list of best quotes is sorted by the total amount of the network. This suffers from a terrible time shift - older quotes have accumulated a huge amount of positive voices through simple longevity, even if they are only slightly humorous. This algorithm could make sense if the jokes got funnier when they got older, but - believe me - they don't.

There are various attempts to fix this - looking at the number of positive votes for a period of time, weighing more recent votes, introducing a decay system for older votes, calculating the ratio of positive and negative votes, etc. Most of them suffer from other disadvantages.

The best solution - I think, is that the Funniest The Cutest , The Fairest and Best thing use websites are a modified Condorcet voting system :

The system gives everyone a number based on what it came from, what percentage of them usually hits. Thus, everyone gets a percentage score of NumberOfThingsIBeat / (NumberOfThingsIBeat + NumberOfThingsThatBeatMe). In addition, items are prohibited from the top list until they are compared with a reasonable percentage of the set.

If there is a Condorcet winner in the kit, this method will find it. Since this is unlikely, given the statistical nature, he finds one that is “closest” to the Condorcet winner.

For more information on implementing such systems, the Wikipedia page on Ranked Pairs should be helpful.

The algorithm requires people to compare two objects (your Pick-A-or-B option), but frankly, that's good. I consider it very well accepted in decision theory that people compare two objects much better than they are in abstract ranking. Millions of years of evolution make us choose the best apple by the tree, but it’s scary to decide how close we have chosen the apple that we see in real platonic form. (This, by the way, is why the Analytical Hierarchy process is so elegant ... but it's a little different from the topic.)

One final point is that SO uses an algorithm to find the best answers, which is very similar to the bash.org algorithm to find the best quote. It works well here, but it is terribly scary there - largely because the old, highly rated, but now outdated answer here is likely to be edited. bash.org does not allow editing, and it is unclear how you are going to edit ten-year-old jokes about modern Internet memes, even if you could ... In any case, I want to say that the right algorithm usually depends on the details of your problem .: -)

+39
Oct 02 '08 at
source share

I know this question is pretty old, but I thought it would contribute

I would look at the TrueSkill system developed by Microsoft Research. This is similar to ELO, but has a much faster convergence time (looks exponential compared to linear), so you get more from each vote. This, however, is more complicated mathematically.

http://en.wikipedia.org/wiki/TrueSkill

+11
Dec 16 '09 at 18:06
source share

I don't like the Hot-or-Not style. Different people chose different numbers, even if they all liked the image equally. I also hate the rating of things out of 10, I never know which number to choose.

Choosing A-or-B is much easier and more interesting. You see two images, and comparisons are made between the images on the site.

+8
Oct 02 '08 at 22:20
source share

These equations from Wikipedia make it easier / more efficient to calculate Elo estimates, the algorithm for images A and B will be simple

  • Get Ne, mA, mB and RA, RB ratings from your database.
  • Calculate KA, KB, QA, QB using the number of comparisons completed (Ne) and the number of times this comparison has been matched (m) and current ratings:

K

QA

QB

  • Calculate EA and EB.

EA

EB

  • Winner S score: Winner 1, Loser as 0, and if you have a draw of 0.5,
  • Calculate new grades for both: New Rating

  • Update your new RA, RB ratings and read mA, mB in the database.

+5
Dec 24 '08 at 13:42
source share

You can go with a combination.

First step: Hot or non-style (although I would go with 3 options: Sucks, Meh / OK. Cool!)

After you sort the set into 3 buckets, I would select two images from the same bucket and go to "Which is better"

Then you can use the English football promotion and demotion system to move the top several “sucks” to the Meh / OK area to clarify edge cases.

+4
Oct 02 '08 at 22:14
source share

A rating of 1-10 will not work, each has different levels. Anyone who always gives a rating of 3-7 will score his ratings with people who always give 1 or 10.

a-or-b is more efficient.

+4
Oct 02 '08 at 22:26
source share

Wow, I'm late in the game.

I really like the ELO system, but, as Owen says, it seems to me that you will slowly build up any significant results.

I believe that people have a lot more ability than just comparing two images, but you want to maintain interaction with a minimum minimum.

So, how about you showing n images (n is any number that you can explicitly display on the screen, maybe it can be 10, 20, 30 depending on the user's preferences) and make them choose what they consider the best in this installment, back to ELO. You need to change the rating system, but keep the same spirit. You actually compared one image with n-1 others. Thus, you do your ELO rating n-1 times, but you must divide the rating change by n-1 to match (so that results with different n values ​​are coherent with each other).

You are done. Now you have the best of all worlds. A simple rating system that works with many images with one click.

+3
Mar 05 2018-11-11T00:
source share

If you prefer to use the selection strategy A or B, I would recommend this article: http://research.microsoft.com/en-us/um/people/horvitz/crowd_pairwise.pdf

Chen, X., Bennett, PN, Collins-Thompson, K., and Horvitz, E. (2013, February). The combination of pair ranking in crowdsourcing. In Materials of the Sixth ACM International Conference on Internet Search and Data Mining (pp. 193-202). ACM

The paper talks about the Crowd-BT model, which extends the famous Bradley-Terry pairing comparative model in crowd customization. It also provides an adaptive learning algorithm to improve the efficiency of time and space models. You can find the Matlab algorithm on Github (but I'm not sure if it works).

+3
May 17 '15 at 3:01
source share

The unused website whatsbetter.com used the Elo style method . You can read about the method in the FAQ in the online archive .

+2
Dec 19 '08 at 6:36
source share

Choose A-or-B its simplest and less prone to bias, but each time you interact with a person, it gives you much less information. I think that due to a decrease in bias, Pick is superior and in the limit provides you the same information.

A very simple counting scheme is counting for each image. When someone gives a positive comparison increment, the counter, when someone gives a negative comparison, decreases the count.

Sorting the 1 millionth integer list is very fast and takes less than a second on a modern computer.

However, the problem is quite incorrect - it will take you 50 days to display each image only once.

I bet you’re more interested in top-rated images? Thus, you probably want to avoid looking for images using the predicted rank - so you are more likely to show images that have already reached several positive comparisons. This way, you’ll start to show “interesting” images faster.

+1
Oct 02 '08 at 22:27
source share

I like the quick sort option, but I would do a few tweaks:

  • Save the comparison results in the database, and then average them.
  • Get more than one comparison per view by giving the user 4-6 images and sorting them.
  • Choose which images will be displayed by running qsort and writing down and cropping everything you don’t have enough data for. Then, when you have enough recordings, drink a page.

Another interesting option would be to use the crowd to train the neural network.

+1
May 7, '09 at 17:04
source share



All Articles