Algorithm for determining the most popular articles last week, month and year?

I am working on a project where I need to sort the list of articles published by users by their popularity (last week, last month and last year).

I thought about this for a while, but I'm not a great statistic, so I decided that I could get some information here.

Here are the available variables:

  • Time [date], the article was originally published.
  • Time [date] The article was recommended by the editors (if there was one)
  • The number of votes received from users (total for the last week, last month, last year).
  • How many times the article has been viewed (in just the last week, last month, last year).
  • The number of user-uploaded materials (total for the last week, last month, last year).
  • Comments on the article (total for the last week, last month, last year)
  • How many times the user has saved the article on his reading list (Total, last week, last month, last year).
  • The number of articles that were published as the “best we need to offer” (editorial) list (Total, last week, last month, last year).
  • The time [date] of the article was called the "article of the week" (if there was one)

Now I am doing a weighting for each variable and dividing by the times when they were read. That's almost all I could think of after reading Weighted Funds . My biggest problem is that there are some custom articles that are always at the top of the popular list. Probably because the author is "cheating."

I am going to emphasize the importance of the article being relatively new, but I do not want to “punish” articles that are really popular just because they are a bit old.

Anyone with a more mental consciousness than mine who wants to help me?

Thanks!

+7
sorting math algorithm statistics
source share
3 answers

I think a balanced approach is a good one. But I think you need to do two things.

  • How to weigh the criteria.
  • How to prevent the "game" of the system.

How to weigh the criteria

This question belongs to the field of Solution Analysis with several criteria . Your approach is a weighted sum model . In any calculation decision process, ranking the criteria is often the hardest part of the process. I suggest you go the route of pairwise comparisons: how much do you think, how important is it that each criterion is compared with others? Create a table like this:

c1 c2 c3 ... c1 1 4 2 c2 1/4 1 1/2 c3 1/2 2 1 ... 

This shows that C1 is 4 times more important than C2, which is half that of C3. Use the final weight pool, say 1.0, as this is easy. Distributing it according to the criteria, we have 4 * C1 + 2 * C3 + C2 = 1 or roughly C1 = 4/7 , C3 = 2/7 , C2 = 1/7 . If there are discrepancies (for example, if you think C1 = 2*C2 = 3*C3 , but C3 = 2*C2 ), this is a good indication of the error: it means that you are incompatible with your relative rating, so go back and review them . I forgot the name of this procedure, comments here would be helpful. All of this is well documented.

Now this will probably seem a little arbitrary to you. They are for the most part numbers you pulled out of your head. Therefore, I would suggest taking a sample of 30 articles and ranking them like this: “Your gut” says that you need to order them (often you are more intuitive than you can put in numbers). Final numbers until they produce something close to this ordering.

Game prevention

This is the second important aspect. Regardless of which system you use, if you cannot prevent cheating, it will ultimately fail. You should be able to limit the vote (if IP can recommend the story twice?). You should be able to prevent spam comments. The more important the criterion, the more you need to prevent it from being played.

+5
source share

You can use outlier theory to detect anomalies. A very naive way to find emissions is to use the mahalanobis distance . This is a measure that takes into account the distribution of your data and calculates the relative distance from the center. It can be interpreted as the number of standard deviations of the article from the center. This, however, also includes really very popular articles, but it gives you the first indication that something is strange.

The second, more general approach is the creation of a model. You can regress variables that users can manipulate against those associated with editors. One would expect users and editors to agree to some extent. If they do not, then it again indicates that something is strange.

In both cases, you will need to determine some deal and try to find some weight based on this. A possible approach is to use the square root of mahalanobis distance as the reciprocal of the weight. If you are far from the center, your account will be torn down. The same can be done using the rest of the model. Here you can even take the sign. If the editor’s rating is less than expected based on the user's rating, the residual result will be negative. if the editor’s rating is higher than expected based on the user's rating, the residual result is positive, and it is very unlikely that the article is played in games. This allows you to define some rules to outweigh the given points.

Example in R:

alt text

The code:

 #Test data frame generated at random test <- data.frame( quoted = rpois(100,12), seen = rbinom(100,60,0.3), download = rbinom(100,30,0.3) ) #Create some link between user-vars and editorial test <- within(test,{ editorial = round((quoted+seen+download)/10+rpois(100,1)) }) #add two test cases test[101,]<-c(20,18,13,0) #bad article, hyped by few spammers test[102,]<-c(20,18,13,8) # genuinely good article # mahalanobis distances mah <- mahalanobis(test,colMeans(test),cov(test)) # simple linear modelling mod <- lm(editorial~quoted*seen*download,data=test) # the plots op <- par(mfrow=c(1,2)) hist(mah,breaks=20,col="grey",main="Mahalanobis distance") points(mah[101],0,col="red",pch=19) points(mah[102],0,,col="darkgreen",pch=19) legend("topright",legend=c("high rated by editors","gamed"), pch=19,col=c("darkgreen","red")) hist(resid(mod),breaks=20,col="grey",main="Residuals model",xlim=c(-6,4)) points(resid(mod)[101],0,col="red",pch=19) points(resid(mod)[102],0,,col="darkgreen",pch=19) par(op) 
+3
source share

There are several ways to do this, and what works for you will depend on your actual data set and what results you want for specific articles. As a rough rework, however, I would suggest moving the time when it was read by weighted numbers and dividing the article by age, since the older the article, the more likely it is to have higher numbers in each category.

for example

 // x[i] = any given variable above // w[i] = weighting for that variable // age = days since published OR // days since editor recommendation OR // average of both OR // ... score = (x[1]w[1] + ... + x[n]w[n])/age 

Your problem is to promote new articles more, but not wanting to punish the really popular old articles, you need to consider how you can tell if the article is really popular. Then simply use the “authenticity” algorithm to weight votes or representations, not static weighing. You can also change any other weights, not constants, and then have non-linear weights for any variables you want.

 // Fw = some non-linear function // (possibly multi-variable) that calculates // a sub-score for the given variable(s) score = (Fw1(x[1]) + ... + FwN(x[n]))/FwAge(age) 
+1
source share