An algorithm for determining which words make a phrase popular

Suppose I had a list of slogans (short, verbose phrases), and people voted for the ones that they liked best, and I wanted to evaluate which words, if any, made some slogans more popular than others. What would be the best way to achieve this? My first thought was to simply find all the unique words in the set of slogans and evaluate each of them as the average number of votes of all the slogans containing the specified word, but the frequency should also enter the game in some way, I think, so it should be fair following:

  • If Word A is found only in the slogan with the most votes, and Word B is found only in the slogan that got the second place, Word A is more โ€œgenerating popularityโ€
  • However, if Word A is found only in the slogan of the highest rank, and Word B is found in both the slogans of the second and the third word, Word B must win because it pushes more slogans to the top.
  • However, one of the words Word A in the upper slogan should still surpass the three occurrences of Word B in other slogans if they are, say, in the middle or lower half of the package (that is, there should be a balance of votes and frequency in the balance).

I also want to exclude words that are commonly spoken (for example, "or"). This is due to questions about the definition of trending words that were asked in the past, but are different, because change over time is not a factor. I would be happy to simply point in the right direction about this, as far as literature is concerned, but I'm not quite sure what to look for. Is this a class of problems that other people face?

+4
source share
3 answers

This is a machine learning issue. You are trying to learn a model from controlled data. To do this, you can run a simple algorithm similar to Perceptron or SampleRank ( pdf ):

First you define the functions that apply to the words in the tagline. Opportunities can be divided between words, for example. features of the word "world" can be:

  • "peace",
  • "noun",
  • "abstract noun",
  • short noun
  • "starts with p",
  • "ends with 's'-sound",
  • ...

The first feature of the โ€œworldโ€ is a unique function that works only on the โ€œworldโ€, while other functions can also be triggered by other words.

Each function has a weight (higher is better). So you have a vector function and a vector weight. This will allow you to assign a weight (rating) to any slogan (just the sum of all the weighted objects that work on the words in the slogan). All weights are initialized to 0.0.

Now you start training:

You sort through all the pairs of slogans. For each pair, you know the true rating (by the votes you have). Then you calculate the rating according to the functions and their current weights. If the true rating and the rating according to your current weights (i.e., according to the current model) are the same, you simply move on to the next pair. If your model assigned the wrong rating, you correct the function scales: you add 1.0 to the functions weights that work on the best slogan (the one that is better in people's voices) and subtracts 1.0 from the functions weights that work on the worst slogan (its score was clearly too high, so now you lower it). These weight updates will affect the estimates your model assigns to the following pairs, etc.

You run this loop several times until your model gets most of the pairs to the right (or some other convergence criterion).

As a rule, you really do not add or subtract 1.0, but eta times 1.0, where eta is the learning speed that you can set experimentally. As a rule, it is higher at the beginning of training and gradually decreases during training, as your weights move in the right direction. (See Also stochastic gradient descent.) To start, you can simply set it to 0.1 as a constant.

This procedure takes care of the stop words ("the", "of", ...), because they should be found equally often in good and bad slogans (and if they really do not, you will also learn about it).

After training, you can calculate the score for each word in accordance with the recognized weights.

+2
source

What about the Bayesian conclusion ?

0
source

I think I would use an algorithm that does this:

0
source

All Articles