Topics Twitter Trends: Combine Different Spellings

Twitter trends often consist of no more than one word. But for folded terms, there are different ways of writing, for example:

Half-Blood Prince / Half-Blood Prince

To find all the updates on the Trend topic, you will need all the writing methods. Twitter does this:

Twitter Admin Trending Topics http://i26.tinypic.com/hu4uw1.png

You have a topic name on the left and different ways of writing on the right. Do you think this is done manually or automatically? Is it possible to do this automatically? If yes: how?

I hope you help me. Thanks in advance!

+6
tags twitter spelling
source share
7 answers

I will try to answer my question based on the Broken Link comment (thanks for that):


You have extracted phrases consisting of 1 to 3 words from your document database. Among these additional phrases are the following phrases:

  • Half-blood prince
  • Half-blood prince
  • Prince Halfblood Prince

For each phrase, you delete all special characters and spaces and make a line in lowercase:

$ phrase = 'Half-Blood Prince'; $ phrase = preg_replace ('/ [^ az] / i', '', $ phrase); $ phrase = strtolower ($ phrase); // the result is "halfbloodprince"

When you did this, all 3 phrases (see above) have one common spelling:

  • Half-Blood Prince => halfbloodprince
  • Half-Blood Prince => halfbloodprince
  • Halfblood Prince => halfbloodprince

So, "halfbloodprince" is a parent phrase. You insert both into your database, the regular phrase, and the parent phrase.

To show Trend Admins by topic, like Twitter, you do the following:

// first select the top 10 parent phrases $sql1 = "SELECT parentPhrase, COUNT(*) as cnt FROM phrases GROUP BY parentPhrase ORDER BY cnt DESC LIMIT 0, 10"; $sql2 = mysql_query($sql1); while ($sql3 = mysql_fetch_assoc($sql2)) { $parentPhrase = $sql3['parentPhrase']; $childPhrases = array(); // set up an array for the child phrases $fifthPart = round($sql3['cnt']*0.2); // now select all child phrases which make 20% of the parent phrase or more $sql4 = "SELECT phrase FROM phrases WHERE parentPhrase = '".$sql3['parentPhrase']."' GROUP BY phrase HAVING COUNT(*) >= ".$fifthPart; $sql5 = mysql_query($sql4); while ($sql6 = mysql_fetch_assoc($sql5)) { $childPhrases[] = $sql3['phrase']; } // now you have the parent phrase which is on the left side of the arrow in $parentPhrase // and all child phrases which are on the right side of the arrow in $childPhrases } 

Is that what you thought Broken Link was? Will this work?

+6
source share

Basically you want to find similarity between two lines .

I think the Soundex algorithm is what you are looking for. It can be used to compare strings depending on how they sound. Or as the wiki describes:

Soundex is a phonetic algorithm for indexing names with sound, as pronounced in English. The goal is for the homophones to be encoded into the same representation so that they can be matched, despite slight differences in spelling.

and

Using this algorithm [EDIT: “rating” words with a letter and three digits], both “Robert” and “Rupert” return the same string “R163”, while “Ruby” gives “R150”. Ashcraft gives A261.

There is also the Levenshtein distance .

Good luck.

+7
source share

There are many ways to do this. One straightforward google-style article "did you mean" that validation is a good idea on how to achieve it. written by peter norvig, director of google research.

http://norvig.com/spell-correct.html

+3
source share

"anderstornvig" mentioned Levenshtein / editing distance, which is a great idea, but not entirely appropriate, because some permutations are more significant than other permutations. The problem is that we use a lot of domain knowledge when we determine which differences are “significant” and which are “not significant”. For example, we know that the hyphen in "Half-Blood Prince" is very important, but the number in "Firefox 3" is very important.

For this reason, you might consider setting up a simple metric such as Levenshtein. Add options that let you customize which differences are important and which are irrelevant.

In particular, Levenshtein counts the number of “edits” (that is, insertions, deletions, and replacements) needed to turn one line into another. Effectively, it weighs every edit the same. You can write an implementation that will put some changes in different ways. For example, changing "-" to "should have a very low weight (which indicates unimportance). Changing" 3 "to" 2 ", when the number is left alone, should have a very high weight (which indicates a high value).

By parameterizing the calculation, you create a prospectus for continuous improvement of your algorithm. Create the initial configuration and run it on some test data. Find places where the metric is weak - where it combines two terms that you think should be separated, for example - and change the parameterization until you are satisfied.

Thus, you can train your algorithm using the knowledge of your domain.

+2
source share

Assuming trend themes are computationally generated, the exact algorithm executed on Twitter will be hard to guess. This is most likely very confidential and patented (as scary as patent algorithms might seem).

I consider it reasonable to believe that they will use some kind of natural language algorithm. Depending on the case, they are often very difficult to do the computational work and will do what you want to some extent.

An obvious helpful read on this from the wiki:

Good luck.

+1
source share

Most likely, they have some automatic systems that suggest likely candidates for unification, and then the person makes the final choice to combine them. Perhaps some of them are automatically combined.

  • Your suggestion for removing spaces and other punctuation is good. Most likely, they combine things that differ from each other only in punctuation or space.
  • Pluralism versus singularity: the search for these differences will be easily automated and will create likely candidates for unification.
  • Common spelling errors - there are databases with common spelling errors. They may even rely on the Google API for spelling suggestions (I think they reveal them).
  • Soundex (or similar) is good for spelling mistakes, but first you need to go through these two filters (remove spaces, punctuation marks and plurals), and then, most likely, someone will need to call if they are the same. But if you can present a graphical representation showing clustering with the same or similar sound test, then you really will make this part easy. You can automatically send a notification when a cluster begins to appear and a trend (they really only care about trends anyway, so even if the combined cluster has no trends, it can wait to study it.)

If you really need a person to intervene, this is when there are common nicknames. Like Michael Jackson, MJ, Michael, etc. Or MacDonalds, McD, Micky-D, etc. And then from the technical side you have Visual Studio, VS2008, VS, etc. Or StackOverflow, SO, etc. Then C #, C-Sharp, C # .NET are all the same, but C and C ++ are different.

So it should be a combination. It can rely on a database of known variations and combinations based on previous analysis or other sources, but this database will be regularly maintained by the editor.

+1
source share

I remember when Manchester United died, Twitter came back manually and fixed topics to indicate tweets about his death. These days, it would be a lot to ask the computer to do something similar automatically, although it can be easily done.

0
source share

All Articles