Programmatically extract keywords from domain names

Question

Programmatically extract keywords from domain names

Let's say I have a list of domain names that I would like to analyze. If the domain name is not migrated, I don’t see a particularly simple way to “extract” the keywords used in the domain. However, I can see this being done on sites like DomainTools.com, Estibot.com, etc. For example:

ilikecheese.com becomes "i like cheese" sanfranciscohotels.com becomes "san francisco hotels" ...

Any suggestions for an effective and efficient solution?

Edit: I would like to write this in PHP.

+6

string php keyword dns extraction

Kevin Aug 22 '09 at 7:14

source share

7 answers

You might want to check out this SO question .

+3

Zed Aug 27 '09 at 7:03

source share

You need to develop a heuristic that will get likely matches from the domain. The way I do this will first find a large body of text. For example, you can download Wikipedia.

Then take your case and combine all two adjacent words. For example, if your sentence:

 quick brown fox jumps over the lazy dog

You will create a list:

 quickbrown brownfox foxjumps jumpsover overthe thelazy lazydog

Each of them will have one score. When you disassemble your case, you will track the frequency pairs of every two words. In addition, for each pair you need to sort the original two words.

Sort this list by frequency, and then try to find matches in your domain based on these words.

Finally, do a domain check for the top two phrases that are not registered!

I think sites like DomainTool take a list of the highest words. Then they try to make out these words. Depending on the purpose, you may want to use MTurk to complete the task. Different people will analyze the same words in different ways and may not do this in proportion to how common the words are.

+3

brianegge Aug 27 '09 at 7:26

source share

choosespain.com kidsexpress.com childrenswear.com dicksonweb.com

Have fun (and a good lawyer) if you try to parse the dictionary url.

You might be better off if you can find the same characters but separated by a space on your website.

Other features: extract data from ssl certificate; request top-level domain name server; Access to the domain name server (TLD); or use one of the whois tools or services (just google "whois").

+2

Dipstick Aug 22 '09 at 7:45

source share

If you have a list of valid words, you can scroll the line of your domain and try to cancel the correct word each time using the reverse tracking algorithm. If you manage to use all the words, you are done. Keep in mind that the time complexity of this is not optimal :)

+1

Zed Aug 22 '09 at 7:39

source share

 function getwords( $string ) { if( strpos($string,"xn--") !== false ) { return false; } $string = trim( str_replace( '-', '', $string ) ); $pspell = pspell_new( 'en' ); $check = array(); $words = array(); for( $j = 0; $j < ( strlen( $string ) - 5 ); $j++ ) { for( $i = 4; $i < strlen( $string ); $i++ ) { if( pspell_check( $pspell, substr( $string, $j, $i ) ) ) { $check[$j]++; $words[] = substr( $string, $j, $i ); } } } $words = array_unique( $words ); if( count( $check ) > 0 ) { return $words; } return false; } print_r( getwords( 'ilikecheesehotels' ) ); Array ( [0] => like [1] => cheese [2] => hotel [3] => hotels )

as a simple start using pspell. you can compare the results and see if you have the base of the words without the "s" at the end and combine them.

+1

Tobias Dec 9 '11 at 1:49

source share

You will need to use the dictionary engine to write the domain to find the correct words and start the dictionary engine for the result to ensure that the result is valid.

0

austin cheney Aug 22 '09 at 7:18

source share

Squarecog · Accepted Answer · 2009-08-29T20:19:27+0000

Well, I ran a script, I wrote an SO for this question , with a few minor changes - using logarithmic probabilities to avoid underestimating and changing to read multiple files in a corpus.

For my case, I downloaded a bunch of files from the Gutenberg project - there is no real method for this, I just grabbed all the English files from etext00, etext01 and etext02.

Below are the results, I saved the first three for each combination.

  expertsexchange: 97 possibilities
  - experts exchange -23.71
  - expert sex change -31.46
  - experts ex change -33.86

 penisland: 11 possibilities
  - pen island -20.54
  - penis land -22.64
  - pen is land -25.06

 choosespain: 28 possibilities
  - choose spain -21.17
  - chooses pain -23.06
  - choose spa in -29.41

 kidsexpress: 15 possibilities
  - kids express -23.56
  - kid sex press -32.65
  - kids ex press -34.98

 childrenswear: 34 possibilities
  - children swear -19.85
  - childrens wear -25.26
  - child ren swear -32.70

 dicksonweb: 8 possibilities
  - dickson web -27.09.09
  - dick son web -30.51
  - dicks on web -33.63

Programmatically extract keywords from domain names

More articles: