Algorithms that recognize a physical address on a web page

What are the best algorithms for recognizing structured data in an HTML page?

For example, Google finds out the home / company address by email and offers a map at that address.

+5
source share
9 answers

An entity-name retrieval framework such as GATE has at least solved the problem of retrieving information for a location using a directory of geographical names of famous places to help solve common problems. If the pages were not generated by the machine from a common source, you will find regular expressions that are slightly weak for the job.

+10
source

, , Beautiful Soup . , . adr microformat. , .

+4

, ; .

+3

, Google ( , , ). , , , , - . , , , , , . , , , .

, , , - , .

+3

. HTML, , Python. BeautifulSoup. HTML BeautifulSoup.

, , , HTML, , ..

+2

, , , . , , , , . , . -, , , URL-.

, , , http://metacpan.org/pod/Regexp::Common::URI::http

+1

, .

- , (), () Street | Boulevard | Main ..

Firefox, , , , .

0
  • .

regex . . NLP (NER) POS. , , NER.

  • If you need information, such as paragraphs, get the content using tags.
0
source

All Articles