What are the best algorithms for recognizing structured data in an HTML page?
For example, Google finds out the home / company address by email and offers a map at that address.
An entity-name retrieval framework such as GATE has at least solved the problem of retrieving information for a location using a directory of geographical names of famous places to help solve common problems. If the pages were not generated by the machine from a common source, you will find regular expressions that are slightly weak for the job.
, , Beautiful Soup . , . adr microformat. , .
, ; .
, Google ( , , ). , , , , - . , , , , , . , , , .
, , , - , .
. HTML, , Python. BeautifulSoup. HTML BeautifulSoup.
, , , HTML, , ..
, , , . , , , , . , . -, , , URL-.
, , , http://metacpan.org/pod/Regexp::Common::URI::http
, .
- , (), () Street | Boulevard | Main ..
Firefox, , , , .
http://code.google.com/p/graph-expression/wiki/USAAddressExtraction
regex . . NLP (NER) POS. , , NER.