Geolocation by free text

I am working on a project that is not quite sure how to approach. The problem can be summarized as follows:

  • For arbitrary text (such as a report), determine which geographic location refers to each part of the report.

Geographic locations vary from state to county (all within the United States), so their number is limited, but each report usually contains links to several locations. For example, the first 5 paragraphs of the report may be about the state as a whole, and then the next 5 will relate to individual counties in that state or something like that.

I am curious what would be the best way to approach a similar problem, perhaps with a specific recommendation in terms of NLP or ML frameworks (Python or Java)?

+7
source share
4 answers

I can really help a little here (my research is in the field of resolving toponyms).

If I understand you correctly, you are looking for a way (1) to find the names of places in the text, (2) to eliminate the ambiguity of the geographical link of the geographical name and (3) to spatially substantiate whole sentences or paragraphs.

There are many open source packages that can run # 1. NLP Stanford Core , OpenNLP

There are several packages that can run # 1 and # 2. CLAVIN is probably the only open source application ready to use that can do this at the moment. Yahoo Placemaker costs money, but can do it.

Actually there is no package that # 3 can do. There is a newer project called TEXTGROUNDER that does something called "Document Geolocation", but as long as the code is available, it is not configured to run your own input texts. I recommend that you only look at it if you feel like starting or contributing to a project trying to do something like this.

All three tasks are still part of ongoing research and can become incredibly complex depending on the details of the source text. You have not indicated in detail about your texts, but I hope this information can help you.

+6
source

An old question, but it may be useful for others to know that Apache OpenNLP has an add-on called GeoEntityLinker, and accepts text and document sentences, extracts entities (toponyms), searches USGS and GeoNames gazateers (Lucene indexes), solves (or tries , at least, at least) topopnymns in several ways, and returns you gazateer credits in relation to each sentence in the document submitted. It will be released with OpenNLP 1.6 if everything goes well .... not much documentation, if any at the moment.

This is a ticket to the OpenNLP Jira: https://issues.apache.org/jira/i#browse/OPENNLP-579 .

this is the source code:

http://svn.apache.org/viewvc/opennlp/addons/geoentitylinker-addon/

FYI: I am working on this main committer.

+2
source

The identification of references to geographical locations is pretty trivial using OpenNLP or GATE, etc. The main problem arises after this, when you have to ambiguously place places with the same name. For example, in the US there are 29 places called "Bristol". Which one is correct?

There are several approaches that can be used to disambiguate. It is easy to collect a list of all the places mentioned in the text, get their potential longitude / latitude, and then search for a set that has a minimum sum of distances.

The best solution I've seen for deploying people is to get all the articles related to places from Wikipedia, put them in the database for text like Lucene, and then use your text as a query to find the most promising place between candidates by measuring some similarity score. The idea is that in addition to the word "Bristol", the article also mentions the name of the river, a person, or something similar.

+2
source

To complete the task, you will need a labeled set of workouts. Then you train the classification model for this set of trainings and predict the location of new text fragments based on the model. You can see how they all work together in this code example written on top of SCIKIT-LEARN: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

designated training set:

You can train the classifier over the training complex, where each sample is in training (paragraph, region_id). region_id can be an identifier of a country, region or city.

Classification model training:

You create a package of words (for example, unigrams) of the model of each sample and train a classifier (for example, logistic regression with regulation L1) over the marked set of trainings. You can use any tool, but I recommend using SCIKIT-LEARN in Python, which is very simple and efficient to use.

Forecast:

After training, taking into account a paragraph or a fragment of text, the trained model can find the region_id for it, which is based on the words used in the sample.

Remember to adjust the regularization parameter over the development kit to get a good result (to prevent overriding the sample).

Read my article and this geolocation using text: http://www.aclweb.org/anthology/N15-1153

and related poster: http://www.slideshare.net/AfshinRahimi2/geolocation-twittertextnetwork-48968497

I also wrote a tool called Pigeo , which does just that and comes with a pre-prepared model. In addition to these works, there are many other textual geolocation research papers you can find.

0
source

All Articles