Extract business headers and time periods from a string

Question

Extract business headers and time periods from a string

I am retrieving some company information from Reuters using Python. I was able to get the names of officers / officers, biographies and compensation from this page.

Now I want to extract previous posts and companies from a biography section that looks something like this:

Mr. Donald T. Grimes is Senior Vice President, Chief Financial Officer and Treasurer of Wolverine World Wide, Inc. since May 2008. From 2007 to 2008, he was Executive Vice President and Chief Financial Officer of Keystone Automotive Operations, Inc. distributor of automotive accessories and equipment. Prior to Keystone, Mr. Grimes held a series of senior corporate and divisional financial roles at Brown-Forman Corporation, a producer and marketer of premium wines and spirits. While at Brown Forman, Mr. Grimes was Vice President, Director of Beverage Finance from 2006 to 2007; Vice President, Director of Corporate Planning and Analysis from 2003 to 2006; and senior vice president, chief financial officer, Brown-Forman Spirits America from 1999 to 2003.

I can use a simple regular expression to get from and to years, but I don’t understand how to write a regular expression to get the name and the name of the company. I know that the string format is inconsistent, so I would take an answer that works in at least 70% of cases. Here's the output I need:

2007-2008, executive vice president and chief financial officer, Keystone Automotive operations

+2

python regex nlp

karlos Oct 13 '11 at 16:50

source share

2 answers

I do not think that there will be one regular expression that you can use for this, if it is not disgusting. I think the solution to this could be Natural Language Processing . Of course, there are packages for this, but using them may not be easy.

Essentially, you want to accept a sentence like "X is / was Y" and find out which part is the name, which part is the list of job names, and which parts are irrelevant. Perhaps find sequences of words that are either uppercase or small words like "and" and "from"?

 (?:\u\w+)( (?:\u\w*)|(?:of)|(?:and))* #Note the space

\u means that the next single character (the first character of the group \w+ ) is uppercase. Did not check it, but it looks like it should work. This may be a nontrivial problem.

+1

andronikus Oct 13 '11 at 19:27

source share

bdk · Accepted Answer · 2011-10-14T02:04:51+0000

The problem that you are trying to solve is well known and researched, and you will find a large volume of research article describing approaches and algorithms if you use Google for the terms "Named object extraction" and "Extracting links." Good starting points are:

Chapter 7 of the book "Processing Natural Language with Python," in fact, that the whole book is likely to be useful. Head online here
This article is about "Named relationship of entity relationships using Wikipedia"
This article is "dd " New Algorithms for Developing Relationships, " which describes the names and organization of mining activities as an example.

These are just a few of the links I found interesting, there are a ton more and probably better than these, but this should get you started.

Extract business headers and time periods from a string

More articles: