Give an input sentence with BIO tags :
[('What', 'B-NP'), ('is',' B-VP '), (' the ',' B-NP '), (' airspeed '"I-NP"), (' of ',' B-PP '), (' an ',' B-NP '), ("no threshold", "I-NP") (' swallow ',' I-NP '), ('? ' , 'O')]
I would need to extract relevant phrases, for example. if I want to extract 'NP' , I will need to extract fragments of tuples containing B-NP and I-NP .
[exit]:
[('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]
(Note: the numbers in the extraction tuples are a token token.)
I tried to extract it using the following code:
def extract_chunks(tagged_sent, chunk_type): current_chunk = [] current_chunk_position = [] for idx, word_pos in enumerate(tagged_sent): word, pos = word_pos if '-'+chunk_type in pos: # Append the word to the current_chunk. current_chunk.append((word)) current_chunk_position.append((idx)) else: if current_chunk: # Flush the full chunk when out of an NP. _chunk_str = ' '.join(current_chunk) _chunk_pos_str = '-'.join(map(str, current_chunk_position)) yield _chunk_str, _chunk_pos_str current_chunk = [] current_chunk_position = [] if current_chunk: # Flush the last chunk. yield ' '.join(current_chunk), '-'.join(current_chunk_position) tagged_sent = [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')] print (list(extract_chunks(tagged_sent, chunk_type='NP')))
But when I have a neighboring piece of the same type:
tagged_sent = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'), ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')] print (list(extract_chunks(tagged_sent, chunk_type='NP')))
He outputs this:
[('The Mitsubishi Electric Company Managing Director', '0-1-2-3-4-5'), ('ramen', '7')]
Instead of the desired:
[('The Mitsubishi Electric Company', '0-1-2-3'), ('Managing Director', '4-5'), ('ramen', '7')]
How can this be solved from the above code?
Besides how this is done from the above code, is there a better solution for extracting the desired fragments of a particular chunk_type ?