A functional alternative written in Python 3.5. I simplified your example for only 5 words on both sides. There are other simplifications regarding filtering unwanted values, but this will require minor changes. I will use the fn package from PyPI to make this functional code more natural to read.
from typing import List, Tuple from itertools import groupby, filterfalse from fn import F
First we need to extract the column:
def getcol3(line: str) -> str: return line.split("\t")[2]
Then we need to break the lines into blocks separated by a predicate:
TARGET_WORDS = {"target1", "target2"} # this is out predicate def istarget(word: str) -> bool: return word in TARGET_WORDS
Allows you to filter garbage and write a function to get the last and first 5 words:
def isjunk(word: str) -> bool: return word == "(unknown)" def first_and_last(words: List[str]) -> (List[str], List[str]): first = words[:5] last = words[-5:] return first, last
Now let's get the groups:
words = (F() >> (map, str.strip) >> (filter, bool) >> (map, getcol3) >> (filterfalse, isjunk))(lines) groups = groupby(words, istarget)
Now process the groups
def is_target_group(group: Tuple[str, List[str]]) -> bool: return istarget(group[0]) def unpack_word_group(group: Tuple[str, List[str]]) -> List[str]: return [*group[1]] def unpack_target_group(group: Tuple[str, List[str]]) -> List[str]: return [group[0]] def process_group(group: Tuple[str, List[str]]): return (unpack_target_group(group) if is_target_group(group) else first_and_last(unpack_word_group(group)))
And the last steps:
words = list(map(process_group, groups))
PS
This is my test case:
from io import StringIO buffer = """ _\t_\tword _\t_\tword _\t_\tword _\t_\t(unknown) _\t_\tword _\t_\tword _\t_\ttarget1 _\t_\tword _\t_\t(unknown) _\t_\tword _\t_\tword _\t_\tword _\t_\ttarget2 _\t_\tword _\t_\t(unknown) _\t_\tword _\t_\tword _\t_\tword _\t_\t(unknown) _\t_\tword _\t_\tword _\t_\ttarget1 _\t_\tword _\t_\t(unknown) _\t_\tword _\t_\tword _\t_\tword """
Given this file, you will get this result:
[(['word', 'word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word', 'word']), (['target1'], ['target1']), (['word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word']), (['target2'], ['target2']), (['word', 'word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word', 'word']), (['target1'], ['target1']), (['word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word'])]
Here you can drop the first 5 words and last 5 words.