Python - joining two lines that overlap

Question

Python - joining two lines that overlap

I am trying to create a full address, but the data that I have is presented as:

Line 1 | Line 2 | Postcode 1, First Street, City, X13 1, First Street First Street, City X13 1 1, First Street, City, X13 X13

There are several other permutations on how this data is created, but I want to be able to combine all this into one line where there is no overlap. So I want to create a line:
1, First Street, City, X13

But not 1, First Street, First Street, City, X13 , etc.

How can I concatenate or combine them without duplicating data already there? There are also some cells, for example, on the top line, where there is no information preceding the first cell.

+7

python

Abi Dec 10 '15 at 10:19

source share

2 answers

If you don't mind losing punctuation:

 from collections import OrderedDict od = OrderedDict() from string import punctuation with open("test.txt") as f: next(f) print("".join(od.fromkeys(word.strip(punctuation) for line in f for word in line.split()))) 1 First Street City X13

If you have duplicate words, you cannot use this approach, but based on your input, there is no way to find out what possible combination is possible. The second line is actually always intact, in which case you just need to pull the second line.

+2

Padraic cunningham Dec 10 '15 at 10:47

source share

Kasramvd · Accepted Answer · 2015-12-10T10:33:53+0000

If you have plain text, you can split the text by \n to get a string and split the strings with, to get separate fields:

 >>> s = """1, First Street, City, X13 ... 1, First Street First Street, City, X13 ... 1 1, First Street, City, X13 X13""" >>> >>> lines = s.split('\n') >>> >>> splitted_lines = [line.split(',') for line in lines]

Note that as a more pythonic method, you can use the csv module to read your text, specifying a comma as a separator.

 import csv with open('file_name') as f: splitted_lines = csv.reader(f,delimiter=',')

You can then use the following list comprehension to get unique fields in each column:

 >>> import re >>> ' '.join([set([set(re.split(r'\s{2,}',i)).pop() for i in column]).pop() for column in zip(*splitted_lines)]) '1 First Street City'

Note that here you can get the columns using the zip() function, and then split the elements with re.split() with regex r'\s{2,}' , which break your line into 2 or more white spaces then you can sue set() to save unique items.

Note. If you care about ordering, you can use collections.OrderedDict instead of set

 >>> from collections import OrderedDict >>> >>> d = OrderedDict() >>> ' '.join([d.fromkeys([set(re.split('\s{2,}',i)).pop() for i in column]).keys()[0] for column in zip(*splitted_lines)]) '1 First Street City X13'

Python - joining two lines that overlap

More articles: