Separating lines with 'from infile in python

Question

Separating lines with 'from infile in python

I have a number of input files, such as:

chr1 hg19_refFlat exon 44160380 44160565 0.000000 + . gene_id "KDM4A"; transcript_id "KDM4A"; chr1 hg19_refFlat exon 19563636 19563732 0.000000 - . gene_id "EMC1"; transcript_id "EMC1"; chr1 hg19_refFlat exon 52870219 52870551 0.000000 + . gene_id "PRPF38A"; transcript_id "PRPF38A"; chr1 hg19_refFlat exon 53373540 53373626 0.000000 - . gene_id "ECHDC2"; transcript_id "ECHDC2_dup2"; chr1 hg19_refFlat exon 11839859 11840067 0.000000 + . gene_id "C1orf167"; transcript_id "C1orf167"; chr1 hg19_refFlat exon 29037032 29037154 0.000000 + . gene_id "GMEB1"; transcript_id "GMEB1"; chr1 hg19_refFlat exon 103356007 103356060 0.000000 - . gene_id "COL11A1"; transcript_id "COL11A1";

in my code. I am trying to capture 2 elements from each line, the first is the number after it says exon, the second is the gene (a number and letter combo surrounded by ", for example" KDM4A ". My code:

  with open(infile,'r') as r: start = set([line.strip().split()[3] for line in r]) genes = set([line.split('"')[1] for line in r]) print len(start) print len(genes)

for some reason, getting started is great, but genes don't capture anything. Here is the result:

  48050 0

I believe this is due to the "" surrounding the gene name, but if I find it on the terminal, it works fine:

 >>> x = 'A b P "G" m' >>> x 'A b P "G" m' >>> x.split('"')[1] 'G' >>>

Any solutions would be highly appreciated? Even if it's a completely different way to capture 2 data items from each row. Thanks

+5

python split

user3062260 Sep 16 '15 at 12:14

source share

5 answers

You can use regex.

 with open(file) as f: start = [] genes = [] for line in f: st, gen = re.search(r'\bexon\s+(\d+)\b.*?\s+gene_id\s+"([^"]*)"', line).groups() start.append(st) genes.append(gen) print set(start) print set(genes)

Demo

+4

Avinash raj Sep 16 '15 at 12:20

source share

You can load all the lines into a list, and then split for each element in this list (not sure how effective this file is if the file is long)

 with open(infile) as r: lines = [line for line in r] start = set([line.strip().split()[3] for line in lines]) genes = set([line.split('"')[1] for line in lines])

+2

tom Sep 16 '15 at 12:26

source share

Using shlex (like shell arguments) neutralizes several spaces and quotes
Not sure if it's faster, but safe and nice.

 import shlex with open(infile, 'r') as f: for line in f: parts = shlex.split(line.replace(';', '')) print parts[3], parts[9]

+2

saeedgnu Sep 16 '15 at 12:27

source share

The reason genes failed to load is because you would need to start reading the file from the beginning. The following approach should work:

 import re start = set() genes = set() with open('input.txt', 'r') as f_input: for line in f_input: s, g = re.match(r'(?:.*?\s+){3}(\d+).*"(\w+)"', line).groups() start.add(s) genes.add(g) print start print genes

Providing output:

 set(['44160380', '29037032', '103356007', '19563636', '53373540', '52870219', '11839859']) set(['COL11A1', 'PRPF38A', 'KDM4A', 'C1orf167', 'EMC1', 'GMEB1', 'ECHDC2_dup2'])

+2

Martin evans Sep 16 '15 at 12:32

source share

The6thSense · Accepted Answer · 2015-09-16T12:17:49+0000

This is because your file object is exhausted, when you loop it once here start = set([line.strip().split()[3] for line in r]) again, you try to loop here genes = set([line.split('"')[1] for line in r]) on the exhausted file object

Decision:

You can find the beginning of the file (this is one solution)

Modification of your code:

 with open(infile,'r') as r: start = set([line.strip().split()[3] for line in r]) r.seek(0, 0) genes = set([line.split('"')[1] for line in r]) print len(start) print len(genes)

Separating lines with 'from infile in python

More articles: