Separating lines with 'from infile in python

I have a number of input files, such as:

chr1 hg19_refFlat exon 44160380 44160565 0.000000 + . gene_id "KDM4A"; transcript_id "KDM4A"; chr1 hg19_refFlat exon 19563636 19563732 0.000000 - . gene_id "EMC1"; transcript_id "EMC1"; chr1 hg19_refFlat exon 52870219 52870551 0.000000 + . gene_id "PRPF38A"; transcript_id "PRPF38A"; chr1 hg19_refFlat exon 53373540 53373626 0.000000 - . gene_id "ECHDC2"; transcript_id "ECHDC2_dup2"; chr1 hg19_refFlat exon 11839859 11840067 0.000000 + . gene_id "C1orf167"; transcript_id "C1orf167"; chr1 hg19_refFlat exon 29037032 29037154 0.000000 + . gene_id "GMEB1"; transcript_id "GMEB1"; chr1 hg19_refFlat exon 103356007 103356060 0.000000 - . gene_id "COL11A1"; transcript_id "COL11A1"; 

in my code. I am trying to capture 2 elements from each line, the first is the number after it says exon, the second is the gene (a number and letter combo surrounded by ", for example" KDM4A ". My code:

  with open(infile,'r') as r: start = set([line.strip().split()[3] for line in r]) genes = set([line.split('"')[1] for line in r]) print len(start) print len(genes) 

for some reason, getting started is great, but genes don't capture anything. Here is the result:

  48050 0 

I believe this is due to the "" surrounding the gene name, but if I find it on the terminal, it works fine:

 >>> x = 'A b P "G" m' >>> x 'A b P "G" m' >>> x.split('"')[1] 'G' >>> 

Any solutions would be highly appreciated? Even if it's a completely different way to capture 2 data items from each row. Thanks

+5
source share
5 answers

This is because your file object is exhausted, when you loop it once here start = set([line.strip().split()[3] for line in r]) again, you try to loop here genes = set([line.split('"')[1] for line in r]) on the exhausted file object

Decision:

You can find the beginning of the file (this is one solution)

Modification of your code:

 with open(infile,'r') as r: start = set([line.strip().split()[3] for line in r]) r.seek(0, 0) genes = set([line.split('"')[1] for line in r]) print len(start) print len(genes) 
+8
source

You can use regex.

 with open(file) as f: start = [] genes = [] for line in f: st, gen = re.search(r'\bexon\s+(\d+)\b.*?\s+gene_id\s+"([^"]*)"', line).groups() start.append(st) genes.append(gen) print set(start) print set(genes) 

Demo

+4
source

You can load all the lines into a list, and then split for each element in this list (not sure how effective this file is if the file is long)

 with open(infile) as r: lines = [line for line in r] start = set([line.strip().split()[3] for line in lines]) genes = set([line.split('"')[1] for line in lines]) 
+2
source

Using shlex (like shell arguments) neutralizes several spaces and quotes
Not sure if it's faster, but safe and nice.

 import shlex with open(infile, 'r') as f: for line in f: parts = shlex.split(line.replace(';', '')) print parts[3], parts[9] 
+2
source

The reason genes failed to load is because you would need to start reading the file from the beginning. The following approach should work:

 import re start = set() genes = set() with open('input.txt', 'r') as f_input: for line in f_input: s, g = re.match(r'(?:.*?\s+){3}(\d+).*"(\w+)"', line).groups() start.add(s) genes.add(g) print start print genes 

Providing output:

 set(['44160380', '29037032', '103356007', '19563636', '53373540', '52870219', '11839859']) set(['COL11A1', 'PRPF38A', 'KDM4A', 'C1orf167', 'EMC1', 'GMEB1', 'ECHDC2_dup2']) 
+2
source

All Articles