I have a number of input files, such as:
chr1 hg19_refFlat exon 44160380 44160565 0.000000 + . gene_id "KDM4A"; transcript_id "KDM4A"; chr1 hg19_refFlat exon 19563636 19563732 0.000000 - . gene_id "EMC1"; transcript_id "EMC1"; chr1 hg19_refFlat exon 52870219 52870551 0.000000 + . gene_id "PRPF38A"; transcript_id "PRPF38A"; chr1 hg19_refFlat exon 53373540 53373626 0.000000 - . gene_id "ECHDC2"; transcript_id "ECHDC2_dup2"; chr1 hg19_refFlat exon 11839859 11840067 0.000000 + . gene_id "C1orf167"; transcript_id "C1orf167"; chr1 hg19_refFlat exon 29037032 29037154 0.000000 + . gene_id "GMEB1"; transcript_id "GMEB1"; chr1 hg19_refFlat exon 103356007 103356060 0.000000 - . gene_id "COL11A1"; transcript_id "COL11A1";
in my code. I am trying to capture 2 elements from each line, the first is the number after it says exon, the second is the gene (a number and letter combo surrounded by ", for example" KDM4A ". My code:
with open(infile,'r') as r: start = set([line.strip().split()[3] for line in r]) genes = set([line.split('"')[1] for line in r]) print len(start) print len(genes)
for some reason, getting started is great, but genes don't capture anything. Here is the result:
48050 0
I believe this is due to the "" surrounding the gene name, but if I find it on the terminal, it works fine:
>>> x = 'A b P "G" m' >>> x 'A b P "G" m' >>> x.split('"')[1] 'G' >>>
Any solutions would be highly appreciated? Even if it's a completely different way to capture 2 data items from each row. Thanks
source share