Using Unix / Bash, how can I create a lookup table?

So, I have a .txt list of gene names and probe identifiers, originalFile.txt, for example:

GENE_ID PROBE_ID 10111 19873 10112 284, 19983 10113 187 

This text file contains about 30,000 lines. I would like to create a new text file without commas in the second column, for example:

 GENE_ID PROBE_ID 10111 19873 10112 284 10112 19983 10113 187 

... but also, I want all PROBE_IDs to come from another probes.txt text file that looks like this:

 19873 284 187 

... so that I can make a finalProduct.txt file that looks like this:

 GENE_ID PROBE_ID 10111 19873 10112 284 10113 187 

If I wanted to type each line of the probes.txt file manually, I think I could achieve this result with something like:

 awk -F"/t" '{for(i=1;i<=NF;i++){if ($i ~ /probeID#/){print $i}}}' myGenes > test.txt 

But of course, this would not put comma-delimited zone identifiers on separate lines, and I would have to manually enter each of the thousands of probe identifiers.

Does anyone have any tips or best deals? Thanks!

EDIT FOR CLARITY
Therefore, I think that there are two steps in what I ask. I would like to take originalFile.txt and end up creating finalProduct.txt using probes.txt. There are two steps to this:

For each probe specified in the probe.txt file, find out if it exists in the originalFile.txt file; if the probe exists, then print a line containing only the probe and the corresponding GENE_ID.

or you could think of it as some kind of connection between the filter on originalFile.txt using the probes.txt file, where the output file has the PROBE_ID column as probes in the probes.txt file and the corresponding GENE_ID from originalFile.txt.

or you could think of it as: 1. Make an intermediate file in which there is a multi-valued correspondence between GENE_ID and PROBE_ID 2. Delete all lines of this intermediate file where PROBE_ID does not match the entry in the probes.txt file

EDIT 2
Currently trying to reprofile this one - there is no result yet, but perhaps the link will be useful.

+5
source share
1 answer

If probes.txt is small enough to fit in memory, you can try the following awk script:

 BEGIN { OFS="\t"; # this is to handle the given input that has spaces after the comma # and tabs between gene and probes FS="[\t, ]+"; # load probes into an array while ((getline probe < "probes.txt") > 0) { probes[probe] = 1; } close ("probes.txt"); } { # for each probe, check if it in the array # and skip it if not for (i=2; i <= NF; i++) { if (probes[$i] == 1) { print $1, $i; } } } 
+3
source

All Articles