So, I have a .txt list of gene names and probe identifiers, originalFile.txt, for example:
GENE_ID PROBE_ID 10111 19873 10112 284, 19983 10113 187
This text file contains about 30,000 lines. I would like to create a new text file without commas in the second column, for example:
GENE_ID PROBE_ID 10111 19873 10112 284 10112 19983 10113 187
... but also, I want all PROBE_IDs to come from another probes.txt text file that looks like this:
19873 284 187
... so that I can make a finalProduct.txt file that looks like this:
GENE_ID PROBE_ID 10111 19873 10112 284 10113 187
If I wanted to type each line of the probes.txt file manually, I think I could achieve this result with something like:
awk -F"/t" '{for(i=1;i<=NF;i++){if ($i ~ /probeID#/){print $i}}}' myGenes > test.txt
But of course, this would not put comma-delimited zone identifiers on separate lines, and I would have to manually enter each of the thousands of probe identifiers.
Does anyone have any tips or best deals? Thanks!
EDIT FOR CLARITY
Therefore, I think that there are two steps in what I ask. I would like to take originalFile.txt and end up creating finalProduct.txt using probes.txt. There are two steps to this:
For each probe specified in the probe.txt file, find out if it exists in the originalFile.txt file; if the probe exists, then print a line containing only the probe and the corresponding GENE_ID.
or you could think of it as some kind of connection between the filter on originalFile.txt using the probes.txt file, where the output file has the PROBE_ID column as probes in the probes.txt file and the corresponding GENE_ID from originalFile.txt.
or you could think of it as: 1. Make an intermediate file in which there is a multi-valued correspondence between GENE_ID and PROBE_ID 2. Delete all lines of this intermediate file where PROBE_ID does not match the entry in the probes.txt file
EDIT 2
Currently trying to reprofile this one - there is no result yet, but perhaps the link will be useful.