Extract the first three columns from all tsv files in the folder

I have several tsv files in a folder totaling over 50 GB in total. To make it easier to work with memory when loading these files into R, I want to extract only the first three columns of these files.

How can all files be extracted immediately after output to the terminal? I am running Ubuntu 16.04.

+6
source share
5 answers

Something like the following should work:

#!/bin/bash
FILES=/path/to/*
for f in $FILES
do
    # Do something for each file. In our case, just echo the first three fields:
    cut -f1-3 < "$f"
done

(see this web page for more information on iterating files in bash.)

. , find. , , , (, , ).

:. , - script:

#!/bin/bash
FILES=/path/to/*
for f in $FILES
do
    # Do something for each file. In our case, echo the first three fields to a new file, and rename the new file to the original file:
    cut -f1-3 < "$f" > "$f.tmp"
    rm "$f"
    mv "$f.tmp" "$f"
done

cut .tmp; .

+5

cut

:

cut -d$"\t" -f 1-3 folder/*

-d ( ), -f folder/* glob, , .

+4

R - , :

fread("foo.tsv", sep = "\t", select=c("f1", "f2", "f3"))
+4

:

find ./ -type f -name ".tsv" -exec awk '{ print $1,$2,$3 }' {} \; 

, , .

, , awk:

find ./ -type f -name ".tsv" -exec awk '{ print $1,$2,$3 }' {} >> someOtherFile \;
+3

R, , . .

( ) ( data.frame):

> df1 = read.table(pipe("cut -f 1-3 *.tsv"), sep="\t", header=FALSE, quote="")

tidyverse/readr ( tibble):

> df2 = read_tsv(pipe("cut -f 1-3 *.tsv"))

data.table ( a data.table , , a data.frame):

> df3 = fread("cut -f 1-3 *.tsv")

unix shell, . . , . , 10 000 :

> df4 = fread("cut -f 1,3 *.tsv | shuf -n 10000")

.

0

All Articles