Retrieving columns containing a specific name

Question

Retrieving columns containing a specific name

I am trying to use it to manage data in large txt files.

I have a txt file with more than 2000 columns, and about a third of them have a header that contains the word "Net". I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do this?

I searched a little, but could not find what helps me. Sorry if similar questions were asked and resolved before.

EDIT 1: Thanks everyone! At the time of writing, 3 users have suggested solutions, and they all work very well. I honestly did not think that people would answer, so I did not check for a day or two, and I was happy with this surprised. I'm very impressed.

EDIT 2: I added an image that shows what part of the source txt file might look like in case this helps someone in the future:

Sample from original txt-file

+5

python text-files extraction

Rickyboy May 04 '15 at 11:44

source share

3 answers

This can be done, for example, using Pandas,

 import pandas as pd df = pd.read_csv('path_to_file.txt', sep='\s+') print(df.columns) # check that the columns are parsed correctly selected_columns = [col for col in df.columns if "net" in col] df_filtered = df[selected_columns] df_filtered.to_csv('new_file.txt')

Of course, since we do not have the structure of your text file, you will have to adapt the read_csv arguments to make this work in your case (see the relevant documentation ).

This will load the entire file into memory and then filter out unnecessary columns. If your file is so large that it cannot be loaded directly into RAM, there is a way to load only certain columns with the usecols argument.

+4

rth May 04 '15 at 12:05

source share

You can use the pandas filter function to select multiple regex based columns

 data_filtered = data.filter(regex='net')

+3

Kathirmani sukumar May 04 '15 at 16:48

source share

Marco nawijn · Accepted Answer · 2015-05-04T12:08:43+0000

One way to do this, without installing third-party modules like numpy / pandas, is as follows. Given the input file called "input.csv" as follows:

a, b, c_net, d, e_net

0,0,1,0,1

(delete the empty lines between them, they are only for formatting the content in this message)

The following code does what you want.

import csv input_filename = 'input.csv' output_filename = 'output.csv' # Instantiate a CSV reader, check if you have the appropriate delimiter reader = csv.reader(open(input_filename), delimiter=',') # Get the first row (assuming this row contains the header) input_header = reader.next() # Filter out the columns that you want to keep by storing the column # index columns_to_keep = [] for i, name in enumerate(input_header): if 'net' in name: columns_to_keep.append(i) # Create a CSV writer to store the columns you want to keep writer = csv.writer(open(output_filename, 'w'), delimiter=',') # Construct the header of the output file output_header = [] for column_index in columns_to_keep: output_header.append(input_header[column_index]) # Write the header to the output file writer.writerow(output_header) # Iterate of the remainder of the input file, construct a row # with columns you want to keep and write this row to the output file for row in reader: new_row = [] for column_index in columns_to_keep: new_row.append(row[column_index]) writer.writerow(new_row)

Please note that there is no error handling. There are at least two that need to be processed. The first one is checking for the presence of an input file (hint: check the functionality provided by the os and os.path modules). The second is to process empty rows or rows with an inconsistent number of columns.

Retrieving columns containing a specific name

More articles: