I am relatively new to the TensorFlow world and quite puzzled by how you actually read the CSV data into a usable example of TensorFlow tag tensors. The example from the TensorFlow CSV reading tutorial is pretty fragmented, and you get only part of the ability to train on CSV data.
Here, my code that I put together was based on this CSV tutorial:
from __future__ import print_function import tensorflow as tf def file_len(fname): with open(fname) as f: for i, l in enumerate(f): pass return i + 1 filename = "csv_test_data.csv"
And here is a short example from the CSV file that I upload - pretty basic data - 4 function columns and 1 shortcut column:
0,0,0,0,0 0,15,0,0,0 0,30,0,0,0 0,45,0,0,0
All of the above code prints each example from a CSV file, one after another , which, although good, is pretty poor for learning.
What I'm struggling with is how you actually turn these individual examples, downloaded one by one, into a set of training materials. For example, here is a laptop that I worked on in the Udacity Deep Learning course. I basically want to take the CSV data that I upload and flip it to something like train_dataset and train_labels :
def reformat(dataset, labels): dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
I tried using tf.train.shuffle_batch like this, but it just inexplicably hangs:
for i in range(file_length):
So, to summarize, here are my questions:
- What am I missing in this process?
- There seems to be some kind of key intuition that I donβt know about how to properly build the input pipeline.
- Is there a way to avoid having to know the length of the CSV file?
- It feels rather inconvenient to know the number of lines you want to process (the line
for in range(file_length) above)
Edit: As soon as Yaroslav pointed out that I was probably mixing imperative and plotting details here, he became clearer. I managed to collect the following code, which, I think, is closer to what was usually done when preparing the model from CSV (with the exception of any model training code):
from __future__ import print_function import numpy as np import tensorflow as tf import math as math import argparse parser = argparse.ArgumentParser() parser.add_argument('dataset') args = parser.parse_args() def file_len(fname): with open(fname) as f: for i, l in enumerate(f): pass return i + 1 def read_from_csv(filename_queue): reader = tf.TextLineReader(skip_header_lines=1) _, csv_row = reader.read(filename_queue) record_defaults = [[0],[0],[0],[0],[0]] colHour,colQuarter,colAction,colUser,colLabel = tf.decode_csv(csv_row, record_defaults=record_defaults) features = tf.stack([colHour,colQuarter,colAction,colUser]) label = tf.stack([colLabel]) return features, label def input_pipeline(batch_size, num_epochs=None): filename_queue = tf.train.string_input_producer([args.dataset], num_epochs=num_epochs, shuffle=True) example, label = read_from_csv(filename_queue) min_after_dequeue = 10000 capacity = min_after_dequeue + 3 * batch_size example_batch, label_batch = tf.train.shuffle_batch( [example, label], batch_size=batch_size, capacity=capacity, min_after_dequeue=min_after_dequeue) return example_batch, label_batch file_length = file_len(args.dataset) - 1 examples, labels = input_pipeline(file_length, 1) with tf.Session() as sess: tf.initialize_all_variables().run()