Best way to read from a large CSV file without loading everything into memory using Javascript

I am using Atom / Electron to create an application that has data-based video visualization. Each video has a corresponding CSV file with information for each frame. The videos are about 100 minutes, so there is a lot of data in the files!

The problem I am facing is that it takes a few seconds to download and analyze the file. In most cases, this is not a problem. But I need to make a playlist from parts of the video, and downloading the entire CSV file with every video change is not a viable option.

I searched for file streaming options as fast-csv, but I was not able to start reading for an arbitrary part of the file.

EDIT: from FS documentation. In this case, the question is, how can I find out which byte corresponds to the position I want in the file?

parameters can include start and end values ​​for reading a range of bytes from a file instead of the entire file. Both the beginning and the end are inclusive and start at 0.

What do you think is the best and most effective approach to this situation?

In concrete:

Is there a way to start reading a stream from any part of the CSV file?

Do you think there is another storage method that will allow me to better solve this problem?

+8
javascript file csv electron
source share
2 answers

In my comment, Sqlite seems to be what you are looking for. This may not be your permanent solution in the long run, but it will certainly work until you decide whether you want to stick to it or encode your own solution.

Sqlite Interior Works

Sqlite is optimized for the kernel, but has three main functions that make it run faster than regular disk reads, especially CSV files:

  • The entire database (each database you created) is stored in one file, not in several files or records.
  • This file is uploaded into blocks of 1024 bytes (1 KB) in size, making it easy to move around data.
  • (Indeed, part 2). The entire database and swap system is one massive binary tree, which usually takes less than 10 hops to find any data. Thus, in unprofessional conditions, very fast!

If you are really interested in understanding the full extent of all this, I have not found a better explanation than this amazing Julia Evans blog post .

Possible disadvantages

In addition to internal work, Sqlite is designed to work on the user side on the user side. If this is not a viable solution, you can work around with workarounds. For example, Sqlite can be used as a web server, but it really thrives in a single or mixed installation. Also remember that each client computer is different. One computer can process recordings faster than the next, but overall you don’t need to worry, because client-side computers are usually under light load.

  • Standalone will require that everything be on the client side. This is commonly used as Sqlite. I have used it for games in the past, using the sqlite4java API to connect to a database with Java; The API made the whole experience feel like PHP and MySQL on the server. You may need to find another API since Sqlite is written in C.
  • Mixed installation is performed in the same way as a stand-alone installation, but you encode a link to the actual server into your program. For games, I helped make sure that we track things like ratings and user data, and then periodically in the background pass this to the real server if we could get a connection. This also works in reverse order. You can start the user from scratch, but at the first start, he can download everything you need, and from that moment keep his relevance with what is on the server.

Summary

Sqlite will work for what you need, but it may take a little homework to configure in the form you need. For example, Sqlite4java is easy to install, but confusing because their documentation is so poor; However, the stack overflow made me go through this. Sqlite also uses it and forgets about the type of installation, so to answer your question, it will process 25 lines per second, like a cake, you do not need to worry about optimizing its own code only.

+1
source share

I would highly recommend Papaparse for this. It allows CSV streaming in turn, which can be processed in JSON format based on the headers in the file.

Inside the configuration object passed to the parsing function, you can specify the “step” parameter, which is the function that will be executed for each line of the file as it is executed.

Note. You can also tune the workflow to improve performance when working with very large CSV files.

http://papaparse.com/docs

+2
source share

All Articles