Read / write to a large file in java

Question

Read / write to a large file in java

I have a binary file with the following format:

[N bytes identifier & record length] [n1 bytes data] [N bytes identifier & record length] [n2 bytes data] [N bytes identifier & record length] [n3 bytes data]

As you can see, I have records of different lengths. in each record, I have N bytes that contain both id and data length in the record .

this file is very large and can contain 3 million records.

I want to open this file with an application and allow the user to view and edit entries. (Insert / Update / Delete Entries)

my initial plan is to create and index the file from the source file and for each record, save the next and previous record address to easily move back and forth. (some sort of linked list, but not in memory in the file)

Is there a java library to help me implement this requirement?
any recommendations or experience that you think are helpful?

----------------- EDIT ----------------------------- --- --------------

Thanks for the tips and tricks,

Additional Information:

the source file and its format are beyond my control (this is a third-party file), and I cannot change the file format. but I have to read it, let the user navigate through the records and edit some of them (insert a new record / update an existing record / delete the record), and at the end save it back to the original file format .

do u still recommend DataBase instead of regular index file?

----------------- SECOND EDITING ---------------------------- --- ---------------

Record size in update mode is fixed. this means that the updated (edited) record has the same length as the original record, unless the user deletes the record and creates another record in a different format.

Many thanks

+8

java file-io binaryfiles

mhshams Apr 1 '11 at 11:13

source share

6 answers

Having a data file and an index file will be the main idea for such an implementation, but you will largely encounter data fragmentation during repeated updates / deletion of data, etc. This project in itself, should be a separate project and should not be part of your main application. However, in essence, a database is what you need, because it is specifically designed for such operations and applications, and also allows you to search, sort and expand (change) the data structure without the need to reorganize the internal (user) solution.

May I suggest you download Apache Derby and create a local embedded database (Derby does this so that you create a new embedded connection at runtime). It will not only be faster than anything you write yourself, but it will also simplify your application.

Apache Derby is a single jar file that you can simply include and distribute with your project (check the license if you can apply any legal problem in your application). No need for a database server or third-party software; all this is pure java.

The bottom line, since it all depends on how big your application is, if you need to share data on many clients, if speed is a critical aspect of your application, etc.

For a stand-alone single-user project, I recommend Apache Derby. For an n-level application, you can look at MySQL , PostgreSQL or (hrm) even Oracle . Using already made and tested solutions is not only smart, but also reduces development time (and maintenance effort).

Greetings.

+2

Yanick rochon Apr 1 '11 at 11:30

source share

Generally, you better let the library or database do the work for you.

You may not need an SQL database, and there are many simple databases that do not use SQL. http://nosql-database.org/ lists 122 of them.

At a minimum, if you are planning to write this, I suggest you read the source for one of these databases to find out how they work.

Depending on the size of the records, there are not many 3 million, and I suggest that you save as much memory as possible.

The problem you probably should have is to ensure data consistency and data recovery in the event of corruption. The second problem is related to fragmentation efficiently (something is the brightest minds working on GC). The third problem is likely to keep the index in a transactional mode with raw data to ensure there are no inconsistencies.

Although this may seem simple at first, there are significant difficulties in providing reliable, convenient service and efficient access to data. This is why most developers use the existing database / data warehouse library and focus on features that are not relevant to their application.

+1

Peter Lawrey Apr 1 '11 at 11:21

source share

(Note: My answer concerns the problem in general, not counting any Java libraries, or - like the other answers also offered - using a database (library), which could be better than reinventing the wheel)

The idea of creating an index is good and will be very useful in terms of performance (although you wrote the "index file", I think it should be stored in memory). Index generation should be pretty fast if you read the ID and record length for each record, and then just skip the file search data.

You should also consider editing functionality. Especially insertion and deletion can be very slow in such a large file if you do it wrong (for example, deleting and then moving all of the following entries to close the space).

The best option would be to only mark deleted records as deleted. When pasting, you can overwrite one of them or add to the end of the file.

0

schnaader Apr 1 '11 at 11:26

source share

Insert / Update / Delete Entries

Inserting (and not just adding) and deleting records in a file is expensive because you need to move the entire contents of the content to create space for the new record or to delete the place that it used. Updating is just as expensive if updating changes the length of the record (you say it's variable length).

The file format that you offer is fundamentally unsuitable for the kinds of operations that you want to perform. Others suggested using a database. If you don't want to go this far, adding an index file (as you suggest) is the way to go. I recommend making index entries of the same length.

0

Raedwald Apr 1 '11 at 11:55

source share

As others have stated, a database would show a better solution. The following is a Java SQL database: H2 , Derby or HSQLDB

If you want to use the index file, see Berkley DB or No Sql

If there is any reason to use the file, check out JRecord . He has

Several classes for reading / writing files with binary records of variable length (they are written for Cobol VB files). Any file structures of the Mainframe / Fujitsu / Open Cobol file system should do the job.
Editor for editing JRecord files. The latest version of the editor can process large files (a compression / spill file is used). The editor has to download the entire file, and only one user can edit the file at a time.

JRecord solution will only work if

There is a limited number (preferably one) of users, all of which are located in one place.
Fast infostructure

0

Bruce martin Apr 2 '11 at 7:18

source share

Stephen c · Accepted Answer · 2011-04-01T11:23:07+0000

Seriously, you should NOT use a binary for this. You must use the database.

The problems with trying to implement this as a regular file are related to the fact that operating systems do not allow inserting extra bytes into the middle of an existing file. Therefore, if you need to insert a record (anywhere except the end), update the record (with a different size) or delete the record, you will need:

rewrite other entries (after the insert / update / delete point) to create or return space, or
implement some kind of free space management in the file.

All this is complicated and / or expensive.

Fortunately, there is a class of software that implements such things. It is called database software. There is a wide range of options, ranging from using a full-blown RDBMS to lightweight solutions like BerkeleyDB files.

In response to your 1st and 2nd changes, the database will continue to be simpler.

However, there is an alternative that might be better for this use case than using a database ... without complicated free space management.

Read the file and create an in-memory index that maps identifiers to file locations.
Create a second file to store new and updated records.
Recording adds / updates / deletes:
- Adding is handled by writing a new entry to the end of the second file and adding an index entry for it.
- The update is processed by writing the updated record to the end of the second file and modifying the existing index record to point to it.
- Deletion is performed by deleting the index entry for the entry key.
Map the file as follows:
- Create a new file.
- Read each entry in the old file in order and check the index for the entry key. If the record still indicates the location of the record, copy the record to a new file. Otherwise, skip it.
- Repeat step 4.2 for the second file.
If we did all this successfully, delete the old file and the second file.

Please note that this depends on the ability to store the index in memory. If this is not possible, then the implementation will be more complicated ... and more like a database.

Read / write to a large file in java

More articles: