Reading and deleting the first (or last) line from a txt file without copying

I want to read and delete the first line from a txt file (without copying, this is a huge file).
I read the network, but everyone just copies the desired content into a new file. I can not do it.

Below the first attempt. This code will loop in a loop since no rows will be deleted. If the code deletes the first line of the file each time it is opened, the code reaches the end.

#include <iostream> #include <string> #include <fstream> #include <boost/interprocess/sync/file_lock.hpp> int main() { std::string line; std::fstream file; boost::interprocess::file_lock lock("test.lock"); while (true) { std::cout << "locking\n"; lock.lock(); file.open("test.txt", std::fstream::in|std::fstream::out); if (!file.is_open()) { std::cout << "can't open file\n"; file.close(); lock.unlock(); break; } else if (!std::getline(file,line)) { std::cout << "empty file\n"; // file.close(); // never lock.unlock(); // reached break; // } else { // remove first line file.close(); lock.unlock(); // do something with line } } } 
+6
source share
2 answers

What you want to do is really not easy.

If you open the same file for reading and writing to it without being careful, you will end up reading what has just been written and the result will not be what you want.

Changing the file in place is feasible: just open it, find it, change and close it. However, you want to copy the entire contents of the file except K bytes at the beginning of the file. This means that you have to iteratively read and write the entire file in chunks of N bytes.

Now upon completion, K bytes will remain at the end, which will need to be deleted. I don't think there is a way to do this with threads. You can use ftruncate or truncate functions from unistd.h or use Boost.Interprocess truncate to do this.

Here is an example (without error checking, I can add it):

 #include <iostream> #include <fstream> #include <unistd.h> int main() { std::fstream file; file.open("test.txt", std::fstream::in | std::fstream::out); // First retrieve size of the file file.seekg(0, file.end); std::streampos endPos = file.tellg(); file.seekg(0, file.beg); // Then retrieve size of the first line (aka bufferSize) std::string firstLine; std::getline(file, firstLine); // We need two streampos: the read one and the write one std::streampos readPos = firstLine.size() + 1; std::streampos writePos = 0; // Read the whole file starting at readPos by chunks of size bufferSize std::size_t bufferSize = 256; char buffer[bufferSize]; bool finished = false; while(!finished) { file.seekg(readPos); if(readPos + static_cast<std::streampos>(bufferSize) >= endPos) { bufferSize = endPos - readPos; finished = true; } file.read(buffer, bufferSize); file.seekg(writePos); file.write(buffer, bufferSize); readPos += bufferSize; writePos += bufferSize; } file.close(); // No clean way to truncate streams, use function from unistd.h truncate("test.txt", writePos); return 0; } 

I would really like for you to have a cleaner solution for modifying the file in place, but I'm not sure if it exists.

+3
source

Here is a solution written in C for Windows. It will execute and complete on 700,000 lines a 245 MB file in the shortest possible time. (0.14 s)

Basically, I store a memory card so that I can access the content using the functions used to access raw memory. Once the file has been matched, I simply use the strchr function to find the location of one of the pair of characters used to indicate EOL in the windows (\ n and \ r) - this tells us how long the bytes in the first line are.

From here I just memcpy from the first byte f of the second line to the beginning of the memory mapping area (basically, the first byte in the file).

Once this is done, the file will not be displayed, the file descriptor mem-mapped will be closed, and then we will use the SetEndOfFile function to reduce the length of the file along the length of the first line. When we close the file, it decreased by this length, and the first line disappeared.

Having the file already in memory, as I just created and wrote, it obviously changes the execution time a little, but the window caching mechanism is the β€œculprit” here - the very mechanism that we use to execute the operations is very fast.

The test data is the source of the program duplicated 100,000 times and saved as testInput2.txt (paste it 10 times, select everything, copy, paste 10 times - replace the original 10, a total of 100 times - repeat until the output was large enough. I stayed here because more seemed to make Notepad ++ β€œbit” unhappy)

Error checking in this program practically does not exist, and it is expected that the input will not be UNICODE, i.e. input 1 byte per character. The EOL sequence is 0x0D, 0x0A (\ r, \ n)

Code:

 #include <stdio.h> #include <windows.h> void testFunc(const char inputFilename[] ) { int lineLength; HANDLE fileHandle = CreateFile( inputFilename, GENERIC_READ | GENERIC_WRITE, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_WRITE_THROUGH, NULL ); if (fileHandle != INVALID_HANDLE_VALUE) { printf("File opened okay\n"); DWORD fileSizeHi, fileSizeLo = GetFileSize(fileHandle, &fileSizeHi); HANDLE memMappedHandle = CreateFileMapping( fileHandle, NULL, PAGE_READWRITE | SEC_COMMIT, 0, 0, NULL ); if (memMappedHandle) { printf("File mapping success\n"); LPVOID memPtr = MapViewOfFile( memMappedHandle, FILE_MAP_ALL_ACCESS, 0, 0, 0 ); if (memPtr != NULL) { printf("view of file successfully created"); printf("File size is: 0x%04X%04X\n", fileSizeHi, fileSizeLo); LPVOID eolPos = strchr((char*)memPtr, '\r'); // windows EOL sequence is \r\n lineLength = (char*)eolPos-(char*)memPtr; printf("Length of first line is: %ld\n", lineLength); memcpy(memPtr, eolPos+2, fileSizeLo-lineLength); UnmapViewOfFile(memPtr); } CloseHandle(memMappedHandle); } SetFilePointer(fileHandle, -(lineLength+2), 0, FILE_END); SetEndOfFile(fileHandle); CloseHandle(fileHandle); } } int main() { const char inputFilename[] = "testInput2.txt"; testFunc(inputFilename); return 0; } 
+2
source

All Articles