The best way to write to files is to add to the SSD only

I want to know what is the best way to enter the SSD. Think of something like a database log, where you write append-only only, but you also need to execute fsync () on each transaction or multiple transactions to ensure the application-level data is durable.

I will talk about how SSDs work, so if you already know all this, skip it anyway if I am wrong. Some good reading material is the Coding Guide for SSDs from Emmanuel Goossaert and the document Don't Keep Your Journal in My Journal [pdf] .

SSDs write and read only on whole pages. Where the page size differs from SSD to SSD, but usually a multiple of 4kb. My Samsung EVO 840 uses an 8 KB page size (which, by the way, Linus causes "unusable shit" in its usual colorful way.) SSDs cannot change data in -place, they can only write to free pages. Therefore, combining these two restrictions, updating one byte on my EVO requires reading an 8kb page, changing the byte and writing it to a new 8kb page, and updating the display of the FTL page (ssd data structure), so the logical address of this page, as the OS understands, now points to new physical page. Since the file data is also no longer adjacent to one erase block (the smallest group of pages that can be erased), we also create a fragmentation debt form that will cost us future garbage collection in the SSD. Terribly inefficient.

How to understand, looking at my file system on a PC: C:\WINDOWS\system32>fsutil fsinfo ntfsinfo c: It has a sector size of 512 bytes and a distribution of 4 KB (cluster size. None of them display SSD page size - maybe not very effective.

There are some problems with simple spelling, for example. pwrite() to the kernel page cache and allowing the OS to process entries. First, you will need to issue an extra call to sync_file_range() after sync_file_range() called, in order to actually start IO, otherwise everyone will wait until you call fsync() and untie the I / O storm. Secondly, fsync() seems to block future calls to write() in the same file. Finally, you can’t control how the kernel writes things to the SSD, which can do well, or it can badly reflect write gain.

Due to the above reasons and due to the fact that I still need AIO to read the magazine, I prefer to write to the journal with O_DIRECT and O_DSYNC and have full control.

As I understand it, O_DIRECT requires that all records be aligned in sector size and integer sectors. Therefore, every time I decide to release an addition to the journal, I need to add the end to the end to bring it to a number of sectors (if all entries are always integer numbers of sectors, they will also be correctly aligned, at least in my code .) Well, that is not so bad. But my question is: is it not better to round to a whole series of SSD pages instead of sectors? Presumably, what would eliminate the recording gain?

This can save a lot of space, especially if you write small amounts of data to the log at a time (for example, a couple of hundred bytes). It may also be unnecessary. SSDs such as Samsung EVO have a write cache, and they do not clear it of fsync (). Instead, they rely on capacitors to write cache to the SSD in case of power loss. In this case, it is possible that the SSD does the right thing with the addition of only a log recording sectors at a time - it may not write out the final partial page until the next addition arrives and finishes (or if it is not replaced for a lot of unrelated IOs.) Since the answer to this question depends on the device and possibly the file system, is there any way that I can encode two possibilities and test my theory? How can I measure the recording gain or the number of updated / RMW pages in Linux?

+6
source share
1 answer

I will try to answer your question, since I had the same task, but on SD cards, which is still flash memory.

Short answer

In flash memory, you can write a full page with 512 bytes. Given that flash memory has a poor write count, the kernel buffers to improve the life of your disk.

To write a bit to flash, you must delete the entire page where it sits first. Therefore, if you want to add or change 1 byte to an already written page by 400 bytes, the kernel will basically do:

  • Read entire page to clipboard
  • Change the buffer with the added content
  • Erase entire page
  • Rewrite the entire page with the modified buffer

Long answer

The sector (pages) mainly refers to the hardware of the flash implementation and the flash physical driver, in which you have no control. This page should be cleaned and rewritten every time you change something.

As you probably already know, you cannot overwrite one bit on a page without clearing and overwriting a total of 512 bytes. Flash drives now have a write life of around 100,000 before the sector can be damaged. To improve the lifetime, it is usually a physical driver, and sometimes the system will have a randomization algorithm for recording to avoid recording the same sector.

As for the cluster, this is handled at a higher level, which is associated with the file system, and you have control. Typically, when formatting a new hard disk, you can choose the cluster size, which in windows refers to the size of the distribution block in the format window.

Fat 32 format

Most file systems, as I know, work with the index, which is located at the beginning of the disk. This index will track each cluster and what is assigned to it. This means that the file will occupy at least one sector, even if it is much smaller.

Fat32

Now, sales are smaller than your sector, your index table will be larger and take up a lot of space. But if you have many small files, then you will have a better place to practice.

On the other hand, if you only store large files, and you want to select the largest sector size, slightly larger than the size of your file.

Since your task is to perform logging, I would recommend registering one, huge file with a large sector size. After experimenting with this type of log, the presence of a large number of files in one folder can cause problems, especially if you are in built-in devices.


Implementation

Now, if you have raw disk access and you really want to optimize, you can write directly to the disk without using a file system.

Positive * Saves you enough disk space * In case of failure disk tolerance will appear if your design is smart enough * much less resources will be required if you are in a limited system

On the other hand * More work and debugging * The drive will not be recognized by the system by the system.

If you are just registering, you do not need to have a file system, you just need an entry point to the page where you can record data that will constantly increase.

The implementation I made on the SD card was to save 100 pages when begging the flash to store information about the location of the recording and reading. This was done on one page, but in order to avoid a memory cycle problem, I would write down a circular method on 100 pages sequentially and then have an algorithm to check which last one contained the latest information.

A position record was recorded every 5 minutes or so, which means that in the event of a power outage, I would have lost only 5 minutes of the log. It is also possible to check an additional sector from the last recording location if they contain reliable data before writing further.

This provided a very reliable solution, since it is less likely to have table corruption.

I also suggest writing 512 bytes and writing page by page.


Other

You can also check a specific file system for a specific log, they can just do the job for you: Log-structured file system

0
source

All Articles