Are POSIX 'read () and write () atomic system calls?

I am trying to implement a database index based on a data structure (B link tree) and the algorithms proposed by Lehman and Yao in this document . On page 2, the authors declare that:

The disk is divided into sections of a fixed size (physical pages, in this article they correspond to tree nodes). These are the only units that can be read or written by the process. [my accent] (...)

(...) a process is allowed to lock and unlock a disk page. This lock gives this process exclusive rights to modify this page; In addition, the process must have a locked page to change this page. (...) Locks do not prevent other processes from reading a locked page. [emphasis mine]

I’m not quite sure that my interpretation is correct (I’m not used to reading scientific articles), but I think that it can be concluded from the underlined sentences that the authors consider that the operations that read and write the page are considered β€œatomic”, including meaning that if process A has already begun to read (or write) a page, another process B may not start writing (or reading) the same page until A completes its reading (write accordingly) work. Of course, several processes simultaneously viewing the same page are a legal condition, as well as the simultaneous execution of several processes on arbitrary operations exclusively on different pages (process A on page P, process B on page Q, process C on page R, and t .d.).

  • Is my interpretation correct?

  • Can I assume that the POSIX ' read() and write() system calls are "atomic" in the sense described above? Can I rely on these system calls with some internal logic to determine if a temporary blocked call to read() or write() depend on the position of the file descriptor and the specified fragment size to be read or written?

  • If the answer to the above questions is β€œNo”, how can I minimize my own locking mechanism?

+3
source share
2 answers

I do not believe that the text you are quoting implies something like that. It does not even mention read() or write() or POSIX. In fact, read() and write() cannot be considered atomic. The only thing POSIX says is that write() should be atomic if the size of the record is less than PIPE_BUF bytes, and even this applies only to pipes.

I did not read the context around the part of the article that you quoted, but it looks like you have quoted the restrictions quoted that must be placed in the implementation for the algorithm to work correctly. In other words, he claims that blocking is required to implement this algorithm.

How you do this lock is up to you (the developer). If we are dealing with a regular file and several independent processes, you can try fcntl(F_SETLKW) -style lock. If your data structure is in memory and you are dealing with multiple threads in the same process, maybe something else.

+3
source

Answers:

  1. Parallel reads in records may see intermittent records depending on the OS, the feed system, and which flags you opened the file with. Below is a brief description of the flags, OS and registration system.

  2. You can lock byte ranges in a file before accessing them using fcntl () on POSIX or LockFile () on Windows.


No O_DIRECT / FILE_FLAG_NO_BUFFERING:

Microsoft Windows 10 with NTFS: update atomicity = 1 byte

Linux 4.2.6 with ext4: update atomicity = 1 byte

FreeBSD 10.2 with ZFS: atomicity update = at least 1 MB, possibly infinite (*)

O_DIRECT / FILE_FLAG_NO_BUFFERING:

Microsoft Windows 10 with NTFS: update atomicity = up to 4096 bytes only if aligned on the page, otherwise 512 bytes if FILE_FLAG_WRITE_THROUGH is turned off, and another 64 bytes. Note that this atomicity is probably a feature of PCIe DMA, not designed in (*).

Linux 4.2.6 with ext4: update atomicity = at least 1 MB, possibly infinite (*). Note that previously Linuxes with ext4 definitely did not exceed 4096 bytes, XFS certainly used a user lock, but it looks like the latest Linux has finally fixed it.

FreeBSD 10.2 with ZFS: atomicity update = at least 1 MB, possibly infinite (*)


You can see the initial results of the empirical test https://github.com/BoostGSoC13/boost.afio/blob/master/fs_probe/fs_probe_results.yaml . The results were generated by a program written using asynchronous file I / O on all platforms. Note that we only test offsets by 512 bytes, so I cannot say if a partial update of the 512-byte sector will break during a read-modify-write cycle.

+2
source

All Articles