How to implement tcp with zero copy using block cyclic buffer in C ++

Question

How to implement tcp with zero copy using block cyclic buffer in C ++

I have several streams that should consume data from a TCP stream. I want to use a circular buffer / queue in shared memory to read from a TCP socket. TCP reception will be recorded directly to the round-robin queue. Consumers will read from the lineup.

This design should include a null copy and a zero lock. However, there are two different questions.

Is it possible / efficient to read only 1 logical message from a TCP socket? If not, and I read more than 1 post, I will have to copy the remainder from this to this → next.
Is it possible to implement a lock without a queue? I know there are atomic operations, but they can also be costly. because the entire CPU cache must be invalid. This will affect all operations on all of my 24 cores.

Im a little rusty in low-level TCP, and it's not entirely clear how to tell when a message is complete. Am I looking for \ 0 or is it a specific implementation?

tee

+7

c ++ zero-copy lockless tcp circular-buffer

jaybny Jul 2 '12 at 14:16

source share

2 answers

If you are using Windows 8 or Windows Server 2012, you can use registered I / O, which offers higher bandwidth for lower CPUs than regular IOCP; he does this by cutting out kernel transitions, a null copy, among other things

API: http://msdn.microsoft.com/en-us/library/windows/desktop/ms740642%28v=vs.85%29.aspx

Background information: http://www.serverframework.com/asynchronousevents/rio/

0

Ben adams Jan 20 '14 at 5:11

source share

Martin james · Accepted Answer · 2012-07-02T21:18:59+0000

Unfortunately, TCP cannot send messages, but only byte streams. If you want to send messages, you will need to apply the protocol from above. The best protocols for high performance are those that use a sanity-checking header with a message length - this allows you to read the correct amount of data directly into a suitable buffer object without repeating bytes of bytes bytes, a message character. You can then queue the POINTER buffer on another thread and create a new buffer object for the next message. This avoids any copying of bulk data, and for large messages it is efficient enough that using a non-blocking queue for pointers to message objects is somewhat pointless.

The next optimization is combining object buffers * to avoid continuous input / deletion, processing * buffers in the consumer stream for reuse in the network receive stream. This is fairly easy to do with ConcurrentQueue, preferably blocking to allow flow control instead of data corruption or segfaults / AV if the pool is temporarily empty.

Then add the [dead zone] cache line size] at the beginning of each data element of the buffer *, so prevent any flow from the data with a false exchange with any other.

The result should be a stream with a high bandwidth of complete messages to the consumer stream with very low latency, CPU waste or caching. All your 24 cores can run on different data.

Copying bulk data in multi-threaded applications is an assumption of poor design and failure.

Following actions.

It looks like you are stuck in repeating data due to different protocols :(

PDU buffer object without flags, example:

typedef struct{ char deadZone[256]; // anti-false-sharing int dataLen; char data[8388608]; // 8 meg of data } SbufferData; class TdataBuffer: public{ private: TbufferPool *myPool; // reference to pool used, in case more than one EpduState PDUstate; // enum state variable used to decode protocol protected: SbufferData netData; public: virtual reInit(); // zeros dataLen, resets PDUstate etc. - call when depooling a buffer virtual int loadPDU(char *fromHere,int len); // loads protocol unit release(); // pushes 'this' back onto 'myPool' };

loadPDU receives a pointer to the length of the raw network data. It returns either 0, or means that it has not yet completely collected the PDU, or the number of bytes that it consumed from the raw network data to completely collect the PDUs, in which case the turn off, depool another and call loadPDU () with an unused remainder raw data, then continue with the following raw data.

You can use different pools of different classes of derived buffers to serve different protocols, if necessary - the TbufferPool [Eprotocols] array. TbufferPool may simply be a BlockingCollection queue. Management becomes almost trivial: buffers can be sent to queues throughout your system, to a graphical interface to display statistics, and then, possibly, to the logger if release () is called at the end of the queue chain.

Obviously, the “real” PDU will load more methods, data unions / structures, iterators, perhaps a state mechanism for managing the protocol, but this is the main idea anyway. The main thing is simple management, encapsulation, and since neither of the two threads can work in one buffer instance, no blocking / synchronization is required for data analysis / access.

Oh, yes, and since no queue should remain locked longer than it takes to press / place one pointer, the chances of actual rivalry are very low - even regular blocking queues are unlikely to ever need to use a kernel lock.

How to implement tcp with zero copy using block cyclic buffer in C ++

More articles: