I developed a prototype NiFi stream to receive data in HDFS. Now I would like to improve the overall performance, but it seems that I can not move forward.
the stream accepts csv input files (each line has 80 fields), splits them at the line level, applies some conversions to the fields (using 4 user processors executed in series), buffers new lines to csv files, outputs them to HDFS. I designed the processors in such a way that the contents of the stream file are available only once, when each individual record is read, and its fields are moved to the stream attributes. Tests were carried out on an instance of amazon EC2 m4.4xlarge (16-core processor, 64 GB of RAM).
This is what I have tried so far:
- Moved the streaming file repository and content repository to different SSDs
- Moved the Provence repository to memory (NiFi could not keep up with the speed of events)
- System configuration according to configuration settings
- I tried to assign multiple threads to each of the processors in order to achieve a different number of threads.
- I tried to increase the nifi.queue.swap.threshold threshold and set the back pressure so that I never reach the swap limit
- Tried various JVM memory settings from 8 to 32 GB (in conjunction with G1GC)
- I tried to increase the spec specs, nothing has changed
From the monitoring that I performed, it seems that the disks are not a bottleneck (they mostly stand idle most of the time, showing that the calculation is actually performed in memory), and the average CPU load is below 60%,
The most I can get is 215 thousand rows per minute, which is 3.5 thousand rows per second. In terms of volume, it is only 4.7 MB / s . I am striving for something definitely larger than that. Like the comparison, I created a stream that reads the file, breaks it into lines, combines them into blocks and outputs on the disk. Here I get 12 thousand lines per second or 17 MB / s. Not too surprisingly fast, and let me think that I'm probably doing something wrong. Anyone have any suggestions for improving performances? How much will I benefit from running NiFi on a cluster instead of growing with instance specs? Thanks to everyone.
source share