Best Spring Batch Scaling Strategy

We have simple batch processes that work fine. Recently, we have a new requirement for implementing a new batch process for creating reports. We have a data source for reading to prepare these reports. In particular, we can have one view for each report.

Now we want to scale this process so that it can be scaled and completed as early as possible.

I am familiar with a multi-threaded step, but not sure of another strategy (remote chunking and section step), and which one to use when.

In our case, processing + writing to a file stimulates resources more, and then reading.

In such cases, the most appropriate approach.

Or if we find out that reading data from db is the same resource incentive as writing + processing to a file, then what is the best option we need to improve / scale this process.

+8
spring spring-batch parallel-processing scalability
source share
1 answer

TL; DR;

Based on your description, I think you can try a multi-threaded step with a synchronized reader, as you mention that processing and writing is a more expensive part of your step.

However, seeing that your reader is a database, I think that creating a partitioned step, configured and working, would be very useful. It takes a bit more work to set up, but will be better scaled in the long run.

Multithreaded step

Use for:

  • Acceleration of an individual step.
  • When load balancing can be handled by a reader (i.e. JMS or AMQP)
  • When using a custom reader that manually separates the data being viewed.

Do not use for:

  • Readers with Open Positions

Multithreaded steps use the block processing used by Spring Batch. When you execute a multi-threaded step, it allows the Spring package to execute the entire piece in its own thread. Please note that this means that the entire read-process-write cycle for your data fragments will occur in parallel. This means that there is no guaranteed order to process your data. Also note that this will not work with stateful ItemReaders ( JdbcCursorItemReader and JdbcPagingItemReader are both states).

Multi-threaded step with synchronized reader

Use for:

  • Speeding up processing and recording for a single step.
  • When reading, the status is displayed.

Do not use for:

  • Reading acceleration

There is one way to limit the ability to use multi-threaded steps with stateful readers. You can synchronize their read() method. This, in essence, will lead to the fact that reading will occur sequentially (order is still not guaranteed), but it still allows processing and writing in parallel. This may be a good option when reading is not a bottleneck, but processing or writing.

Partitioning

Use for:

  • Acceleration of an individual step.
  • When reading, the status is displayed.
  • When input can be shared

Do not use for:

  • When input cannot be partitioned

Step separation behaves somewhat differently than multithreaded step. With a partitioned step, you actually have the full StepExecutions report. Each StepExecution works on its own data section. Thus, the reader does not have problems reading the same data, because each reader looks only at a certain piece of data. This method is extremely efficient, but harder to configure than a multi-threaded step.

Remote chunking

Use for:

  • Speeding up processing and recording for a single step.
  • Readers with particular attention

Do not use for:

  • Reading acceleration

Remote locking is very advanced Spring Using a batch. This requires some form of robust middleware to send and receive messages (such as JMS or AMQP). With remote sharing, the reading is still single-threaded, but as each fragment is read, it is sent to another JVM for processing. In practice, this is very similar to how a multi-threaded step works, but remote chunking can use more than one process , as opposed to more than one thread . This means that remote splitting allows you to scale your application horizontally, rather than vertically scale it. (TBH I think that if you are thinking about implementing remote locking, you should consider looking at something like Hadoop.)

Parallel step

Use for:

  • Speeding up overall work
  • When there are independent steps that do not rely on each other

Do not use for:

  • Speeding up a step
  • Dependent Steps

Parallel steps are useful if you have one or more steps that can be performed independently. A spring package can easily allow the execution of steps in parallel on separate threads.

+20
source share

All Articles