Process Files in Java EE

I have a system that needs to take large files containing documents and process them in order to separate individual documents and create document objects that will be saved using JPA (or, at least, is supposed in this question).

Files range from 1 to 100,000 in each file. Files come in many types.

  • Compressed
    • Zip
    • Tar + gzip
    • Gzip
  • Plain Text
  • XML
  • Pdf

Now the biggest problem is that the specification denies access to local files. At least the way I'm used to.

I can save the files in a database table, but is this a really good way to do this? Files can reach 2 GB , and to access files from the database, you will need to download the entire file either to memory or to disk.

My first thought was to separate this process from the application server and use a more traditional approach, but I was thinking about how to save it on the application server for future purposes, such as clustering, etc.

My questions are mostly

  • Is there a standard way or recommended way to solve this problem in Java EE?
  • Is there a specific application server method?
  • Can you justify the violation of this process on the application server? And how would you create a communication channel between these two separate systems?
+4
source share
5 answers

I will outline a few other suggestions here and consider the following issues:

  • scalability (file size, clustering, etc.)
  • batch architecture (job recovery, error handling, monitoring, etc.).
  • J2EE compliance

With JCA

JCA connectors belong to the Java EE glass and allow you to connect through / to the EJB world. JDBC and JMS are typically implemented as a JCA connector. An inbound JCA connector can use thread (via worker abstraction) and transactions. It can then redirect any processing to a message-driven bean (MDB).

  • write a JCA connector that polls the new file, then processes them and delegates the additional processing to the managed beans in a synchronous manner.
  • MDB can then store information in a database using JPA
  • The JCA connector has transaction control, and multiple MDB calls can be in the same transaction.
  • the file system is not transactional, so you will need to somehow figure out how to deal with the error, for example, with erroneous input files.
  • you can probably use streaming (InputStream) along the entire line.

When using simple threads

We can achieve more or less the same value as the JCA path using threads launched from the web servlet context listener (or EJB Timer).

  • Polling threads for a new file, if a file is found, it processes it and delegates further processing to a regular SLSB synchronously.
  • The theme in the web container has access to UserTransaction and can manage the transaction.
  • EJB can be local, so InputStream is passed by reference
  • Web module + ejb deployment can be done using ear

With jms

To avoid the need for multiple parallel polling threads and a run-time / blocking problem, actual processing can be implemented asynchronously using JMS. JMS may also be interesting to divide processing into smaller tasks.

  • Periodic task polls for a new file. If the file is found, the JMS message is queued.
  • When the JMS message is delivered, the file is read and processed, and the information is stored in a database with JPA
  • If the JMS processing failed, the application. the server may retry automatically or place the message in the dead message queue
  • monitoring / error handling is more complicated.
  • Maybe you are using streaming control

With ESB

Last year, many projects for integration with integration appeared: integration of JBI, ServiceMix, OpenESB, Mule, Spring, Java CAPS, BPEL. Some of them are technologies, some are platforms, and there is some overlap between them. Everyone has a universal connector for routing, converting and organizing the flow of messages. IMHO, the message should be a small piece of information, and it can be difficult to use these technologies to process your large data file. The enterprise application integration model website is a great site for more information.

IMO, the approach that best fits the Java EE philosophy is the JCA. But investment efforts are relatively high. In your case, using a simple thread that delegates further processing of the SLSB might be the easiest solution. The JMS approach (similar to P. Tiven's suggestion) can be interesting if pipelie processing becomes more complex. Using ESB seems redundant to me.

+3
source

Is there a standard way or recommended way to solve this problem in Java EE?

For this purpose, I would use a real level of integration (as in EAI), working as an external process. Integration tools (ETL, EAI, ESB) are specially designed to work with ... integration, and many of them provide everything you need out of the box (simplified version: transport, connectors, transformation, routing, security).

In principle, when working with files, the file connector is used to control the directory for incoming files, which is then analyzed / split into messages (using optional conversions) and sent to the endpoint for business processing.

Take a look at the Mule ESB , for example (it has a file connector that supports many transports, it can run as a separate process). Or maybe Spring Integration (in combination with Spring Batch?), Which also has files and JMS adapters. But I do not have much experience in this, so I can not say anything about it. Or, if you are rich, you can look at Tibco EMS , WebMethods, etc. Or create your own solution using some parsing library (for example, jFFP or Flatworm ).

Is there a specific application server method?

I don't know anything like that.

Can you justify the violation of this process on the application server? And how would you create a communication channel between these two separate systems?

As I said, I would use an external process to process the files (best suited) and send the contents of the file as messages via JMS to the application server for processing the business (and thus the benefits of Java EE features such as load balancing and management transactions).

+2
source

accessing files from the database will require you to download the entire file either to memory or to disk.

This is not entirely true. You are not forced to put all this into indermetiating byte[] or so. You can just use streams. Get an InputStream from it with ResultSet#getBinaryStream() and immediately process it in the usual way, for example. writing down HttpServletResponse#getOutputStream() . Cost is only the size of the buffer, which you can determine yourself.

Is there a standard way or recommended way to solve this problem in Java EE?

Either the database or the path to the file system with a fixed disk with r / w access for the application server. For instance. /var/webapp/files on the root drive.

+1
source

I think the healthiest way to do this is to do without a Java application server.

Application servers how to manage their resources (CPU, memory, threads) in their own way. Performing lengthy work, intensive batch I / O processing is subject to distortion of such resource management.

I suggest using an external process to separate batch files to control disk usage, and use AS to read access through the file system, as BalusC suggested.

I assume that concurrency issues will be addressed by the JPA level, which I admittedly know little about, but I think this also happens in the J2SE flavor.

+1
source

The specification prohibits access to files using java.io. There are other legal ways to access files, for example. via the DataSource / JDBC driver or through a resource connector.

See pp545 "JSR 220: Enterprise JavaBeansTM, Version 3.0 Basic Contracts and EJB Requirements"


... using JDBC to access files. Could you explain this in more detail?

A file is a data warehouse just like a database. This is a pretty good data warehouse for commercially available, unstructured, character data, and not so great when you need transaction security, multi-user access, writable random access, or structured binary data. In a corporate system, you should almost always have at least one of the latest requirements.

Although this is not entirely true: โ€œThere are no files in the corporate systemโ€ (since log files and almost all databases use files at a low level), this is a pretty good design rule, due to all the problems that create data files in a high-performance, multi-user, transactional, write-protected, corporate system.

Unfortunately, the business world is filled with business data stored in files. You have to deal with them. Some files (such as Excel spreadsheets) have much in common with a simple database that can be accessed through the JDBC driver. I have never heard anyone access text files through a JDBC driver, but you could - or you could use a more general resource adapter (according to the EJB3 specification, JDBC is the resource manager API).

+1
source

All Articles