Where is the Pentaho Kettle architecture?

Where can I find Pentaho Kettle architecture? I am looking for a short wiki, a design document, a blog post, something to give a good overview of how everything works. This question is not intended for specific "how" beginner tutorials, but rather provides a good look at technology and architecture .

Specific questions that I have:

  • How is the data exchange between stages? It would seem that everything is in my memory - am I right about that?
  • Is the above also about the various transformations?
  • How are the Collect steps performed?
  • Any specific usage recommendations for using it?
  • Is the ftp task reliable and efficient?
  • Any other Dos and Don?
+4
source share
2 answers

See this pdf .

+4
source
  • How is the data exchange between stages? It would seem that everything is in memory. Am I right about this?

The data stream is row based. For transformation, each step creates a "tuple" or row with fields. Each field is a pair of data and metadata. Each step has an input and an output. A step accepts lines from input, changes lines, and sends lines to outputs. In most cases, all information is stored in memory. But. Steps read data in a streaming manner (for example, jdbc or others) - this is usually only part of the data from the stream in memory.

  1. Is the above also about the various transformations?

There is a concept of "work" and a concept of "transformation." Everything written above is mostly true for transformation. Basically - means that the transformation can contain very different steps, some of them - for example, the collection steps - may try to collect all the data from the stream. Jobs are a way to perform some actions that don’t follow the concept of “streaming” - for example, send e-mails for success, download some files from the network, perform different conversions one at a time.

  1. How are data collection steps performed?

It only depends on the specific step. Typically, as stated above - collecting steps may try to collect all the data from the stream - having so - may cause OutOfMemory exceptions. If the data is too large, consider collecting steps with a different approach to process data (for example, use steps that do not collect all the data).

  1. Any specific usage recommendations for using it?

A lot of. Depends on the conversion of steps, data sources are used. I would try to talk about the exact scenario, not general recommendations.

  1. Is the ftp task reliable and efficient?

As far as I remember, ftp is supported by the EdtFTP implementation, and there may be some problems with steps such as: some settings are not saved, or the http-ftp proxy server does not work. I would say that Kettle is generally reliable and performance, but for some not often used scenarios - this may not be the case.

  1. Any other Dos and Don?

I would say that Do - must understand the tool before starting to use it intensively. As mentioned in this discussion, there are a couple of literature on data integration in Kettle / Pentaho that you can try to find on specific sites.

One of the benefits of Pentaho Data Integration / Kettle is its relatively large community, which you can set for specific aspects.

http://forums.pentaho.com/

https://help.pentaho.com/Documentation

0
source

All Articles