Where is the Pentaho Kettle architecture?

Question

Where is the Pentaho Kettle architecture?

Where can I find Pentaho Kettle architecture? I am looking for a short wiki, a design document, a blog post, something to give a good overview of how everything works. This question is not intended for specific "how" beginner tutorials, but rather provides a good look at technology and architecture .

Specific questions that I have:

How is the data exchange between stages? It would seem that everything is in my memory - am I right about that?
Is the above also about the various transformations?
How are the Collect steps performed?
Any specific usage recommendations for using it?
Is the ftp task reliable and efficient?
Any other Dos and Don?

+4

pentaho kettle etl

ripper234 Oct 15 '09 at 17:23

source share

2 answers

How is the data exchange between stages? It would seem that everything is in memory. Am I right about this?

The data stream is row based. For transformation, each step creates a "tuple" or row with fields. Each field is a pair of data and metadata. Each step has an input and an output. A step accepts lines from input, changes lines, and sends lines to outputs. In most cases, all information is stored in memory. But. Steps read data in a streaming manner (for example, jdbc or others) - this is usually only part of the data from the stream in memory.

Is the above also about the various transformations?

There is a concept of "work" and a concept of "transformation." Everything written above is mostly true for transformation. Basically - means that the transformation can contain very different steps, some of them - for example, the collection steps - may try to collect all the data from the stream. Jobs are a way to perform some actions that don’t follow the concept of “streaming” - for example, send e-mails for success, download some files from the network, perform different conversions one at a time.

How are data collection steps performed?

It only depends on the specific step. Typically, as stated above - collecting steps may try to collect all the data from the stream - having so - may cause OutOfMemory exceptions. If the data is too large, consider collecting steps with a different approach to process data (for example, use steps that do not collect all the data).

Any specific usage recommendations for using it?

A lot of. Depends on the conversion of steps, data sources are used. I would try to talk about the exact scenario, not general recommendations.

Is the ftp task reliable and efficient?

As far as I remember, ftp is supported by the EdtFTP implementation, and there may be some problems with steps such as: some settings are not saved, or the http-ftp proxy server does not work. I would say that Kettle is generally reliable and performance, but for some not often used scenarios - this may not be the case.

Any other Dos and Don?

I would say that Do - must understand the tool before starting to use it intensively. As mentioned in this discussion, there are a couple of literature on data integration in Kettle / Pentaho that you can try to find on specific sites.

One of the benefits of Pentaho Data Integration / Kettle is its relatively large community, which you can set for specific aspects.

http://forums.pentaho.com/

https://help.pentaho.com/Documentation

0

Dzmitry prakapenka Apr 26 '16 at 14:22

source share