- How is the data exchange between stages? It would seem that everything is in memory. Am I right about this?
The data stream is row based. For transformation, each step creates a "tuple" or row with fields. Each field is a pair of data and metadata. Each step has an input and an output. A step accepts lines from input, changes lines, and sends lines to outputs. In most cases, all information is stored in memory. But. Steps read data in a streaming manner (for example, jdbc or others) - this is usually only part of the data from the stream in memory.
- Is the above also about the various transformations?
There is a concept of "work" and a concept of "transformation." Everything written above is mostly true for transformation. Basically - means that the transformation can contain very different steps, some of them - for example, the collection steps - may try to collect all the data from the stream. Jobs are a way to perform some actions that don’t follow the concept of “streaming” - for example, send e-mails for success, download some files from the network, perform different conversions one at a time.
- How are data collection steps performed?
It only depends on the specific step. Typically, as stated above - collecting steps may try to collect all the data from the stream - having so - may cause OutOfMemory exceptions. If the data is too large, consider collecting steps with a different approach to process data (for example, use steps that do not collect all the data).
- Any specific usage recommendations for using it?
A lot of. Depends on the conversion of steps, data sources are used. I would try to talk about the exact scenario, not general recommendations.
- Is the ftp task reliable and efficient?
As far as I remember, ftp is supported by the EdtFTP implementation, and there may be some problems with steps such as: some settings are not saved, or the http-ftp proxy server does not work. I would say that Kettle is generally reliable and performance, but for some not often used scenarios - this may not be the case.
- Any other Dos and Don?
I would say that Do - must understand the tool before starting to use it intensively. As mentioned in this discussion, there are a couple of literature on data integration in Kettle / Pentaho that you can try to find on specific sites.
One of the benefits of Pentaho Data Integration / Kettle is its relatively large community, which you can set for specific aspects.
http://forums.pentaho.com/
https://help.pentaho.com/Documentation