Explain Apache Beam

I read the Beam documentation and also looked at the Python documentation, but did not find a good explanation of the syntax used in most Apache Beam examples.

Can anyone explain what _ , | and >> in the code below? Also is there text in quotation marks, i.e. "ReadTrainingData", or can it be exchanged with any other shortcut? In other words, how is this label used?

 train_data = pipeline | 'ReadTrainingData' >> _ReadData(training_data) evaluate_data = pipeline | 'ReadEvalData' >> _ReadData(eval_data) input_metadata = dataset_metadata.DatasetMetadata(schema=input_schema) _ = (input_metadata | 'WriteInputMetadata' >> tft_beam_io.WriteMetadata( os.path.join(output_dir, path_constants.RAW_METADATA_DIR), pipeline=pipeline)) preprocessing_fn = reddit.make_preprocessing_fn(frequency_threshold) (train_dataset, train_metadata), transform_fn = ( (train_data, input_metadata) | 'AnalyzeAndTransform' >> tft.AnalyzeAndTransformDataset( preprocessing_fn)) 
+8
python apache-beam
source share
1 answer

Operators in Python can be overloaded. In Beam | is a synonym for apply , which applies PTransform to a PCollection to create a new PCollection . >> allows you to name a step for simplification of display in different user interfaces - a line between | and >> used only for these purposes to display and define this particular application.

See https://beam.apache.org/documentation/programming-guide/#transforms

+15
source share

All Articles