Hadoop cascading: CascadeException "cascade is prohibited" when two cogroup channels

Question

Hadoop cascading: CascadeException "cascade is prohibited" when two cogroup channels

I am trying to write casacading (v1.2) casade ( http://docs.cascading.org/cascading/1.2/userguide/htmlsingle/#N20844 ) consisting of two streams:

1) The first thread displays url in the db table (in which the identifier is automatically assigned using the value of auto-incrementing id). This stream also outputs URL pairs in the SequenceFile with the field names " urlTo ", " urlFrom ".

2) The second stream is read from both of these sources and tries to make CoGroup on " urlTo " (from SequenceFile) and " url " (from db source) to get a db record " id " for each " urlTo ".

He then does a CoGroup on " urlFrom " and " url " to get a db record " id " for each " urlFrom ".

Two threads work individually - if I call flow.complete () on the first, before starting the second thread. But if I put two threads in a cascading object, I get an error

 cascading.cascade.CascadeException: no loops allowed in cascade, flow: urlLink*url*url, source: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='urls', columnNames=null, columnDefs=null, primaryKeys=null}}, sink: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='url_link', columnNames=[urlLinkFrom, urlLinkTo], columnDefs=[bigint(20), bigint(20)], primaryKeys=[urlLinkFrom, urlLinkTo]}}

when trying to configure a cascade.

I see that this comes from the addEdgeFor function of the CascadeConnector , but I do not understand how to solve this problem.

I have never used the Cascade / CascadeConnector . Is something missing?

+2

hadoop cascading

Katie Jul 16 '13 at 14:30

source share

2 answers

It seems that your paths for the source and sinks are the same.

A Cascade uses the Direct Graphs concept to build Cascade itself, so if you have a stream source and a receiver source pointing to the same place that essentially creates the loop and is not allowed in the Directed Graphs concept, since

it does not change:

Source Location A to Sink Location B

but instead comes from:

Source Location A to Sink Location A.

+2

Engineiro Jul 16 '13 at 21:34

source share

Katie · Accepted Answer · 2013-07-17T15:33:46+0000

"A Tap does not receive an explicit name for the design. This means that this Tap instance can be reused in different {@link Flow} s that a source or receiver can expect using a different logical name, but the same physical resource.

"In general, two instances of the same Tap class must have different identifiers (and different # equals).

It turns out that JDBCTaps only generate their identifier from the connection URL (and do not include the table name). Since I was reading from one table and writing to another table in the same database, it seemed to me that I was reading and writing to the same Tap and calling the loop.

As a job, I'm going to subclass JDBCTap and override the getIdentifier () method to include the table name.

Hadoop cascading: CascadeException "cascade is prohibited" when two cogroup channels

More articles: