I have a simple task that scans websites and caches them in HDFS. Carter checks to see if the URL exists in HDFS, and if so, he uses it otherwise, loads the page and saves it to HDFS.
If a network error occurs during page loading (404, etc.), the URL is skipped completely - it is not written to HDFS. Whenever I run a small list of ~ 1000 websites, I always encounter this error, which repeatedly interrupts the task in my pseudo-distributed installation. What could be the problem?
I am running Hadoop 0.20.2-cdh3u3.
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/raj/cache/9b4edc6adab6f81d5bbb84fdabb82ac0 could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1520) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:665) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
hadoop cloudera
rsman Apr 03 '12 at 4:16 2012-04-03 04:16
source share