AnsweredAssumed Answered

Loading extremely long lines with TextLine in Cascading

Question asked by ivan_nikolaev on Aug 14, 2014
Latest reply on Aug 15, 2014 by Ted Dunning
Hello everyone,

I've been struggling with this for a few days now. I've posted the same topic to stack overflow, with no luck. Any help is appreciated.

I'm using TextLine in Cascading to load files with very large lines in Cascading. The lines are very long - around 30Mb on average, some much longer. When I run the job locally to test it it runs fine, but when I run it on the cluster it fails after a period of intensive crunching. It gives errors like:

    cascading.tuple.TupleException: unable to read from input identifier: maprfs:/xxx/xxx/xxx/part-00001
    at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
    at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
    at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
    at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:127)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:443)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:282)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1122)
    at org.apache.hadoop.mapred.Child.main(Child.java:271)

It also sometimes complains about stale file handles. The file it's trying to read is definitely there. Can somebody help me, please?

Here is a link to a more complete stack trace: http://pastebin.com/9JCbsmcr . I've run this job on two different clusters with the same results. I really need to solve this problem because it's blocking me significantly.



Best regards,

Ivan

Outcomes