AnsweredAssumed Answered

pyspark catch exception

Question asked by imazor on Dec 10, 2014
Hi ,

Is it possible to catch exceptions using pyspark so in case of error, the program will not fail and exit.

for example if I am using (key, value) rdd functionality but the data don't have actually (key, value) format, pyspark will throw exception (like ValueError) that I am unable to catch.

I guess this is because the errors occurs on each worker where I don't have full control.
Also probably because of the DAG. For example (see below), it is useless to catch exception on the .foldByKey since its transformation and not action, as a result the transformation will be piped and materialized when some action applied, like .first().
But even when trying to catch exception on the action, will fail.

I would expect that eventually the different exceptions will be collected and return back to the driver, where the developer could control it and decide on the next step.

*** of course I can first check the input to verify that it matches (key, value), but for my opinion this will be overhead and will involve extra transformations.
 

code example:

    data = [((1,),'e'),((2,),'b'),((1,),'aa', 'e'),((2,),'bb', 'e'),((5,),'a', 'e')]
    pdata = sc.parallelize(data,3)
    t=pdata.foldByKey([], lambda v1, v2: v1+[v2] if type(v2) != list else v1+v2)

t.first()
this also fail:

    try:
      t.first()
    except ValueError, e:
      pass



Outcomes