AnsweredAssumed Answered

Cannot delete directory after saving Dataframe by partition

Question asked by sjgx on Nov 13, 2017
Latest reply on Nov 21, 2017 by maprcommunity

I ran a simple Pyspark script to which saves a Dataframe by partitions over 6 million "user_id":


> df.write.partitionBy('_user_id').mode('overwrite').format("csv").save(output_file)


I run this, all stages complete but at the very end it hangs (for days) with state = "RUNNING", FinalStatus = "UNDEFINED" and FinishTime = "N/A". I kill the script and can see some of the output:

drwxr-xr-x   - username username      2294304 2017-11-13 09:01 /data/output_of_my_script.csv

Any action on this directory hangs (hadoop fs -rm -r or hadoop fs -ls). When I run this over a small amount of data (a CSV with less than 100 lines) I can eventually get the ls or rmcommands to work but this takes at least 24 hours.

I tried restarting Yarn but am getting errors that there is no space left:

> FATAL resourcemanager.ResourceManager: Error starting ResourceManager org.apache.hadoop.service.ServiceStateException: Error: No space left on device(28), file: system, user name

Not sure what I did wrong or how to fix this.

Here is what I am using:

OS: Ubuntu 14.04

MapR: 5.2.0 Community Edition

Hadoop: Hadoop 2.7.0-mapr-1607

Spark: 2.0.1-mapr-1703


java version "1.7.0_121"
OpenJDK Runtime Environment (IcedTea 2.6.8) (7u121-2.6.8-1ubuntu0.14.04.3)
OpenJDK 64-Bit Server VM (build 24.121-b00, mixed mode)