Huge Pig job causes local /tmp directory runs out of disk space.

Document created by Hao Zhu Employee on Feb 18, 2016Last modified by Hao Zhu Employee on Feb 18, 2016
Version 2Show Document
  • View in full screen mode

Author: Hao Zhu

Original Publication Date: April 22, 2015

Symptom:

Huge Pig job causes local /tmp directory runs out of disk space.

Env:

Pig 0.13

Root cause:

Per PIG-1838, pig keeps the jar files for each job until the pig script finishes.It means if a single pig script contains lots of MapReduce jobs, pig will create many jar files in /tmp directory on the node where the pig job is submitted. Until the whole pig script finishes, pig will then clean the temp jars.Tests:For example, below pig job will keep 2 jars in /tmp directory until the whole pig job finishes, because it contains 2 MapReduce jobs.

a = load '/dir' using ParquetLoader(); 
b = order a by price ;
STORE b INTO '/output' USING parquet.pig.ParquetStorer;

The temp jars in /tmp during execution:

Job4571716915535666311.jar 
Job3312616966593773080.jar

If we put 2 of above pig jobs into one pig script, pig will keep 4 temp jars in /tmp:

Job7482213044249144977.jar 
Job4615931692370853067.jar Job182685348991417556.jar
Job4601767432482914524.jar

Source Code analysis:The logic is in pig source code -- JobControlCompiler.java, which calls createTempFile() function in java.io.File:

import java.io.File; 
File submitJarFile = File.createTempFile("Job", ".jar");
log.info("creating jar file "+submitJarFile.getName());

Per java source doe -- File.java, the directory location is controlled by java.io.tmpdir:

 

File tmpdir = (directory != null) ? directory : TempDirectory.location();

 

  private TempDirectory() { }

 

  // temporary directory location

  private static final File tmpdir = new File(fs.normalize(AccessController

  .doPrivileged(new GetPropertyAction("java.io.tmpdir"))));

  static File location() {

  return tmpdir;

  }

Solution:

To avoid /tmp directory running of disk space, available solutions are:1. Split a huge pig script into small pieces and run each piece separately.Or2. Set java.io.tmpdir to a directory with enough disk space in HADOOP_OPTS or PIG_OPTS before submitting the pig job.For example:

export PIG_OPTS="-Djava.io.tmpdir=/dir_with_enough_disk_space" 
pig test.pig

Attachments

    Outcomes