leonclayton

Track Space Consumption in a Cluster and Identify Wasted Space

Blog Post created by leonclayton Employee on Oct 19, 2016

Using DUC to Track Space Consumption

Using the following program http://duc.zevv.nl you can create a simple script to target on demand areas of the MapR file system for analysis. This program is written in C and in the following example uses the NFS mount point. It creates a hidden. duc.db file in the target location which is subsequently browsed by via command line tools or webpage.

 

Caveats

  1. It is not recommend running this on the entire cluster in one go. Targeted use is preferable as per the example below.
  2. Additional load will only be generated on first run or updates.
  3. NFS connection to the MapR cluster is required.
  4. Additional cluster space will be used for the. duc.db files. These can easily be created on a file system outside the cluster is so desired.

 

How to compile duc on the MapR 5.2 sandbox as an example. Tested on “MapR-Sandbox-For-Hadoop-5.2.0-vmware.ova” running on Vm Fusion. Assumes you have internet access.

 

# cat /etc/centos-release

CentOS release 6.7 (Final)

 

Install the following

# yum groupinstall "Development Tools"

# yum install ncurses-devel pango-devel cairo-devel tokyocabinet-devel

# cd

# wget https://github.com/zevv/duc/archive/1.4.1.tar.gz

# tar -zxvf 1.4.1.tar.gz

# cd duc-1.4.1/

# autoreconf -i

# ./configure

# make

# make install

# which duc

/usr/local/bin/duc

 

Then test

# duc index /mapr/demo.mapr.com/user/mapr

# duc ui /mapr/demo.mapr.com/user/mapr

 

Lets build it into a web server and place the index at the same point.

cat ducduc.sh 

#!/bin/bash

/usr/local/bin/duc index  -p  $1 -d $1/.duc.db

 

[root@maprdemo ~]# ./ducduc.sh /mapr/demo.mapr.com/user/mapr/

497.5Mb in 56 files and 74 directories

 

Call the following cgi scripts from a web server

 

cat /srv/www/htdocs/test.cgi 

#!/bin/sh

/usr/local/bin/duc cgi -d /mapr/demo.mapr.com/user/mapr/.duc.db

 

This produces an interactive click through a single volume space usage as below

There is also a console ui you can use that is triggered by using the following command. This is interactive and very fast.

 

#duc ui /mapr/demo.mapr.com/user/mapr/


 

Using RMLINT to Identify Wasted Space

PLEASE REVIEW ANY OUTPUT OF RMLINT BEFORE YOU DELETE DATA. MAPR HAS ZERO RESPONSIBILITY FOR THIS TOOL OR DATA IT DELETES. IT'S ONLY AN INDICATOR. 

 

reliant (User manual — rmlint (2.4.4 Myopic Micrathene) documentation ) finds space waste and other broken things on your filesystem and offers to remove it. It is able to find:

  1. Duplicate files & directories.
  2. Nonstripped Binaries
  3. Broken symlinks.
  4. Empty files.
  5. Recursive empty directories.
  6. Files with broken user or group id.

 

Install/compile On 5.2 Sandbox

 

We need to install Glib >= 2.3.0 first.

# yum groupinstall "Development Tools"

# yum install libffi-devel zlib-devel scons libblkid-devel elfutils-libelf-devel python3-sphinx gettext

# cd /root/

# mkdir glib-source

# cd glib-source/

# wget http://ftp.gnome.org/pub/GNOME/sources/glib/2.32/glib-2.32.4.tar.xz

# tar -xf glib-2.32.4.tar.xz

# cd glib-2.32.4/

# ./configure --prefix=/usr/local/glib-2.32

# make

# make install

We also need to set the system wide variables for Glib upgrade.

# echo export LD_LIBRARY_PATH=/usr/local/glib-2.32/lib/ >> /etc/environment

# echo export PKG_CONFIG_PATH=/usr/local/glib-2.32/lib/pkgconfig >> /etc/environment

Now logout and Login

 

We also need to upgrade gcc to a newer version

$ wget http://people.centos.org/tru/devtools-1.1/devtools-1.1.repo -P /etc/yum.repos.d

$ sh -c 'echo "enabled=1" >> /etc/yum.repos.d/devtools-1.1.repo'

$ yum install devtoolset-1.1

$ scl enable devtoolset-1.1 bash

$ gcc --version

gcc (GCC) 4.7.2 20121015 (Red Hat 4.7.2-5)

 

Then we need to compile and install rmlint
# git clone https://github.com/sahib/rmlint.git
# cd rmlint/
# scons config      
# scons -j4
# scons -j4 --prefix=/usr install

 

Guide to using rmlint http://rmlint.readthedocs.io/en/latest/tutorial.html

 

example

# rmlint -b  /mapr/demo.mapr.com/user/mapr/

 

# Empty dir(s):

   rmdir '/mapr/demo.mapr.com/user/mapr/tmp/hive/mapr/09034c42-0645-4554-bf23-fe141fe366be/_tmp_space.db'

   rmdir '/mapr/demo.mapr.com/user/mapr/tmp/hive/mapr/09034c42-0645-4554-bf23-fe141fe366be'

   rmdir '/mapr/demo.mapr.com/user/mapr/tmp/hive/mapr'

   rmdir '/mapr/demo.mapr.com/user/mapr/tmp/hive'

   rmdir '/mapr/demo.mapr.com/user/mapr/tmp'

 

==> Note: Please use the saved script below for removal, not the above output.

==> In total 5 files, whereof 0 are duplicates in 0 groups.

==> This equals 0 B of duplicates which could be removed.

==> 5 other suspicious item(s) found, which may vary in size.

 

Wrote a sh file to: /mapr/demo.mapr.com/user/mapr/rmlint.sh

Wrote a json file to: /mapr/demo.mapr.com/user/mapr/rmlint.json

Outcomes