Initially cluster was running on EMR2 (MapR 3.1.0). Cluster was upgraded to EMR3 (MapR 4.0.2). Application was modified to support Hadoop2.
- r3.4xlarge master node - 1 no.
- m3.2xlarge "core" nodes - 10 nos.
- m3.2xlarge "task" nodes - 6 nos.
The task nodes are spot instances, While other nodes are reserved instances.
Workload is composed of ETL-style MapReduce jobs written in Cascading, compiled against Cascading's Hadoop2 adaptor to make it support Hadoop2. These jobs are run for a few hundred different clients and they vary by size of data set and complexity of workload.
After running a large number of applications successfully, the cluster stops doing work. When this happens, the ResourceManager logs are full of resource requests that can't be fulfilled. Suspecting that the running
jobs need more resources in order to complete, but the cluster is unable to allocate those resources.
On the MapR Control UI of these "locked" clusters, none of the cluster-wide resources (memory, CPUs, disks) are saturated indicating any potential contention.
Capacity Scheduler is also used.