Assumptions: Spark jobs would also be run on the same nodes where each data node has 24 TB of storage.
Hi George M,
Recommend to check out Memory and Disk Space and let us know if you have further questions.
Hi, the link above details the minimum RAM requirements for each node, but I wanted to understand the relationship between RAM and Disk storage. If I had a node with 24TB of storage, what would be the ideal amount of RAM taking into account the need to have Spark jobs on the same nodes?
I am inviting Vinayak Meghraj, our Spark expert, to join this discussion and share his Spark knowledge. Thank you Vinayak.
Spark Troubleshooting guide: Tuning Spark: Estimating Memory and CPU utilization for Spark jobs
Hi George M,
Please let us know if the support article helps you to resolve the issue.
Hi, am still not yet on the clear on the ratio. I have however resolved to use 512 GB to a data node with 24 TB of data. I hope this would suffice and not be over the top?
Spark does not need to load everything in memory to be able to process it. This is because Spark will partition the data into smaller blocks and operate on these separately. The number of partitions, and this their size depends on several things:
Where the file is stored. The most commonly used options with Spark already store the file in a bunch of blocks rather than as a single big piece of data. It's stored in filesystem for instance, by default these blocks are 256MB and the blocks are distributed (and replicated) across your nodes.
To control how much memory you need for each executor In YARN terminology, executors and application masters run inside “containers”. Memory requests higher than this won't take effect, and will get capped to this value using the following properties.
<property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>20000</value> </property>
<property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>5</value> </property>
<property> <name>yarn.nodemanager.resource.memory-mb</name> <value>200000</value>
<property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>5</value> </property> Best Practices for YARN Resource Management - https://mapr.com/blog/best-practices-yarn-resource-management/
Retrieving data ...