If I want 10TB space of the 5 - node cluster, what specifation would I need? How much ram on each node, how to spread the space across the nodes, and what the general specifations would be.
It would be very subjective case by case. You will need to decide the topology, services, workload, users and other factors before can jump into accurate sizing. I share some of my understandings below, hope it can point you to right direction.
General rule of thumb for the storage for MapR-FS sizing would be:
With assumptions below:
1. Default Replication Factor of 3 = 10TB * 3 = 30TB
2. Working/Temp Spaces 25 % to 30% = 9TB
3. CLDB containers will be using the same storage allocation in item 1.
4. No usage of mirrors/snapshots else you need to cater for their frequency and data volume changes
5. Spreading them equally to the 5 nodes. Best practice will be to use same size, same quantity of disks at each node unless you have specific storage volume topology to implement.
6. JBOD would do. No RAID and LVM except for OS/Non-MapR disks.
7. You should be look at 8TB storage per node at least. Go for more spindles.
1. Depending on services you will implement in the cluster. If you have Drill or Spark with heavy querying and processing, 128G/256G is common.
2. Take note of multi-tenant requirements if any.
3. 5 Nodes is a small cluster, probably you will have Control-and-Data nodes instead of separating them. You need to cater for these Control services.
1. Depending on services you will implement in the cluster. If you have Drill or Spark with heavy querying and processing, 12 Cores and above is not uncommon.
3. 5 Nodes is small cluster, probably you will have Control-and-Data nodes instead of separating them. You need to cater for these Control services.
1. At least Dual bonded 1Gbps. 10Gbps is common as well especially for the jobs with alot of shufflings.
2. Dedicated. Don't mess with your corporate networks.
Eddy Lee gives you a good start, however, there's a bit more to building out a cluster. Even a dev cluster.
First, we don't know you, your company, your budget, and your hardware vendor. All of those things are important.
We also don't know if you're planning on running community edition, or if you're buying a support license.
(This is also important because there are limitations that will impact cluster design decisions. )
We also don't know if you're trying to build this out as a developer's cluster or if you want to put this in to production.
We don't know which tools you plan on using within core MapR, or additional tools.
So you need to take what is said with a grain of salt.
First, 5 nodes is too small for a production cluster.
(Too many details to type out here. ;-)
So, lets assume that this is a community edition, development cluster. Lets also assume that you have another machine which you will use as your client.
You will want to set up one machine for your CLDB and then four machines as your Data Nodes.
Community Edition doesn't have HA features enabled so you might as well as segment this to a single node. Having said that... you now have a SPOF (Single Point of Failure). You could run the CLDB side by side with your data volumes, however, a cluster this size is a throw away and you can mitigate some of the points of failure by using SSDs in a RAID 10 setup. ( yes I know its counter intuitive ;-) If you don't want to do that... you can run one disk in CLDB volume, the other disks in your data volume(s).
In terms of disk space... you said 10TB. Assuming that's raw, you're going to want to have 6X the storage as a starting point. Here's why...
First, that 10TB is raw data. You will also have metadata surrounding the data and you didn't say which tools you're going to use. There's overhead in managing the data, depending on tool, schema, etc... and you will more than likely want to have copies of subsets and/or reorganization of the data so there will be some duplication as you transform the data.
Second, you will want to have some free space so as your job runs, you can store intermediate products and as you run your job, your output also has to be stored. You would be amazed at how fast disk space gets utilized.
So if you do the math... <75% utilization, extra space for metadata , copies, you end up with 6X over your raw data.
(Again YMMV its just a simple rule of thumb) 10TB means you will want 60TB over four nodes or 15TB per node for data storage. ( 4TB per disk is 16TB so that would be what you need. ) Here again, you need to talk to your vendor. They may give you a better price point for 2.5" 10K drives depending on your server, so you need to work with them to figure out your options.
You have two options... you can build out all of the machines the same, or you can specialize in the build.
If you're running the CLDB on a single node, you will want to raid your drives using raid-10. (hardware or software raid)
You may want to use a good brand of SSDs. (I prefer SSDs to spinning disk even though both have the same warranty period. ('faster' and less heat/energy ) YMMV)
The other issue is number of cores and amount of memory. Again, your vendor will have different prices based on number of cores, TDP, and processing speed. Since we don't know your use cases and/or budget... pick one.
For each physical core, to run a bare minimum cluster for Hadoop, 4GB per core. This doesn't give you a lot of space and can limit you if you want to use tools like YARN. 8 cores per CPU, 16 cores (dual socket) means 64GB. 8GB per core is better so you would need 128GB. Since we don't know the number of cores, the total amount of memory will vary. As Eddy points out. 128 or 256GB is common. Again, talk with your vendor because they may have a better deal for you. Don't go below 4GB per physical core.
You don't need redundant power supplies. Again the expectation is that this is a dev cluster and if your system fails, you can survive some down time.
Networking... Eddy is right. 10GbE is common in the DC. Just make sure your networking card's ports match your switch. So if you need RJ45 you get RJ45 and if you need SPIFs you get SPIFs (different sockets)
Note: You don't want to do the OS bonding. You want to set up both ports independent and let MapR use both ports.
(OS bonding, 1+1 = 1.5 . MapR using both, 1+1= 2)
There's more, but it should point you in the right direction.
Thanks for your information. The NIC point is a good reminder that MapR can load balance multiple NICs transparently. Cheers!
Thanks for the explanation. Sorry for not giving all the details. I was asked to make an architectural diagram but was not sure how it would look like. I agree that 5 nodes is small but I only needed to show them a sample diagram which I myself am not sure how to do. Thanks for this information, it helps me a lot.
Appreciated you shared your thoughts and progress with us. Don't forget to endorse Eddy and Michael top skills under their Profile.
Retrieving data ...