Hi Ted Dunning,
Why does FileServer takes up so much memory? I have seen it's taking up to 80% of node memory with default mfs settings in warden.conf . Does it also act as a Memstore (cache) for data on disk including CLDB containers, temporary shuffle data ?
The amount of memory that the file server process takes is controlled by the settings in the config files. Normally, the setting is 25% of memory on a file-system only config and 35% for converged platforms that have streams and tables enabled. These settings can be changed according to need as well with some table or stream heavy application mixes benefiting from larger memory usage. Your observations seem much higher than expected from a vanilla setup and it would be good to hear more about what configurations you have changed and what kind of machine you are running on.
Regarding your question about the file system being like a cache, yes, the MFS process (the thing that actually does the data platform work) does act as a very sophisticated kind of cache. There are many priority levels of caching for different types of data and the mix of space devoted to each can be adjusted, but you should almost never do so. For example, the interior nodes of the b-trees that make up the internal structure of the file system are very aggressively cached even without much recent usage. This means that almost all data accesses require no disk I/O for meta-data and the system can go directly to the desired disk blocks. Other cache levels include things like key partitions for tables and streams, directory contents, table keys and parts of table values. File blocks are one of the lowest priority caching levels. This makes a lot of sense because the other levels are typically quite small in comparison and because file access patterns involve much less re-reading than, say, table access.
The behavior of the cache is also partially dependent on hints from the MapR client software. Some of these hints are as simple as adaptive read-ahead. Other hinting is much more subtle, as in the zero-copy tablet splitting that can happen with common table update patterns. Regardless, having a global view of the entire problem from client API to bits on the disk lets MFS make use of all the information available in order to do a better job of caching. Most application specific caches really can't make these trade-offs due to visibility limits between layers of the system.
Thanks Ted for detail explanation on mapr filesystem!
Regarding our cluster setup, we do have 35% allocation for mfs heapsize. What happened was we tried to remove MR1 as we migrated all our applications to MR2. We were doing it by setting mr1 heap, cpu allocations to 0. However, after that we forgot to remove job-tracker and task-tracker roles from node, we also didn't remove those packages and we restarted warden. I think warden mis calculated memory allocations. Not sure exactly what it did (perhaps you can answer it better) but looked like it gave those MR1 resources to MR2 as I could confirm that by looking at resource manager UI for total cores and memory. But at the same time it also gave much more memory (80%) to itself! And I think it still kept resources for MR1 too as I did show in warden.log mentioning something about MR1 and MFS memory setting after adjustment. Anyway, total of MFS and NodeManager memory at that point was more then total system memory. So during this time our running MR2 jobs caused kernel panic on node due to OOM and crashed it. We did opened up a case today and were able to resolve it with help of support.
Yowza. That was an interesting problem.
Glad to hear the support folk fixed it.
Retrieving data ...