Is it advisable to run an antivirus software against the data residing on MapR cluster?
Technically, scanning of data residing on the MapR cluster over MapR's NFS mount-point will work. But there will be performance hits based on the nature of the load on the cluster at the time of the scan. And this, of course, varies across use cases.
The scanning process will have to churn the entire namespace and hence this would cause load on disk IOs and CPU utilizations. Therefore, a slot will have to be identified when the load on the cluster is low to perform this task.
One could always apply exceptions to the content to be scanned, like -
It'd also do well to have the data primed (scanned) prior to loading it to the MapR cluster itself. That way, it'd be far efficient to scan the end-points of usage/data access than having to continually scan the central store.
Thanks to leon clayton UV Saradhi and Martijn Kieboom for their inputs.
I agree with Mufeed Usman that the simplest thing to do would be to perform the anti-virus check when the data is being uploaded to the cluster and there may be a way to do that without too much pain. However lets look a little deeper at the underlying issue.
Why would you want to do this?
First, how did the virus get on the cluster and what threat does it pose?
Most likely this happens when you connect the cluster to a PC or Mac as a client instead of using an edge node which would more than likely be a linux box or vm. In most enterprise use cases this wouldn't happen. Typically you have a landing zone for inbound data and it wouldn't be binary files but data files. Here, you have the ability to scan files as needed as part of the ingestion process. The one use case where it could happen is if you're using the MapRFS as enterprise storage. Letting PC's connect via NFS/FUSE/Samba/etc to the cluster as an additional drive. Here's where you have the most risk.
IMHO, this is a fringe use case and not really recommended.
The second question... which anti-virus software will you use? I mean if you're on the cluster in a map/reduce scenario checking file by file, what anti-virus software runs on Linux or libraries are open for your m/r program to access?
Also when your user uploads files, they would go to their personal directory. So, how would they infect other machines?
Which brings up a different issue... how does the virus get uploaded to the cluster and why doesn't the PCs antivirus software run a scan on the mounted drive itself. To Mufeed's point, there would be an increase in I/O, probably saturating the NFS server, so you would see a spike if all connected PCs were to try and run their scans at the same time. Which brings us to a potential solution...
When a file is uploaded via NFS (and Fuse I think) its possible to catch this event. (There's a hook already present in NFS). Its here that you could run an anti-virus software program, however... which one will run on Linux?
So its possible to stop this as the attack vector...
The larger question that should also be asked would be the enterprise security and stopping the attack earlier on before it could get to the cluster. Or even if this is the best use case for the cluster.
Just my $0.02
Retrieving data ...