AnsweredAssumed Answered

Cause of flood of expandaudit "rpc error: Stale File handle" logs?

Question asked by reedv on Feb 15, 2018
Latest reply on Feb 22, 2018 by jbubier

I am running a script that runs expandaudit operations (expandaudit) on a csv list of mapr volumes. I am seeing that there are many expandaudit errlogs in the /opt/mapr/logs/ directory across nodes. All of the files look like...

2018-02-15 10:10:37,1183 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:3916 Thread: 10532 getAttr failed for fid:2754.66.263380, rpc error:Stale File handle
2018-02-15 10:10:37,1295 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:3916 Thread: 10530 getAttr failed for fid:2754.135.263368, rpc error:Stale File handle
2018-02-15 10:10:37,1355 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:3916 Thread: 10530 getAttr failed for fid:2754.136.263370, rpc error:Stale File handle
2018-02-15 10:10:37,1385 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:3916 Thread: 10532 getAttr failed for fid:2754.67.263382, rpc error:Stale File handle
2018-02-15 10:10:37,1409 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:3916 Thread: 10530 getAttr failed for fid:2754.66.263372, rpc error:Stale File handle
2018-02-15 10:10:37,1424 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:3916 Thread: 10530 getAttr failed for fid:2754.67.263374, rpc error:Stale File handle
2018-02-15 10:10:37,1714 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:3916 Thread: 10530 getAttr failed for fid:2754.69.263392, rpc error:Stale File handle

Looking into one of the suspect FIDs

[mapr@mapr001 ~]$ maprcli fid dump -fid 2754.66.263380 -json
{
        "timestamp":1518725705156,
        "timeofday":"2018-02-15 10:15:05.156 GMT-1000 AM",
        "status":"ERROR",
        "errors":[
                {
                        "id":22,
                        "desc":"Unable to communicate with 172.18.4.100:7222. Retry after obtaining a new ticket using maprlogin"
                }
        ]
}

[mapr@mapr001 ~]$ maprcli fid dump -fid 2754.66.263380 -json
{
        "timestamp":1518725732866,
        "timeofday":"2018-02-15 10:15:32.866 GMT-1000 AM",
        "status":"ERROR",
        "errors":[
                {
                        "id":116,
                        "desc":"GetAttr failed, Error : Stale NFS file handle"
                }
        ]
}

did not give me anything more to go on.

It may be useful though to note that the scripts that are running the expandaudit process are located in a directory of the mapr cluster mounted via NFS and that the script itself is being run periodically by cron on a particular node (generating these weird errors each time).

From looking at the expanded audits through drill, it seems to be expanding the audit logs correctly, so I am curious what is going on here. Does anyone know what could be causing this and how to make it stop? Thanks. 

Outcomes