AnsweredAssumed Answered

getTable threads blocked on Inode.closeAll in MapR

Question asked by itrs.eng on Jul 17, 2018

We are experiencing an issue that first appeared to be a deadlock in MapR but seems to be more of a live lock scenario. We have several threads attempting to get a table. 

 

Each thread has this stack trace: 

 

"default-akka.actor.default-dispatcher-2" #22 prio=5 os_prio=0 tid=0x00007f62f6fd70d0 nid=0x7e38 waiting for monitor entry [0x00007f62ac94e000] java.lang.Thread.State: BLOCKED (on object monitor) at com.mapr.fs.MapRFileSystem.initConfig(MapRFileSystem.java:592) - waiting to lock <0x000000008026f610> (a java.lang.Integer) at com.mapr.fs.MapRFileSystem.initialize(MapRFileSystem.java:345) at com.mapr.fs.MapRFileSystem.initialize(MapRFileSystem.java:335) at com.mapr.fs.MapRHTable.init(MapRHTable.java:98) - locked <0x0000000080acb6e8> (a com.mapr.fs.MapRHTable) at com.mapr.fs.hbase.HTableImpl.(HTableImpl.java:94) at com.mapr.fs.hbase.HTableImpl11.(HTableImpl11.java:57) at sun.reflect.GeneratedConstructorAccessor14.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hbase.client.mapr.GenericHFactory.getImplementorInstance(GenericHFactory.java:37) at org.apache.hadoop.hbase.client.HTable.createMapRTable(HTable.java:556) at org.apache.hadoop.hbase.client.HTable$2.run(HTable.java:519) at org.apache.hadoop.hbase.client.HTable$2.run(HTable.java:516) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1611) at org.apache.hadoop.hbase.client.HTable.initIfMapRTableImpl(HTable.java:515) at org.apache.hadoop.hbase.client.HTable.initIfMapRTable(HTable.java:473) at org.apache.hadoop.hbase.client.HTable.(HTable.java:230) at com.mapr.fs.hbase.MapRClusterConnectionImpl.getTable(MapRClusterConnectionImpl.java:174) at com.mapr.fs.hbase.MapRClusterConnectionImpl.getTable(MapRClusterConnectionImpl.java:53)

 

The thread that has the lock on this Integer has the following stack:

 

"default-akka.actor.default-dispatcher-3" #23 prio=5 os_prio=0 tid=0x00007f62f6feee30 nid=0x7e39 runnable [0x00007f62ac84e000]
java.lang.Thread.State: RUNNABLE
at com.mapr.fs.Inode.closeAll(Inode.java:1158)
at com.mapr.fs.BackgroundWork.close(BackgroundWork.java:99)
at com.mapr.fs.MapRFileSystem.close(MapRFileSystem.java:1612)
- locked <0x000000008026f610> (a java.lang.Integer)

 

All table access to getTable is via a "using" style construct so we know we're closing tables.   The  blocking file system call is as follows:

 

def withFileSystem[T](conf: Configuration = EmptyConfiguration)(thunk: FileSystem => T): T = {
  using(FileSystem.get(conf)) { fileSystem =>
    thunk(fileSystem)
  }
}

So we are closing the fs. ( ie using closes the fileSystem object) 

 

Quickly checking the MapR code in the IDE ( which just decomps it crudely)  the lock appears to be on numInstances inside of MapRFileSystem.

 

synchronized(numInstances) {
Integer var2 = numInstances;
numInstances = numInstances - 1;
if (numInstances == 0) {
BackgroundWork.close();
}
}

 

BackgroundWork.close() calls Inode.closeAll().  

 

Now.. we have observed in the past that Inode.closeAll() can appear to hang but in those cases it seemed to be more related to code version mismatch ( we specifically saw this in 6.0.1).   From what i can see we have no such version mismatch this time around ( although the line numbers in the stack don't match the decompiled code line numbers). 

 

The code inside Inode.CloseAll is suspicious.   That is, it appears that Inode.List.first may not be threadsafe or may be racey, as the only way for Inode.CloseAll to never complete would be that Inode.List.first always returns a non null item.   All the while, it holds the lock on numInstances in a kind of busy spin of the infinite variety.   We end up with a CPU going at 100% and the only way to stop this is to restart the application. 

 

More info:

- ulimit -n  is 64000

- MapRBuildVersion is 6.0.0.20171109191718.GA 

- If we do numerous hits to getTable around the same time ( multiple threads) we can get this to happen. if only a couple of threads are running it does not occur.

- Feels somewhat related to MapR-DB Java API Thread Issue // Hang?  

 

I think this needs to be resolved as it is causing us all kinds of problems.  So my questions are: 

 

- Is this a MapR code problem?  ( i think any potential for an infinite lock hold is pretty bad)

- Is the MapRFileSystem class threadsafe?  We think it should be as implementations of FileSystem are supposed to be threadsafe. 

- Is Inode.List and Inode itself threadsafe?  It really seems like it has a potential there for an infinite loop.

- Could there be some underlying contention between getFileSystem and getTable ?

- Is there anything else that you can observed that might help us resolve without a code patch 

 

Thanks for your assistance,

 

Paul

Outcomes