Original Publication Date: December 16, 2014
Hive commands sometimes return errors of the type :
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: org.apache.thrift.transport.TTransportException java.net.SocketTimeoutException: Read timed out)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
Root Cause :
The preliminary suspect when the metastore timeouts occur is the delay in query on the metastore RDBMS store connected via the DataNucleus component within Hive.
Some optimization was done on this behaviour in Hive 0.12 onwards through Hive-4051 where the new option hive.metastore.try.direct.sql when set to true tries to use direct SQL queries instead of the DataNucleus for certain read paths. This improves metastore performance when fetching many partitions or column statistics by orders of magnitude, upto 20x in some cases.
This value can also be set on a per client basis from 0.14 onwards using the "set metaconf:hive.metastore.try.direct.sql=<value>" command (HIVE-7532).
Other areas of slowness in metastore operations are investigated in Hive-7195. Here, areas suspected of causing the delay are in the codepaths doing the following:
1) When a client gets all partitions it is not sent to an iterator, instead a collection of all data is created and and then the object is passed over the network in total
2) Operations which require looking up data on the HDFS are not cached and looks are done in serial fashion
These are details internal to the Hive codepath and have not been fully fixed in the JIRA. But a related JIRA, Hive-7223 - Support generic PartitionSpecs in Metastore partition-functions (Fixed 0.14.0) partially addresses this by reducing the thrift traffic that causes the metastore timeouts.
Another optimization was introduced in HIVE-7366 - getDatabase using direct sql - Fix Version/s:0.14.0 where it now uses direct SQL against DataNucleus for the get_database calls in the code.
After all these optimizations, still in the latest versions of Hive (0.14 onwards) the default timeout is set to 10 minutes as part of HIVE-7140 Bump default hive.metastore.client.socket.timeout to 5 minutes . There's a comment later that corrects the title, "For the record, the new default is 600 seconds so that's 10 minutes, not 5 minutes as stated in this ticket's title and first comment".
Given the above, the best way ahead for fixing issues of this pattern is an upgrade of hive and subsequent configuration for directSQL wherever possible and set an appropriate timeout value that ensures your jobs do not fail.