HDP 2.5 + YARN 升級到Spark 2

前言

我使用著HDP 2.5+YARN作+Spark 1.6為我的hadoop群集的環境

最近遇到一個狀況,一個使用者想要去查詢一個很巨大得hive table
這個hive table有30萬個左右的partition跟幾百TB的資料
使用Spark 1.6的spark sql去跑的話會直接出現記憶體不足的錯誤

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3664)
        at java.lang.String.<init>(String.java:207)
        at java.lang.StringBuilder.toString(StringBuilder.java:407)
        at java.io.ObjectStreamClass.getClassSignature(ObjectStreamClass.java:1552)
        at java.io.ObjectStreamClass.getMethodSignature(ObjectStreamClass.java:1567)
        at java.io.ObjectStreamClass.access$2500(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$MemberSignature.<init>(ObjectStreamClass.java:1892)
        at java.io.ObjectStreamClass.computeDefaultSUID(ObjectStreamClass.java:1819)
        at java.io.ObjectStreamClass.access$100(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:253)
        at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:251)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.getSerialVersionUID(ObjectStreamClass.java:250)
        at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:611)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)

後來研究了一下發現spark 2之後可以解決這問題
就決定將環境升級到Spark 2了

做法

原本想說Spark 2直接抓下來應該就可以用了,沒想到遇到一堆奇怪的狀況
這邊把步驟記錄一下
參考:https://community.hortonworks.com/articles/53029/how-to-install-and-run-spark-20-on-hdp-25-sandbox.html

首先把最新的spark包給抓下來,我現在抓的是spark-2.1.0-bin-hadoop2.7.tgz 放到家目錄下
之後操作如下

$ sudo mkdir /usr/hdp/current/spark2-client
$ tar -xvf spark-2.1.0-bin-hadoop2.7.tgz
$ cd /usr/hdp/current/spark2-client
$ sudo mv ~/spark-2.1.0-bin-hadoop2.7/* .

複製舊spark client的設定檔

$ sudo cp ../spark-client/conf/* conf/

到這步為止spark 2的client還不能用
需要新增確定幾個設定
先將spark-env.sh的裏面舊路徑換成新的路徑

$ sudo vim conf/spark-env.sh

export SPARK_CONF_DIR=${SPARK_CONF_DIR:-/usr/hdp/current/spark2-client/conf}
....

修改conf/spark-defaults.conf 新增如下設定

sudo vim conf/spark-defaults.conf

...
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native
 
spark.driver.extraJavaOptions -Dhdp.version=2.5.0.0-817
spark.yarn.am.extraJavaOptions -Dhdp.version=2.5.0.0-817
# Required: setting this parameter to 'false' turns off ATS timeline server for Spark

spark.hadoop.yarn.timeline-service.enabled false
...

這一步非常重要,沒有做的話使用yarn模式會跑不起來

之後就可以跑跑看正不正常了

$ bin/spark-shell  --master=yarn-client

2017/02/20

Spark2的client配置好之後,原本想讓Zeppelin直接使用的
但是一直跑出

...
$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:
/etc/hadoop/conf/secure: bad substitution

這樣一個錯誤訊息,後來查詢了一下問題出在${hdp.version}無法被解析
我嘗試了各種設定參數的方式都沒辦法成功,後來終於找到一個可以work的方法
修改mapred-site.xml新增設定如下

...
    <property>
      <name>hdp.version</name>
      <value>2.5.xxxxx</value>
    </property>
...

其中hdp.version填入環境的HDP版本即可
這邊做個記錄

comments powered by Disqus