建立一個支援LDAP登入+Python2/3的JupyterHub

這幾天被要求弄一個Jupyter給部門的同事用

但是讓人煩惱的是Jupyter預設似乎沒那麼容易支援支援多用戶

後來發現有一個JupyterHub可以用
https://github.com/jupyterhub/jupyterhub

這邊紀錄一下建置方法

先安裝Anaconda2跟Anaconda3來支援python 2/3 以python3為主

$ cd /usr/local
$ wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
$ sh Anaconda3-5.0.1-Linux-x86_64.sh
...
Anaconda3 will now be installed into this location:
/root/anaconda3

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/root/anaconda3] >>> /usr/local/anaconda3
..
to PATH in your /root/.bashrc ? [yes|no]
[no] >>> yes


$ wget https://repo.continuum.io/archive/Anaconda2-5.0.1-Linux-x86_64.sh
$ sh Anaconda2-5.0.1-Linux-x86_64.sh
Anaconda3 will now be installed into this location:
/root/anaconda3

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/root/anaconda3] >>> /usr/local/anaconda2
..
to PATH in your /root/.bashrc ? [yes|no]
[no] >>> no


# check version
$ source ~/.bashrc
$ python -V
Python 3.6.3 :: Anaconda, Inc.

安裝python2 kernel到python3 下

/usr/local/anaconda2/bin/python -m pip install ipykernel
/usr/local/anaconda2/bin/python -m ipykernel install --prefix=/usr/local/anaconda3 --name 'python2'

安裝jupyterhub

$ conda install -c conda-forge jupyterhub
$ conda install notebook
$ pip install jupyterhub-ldapauthenticator
$ jupyterhub --generate-config
$ mkdir -p /etc/juypterhub
$ mv jupyterhub_config.py /etc/juypterhub

# Edit config
$ vim /etc/juypterhub/jupyterhub_config.py
...
c.JupyterHub.ip = '0.0.0.0'
c.JupyterHub.hub_ip = '0.0.0.0'
c.JupyterHub.hub_port = 8081
c.JupyterHub.port = 80
c.JupyterHub.authenticator_class = 'ldapauthenticator.ldapauthenticator.LDAPLocalAuthenticator'
c.LDAPAuthenticator.bind_dn_template = 'uid={username},ou=people,ou=account,ou=development,o=example'
c.LDAPAuthenticator.server_address = 'ldap://myldap.com'
c.LocalAuthenticator.create_system_users = True
c.LocalAuthenticator.add_user_cmd = ['useradd', '-m']
c.LDAPAuthenticator.use_ssl = False

這邊要注意一點,我使用了LDAPLocalAuthenticator當作驗證類別
因為我希望LDAP帳號可以自動在local建立對應的home dir,不然的會出現如下錯誤

Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tornado/web.py", line 1511, in _execute
    result = yield result
  File "/usr/local/anaconda3/lib/python3.6/site-packages/jupyterhub/handlers/login.py", line 94, in post
    yield self.spawn_single_user(user)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 475, in spawn_single_user
    yield gen.with_timeout(timedelta(seconds=self.slow_spawn_timeout), finish_spawn_future)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 445, in finish_user_spawn
    yield spawn_future
  File "/usr/local/anaconda3/lib/python3.6/site-packages/jupyterhub/user.py", line 439, in spawn
    raise e
  File "/usr/local/anaconda3/lib/python3.6/site-packages/jupyterhub/user.py", line 378, in spawn
    ip_port = yield gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
  File "/usr/local/anaconda3/lib/python3.6/types.py", line 248, in wrapped
    coro = func(*args, **kwargs)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/jupyterhub/spawner.py", line 968, in start
    env = self.get_env()
  File "/usr/local/anaconda3/lib/python3.6/site-packages/jupyterhub/spawner.py", line 960, in get_env
    env = self.user_env(env)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/jupyterhub/spawner.py", line 947, in user_env
    home = pwd.getpwnam(self.user.name).pw_dir

而LDAPLocalAuthenticator目前沒有實作在官方的ldapauthenticator裡面
必須自己上patch 請參考: https://github.com/jupyterhub/ldapauthenticator/pull/36/files

$ wget https://patch-diff.githubusercontent.com/raw/jupyterhub/ldapauthenticator/pull/36.patch
$ patch -u /usr/local/anaconda3/lib/python3.6/site-packages/ldapauthenticator/ldapauthenticator.py < 36.patch

之後可以用如下command啟動

#前端模式
$ jupyterhub --no-ssl --config=/etc/juypterhub/jupyterhub_config.py
#背景模式, log導到syslog
$ jupyterhub --no-ssl --config=/etc/juypterhub/jupyterhub_config.py  | logger -t jupyterhub &

備註1:

我在測試途中有踩到一個ldap驗證的問題

  File "/usr/local/anaconda3/lib/python3.6/site-packages/tornado/web.py", line 1511, in _execute
    result = yield result
  File "/usr/local/anaconda3/lib/python3.6/site-packages/jupyterhub/handlers/login.py", line 83, in post
    user = yield self.login_user(data)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 328, in login_user
    authenticated = yield self.authenticate(data)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/jupyterhub/auth.py", line 227, in get_authenticated_user
    authenticated = yield self.authenticate(handler, data)
  File "/usr/local/anaconda3/lib/python3.6/types.py", line 248, in wrapped
    coro = func(*args, **kwargs)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/ldapauthenticator/ldapauthenticator.py", line 109, in authenticate
    if conn.bind():
  File "/usr/local/anaconda3/lib/python3.6/site-packages/ldap3/core/connection.py", line 594, in bind
    self.refresh_server_info()
  File "/usr/local/anaconda3/lib/python3.6/site-packages/ldap3/core/connection.py", line 1315, in refresh_server_info
    self.server.get_info_from_server(self)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/ldap3/core/server.py", line 446, in get_info_from_server
    self._get_schema_info(connection)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/ldap3/core/server.py", line 431, in _get_schema_info
    self._schema_info.other[attribute] = format_attribute_values(self._schema_info, attribute, self._schema_info.raw[attribute], self.custom_formatter)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/ldap3/protocol/formatters/standard.py", line 200, in format_attribute_values
    formatted_values = [formatter(raw_value) for raw_value in values]  # executes formatter
TypeError: 'NoneType' object is not iterable

似乎是因為ldap3 2.4.0這個版本沒有處理到空的attribute,而理論上LDAP協議裡面也不該出現空的attribute
這應該是作為LDAP Server的實作有不完整的地方
當前dev的版本已經修掉了這個issue且應該會修正到2.4.1,在新版本出來之前,可以先用下面方法對應

$ pip uninstall ldap3
$ git clone -b dev https://github.com/cannatag/ldap3.git
$ cd ldap3
$ python setup.py install

備註2:

如果想用nginx當proxy的話,記得要注意base_url的對應
假設nginx設定/jupyterhub為代理路徑

    location /jupyterhub {
      proxy_pass http://jupyterhub_url:80;
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
      proxy_set_header Origin "";
      client_max_body_size 5000m;
      proxy_connect_timeout 1d;
      proxy_send_timeout 1d;
      proxy_read_timeout 1d;
    }

base_url則要跟著設定成/jupyterhub

jupyterhub --no-ssl --config=/etc/juypterhub/jupyterhub_config.py --base-url=/jupyterhub

用Apahce+LDAP自建WebDAV當binary repository

這一陣子在用Ansible建立一套平台

當中遇到了一個問題,當使用Ansible去部署application的時候

對於一個需要編譯的application來講每次部署都會自動的在觸發編譯一次

這樣效率有點慢,後來想起來我前公司有用WebDAV去讓人放binary files來加速部署

今天試著自己把WebDAV建起來順變做個紀錄

環境

  • CentOS 7.3
  • Apache 2.4

然後我希望這個webDAV

  • 對於下載沒有限制
  • 對於上傳刪除需要LDAP帳號密碼認證

Apahce內建就有webDAV的功能了,直接用Apache就好

安裝部署如下

# Install apache

$ yum -y install httpd mod_ldap
$ httpd -version
>Server version: Apache/2.4.6 (CentOS)
>Server built:   Nov 14 2016 18:04:44
$ systemctl enable httpd.service
$ systemctl status httpd.service
 
# Make webdav directory

$ mkdir -p /data_hdd/webdav
$ chown apache /data_hdd/webdav
$ chgrp apache /data_hdd/webdav
$ chmod 775    /data_hdd/webdav
$ ls    -ld    /data_hdd/webdav
 
# Check config

$ cat /etc/httpd/conf.modules.d/00-dav.conf
$ vim /etc/httpd/conf.d/dav.conf
DAVLockDB "/tmp/WebDAV.lock"
Alias /webdav "/data_hdd/webdav"
<Directory /data_hdd/webdav>
    DAV On
    Options Indexes MultiViews
    AuthType Basic
    AuthName "intra-auth"
    AuthBasicProvider          ldap
    AuthLDAPGroupAttribute     groupOfNames
    AuthLDAPURL                ldap://$LDAP_URL/ou=people,ou=account,ou=development,o=example.com?uid
    require ldap-group         ou=authgroup,ou=development,o=example.com
    <RequireAny>
        Require method GET POST OPTIONS
        Require valid-user
    </RequireAny>
</Directory>
 
 
# Start apache

$ systemctl restart httpd.service

apache config其中的LDAP設定,AuthLDAPURL,AuthLDAPURL等等
LDAP相關找帳號設定需要依據實際的LDAP設定去修正才行

架好之後去http://你的host/webdav/ 就可以看到一個可以下載的位置列表

使用方法:

建立一個新的資料夾

curl -X MKCOL --user <YOUR_SHORT_ACCOUNT> <WEBDAV_URL>/<FOLDER_NAME>

Example:

curl -X MKCOL --user 你的LDAP帳號 http://你的host/webdav/test2

刪除一個資料夾

curl -X DELETE --user <YOUR_SHORT_ACCOUNT> <WEBDAV_URL>/<FOLDER_NAME>/

Example:

curl -X DELETE --user 你的LDAP帳號 http://你的host/webdav/test2

Note: "/" MUST exists after folder name.

上傳檔案

curl -T <LOCAL_FILE_TO_UPLOAD> --user <YOUR_SHORT_ACCOUNT> <WEBDAV_URL>/<FOLDER_NAME>

Example

curl -T ~/Downloads/oracle_BlueKai.pdf --user 你的LDAP帳號 http://你的host/webdav/test2

之後就可以用像wget之類的指令直接抓檔案了

Presto處理並行job時的效能問題

這一陣子在替人搭建Presto當做data warehouse,同時使用airbnb的superset當做資料可視化的工具

但是在測試過程中遇到了一個問題,superset的dashboard會"同時"的丟許多query到presto查詢(大約十幾個)

整個dashboard既然因為回應太慢而出現timeout錯誤

一開始以為是不是因為presto不支援並行job的處理

找了一下網路上資料
http://teradata.github.io/presto/docs/148t/admin/tuning.html#tuning-pref-task

照到理應該會支援並行job才對阿

但是實際測試結果,假設我一個Query需要花N秒,那我同時丟20個Query就必須大約等20*N秒才會有結果

測試腳本如下

prestocli='java -jar /usr/local/prestocli.jar --server server1 --catalog hive --schema default'
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20;
do
   time $prestocli --execute "SELECT count(1) FROM hive.mydb.mytable LIMIT 100" &
done

SELECT count(1) FROM hive.mydb.mytable LIMIT 100這個query單一執行原本只需要2秒
但是我同時丟20次,他要將近35~40秒才會全部跑完

後來上網查了一下也有人遇過類似的問題
https://groups.google.com/forum/#!topic/presto-users/IEdzrGdMNXQ

後來有找到可以從Query Config去調整一些參數,其中有個user_pipeline.${USER}預設值只有1
https://prestodb.io/docs/current/admin/queue.html

死馬當活馬醫試著把他加入環境中
首先修改$PRESTO_HOME/presto/etc/config.properties指定Query Config位置

$PRESTO_HOME/presto/etc/config.properties
...
query.queue-config-file=/usr/local/presto/etc/presto-queues.json

並且新增一個presto-queues.json

/usr/local/presto/etc/presto-queues.json
{
  "queues": {
    "user.${USER}": {
      "maxConcurrent": 100,
      "maxQueued": 200
    }
  },
  "user_pipeline.${USER}": {
      "maxConcurrent": 10,
      "maxQueued": 20
    },
    "pipeline": {
      "maxConcurrent": 10,
      "maxQueued": 100
    },
  "rules": [
    {
      "queues": ["user.${USER}"]
    }
  ]
}

後來又測試了一次,這次結束所花時間大約是28~32秒

嗯..沒有很明顯的改善,後來想到presto可以支援多個coordinator

那如果我把query的台數增多呢?

後來增加了兩台coordinator,總共三台coordinator來幫忙

重新測試一次,測試腳本如下

prestocli0='java -jar /usr/local/prestocli.jar --server server1 --catalog hive --schema default'
prestocli1='java -jar /usr/local/prestocli.jar --server server2 --catalog hive --schema default'
prestocli2='java -jar /usr/local/prestocli.jar --server server3 --catalog hive --schema default'
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20;
do
   ind=$(($i % 3))
   cmd=prestocli$ind
   time ${!cmd} --execute "SELECT count(1) FROM hive.mydb.mytable LIMIT 100" &
done

這次所花時間約是20~25秒,這次效果好多了
或許還有調整空間,以候有機會在試試

tensorflow升級到1.0的問題

前幾天tensorflow 1.0出來了,很順手的把手邊環境升級上去

這不升級還好一升級以前寫的一些程式一堆都不能跑了

這次升到1.0一堆程式的界面都被改掉了

像是原本LSTM cell是放在tf.nn.rnn_cell.BasicLSTMCell下,新版的移動到了tf.contrib.rnn.BasicLSTMCell

等等諸如此類,哇勒個去,看到快昏倒

tensorflow官方有放個升級手冊,這次升版的好處我還沒享受到 壞處倒是踩了一堆
https://www.tensorflow.org/install/migration

HDP 2.5 + YARN 升級到Spark 2

前言

我使用著HDP 2.5+YARN作+Spark 1.6為我的hadoop群集的環境

最近遇到一個狀況,一個使用者想要去查詢一個很巨大得hive table
這個hive table有30萬個左右的partition跟幾百TB的資料
使用Spark 1.6的spark sql去跑的話會直接出現記憶體不足的錯誤

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3664)
        at java.lang.String.<init>(String.java:207)
        at java.lang.StringBuilder.toString(StringBuilder.java:407)
        at java.io.ObjectStreamClass.getClassSignature(ObjectStreamClass.java:1552)
        at java.io.ObjectStreamClass.getMethodSignature(ObjectStreamClass.java:1567)
        at java.io.ObjectStreamClass.access$2500(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$MemberSignature.<init>(ObjectStreamClass.java:1892)
        at java.io.ObjectStreamClass.computeDefaultSUID(ObjectStreamClass.java:1819)
        at java.io.ObjectStreamClass.access$100(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:253)
        at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:251)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.getSerialVersionUID(ObjectStreamClass.java:250)
        at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:611)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)

後來研究了一下發現spark 2之後可以解決這問題
就決定將環境升級到Spark 2了

做法

原本想說Spark 2直接抓下來應該就可以用了,沒想到遇到一堆奇怪的狀況
這邊把步驟記錄一下
參考:https://community.hortonworks.com/articles/53029/how-to-install-and-run-spark-20-on-hdp-25-sandbox.html

首先把最新的spark包給抓下來,我現在抓的是spark-2.1.0-bin-hadoop2.7.tgz 放到家目錄下
之後操作如下

$ sudo mkdir /usr/hdp/current/spark2-client
$ tar -xvf spark-2.1.0-bin-hadoop2.7.tgz
$ cd /usr/hdp/current/spark2-client
$ sudo mv ~/spark-2.1.0-bin-hadoop2.7/* .

複製舊spark client的設定檔

$ sudo cp ../spark-client/conf/* conf/

到這步為止spark 2的client還不能用
需要新增確定幾個設定
先將spark-env.sh的裏面舊路徑換成新的路徑

$ sudo vim conf/spark-env.sh

export SPARK_CONF_DIR=${SPARK_CONF_DIR:-/usr/hdp/current/spark2-client/conf}
....

修改conf/spark-defaults.conf 新增如下設定

sudo vim conf/spark-defaults.conf

...
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native
 
spark.driver.extraJavaOptions -Dhdp.version=2.5.0.0-817
spark.yarn.am.extraJavaOptions -Dhdp.version=2.5.0.0-817
# Required: setting this parameter to 'false' turns off ATS timeline server for Spark

spark.hadoop.yarn.timeline-service.enabled false
...

這一步非常重要,沒有做的話使用yarn模式會跑不起來

之後就可以跑跑看正不正常了

$ bin/spark-shell  --master=yarn-client

2017/02/20

Spark2的client配置好之後,原本想讓Zeppelin直接使用的
但是一直跑出

...
$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:
/etc/hadoop/conf/secure: bad substitution

這樣一個錯誤訊息,後來查詢了一下問題出在${hdp.version}無法被解析
我嘗試了各種設定參數的方式都沒辦法成功,後來終於找到一個可以work的方法
修改mapred-site.xml新增設定如下

...
    <property>
      <name>hdp.version</name>
      <value>2.5.xxxxx</value>
    </property>
...

其中hdp.version填入環境的HDP版本即可
這邊做個記錄

maven針對單一class做unit test的方法

這一陣子在研究zeppelin的source code
裏面有一堆的子專案,子專案又有許許多多大大小小的unit test
稍微研究了一下如果今天我想要針對特定class做unit test的時候指令要怎麼下

指令如下

$ mvn  -pl livy,zeppelin-interpreter -DfailIfNoTests=false  -Dtest=LivySQLInterpreterTest -Dcheckstyle.skip clean test 

我想要針對LivySQLInterpreterTest執行unitest
除了最基本的mvn clean test之外,需要而外指定幾個參數

  • -Dtest:想要執行unit test的類別
  • -pl: 要包含哪些子專案,我的情況要針對livy做測試,但是livy會依賴zeppelin-interpreter,所以兩個一起打包
  • -DfailIfNoTests:允許子專案不做測試,因為我沒有針對zeppelin-interpreter執行測試,所以要設定這選項
  • -Dcheckstyle.skip:不做style check

而專案結構不複雜的時候只要指定-Dtest就行了

$ mvn -Dtest=... clean test

以上,做個筆記

Zeppelin 0.7 原生支援多用戶環境

以前在Zeppelin 支援多用戶-Livy Server篇的時候研究了用Livy Server去達到多用戶的效果
但由於多了一層Proxy很多問題的追蹤都變得很麻煩,包括jar包的部署還必須打包一份到Proxy上

Zeppelin 0.7之後開始支援impersonate可以做到真正的多用戶環境
點我看文件

另外值得一提的是以前YARN環境下,SparkR要在${SPARK_HOME}/conf/spark-defaults.conf下面新增spark.yarn.dist.archives的設定去打包${SPARK_HOME}/R/lib/sparkr.zip才能運行SparkR
這一版本之後Zeppelin會自己打包了
點我看程式
因此如果spark-defaults.conf下面有打包sparkr.zip的設定反而會讓Zeppelin出問題

因此先的版本要先把spark.yarn.dist.archives的設定事先拿掉

#spark.yarn.dist.archives /usr/hdp/2.4.2.0-258/spark/R/lib/sparkr.zip#sparkr

雖說Zeppelin 0.7還沒正式release,但是github上已經有0.7的branch了,可以先抓下來玩玩
https://github.com/apache/zeppelin

0.7版的設定跟0.6版差了不少,要注意
首先是shiro.ini LDAP的設定類別改了

...
ldapRealm = org.apache.zeppelin.realm.LdapGroupRealm
## search base for ldap groups (only relevant for LdapGroupRealm):
ldapRealm.contextFactory.environment[ldap.searchBase] = dc=COMPANY,dc=COM
ldapRealm.contextFactory.url = ldap://ldap.test.com:389
ldapRealm.userDnTemplate = uid={0},ou=Users,dc=COMPANY,dc=COM
ldapRealm.contextFactory.authenticationMechanism = SIMPLE
...

在來是zeppelin-env.sh下面新增了impersonate相關的設定

...
export ZEPPELIN_IMPERSONATE_CMD='sudo -H -u ${ZEPPELIN_IMPERSONATE_USER} bash -c '
export ZEPPELIN_IMPERSONATE_SPARK_PROXY_USER=false

新增上面兩行,可以讓zeppelin interpreter切換成特定user去執行
而ZEPPELIN_IMPERSONATE_SPARK_PROXY_USER要設定成false,不然spark會帶--proxy-user去執行
這樣反而會讓--proxy-user變成特定user而出問題

接下來精彩的地方是interpreter可以設定Per User
還可以設定User Impersonate,讓interpreter變成特定使用者身份去執行

有了這個功能之後,某方面來說就不再需要Livy Server了

更棒的是user可以針對自己的環境去重啟interpreter

在0.6之前如果某個user把Spark弄到當機就只能找管理者重開Zeppelin的所有Interpreter,現在有了這個功能
使用者再出問題就能自己重開自己的環境了

不過現在zeppelin 0.7正式版還沒出來,現在自己抓下來編譯還是有不少bug
至少這是個可以期待的版本

用Keras跑tensorflow

在上一篇用tflearn來做深度學習辨識初音玩了一下tflearn
後來又去看了幾個當紅的深度學習套件,tensorflow做為低層運算的API,上層除了tflearn之外
Keras這個套件也能用tensorflow為基底去運作

參考:
https://keras.io/
http://tjo.hatenablog.com/entry/2016/06/09/190000 (日文 請注意)

而Keras在社群上活躍程度又比tflearn更高,姑且去了解一下他的用法

環境設定

1.把ANACONDA裝起來

https://www.continuum.io/downloads

2.安裝tensorflow
$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.12.1-py2-none-any.whl
$ pip install  $TF_BINARY_URL
3.安裝keras
pip install keras
4.測試環境
$ python

>>> import keras
Using TensorFlow backend.

這邊可以看到keras用了TensorFlow當作他的運算基底
如果他不是用tensorflow而是用Theano作為基底的話,可以去改個人config

$ vim ~/.keras/keras.json

確定backend的設定是Theano或tensorflow,改成自己想要的

{
    "image_dim_ordering": "tf",
    "epsilon": 1e-07,
    "floatx": "float32",
    "backend": "tensorflow"
}
5.測試keras

首先上網找了keras CNN的範例來跑跑看
keras + tensorflow,用tensorboard看結果

參考:
https://keras.io/callbacks/#tensorboard
http://qiita.com/supersaiakujin/items/568605f999ef5cc741be (日文 請注意)

import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras.utils import np_utils
import keras.callbacks
import keras.backend.tensorflow_backend as KTF
import tensorflow as tf

batch_size = 128
nb_classes = 10
nb_epoch   = 20
nb_data    = 28*28
log_filepath = '/tmp/keras_log'

# load data

(X_train, y_train), (X_test, y_test) = mnist.load_data()

# reshape

X_train = X_train.reshape(X_train.shape[0], X_train.shape[1]*X_train.shape[2])
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1]*X_test.shape[2])

# rescale

X_train = X_train.astype(np.float32)
X_train /= 255

X_test = X_test.astype(np.float32)
X_test /= 255

# convert class vectors to binary class matrices (one hot vectors)

Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

old_session = KTF.get_session()

with tf.Graph().as_default():
    session = tf.Session('')
    KTF.set_session(session)
    KTF.set_learning_phase(1)
    # build model

    model = Sequential()
    model.add(Dense(512, input_shape=(nb_data,), init='normal',name='dense1'))
    model.add(Activation('relu', name='relu1'))
    model.add(Dropout(0.2, name='dropout1'))
    model.add(Dense(512, init='normal', name='dense2'))
    model.add(Activation('relu', name='relu2'))
    model.add(Dropout(0.2, name='dropout2'))
    model.add(Dense(10, init='normal', name='dense3'))
    model.add(Activation('softmax', name='softmax1'))
    model.summary()

    model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.001), metrics=['accuracy'])

    tb_cb = keras.callbacks.TensorBoard(log_dir=log_filepath, histogram_freq=1)
    cbks = [tb_cb]

    history = model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch = nb_epoch, verbose=1, callbacks=cbks)

    score = model.evaluate(X_test, Y_test, verbose=0)
    print('Test score:', score[0])
    print('Test accuracy;', score[1])


KTF.set_session(old_session)

Output:

...
('Test score:', 0.44597396918535231)
('Test accuracy;', 0.86890000000000001)
[Finished in 242.44s]

上面設定了輸出tensorflow log到/tmp/keras_log,可以用以下指令打開瀏覽器

$ tensorboard --logdir=/tmp/keras_log

跑出來了,看來效果還OK
之後再來玩玩其他範例

用tflearn來做深度學習辨識初音

之前的玩玩tensorflow-mnist手寫辨識初探了tensorflow跟CNN的威力之後
今天心血來潮,想寫個自動判別一張圖片裏面有沒有初音的神經網路

正好看到了用tensorflow實現的深度學習演算法的高階套件tflearn

tflearn官網開宗名義就點明了他的定位Deep learning library featuring a higher-level API for TensorFlow.,純用TensorFlow的時候必須自己去打點每個神經元的輸入輸出維度定義等等,而tflearn就常用的deep learning方法封裝成更容易使用的API,使用上也更為直覺

開始正題
整篇分兩部分

  • 準備訓練資料訓練CNN
  • 隨機抓圖測試

準備訓練資料訓練CNN

我想要訓練一個網路去幫我辨識一張圖片是不是初音的圖片
首先上網隨機抓了20張初音的JPG圖片然後全部resize成100x100

再隨便抓20張不是初音的圖片然後全部resize成100x100

上面的40張圖片當作我的training set

在隨機抓5張初音的圖片跟5張不是初音的圖片resize成100x100後當作我的validation set

這樣一來訓練資料就準備好了: https://github.com/del680202/MachineLearning-memo/blob/master/src/tensorflow/cnn_dataset_mini.zip
樣本有點少,但是理論上深度學習就是少少的資料也能做出不錯的效果才是

事先裝好numpy等環境之後,tensorflow跟tflearn的用pip就能簡單裝起來了
numpy等環境個人常用Anaconda,https://www.continuum.io/downloads

tensorflow跟tflearn的安裝
參考:https://www.tensorflow.org/get_started/os_setup

$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.12.1-py2-none-any.whl
$ pip install  $TF_BINARY_URL
$ pip install tflearn

把剛才的測試資料準備好解壓縮分別放到miku,no-miku,test-set三個目錄之後
準備一個cnn.py檔案依序填入以下內容

首先把該import的填一填

import tensorflow as tf
import numpy as np
import os
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.conv import conv_2d, max_pool_2d
from tflearn.layers.normalization import local_response_normalization
from tflearn.layers.estimator import regression
from tflearn.data_utils import load_image

載入訓練資料的圖片轉成numpy的array

SCRIPT_PATH = os.path.dirname(os.path.abspath( __file__ ))
num = 20
imgs = []
for i in range(1, num + 1):
    imgs.append(np.asarray(load_image("%s/miku/%s.jpg" % (SCRIPT_PATH, i))))
for i in range(1, num + 1):
    imgs.append(np.asarray(load_image("%s/no-miku/%s.jpg" % (SCRIPT_PATH, i))))
imgs = np.array(imgs)
y_data = np.r_[np.c_[np.ones(num), np.zeros(num)],np.c_[np.zeros(num), np.ones(num)]]
print imgs.shape
print y_data.shape

使用tflearn的load_image可以將一張圖片載入成PIL的物件,再搭配np.asarray轉成numpy的array
而label data,前20張圖片是初音的圖片,設成向量[1, 0],而後20張圖片不是初音的圖片,設成向量[0, 1]
最後出來的資料集維度跟label維度會是
(40, 100, 100, 3) # 40張100x100的圖片 RGB 3個channel
(40, 2) #40個label的資料

用同樣的方式載入驗證資料的圖片轉成numpy的array
前五張圖片是初音的圖片,後五張不是初音的圖片

x_test = []
for i in range(1, 11):
    x_test.append(np.asarray(load_image("%s/test-set/%s.jpg" % (SCRIPT_PATH, i))))
x_test =  np.array(x_test)
y_test = np.r_[np.c_[np.ones(5), np.zeros(5)],np.c_[np.zeros(5), np.ones(5)]]
print x_test.shape
print y_test.shape
#output

#(10, 100, 100, 3)

#(10, 2)

資料準備好後開始準備CNN網路,這邊直接copy官網的範例修改
可以看出相較原生tensorflow,需要顧慮的參數少了不少

# Building convolutional network

network = input_data(shape=[None, 100, 100, 3], name='input')
network = conv_2d(network, 64, 5, activation='relu', regularizer="L2")
network = max_pool_2d(network, 2)
network = local_response_normalization(network)
network = conv_2d(network, 128, 5, activation='relu', regularizer="L2")
network = max_pool_2d(network, 2)
network = local_response_normalization(network)
network = fully_connected(network, 512, activation='relu')
#network = dropout(network, 0.8)

network = fully_connected(network, 1024, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.00001,
                     loss='categorical_crossentropy', name='target')

CNN網路建構好之後就可以開始訓練了
設定訓練500是,同時設定show出訓練過程
訓練完之後將model另存成miku_model.tflearn

# Training

model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit({'input': imgs}, {'target': y_data}, n_epoch=500,
           validation_set=({'input': x_test}, {'target': y_test}),
           snapshot_step=100,show_metric=True, run_id='convnet_miku')
model.save('miku_model.tflearn')

訓練中可以看到目前的訓練狀況
包含訓練方法,loss,正確度等等 很方便

---------------------------------
Run id: convnet_miku
Log directory: /tmp/tflearn_logs/
---------------------------------
Training samples: 40
Validation samples: 10
--
...
Training Step: 2  | total loss: 2.38131
K
| Adam | epoch: 002 | loss: 2.38131 - acc: 0.4950 | val_loss: 2.82867 - val_acc: 0.5000 -- iter: 40/40
Training Step: 2  | total loss: 2.38131

訓練完之後預設會輸出log到/tmp/tflearn_logs下面去,可以用tensorboard打開來看看

$tensorboard --logdir='/tmp/tflearn_logs'

上面指令執行後從 http://localhost:6006 可以看到訓練過程的變化還有神經網路的長相

儘管訓練資料不多,但是精准度還是來到了8,90%,效果還不錯

隨機抓圖測試

把model訓練出來了,接下來就上網找個兩三張圖測試吧
我準備了2張初音的圖t1.jpg跟t2.jpg以及兩張不是初音的圖t3.jpg跟t4.jpg放到/tmp/下
之後用下面的程式測試

首先載入剛剛訓練的model

import numpy as np
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.conv import conv_2d, max_pool_2d
from tflearn.layers.normalization import local_response_normalization
from tflearn.layers.estimator import regression
from tflearn.data_utils import load_image

# Building convolutional network

network = input_data(shape=[None, 100, 100, 3], name='input')
network = conv_2d(network, 64, 5, activation='relu', regularizer="L2")
network = max_pool_2d(network, 2)
network = local_response_normalization(network)
network = conv_2d(network, 128, 5, activation='relu', regularizer="L2")
network = max_pool_2d(network, 2)
network = local_response_normalization(network)
network = fully_connected(network, 512, activation='relu')
network = fully_connected(network, 1024, activation='relu')
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.00001,
                     loss='categorical_crossentropy', name='target')

model = tflearn.DNN(network)
model.load('miku_model.tflearn')

這邊最重要的是用model.load去載入剛才儲存的miku_model.tflearn
而model的結構必須跟訓練時一樣

model成功載入後,用剛剛隨機抓的四張圖片測試
由於圖片是隨機抓的,載入圖片的時候順便resize成100x100

#Load test data

imgs = []
num = 4
for i in range(1, num + 1):
    img = load_image("/tmp/t%s.jpg" % (i))
    img = img.resize((100,100))
    img_arr = np.asarray(img)
    imgs.append(img_arr)
imgs = np.array(imgs)

#predict

print np.round(model.predict(imgs))

#output

[[ 1.  0.]
 [ 1.  0.]
 [ 0.  1.]
 [ 0.  1.]]

結果輸出,model判斷前兩張圖片是初音,後兩張圖片不是初音,這效果看來還在預期之內

總結來說,tflearn算是還不錯用的tensorflow加強套件,他之後的演化應該是可以期待的

從AWS搭一個GPU運算環境來玩tensorflow

這一陣子開始跑一些deep learning的範例之後,開始覺得小筆電不夠力了
而現在一些deep learning的環境很多內建支援用GPU來加速,由於家理空間不夠,不想再用台大桌機來插顯卡
後來雖說也有想到最近有外接顯示卡的產品可以給筆電用,但是我只是跑幾個試驗讓自己玩的
一下子在去買張顯卡好像操之過急,後來看到Amazon的AWS有提供GPU運算的環境,加上之前沒玩過AWS,趁這個機會就來順道玩玩看

這篇文章紀錄整個過程跟遇到的問題,畢竟我不是AWS的專家,就先申請一台來玩玩
流程如下

  • 申請一個AWS GPU環境的Spot Instance
  • 登入環境安裝GPU driver跟機器學習環境及tensorflow
  • 跑CNN測試

申請一個AWS GPU環境的Spot Instance

首先呢,要去申請一個AWS的帳號,申請網址如下
https://aws.amazon.com
申請過程會需要信用卡資料,請先準備好,申請帳號過程我就省略了
這邊直接進入啟動Instance的步驟

帳號申請好之用一樣從 https://aws.amazon.com 去進入管理頁面,點選[Service] -> [EC2]

之後會進入Instance的管理選單選擇[Spot Requests] -> [Request Spot Instances]

這裡閒聊一下為什麼選Spot Requests而不是直接選Instances
首先是價錢問題,Spot Requests產生的instance用途不是為了長期使用,而是短期開一個環境在那邊而已
種種因素Spot Requests的使用上會比直接起一個Instances還要便宜些
另一方面直接從Instances啟動GPU環境時不見得會成功,我第一次從Instances想要啟動一個GPU環境時
被告知必須申請調高該環境上限,送個申請加施工要花24~48小時,我只是想跑個測試還要等一堆時間
基於以上理由,後來就選擇了Spot Requests的方式產生

進入選單頁一步一步往下填
第一頁是需求的伺服器類形,比較重要的東西有下面幾項

  • Target capacity: 要啟幾個環境,這邊先選一個
  • AMI: OS類形,我習慣CentOS,就選了Amazon Linux AMI
  • Instance type(s): 環境類形,也是最重要的選項
    現在AWS有提供的GPU運算環境比較便宜的有兩個,其他p2.xlarge什麼都先不要考慮
    g2.2xlarge跟g2.8xlarge,裡面內建了Nvidia的GPU及CUDA環境
    我這邊是測試用,所以選擇g2.2xlarge就好
    參考:https://aws.amazon.com/cn/ec2/instance-types/

  • Network: 請確定有選項基本上用預設的就好,沒有的話用[Create new VPC]做一個

  • Availability Zone: 選擇一個Zone,這邊先適當的選了ap-northeast-1c

其他先用預設值,其他的沒問題後就選[Next]進到下一頁
第二頁主要是網路跟儲存空間的設定


  • EBS volumes: 預設只有8G請適當的調大這邊設定20G,因為之後要裝GPU driver跟機器學習環境等等有的沒有,一開始我沒有調裝個兩三樣東西就空間不足了
  • Key pair name: 用來產生登入用的ssh私鑰,如果沒有的話點選[Create new key pair]產生下載之後好好保存
  • Security groups: 選擇default值

上面完成後點選[Review]確認一下OK的話就可以點選[Launch]了

大概過個幾個分鐘就可以從[Instances]選單裡面看到g2.2xlarge,點選Connect可以看看要怎麼存取剛剛產生那台伺服器其IP跟HOST的資訊
登入方式就是準備個可以ssh client搭配剛剛產生instance時在[Key pair name]階段下載的ssh金鑰

卡住的點: 一開始的時候我無法用ssh登入,會出現

ssh: connect to host xxx.xxx.xxx.xx port 22: Operation timed out

這樣一個錯誤,後來查了一下資料必須設定[Security groups]才行
剛才Instances那頁往下拉選擇Security groups的default rule
看看[inbound]的選單裡面有沒有[SSH]那欄,沒有的話點選[Edit]之後適當的加入



租賃費用

環境起來了,總是要在意一下會花多少錢
可以參考:https://aws.amazon.com/cn/ec2/spot/pricing/
先不考慮流量計價的話AWS租用是以小時計費,然後每個"Zone"的計費又不太一樣
我因為人在東京,當初就很自然的選擇東京的zone
東京的zone的g2.2xlarge的Linux計費是每小時說是$0.3017美金
我什麼都不做在那邊給他放個兩天就花了10美金上下
嗯...只是測試租用就先算了,好像有些Zone不會記入租賃費用,下次有機會再另外申請來試試
想要看目前花費多少的話,可以從右上角的自己帳號點選[My billing Dashboard]去看

登入環境安裝GPU driver跟機器學習環境及tensorflow

環境準備好了,登上去測試看看,先準備好有ssh client的環境,我這邊是mac直接就有了
再來就是將剛剛下載的ssh金鑰安置好

local$ mv ${YOUR_SSH_KEY} ~/.ssh/myaws.pem
local$ chmod 600 ~/.ssh/myaws.pem

設定一下ssh config

local$ vim ~/.ssh/config

增加以下內容

Host aws
        HostName        ${填入你的AWS instance ip/host位置}
        User            ec2-user
        IdentityFile    ~/.ssh/myaws.pem

之後就可以直接登入

local$ ssh aws

Last login: Mon Jan  2 05:31:52 2017 from p2600006-ipngn22001marunouchi.tokyo.ocn.ne.jp

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/
14 package(s) needed for security, out of 22 available
Run "sudo yum update" to apply all updates.
[ec2-user@ip-xxxxxxxx ~]$

登入之後先切換成root來進行作業
接著確定有沒有GPU跟OS型號

$ sudo su -   #切換成root

root$ lspci | grep -i nvidia  #檢查GPU

00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)

root$ uname -m && cat /etc/*release  #檢查OS

Amazon Linux AMI release 2016.09

確認沒問題後開始安裝GPU的Driver,這裡要裝兩個東西

  • NVIDIA CUDA Toolkit
  • cuDNN library

安裝過程大致可以參考:http://www.pyimagesearch.com/2016/07/04/how-to-install-cuda-toolkit-and-cudnn-for-deep-learning/

第一個先確定自己即將裝的tensorflow-gpu吃的是哪個版本的library
我現在的環境會去參照8.0的driver

去NVidia的網站去找下載路徑
https://developer.nvidia.com/cuda-toolkit

安裝NVIDIA CUDA Toolkit

CUDA Toolkit的安裝會簡單點,只要有目標版號就可以直接抓了

root$ yum install -y gcc
root$ yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
root$ wget https://developer.nvidia.com/compute/cuda/8.0/prod/local_installers/cuda_8.0.44_linux-run
root$ chmod 755 cuda_8.0.44_linux-run
root$ ./cuda_8.0.44_linux-run -extract=/root
root$ ./NVIDIA-Linux-x86_64-367.48.run -s
root$ ./cuda-linux64-rel-8.0.44-21122537.run -noprompt

設置環境變數

root$ vim ~/.bashrc
#增加下面三行

export CUDA_ROOT=/usr/local/cuda-8.0
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64
export PATH=$PATH:$CUDA_ROOT/bin

root$ source ~/.bashrc

測試一下安裝

root$ nvidia-smi -q | head
==============NVSMI LOG==============

Timestamp                           : Mon Jan  2 09:39:51 2017
Driver Version                      : 367.48

Attached GPUs                       : 1
GPU 0000:00:03.0
    Product Name                    : GRID K520
    Product Brand                   : Grid
安裝cuDNN library

cuDNN的安裝就比較麻煩了
首先一樣先去NVidia的官網
https://developer.nvidia.com/cudnn

但是cuDNN library的頁面會要求註冊一個帳號,這邊就適當的申請開通帳號之後下載到本機
帳號註冊成功後登入上面的網址可以進入一個下載頁
配合CUDA Toolkit的版本,選擇[cuDNN vx.x Library for Linux]下載
我現在可以選的是cuDNN v5.1 Library for Linux

下載之後上傳到aws的環境

local$ scp cudnn-8.0-linux-x64-v5.1.tgz  aws:/tmp/
local$ ssh aws
$ sudo su -
root$ cp /tmp/cudnn-8.0-linux-x64-v5.1.tgz .
root$ tar -zxvf cudnn-8.0-linux-x64-v5.1.tgz
root$ cp cuda/lib64/* /usr/local/cuda-8.0/lib64/
root$ cp cuda/include/* /usr/local/cuda-8.0/include/

上面完成後Driver的設置就完成了
之後開始安裝tensorflow相關的環境

首先Python機器學習相關的環境我直接用Anaconda python2.7的版本去安裝
參考:https://www.continuum.io/downloads

root$ wget https://repo.continuum.io/archive/Anaconda2-4.2.0-Linux-x86_64.sh
root$ sh Anaconda2-4.2.0-Linux-x86_64.sh
root$ source /root/.bashrc

裝好之後numpy,scipy等常用套件就可以直接使用了
接下來安裝tensorflow跟tensorflow-gpu

root$ pip install tensorflow
root$ pip install tensorflow-gpu

我現在環境裝起來兩個都是0.12版

裝好之後開啓python測試一下

import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally

sess=tf.Session()

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating Tensor

如果import的時候全部的CUDA library都能成功載進來就沒問題了
如果要說有錯,我遇過以下兩種

  • libcudart.so找不到

    import tensorflow as tf
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/root/anaconda2/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import *
    File "/root/anaconda2/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 60, in <module>
    raise ImportError(msg)
    ImportError: Traceback (most recent call last):
    File "/root/anaconda2/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
    File "/root/anaconda2/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 28, in <module>
    _pywrap_tensorflow = swig_import_helper()
    File "/root/anaconda2/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description)
    ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory
    

    代表CUDA Toolkit沒裝好,請回頭參考CUDA Toolkit的安裝方法

  • libcudnn.so找不到

    ...
    I tensorflow/stream_executor/dso_loader.cc:119] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /usr/local/cuda-8.0/lib64
    I tensorflow/stream_executor/cuda/cuda_dnn.cc:3459] Unable to load cuDNN DSO
    

    代表cuDNN library沒裝好,請回頭參考cuDNN library的安裝方法

跑CNN測試

上面環境都裝好之後就可以跑個CNN深度學習測測看
tensorflow gpu進階的操作可以參考:
https://www.tensorflow.org/how_tos/using_gpu/

測試的程式碼如下,MNIST的手寫辨識
一個只訓練1000次的CNN網路

cnn.py
from __future__ import print_function
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# number 1 to 10 data

mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

def compute_accuracy(v_xs, v_ys):
    global prediction
    y_pre = sess.run(prediction, feed_dict={xs: v_xs, keep_prob: 1})
    correct_prediction = tf.equal(tf.argmax(y_pre,1), tf.argmax(v_ys,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    result = sess.run(accuracy, feed_dict={xs: v_xs, ys: v_ys, keep_prob: 1})
    return result

def weight_variable(shape):
    inital = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(inital)

def bias_variable(shape):
    inital = tf.constant(0.1, shape=shape)
    return tf.Variable(inital)

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')


# define placeholder for inputs to network

xs = tf.placeholder(tf.float32, [None, 784]) # 28x28

ys = tf.placeholder(tf.float32, [None, 10])
keep_prob = tf.placeholder(tf.float32)
x_image = tf.reshape(xs, [-1, 28, 28, 1])
## conv1 layer ##

W_conv1 = weight_variable([5, 5, 1, 32]) #patch 5x5, in channel size 1, out size 32

## pool1 layer ##

b_conv1 = bias_variable([32])
#Combine

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) #output size 28x28x32

h_pool1 = max_pool_2x2(h_conv1) #output size 14x14x32

## conv2 layer ##

W_conv2 = weight_variable([5, 5, 32, 64]) #patch 5x5, in channel size 32, out size 64

## pool2 layer ##

b_conv2 = bias_variable([64])
#Combine

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) #output size 14x14x64

h_pool2 = max_pool_2x2(h_conv2) #output size 7x7x64

## fc1 layer ##

W_fc1 = weight_variable([7*7*64, 1024])
b_fc1 = bias_variable([1024])
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64]) #[n_samples, 7,7,64]  => [n_samples, 7*7*64]

h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
## fc2 layer ##

W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])
prediction = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
# the error between prediction and real data

cross_entropy = tf.reduce_mean(-tf.reduce_sum(ys * tf.log(prediction),
                                              reduction_indices=[1]))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(1001):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob:0.5})
    if i % 50 == 0:
        print(compute_accuracy(
            mnist.test.images, mnist.test.labels))

跑來測測看

root$ time python cnn.py
...
real  0m43.571s
user  0m32.352s
sys 0m9.068s

只要43秒就搞定了,同樣的程式用我的小筆電去跑要跑15分鐘,也算有不錯的進展了
第一次玩AWS,還有很多不清楚的地方,這個就留到以後研究了

2017/01/04

別人做的效能測試:http://qiita.com/shouta-dev/items/5aa4a746b31b9be4838d