jubatus実験中の出来事

jubatusを使ってとある物を作りたく目下試行錯誤中なのですが、その過程で起きているjubatus周りの問題(環境依存?)を記録しておきます。

client*2 -> jubakeeper*1 -> jubaclassifier*6 (3*2)
という、インスタンス2つ(testとuidというname)3個構成で一台のCentOS仮想マシン環境上に全て起動。

clientはphpで、mysqlのデータをばっこんばっこんtrainしていたのだけれど、どうも、ほっとくとjubakeeperが落ちた後、jubaclassifierも2つ程落ちた。

ずっと気にしないようにしていたが、classifierを起動し、zookeeperが認識すると、定期的かつ頻繁に、

E1109 10:38:18.679916  2682 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9180
E1109 10:38:18.680157  2682 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9181
E1109 10:38:18.680387  2682 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9182
E1109 10:38:18.680423  2682 classifier_serv.cpp:328] mix: failed to get diff from 3 servers.

こういうメッセージが出力される。なお、IPアドレスは念の為伏せさせて頂きます。(***.***.***.***)
で、落ちた時のログは下記。

2011-11-05 02:14:07,406:21942(0x40d87940):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 11ms
E1105 02:14:07.962105 21922 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9181
E1105 02:14:08.792454 21922 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9182
E1105 02:14:08.792552 21922 classifier_serv.cpp:328] mix: failed to get diff from 3 servers.
E1105 02:14:11.806685 21928 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9180
E1105 02:14:12.669272 21928 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9181
2011-11-05 02:14:12,775:21942(0x40d87940):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 17ms
E1105 02:14:13.534540 21928 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9182
E1105 02:14:13.534633 21928 classifier_serv.cpp:328] mix: failed to get diff from 3 servers.
2011-11-05 02:14:15,454:21942(0x40d87940):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 11ms
E1105 02:14:20.201802 21934 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9180
E1105 02:14:21.054254 21934 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9181
E1105 02:14:21.955986 21934 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9182
E1105 02:14:21.956148 21934 classifier_serv.cpp:328] mix: failed to get diff from 3 servers.
E1105 02:14:23.354889 21934 zk.cpp:101] /jubatus/actors/test/master_lock/lock_0000019376: removal failed - operation timeout
2011-11-05 02:14:23,481:21942(0x40d87940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [127.0.0.1:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 7ms)
I1105 02:14:23.482103 21944 zk.cpp:211] zk connection expiration : type(-1) state(1)
2011-11-05 02:14:24,822:21942(0x40d87940):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 1347ms
2011-11-05 02:14:24,822:21942(0x40d87940):ZOO_INFO@check_events@1585: initiated connection to server [127.0.0.1:2181]
E1105 02:14:26.592995 21922 zk.cpp:91] /jubatus/actors/test/master_lock/lock_ failed in creation - operation timeout
2011-11-05 02:14:26,811:21942(0x40d87940):ZOO_INFO@check_events@1632: session establishment complete on server [127.0.0.1:2181], sessionId=0x1336c4924310023, negotiated timeout=4000
E1105 02:14:28.499533 21928 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9181
E1105 02:14:28.513236 21928 map_fold_rpc.hpp:73] cannot connect: Connection refused***.***.***.***:9182
E1105 02:14:28.513319 21928 classifier_serv.cpp:328] mix: failed to get diff from 2 servers.
E1105 02:14:36.474046 21934 zk.cpp:91] /jubatus/actors/test/master_lock/lock_ failed in creation - invalid zhandle state
E1105 02:14:36.474241 21934 zk.cpp:131] invalid zhandle state (/jubatus/actors/test/master_lock)
E1105 02:14:45.517791 21928 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9181
E1105 02:14:45.517930 21928 classifier_serv.cpp:328] mix: failed to get diff from 1 servers.
E1105 02:15:02.555667 21928 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9181
E1105 02:15:02.555769 21928 classifier_serv.cpp:328] mix: failed to get diff from 1 servers.
E1105 02:15:19.535115 21928 map_fold_rpc.hpp:73] cannot recv rpc result: type error***.***.***.***:9181
E1105 02:15:19.535222 21928 classifier_serv.cpp:328] mix: failed to get diff from 1 servers.

以降、延々とラスト二行が繰り返し出力されていた。

ちなみにこの後、jubakeeperを起動させて、jubaclassifierも落ちた2個を再起動させたら、しばらくロックしてるよーってメッセージが出た後に、jubaclassifier全部お亡くなりになった…。

2011-11-09 10:18:18,569:2568(0x41f83940):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 14ms
2011-11-09 10:18:27,983:2568(0x41f83940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [127.0.0.1:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 0ms)
E1109 10:18:28.247558  2563 zk.cpp:91] /jubatus/actors/uid/master_lock/lock_ failed in creation - operation timeout
E1109 10:18:28.312479 32062 zk.cpp:91] /jubatus/actors/test/master_lock/lock_ failed in creation - operation timeout
E1109 10:18:28.313173 32187 zk.cpp:91] /jubatus/actors/uid/master_lock/lock_ failed in creation - operation timeout
E1109 10:18:28.307246 32069 zk.cpp:91] /jubatus/actors/test/master_lock/lock_ failed in creation - operation timeout
E1109 10:18:28.313670 32055 zk.cpp:91] /jubatus/actors/test/master_lock/lock_ failed in creation - operation timeout
I1109 10:18:28.410428  2570 zk.cpp:211] zk connection expiration : type(-1) state(1)
E1109 10:18:28.413918  2563 zk.cpp:101] : removal failed - bad arguments
E1109 10:18:28.414378 32062 zk.cpp:101] : removal failed - bad arguments
E1109 10:18:28.414834 32187 zk.cpp:101] : removal failed - bad arguments
E1109 10:18:28.415371 32069 zk.cpp:101] : removal failed - bad arguments
E1109 10:18:28.415837 32055 zk.cpp:101] : removal failed - bad arguments
E1109 10:18:28.401837  2634 zk.cpp:101] /jubatus/actors/uid/master_lock/lock_0000440898: removal failed - operation timeout
2011-11-09 10:18:29,311:2568(0x41f83940):ZOO_WARN@zookeeper_interest@1461: Exceeded deadline by 1329ms
2011-11-09 10:18:29,311:2568(0x41f83940):ZOO_INFO@check_events@1585: initiated connection to server [127.0.0.1:2181]
2011-11-09 10:18:29,312:2568(0x41f83940):ZOO_INFO@check_events@1632: session establishment complete on server [127.0.0.1:2181], sessionId=0x13382aa7fc4002e, negotiated timeout=4000
E1109 10:19:37.884022 32069 zk.cpp:91] /jubatus/actors/test/master_lock/lock_ failed in creation - operation timeout
E1109 10:19:37.884137  2563 zk.cpp:91] /jubatus/actors/uid/master_lock/lock_ failed in creation - operation timeout
E1109 10:19:37.884224 32062 zk.cpp:91] /jubatus/actors/test/master_lock/lock_ failed in creation - operation timeout
E1109 10:19:37.884371 32187 zk.cpp:91] /jubatus/actors/uid/master_lock/lock_ failed in creation - operation timeout
E1109 10:19:37.884474 32055 zk.cpp:91] /jubatus/actors/test/master_lock/lock_ failed in creation - operation timeout
E1109 10:19:38.795603  2634 zk.cpp:91] /jubatus/actors/uid/master_lock/lock_ failed in creation - operation timeout
E1109 10:19:40.592905 32062 zk.cpp:131] operation timeout (/jubatus/actors/test/master_lock)
E1109 10:19:40.593822 32069 zk.cpp:131] operation timeout (/jubatus/actors/test/master_lock)
E1109 10:19:40.594594  2563 zk.cpp:131] operation timeout (/jubatus/actors/uid/master_lock)
E1109 10:19:40.595274 32055 zk.cpp:131] operation timeout (/jubatus/actors/test/master_lock)
E1109 10:19:40.595989 32187 zk.cpp:131] operation timeout (/jubatus/actors/uid/master_lock)
E1109 10:19:41.491140  2634 zk.cpp:131] operation timeout (/jubatus/actors/uid/master_lock)

messageにもログが出ていた。

Nov  9 10:19:40 hiro kernel: jubaclassifier[32062]: segfault at 0000000000000000 rip 0000003e7fa9cf7b rsp 00000000428b8ea0 error 4
Nov  9 10:19:40 hiro kernel: jubaclassifier[32069]: segfault at 0000000000000000 rip 0000003e7fa9cf7b rsp 000000004266cea0 error 4
Nov  9 10:19:40 hiro kernel: jubaclassifier[2563]: segfault at 0000000000000000 rip 0000003e7fa9cf7b rsp 000000004359aea0 error 4
Nov  9 10:19:40 hiro kernel: jubaclassifier[32055]: segfault at 0000000000000000 rip 0000003e7fa9cf7b rsp 000000004340bea0 error 4
Nov  9 10:19:40 hiro kernel: jubaclassifier[32187]: segfault at 0000000000000000 rip 0000003e7fa9cf7b rsp 0000000042464ea0 error 4
Nov  9 10:19:41 hiro kernel: jubaclassifier[2634]: segfault at 0000000000000000 rip 0000003e7fa9cf7b rsp 00007fff185b4490 error 4

とりあえず、今日はこの原因を調査しながら全文検索環境を構築しよう…。

あ、あと、これは別問題かもしれないけれど、ZooKeeperを起動してから、jubakeeperを起動する流れで、jubakeeperが起動したりしなかったりする。
感覚的には、ZooKeeperとjubakeeperのコネクション確立にコケてjubakeeperが落ちてるような振る舞いを見せます。

zookeeper起動コマンドの後、sleep 1程度だと

./jubastart.sh: line 12:  2810 アボートしましたnohup jubakeeper --zookeeper=${ZOOKEEPER_HOST} --rpc-port=9198

となり、うまく起動してくれないが、sleepを15くらいにすると起動してくれる。

追記
hadoop-zookeeper.noarch 3.3.3+12.16-1 installed
zookeeper-c-client.x86_64 3.3.3-1 installed