Issue Details (XML | Word | Printable)

Key: KATTA-192
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Johannes Zillmann
Reporter: Johannes Zillmann
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Katta

unstable master failover on session reconnect

Created: 14/May/11 03:31 PM   Updated: 14/May/11 03:46 PM
Component/s: cluster
Affects Version/s: 0.6.2
Fix Version/s: 0.6.4


 Description  « Hide
Reported by Murali Krishna:
Hi,

I usually see operator thread getting stopped and restarted immediately. Is that expected? This particular instance, in one of the node it stopped for almost 2 hours and no index deployment happened during this time on this node. The 'listNodes' was showing the node connected though. I am using Katta 0.6.2. 
    
2011-05-05 00:02:15,793 INFO net.sf.katta.master.OperatorThread:100 - operator thread stopped
2011-05-05 00:02:17,276 WARN org.I0Itec.zkclient.ZkEventThread:78 - Error handling event ZkEvent[State changed to SyncConnected sent to net.sf.katta.protoco
l.InteractionProtocol$1@64cbdef5]
org.I0Itec.zkclient.exception.ZkNodeExistsException: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /katta/maste
r
        at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:55)
        at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685)
        at org.I0Itec.zkclient.ZkClient.create(ZkClient.java:304)
        at org.I0Itec.zkclient.ZkClient.createEphemeral(ZkClient.java:328)
        at net.sf.katta.protocol.InteractionProtocol.createEphemeral(InteractionProtocol.java:478)
        at net.sf.katta.protocol.InteractionProtocol.publishMaster(InteractionProtocol.java:351)
        at net.sf.katta.master.Master.becomePrimaryOrSecondaryMaster(Master.java:104)
        at net.sf.katta.master.Master.reconnect(Master.java:86)
        at net.sf.katta.protocol.InteractionProtocol$1.handleStateChanged(InteractionProtocol.java:95)
        at org.I0Itec.zkclient.ZkClient$5.run(ZkClient.java:484)
        at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:72)
Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /katta/master
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:608)
        at org.I0Itec.zkclient.ZkConnection.create(ZkConnection.java:87)
        at org.I0Itec.zkclient.ZkClient$1.call(ZkClient.java:308)
        at org.I0Itec.zkclient.ZkClient$1.call(ZkClient.java:304)
        at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
        ... 9 more
2011-05-05 02:05:50,000 INFO net.sf.katta.protocol.upgrade.UpgradeRegistry:52 - version of distribution 0.6.2
2011-05-05 02:05:50,000 INFO net.sf.katta.protocol.upgrade.UpgradeRegistry:53 - version of cluster 0.6.2
2011-05-05 02:05:50,016 INFO net.sf.katta.master.Master:149 - start managing nodes...

Thanks,
Murali Krishna
Yes, this is from master log. I have 2 hosts, each run master and slavenode process. The other node master log was fine and the shards continued to deploy there. 
 Is it possible that it didnot receive zookeeper reconnect at all ? This is happening multiple times a day and causing stability issues. Any pointers to fix this will be really helpful. I run hadoop+hbase+zookeeper+Katta on 3 hosts.


 All   Comments   Change History   git Commits      Sort Order: Ascending order - Click to sort in descending order
Johannes Zillmann added a comment - 14/May/11 03:46 PM
Didi 2 things:
  1. make master declaration really synchronize (replace if-exists with try-catch)
  2. ensure that to stop the old operatorThread on reconnect