Issue Details (XML | Word | Printable)

Key: KATTA-69
Type: Bug Bug
Status: Resolved Resolved
Resolution: Duplicate
Priority: Major Major
Assignee: Peter Voss
Reporter: Ted Dunning
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Katta

Katta is much too sensitive to recoverable KeeperExceptions

Created: 04/Jun/09 10:23 PM   Updated: 14/Oct/09 09:44 PM
Component/s: None
Affects Version/s: None
Fix Version/s: 0.6

File Attachments: 1. Text File retry_for_exists.patch (5 kB)



 Description  « Hide
If you get a ConnectionLossException when trying to deploy an index, the entire index is marked as ERROR and can never recover.

At the least, Katta should handle these situations more gracefully.

For instance, ZkClient.exists just blows out, transforming the recoverable exception into a non-recoverable KattaException.

It is dangerous to change too many of these, but some can probably be fixed.



 All   Comments   Change History   git Commits      Sort Order: Ascending order - Click to sort in descending order
Ted Dunning added a comment - 04/Jun/09 10:30 PM - edited
But watch out for this. http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling

I can see how to change "exists", but many others are more delicate because I can't tell if they are idempotent or not.


Ted Dunning added a comment - 04/Jun/09 10:38 PM
Here is a patch for one problem at least.

Jason Venner added a comment - 30/Jun/09 07:00 PM
I have applied this patch to trunk 473, and the unit tests hang in the MasterTest.
The Master.shutdown is hanging in the join with the DistributeShardsThread, which is in _updateLock.getUpdatedCondition().await();
I am guessing that await is not honoring the interrupt.

Jstack follows

Attaching to process ID 82990, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 11.3-b02-83
Deadlock Detection:

No deadlocks found.

Thread t@34051: (state = BLOCKED)

  • sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
  • java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=158 (Interpreted frame)
  • java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1925 (Interpreted frame)
  • net.sf.katta.master.DistributeShardsThread.run() @bci=190, line=141 (Interpreted frame)

Thread t@34307: (state = BLOCKED)

  • sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
  • java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=158 (Interpreted frame)
  • java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1925 (Interpreted frame)
  • java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=358 (Interpreted frame)
  • org.apache.zookeeper.ClientCnxn$EventThread.run() @bci=4, line=355 (Interpreted frame)

Thread t@34563: (state = IN_NATIVE)

  • sun.nio.ch.KQueueArrayWrapper.kevent0(int, long, int, long) @bci=0 (Compiled frame; information may be imprecise)
  • sun.nio.ch.KQueueArrayWrapper.poll(long) @bci=12, line=136 (Compiled frame)
  • sun.nio.ch.KQueueSelectorImpl.doSelect(long) @bci=46, line=69 (Compiled frame)
  • sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame)
  • sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Compiled frame)
  • org.apache.zookeeper.ClientCnxn$SendThread.run() @bci=192, line=852 (Compiled frame)

Thread t@34819: (state = BLOCKED)

  • sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
  • java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=158 (Interpreted frame)
  • java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1925 (Interpreted frame)
  • java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=358 (Interpreted frame)
  • org.apache.zookeeper.ClientCnxn$EventThread.run() @bci=4, line=355 (Interpreted frame)

Thread t@35075: (state = IN_NATIVE)

  • sun.nio.ch.KQueueArrayWrapper.kevent0(int, long, int, long) @bci=0 (Compiled frame; information may be imprecise)
  • sun.nio.ch.KQueueArrayWrapper.poll(long) @bci=12, line=136 (Compiled frame)
  • sun.nio.ch.KQueueSelectorImpl.doSelect(long) @bci=46, line=69 (Compiled frame)
  • sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame)
  • sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Compiled frame)
  • org.apache.zookeeper.ClientCnxn$SendThread.run() @bci=192, line=852 (Interpreted frame)

Thread t@35331: (state = BLOCKED)

  • sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
  • java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=158 (Compiled frame)
  • java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1925 (Compiled frame)
  • java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=358 (Compiled frame)
  • org.apache.zookeeper.server.PrepRequestProcessor.run() @bci=4, line=97 (Interpreted frame)

Thread t@35587: (state = BLOCKED)

  • sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
  • java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=158 (Compiled frame)
  • java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1925 (Compiled frame)
  • java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=358 (Compiled frame)
  • org.apache.zookeeper.server.SyncRequestProcessor.run() @bci=16, line=71 (Interpreted frame)

Thread t@35843: (state = BLOCKED)

  • java.lang.Object.wait(long) @bci=0 (Interpreted frame)
  • org.apache.zookeeper.server.SessionTrackerImpl.run() @bci=36, line=124 (Interpreted frame)

Thread t@36099: (state = IN_NATIVE)

  • sun.nio.ch.KQueueArrayWrapper.kevent0(int, long, int, long) @bci=0 (Compiled frame; information may be imprecise)
  • sun.nio.ch.KQueueArrayWrapper.poll(long) @bci=12, line=136 (Compiled frame)
  • sun.nio.ch.KQueueSelectorImpl.doSelect(long) @bci=46, line=69 (Compiled frame)
  • sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame)
  • sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Compiled frame)
  • org.apache.zookeeper.server.NIOServerCnxn$Factory.run() @bci=20, line=142 (Interpreted frame)

Thread t@36355: (state = BLOCKED)

Thread t@36611: (state = BLOCKED)

Thread t@36867: (state = BLOCKED)

  • java.lang.Object.wait(long) @bci=0 (Interpreted frame)
  • java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=116 (Interpreted frame)
  • java.lang.ref.ReferenceQueue.remove() @bci=2, line=132 (Interpreted frame)
  • java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159 (Interpreted frame)

Thread t@37123: (state = BLOCKED)

  • java.lang.Object.wait(long) @bci=0 (Interpreted frame)
  • java.lang.Object.wait() @bci=2, line=485 (Interpreted frame)
  • java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=116 (Interpreted frame)

Thread t@37379: (state = BLOCKED)

  • java.lang.Object.wait(long) @bci=0 (Interpreted frame)
  • java.lang.Thread.join(long) @bci=38, line=1167 (Interpreted frame)
  • java.lang.Thread.join() @bci=2, line=1220 (Interpreted frame)
  • net.sf.katta.master.Master.shutdown() @bci=11, line=118 (Interpreted frame)
  • net.sf.katta.AbstractKattaTest$MasterStartThread.shutdown() @bci=4, line=277 (Interpreted frame)
  • net.sf.katta.master.MasterTest.testRebalanceIndexAfterNodeCrash() @bci=371, line=211 (Interpreted frame)
  • sun.reflect.NativeMethodAccessorImpl.invoke0(java.lang.reflect.Method, java.lang.Object, java.lang.Object[]) @bci=0 (Interpreted frame)
  • sun.reflect.NativeMethodAccessorImpl.invoke(java.lang.Object, java.lang.Object[]) @bci=87, line=39 (Interpreted frame)
  • sun.reflect.DelegatingMethodAccessorImpl.invoke(java.lang.Object, java.lang.Object[]) @bci=6, line=25 (Interpreted frame)
  • java.lang.reflect.Method.invoke(java.lang.Object, java.lang.Object[]) @bci=161, line=597 (Interpreted frame)
  • junit.framework.TestCase.runTest() @bci=96, line=154 (Interpreted frame)
  • junit.framework.TestCase.runBare() @bci=5, line=127 (Interpreted frame)
  • junit.framework.TestResult$1.protect() @bci=4, line=106 (Interpreted frame)
  • junit.framework.TestResult.runProtected(junit.framework.Test, junit.framework.Protectable) @bci=1, line=124 (Interpreted frame)
  • junit.framework.TestResult.run(junit.framework.TestCase) @bci=17, line=109 (Interpreted frame)
  • junit.framework.TestCase.run(junit.framework.TestResult) @bci=2, line=118 (Interpreted frame)
  • junit.framework.TestSuite.runTest(junit.framework.Test, junit.framework.TestResult) @bci=2, line=208 (Interpreted frame)
  • junit.framework.TestSuite.run(junit.framework.TestResult) @bci=31, line=203 (Interpreted frame)
  • org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run() @bci=431, line=420 (Interpreted frame)
  • org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(org.apache.tools.ant.taskdefs.optional.junit.JUnitTest, boolean, boolean, boolean, boolean, boolean, boolean) @bci=39, line=911 (Interpreted frame)
  • org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(java.lang.String[]) @bci=741, line=768 (Interpreted frame)

Jason Venner added a comment - 30/Jun/09 07:21 PM
This failure may be intermittant. The next time I ran the test set, the test completed normally.

java -version
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03-211)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02-83, mixed mode)

host environment macos X leopard-


Stefan Groschupf added a comment - 04/Oct/09 03:59 AM
Hi Peter, I think with the zkclient refactoring this kind of issues should be solved.
Do you think we can close this issue?

Ted Dunning added a comment - 04/Oct/09 06:29 PM

There is a related issue in that if a client can't broadcast to all nodes because the search configuration has not yet updated, then the entire search is lost rather than just recording the exception.

This is also related to the bug for restructuring of the ZK data so that ZK can delete all state when a node is lost rather than depending on the master to do that (KATTA-43 and KATTA-58)


Stefan Groschupf added a comment - 14/Oct/09 09:44 PM
I merging this issues into KATTA-43 as well and generally solve that problem.