Issue Details (XML | Word | Printable)

Key: KATTA-58
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Duplicate
Priority: Major Major
Assignee: Unassigned
Reporter: Ted Dunning
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Katta

Change structure of data in Zookeeper to make all node data ephemeral on node connection

Created: 10/May/09 02:03 AM   Updated: 07/Dec/09 07:39 PM
Component/s: None
Affects Version/s: 0.5.1
Fix Version/s: None


 Description  « Hide

The idea here is that all data in katta that depends on a search node staying up should be ephemeral and created by that node.

Currently, the structure is something like

node2shards/<node-name>/shard*

and

shard2node/<shard>/<node-name>

The files under shard2node disappear correctly when the node disappears, but the files in the first are created by the master and do not necessarily go down. The master could be extended to make this happen, but if the master ever lost track, then the data would be corrupt. This currently happens in EC2 if ZK connection parameters are not set with long timeouts and it causes the entire cluster to appear to be down. Just making the shard files ephemeral does not work because the node directory still exists.

What I propose is that the data about what shards a node is serving be kept in an ordinary file that lists the shards rather than as a directory with a single entry per shard. This file can then be node-wise ephemeral and will vanish correctly if the node drops out. To make this work, it would be necessary to have a separate file that the master would write containing the shard assignments per node or, better, have the master just write to this file with an indication that the shards are assigned, but not yet served. As the shards are downloaded and started, the node can update the status file to remove the pending mark.

Obviously, if files are updated by more than one process some care has to be taken in the update. ZK makes this pretty easy, however.



 All   Comments   Change History   git Commits      Sort Order: Ascending order - Click to sort in descending order
Ted Dunning added a comment - 04/Oct/09 06:27 PM - edited
KATTA-43 is closely related to this.

So is KATTA-69.

And so is KATTA-68


Stefan Groschupf added a comment - 14/Oct/09 09:45 PM
Also merging this into KATTA-43.

Johannes Zillmann added a comment - 07/Dec/09 01:17 PM
The idea behind the none-ephemeral node2shards folder is that a node, once its starts, can read it old shard assignments and serve these 'old' shards without download from origin again.

Ted Dunning added a comment - 07/Dec/09 07:39 PM

That may be the intent, but in practice, this leads to real problems because many parts of the software assume that the presence of that directory means that the node still exists.

I don't believe that the fast-start capability works in any case.

Whether my suggestion is taken to heart or not, the way that status and assignments is handled has gotten confused over time.

I think a preferable approach would be for a node that is coming back up should simply advertise whatever shards it finds in a complete state. If a node goes down and comes back quickly, then the master should be able to use the advertised shards to guide the assignments it makes. If a node goes down and comes back much later, then the master can still use the advertised shards to guide rebalancing of the cluster.

In either case, making the assignment non-ephemeral serves little purpose and the contents of a missing node is of little interest as well.

In fact, you might consider putting all state (both from the master and from the node) into a single document. The document could be created on node start with current shard state (if any) and the master could update the document with assignments. As shards go live, the node would update the document repeatedly and the master could decide whether to change the assignment based on what other nodes are doing. This is actually simpler than the current system, no harder to implement and much easier to understand. The dual source of updates isn't a problem either (ZK was made for that and the update rate is low).