It is sometimes important to proceed even if some shards are not available. I suggest two changes in semantics:
1) if a request to a node fails in the threaded request loop in the client, then additional search requests will be created to do the same request on any other nodes that have the same shards as well as marking the node as down. If no other nodes have the shard, then request will be marked as failing.
2) once results from x% of the shards in the original request have been collected, a deadline will be set for t milliseconds in the future. If all results arrive before the deadline, then the search will proceed as normal. If the deadline arrives before all results have been collected, then if y% of the shards have results, the results will be returned as complete. If the deadline passes with less than y% of the shards having results, then an error will be raised. The values of x, y and t will be parameters of the search with reasonable defaults (such as 70%, 90%, 500ms).
Note that it is important for x and y to be separate so that x can be set low enough so that the deadline will always be triggered while y is still high enough to guarantee reasonable results from all successful queries.
Note also that the percentage of shards that are required to bring the cluster out of safe mode is an interesting factor the tells us something about what x and y might be by default.
KATTA-54has a patch related to this bug.