External tasks allow a Fetch & Lock method which will lock a task for a given amount of time. However, this time limit is hard and cannot be changed.
This creates a problem when working with immutable servers and automatic server recovery. For example, say we have two servers up and and we Fetch & Lock and start a long running process on one of those servers. This long running could take 12 hours to complete, so we set the timeout as such. Two hours into the process, that server has a problem and is automatically torn down and a new server is brought up in it's place. Camunda will continue to wait for another 10 hours before retrying due to the timeout.
The preferred way to handle this would be to actually set a timeout of 1 minute (for example). The 12 hour process we are calling could call back into a new Camunda endpoint every 30 seconds to say, "I'm still processing." Camunda would then update the timeout to be another minute. This would happen over and over until the process eventually finishes. Now, if the server fails and stops responding, Camunda notices the issue within a minute and starts a retry on another server in the cluster.
Services, like Amazon's SQS, has a method to allow a time lock to be extended in this manner. This article on Amazon has information on how they implement this functionality in SQS and could be a good starting point: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ChangeMessageVisibility.html
I would like to request that more functionality is considered with external tasks timeouts to make them more responsive and resilient.