Question

Consider custom network protocol. This custom protocol could be used to control robotic peripherals over LAN from central .NET based workstation. (If it is important, the robot is busy moving fabs in chip production environment).

there are only 2 parties in conversation: .NET station and robotic peripheral board
the robotic side can only receive requests and send responses
the .NET side can only initiate requests and receive responses
there always should be exactly one response per request
the consequent requests can follow immediately one after another without waiting for response, but never exceed the fixed limit of simultaneously served requests (for example 5)

I had exhaustive discussion with my friend (who owns the design, I have discussed the thing as a bystander) about all nice details and ideas. At the end of discussion we had strong disagreement about missing timeouts. My friend s argument is that software on both sides should wait indefinitely. My argument was that timeouts are always needed by any network protocol. We simply could never agree.

One of my reasoning is that in case of any failure you should "fail fast" whatever cost, because if failure already occurred anyway, cost of recovery continues to grow proportionally to time spent to receive an info about failure. Say after 1 minute on LAN you definitely should stop waiting and just invoke some alarm.

But his argument was that recovery should include exactly the repairing of what failed (in this case recovery of network connection) and even if it takes to spend hours to figure out that network was lost and fixed, the software should just continue transparently running, immediately after reconnecting the LAN cables.

I would never seriously think about timeless protocols, until this discussion.

Which side of argument is right ? The "fail fast" or "never fail" ?

Edit: Example of failure is loss of communication, normally detected by TCP layer. This part was also discussed. In case of TCP layer returning error, the higher custom protocol layer will retry sends and there is no argument about it. The question is: for how long to allow the lower level to keep trying ?

Edit for accepted answer: Answer is more complex than 2 choices: "The most common approach is never give up connection until actual attempt to send fails with solid confirmation that connection is long lost. To calculate that connection is long lost use heartbeats, but keep age of loss for this confirmation only, not for immediate alarm".

Example: When having telnet session, you can keep your terminal up forever and you never know if in between hitting Enter there were failures detectable by lower level routines.

Answer 1

I prefer your "fast fail" method, but as I think you ve discovered, this is highly preferential.

Cisco equipment that I work with work very similarly - you send a request, they respond. (Over telnet.) The problem is when the network fails: I loose the TCP connection. However, neither side will close that connection until a data send is attempted, and since the cisco side rarely does that, it never closes. Worse, you can only have 1 connection at a time, so if there s network failure, you re locked out. (They can be reset, but it s a just a hassle.)

Now, to test a network connection, you need some sort of ping, just a "are you still there?" - many protocols do this, such as AIM and IRC. But those pings cost bandwidth, depending on how often you send them.

So, is the error detection worth the cost in bandwidth? How big does a ping really need to be? I d say you should be able to get it to <50 octets/ping, and you could ping like once every 10s, 30s, 1m, something like that, I d say it s well worth it. The earlier you know you have a problem, the better. If the software itself can then use these pings to know it lost the connection and re-establish contact automatically, I d say that s great, along the lines of "Computer, heal thyself", and makes for less hassle for the operator.

If you re using TCP/IP, it can do this automatically for you -- see TCP Keepalives. Alternatively, you can do it within your application s protocol, as AIM & IRC do.

Answer 2

In the scenario where ...

Controller has sent a request
Robot hasn t received the request
Network fails

... then the request has been sent, but has been lost and will never arrive.

Therefore, when the network is restored, the controller must resend the request: the controller cannot simply wait forever for the response.

友情链接