English 中文(简体)
Infinite timeouts or "fail fast" in custom network protocol?
原标题:

Consider custom network protocol. This custom protocol could be used to control robotic peripherals over LAN from central .NET based workstation. (If it is important, the robot is busy moving fabs in chip production environment).

  • there are only 2 parties in conversation: .NET station and robotic peripheral board
  • the robotic side can only receive requests and send responses
  • the .NET side can only initiate requests and receive responses
  • there always should be exactly one response per request
  • the consequent requests can follow immediately one after another without waiting for response, but never exceed the fixed limit of simultaneously served requests (for example 5)

I had exhaustive discussion with my friend (who owns the design, I have discussed the thing as a bystander) about all nice details and ideas. At the end of discussion we had strong disagreement about missing timeouts. My friend s argument is that software on both sides should wait indefinitely. My argument was that timeouts are always needed by any network protocol. We simply could never agree.

One of my reasoning is that in case of any failure you should "fail fast" whatever cost, because if failure already occurred anyway, cost of recovery continues to grow proportionally to time spent to receive an info about failure. Say after 1 minute on LAN you definitely should stop waiting and just invoke some alarm.

But his argument was that recovery should include exactly the repairing of what failed (in this case recovery of network connection) and even if it takes to spend hours to figure out that network was lost and fixed, the software should just continue transparently running, immediately after reconnecting the LAN cables.

I would never seriously think about timeless protocols, until this discussion.

Which side of argument is right ? The "fail fast" or "never fail" ?

Edit: Example of failure is loss of communication, normally detected by TCP layer. This part was also discussed. In case of TCP layer returning error, the higher custom protocol layer will retry sends and there is no argument about it. The question is: for how long to allow the lower level to keep trying ?

Edit for accepted answer: Answer is more complex than 2 choices: "The most common approach is never give up connection until actual attempt to send fails with solid confirmation that connection is long lost. To calculate that connection is long lost use heartbeats, but keep age of loss for this confirmation only, not for immediate alarm".

Example: When having telnet session, you can keep your terminal up forever and you never know if in between hitting Enter there were failures detectable by lower level routines.

最佳回答

I prefer your "fast fail" method, but as I think you ve discovered, this is highly preferential.

Cisco equipment that I work with work very similarly - you send a request, they respond. (Over telnet.) The problem is when the network fails: I loose the TCP connection. However, neither side will close that connection until a data send is attempted, and since the cisco side rarely does that, it never closes. Worse, you can only have 1 connection at a time, so if there s network failure, you re locked out. (They can be reset, but it s a just a hassle.)

Now, to test a network connection, you need some sort of ping, just a "are you still there?" - many protocols do this, such as AIM and IRC. But those pings cost bandwidth, depending on how often you send them.

So, is the error detection worth the cost in bandwidth? How big does a ping really need to be? I d say you should be able to get it to <50 octets/ping, and you could ping like once every 10s, 30s, 1m, something like that, I d say it s well worth it. The earlier you know you have a problem, the better. If the software itself can then use these pings to know it lost the connection and re-establish contact automatically, I d say that s great, along the lines of "Computer, heal thyself", and makes for less hassle for the operator.

If you re using TCP/IP, it can do this automatically for you -- see TCP Keepalives. Alternatively, you can do it within your application s protocol, as AIM & IRC do.

问题回答

In the scenario where ...

  • Controller has sent a request
  • Robot hasn t received the request
  • Network fails

... then the request has been sent, but has been lost and will never arrive.

Therefore, when the network is restored, the controller must resend the request: the controller cannot simply wait forever for the response.





相关问题
Robot Simulation in Java

I am doing a project concerning robot simulation and i need help. I have to simulate the activities of a robot in a warehouse. I am using mindstorm robots and lego s for the warehouse. The point here ...

streaming video to and from multiple sources

I wanted to get some ideas one how some of you would approach this problem. I ve got a robot, that is running linux and uses a webcam (with a v4l2 driver) as one of its sensors. I ve written a control ...

iPhone as a robot controller

I have successfully used Pocket PCs in the past (using the serial port) to control simple robots (small rovers). Looking around here and on Apple s developer website, it seems that starting on 3.0, ...

Using CARMEN Robot Navigation Toolkit with Hoyuko Laser

I m currently working with CARMEN (http://carmen.sourceforge.net/), and I m trying to make a robot navigate using the CARMEN toolkit and a Hoyuko URG-04LX. Even though the laser does work, and CARMEN ...

failsafe for networked robot

I have a robot that I m controlling via a browser. A page with buttons to go forward, reverse, etc is written in PHP hosted on an onboard computer. The PHP is just sending ASCII characters over a ...

Infinite timeouts or "fail fast" in custom network protocol?

Consider custom network protocol. This custom protocol could be used to control robotic peripherals over LAN from central .NET based workstation. (If it is important, the robot is busy moving fabs in ...

coachable players for RoboCup Soccer Simulator 2d v14

I am doing a work similar to this one but the coachable players i found online are 3 years old and don t work with the latest version of the soccer server. does anyone know any alternatives? or have ...

热门标签