Our product is a distributed system. The modules I work on are fairly new, quite rigorous, well tested. They were developed with recent best practices in mind. Other modules can be considered as legacy software.
While I m vigilant about everything that happens within modules I m responsible for, I m under constant pressure to work with bad data sent to me from the other modules. At heart, I m a "Fail Fast" principle developer and as a result , when problems arise I usually am able to eliminate the possibility of error in my modules. It s not so much about blame, just saving wasted effort in chasing bugs in the wrong places.
But the argument I keep coming up against is: "We can t let this stuff fail in production, the customer expects this to work, why don t you work around this problem". And this would be an argument for robustness: be liberal in what you accept, conservative in what you send.
I should also note that these are mostly intermittent problems. We see them in integration tests but they are hard to reproduce. Timing and concurrency are involved.
I m having a hard time balancing between the two principles. Part of it is my worry that if I start allowing and propagating exceptional data, I m inviting trouble and I won t have as much confidence in my system. But I can t argue against keeping the system working even if other modules are sending me wrong data. The reason other modules aren t getting fixed is that they are too complex and fragile, while mine still appear clear and safe. But if I don t resist the pressure, my modules will slowly be saddled with the same problems I ve been rejecting until now.
I should say that the system is not "crashing" in production, but my module may simply display an error to the operator and ask them to contact support. A crash would be a big problem, but if I m reporting the error clearly, then isn t this the right thing to do? I suspect that my peers just don t want the customer to see any problems, period. But my module is rejecting data from other modules within our product, not customer input. So it seems to me that we are just not tackling problems.
So, do I need to be more pragmatic or hold my ground?