"Tests don't have to be held to the same level of reliability as production code"
I personally take issue with that statement. If your tests are not reliable and robust, you will stop trusting them. At the point, why even have tests if you cannot trust them to verify correctness of your program (or at least increase your confidence that the program is not broken)?
If your tests are not reliable and robust, you will stop trusting them.
This seems to be the paradox in much advocacy of strategies like TDD. We start with the premise that our code is likely to contain bugs. In order to detect those bugs, we write a lot of tests, perhaps doubling the size of our code base. Now, we can do whatever we like to the production half of our code base, as long as our tests in the other half all continue to pass when we run them, because magically that testing half of our code base is completely error-free.
Definitely. Tests are really like a dual to the code, in the sense that they are another way of describing what it is supposed to do. Like you say, as the test grows more complicated, the risk that the test code itself has bugs grows too.
The risk is small with simple declarative tests ("the output of f(x) == y") but with complicated systems with large state spaces, you start writing combinatorial tests to try to cover more of the space and incur a bigger risk your test code is buggy.
Few things are as frustrating as fixing a bug that should have been caught by a test and realizing the test wasn't actually working correctly. This is even more acute when working on code where speed is important (where performance is correctness, in other words) and you realize a test has been measuring the wrong things, such as asynchronous runtime showing up in the wrong place.
TDD has always seemed more like a useful tactic to be applied to certain kinds of problems than as a silver bullet.
That's not TDD. You define the expected behaviour first (which still can be wrong, now or in the future - they're based on your current assumptions/spec/other mutable thing), then ensure the code produces output that matches your expectations. Your output is now as correct as your understanding of the solution.
Your output is still only as correct as your specification of your initial assumptions.
For example, lets say you make a mistake in your test so you are checking (result = expected_result), which is always true, instead of (result == expected_result). Now when you write your code, you run the test and it passes.
In this case your code may or may not be correct, and the test, which contains a bug, does not catch it. But the bug is not a fundamental misunderstanding of the problem, rather a simple mistake in writing the test. Following strict TDD doesn't prevent this.
You must see a test fail before you make it pass. It goes "Red -> Green -> Refactor" not just "Green -> Refactor". Even if a valid test case passes on the first time, one should modify the test (or the code) in a known way to cause an expected failure.
Your premise (tests are only as correct as the spec of initial assumptions) is correct, but your supporting example is not.
Tests you only partially trust are a lot better than no tests at all.
Also, the tests can be less reliable by generating false positives. As long as all positives are checked, and the tests are maintained to pass (by fixing the tests and production code as needed), then false positives are not going to hurt your production code's reliability.
As long as your less reliable tests are not generating false negatives, you are fine.
I interpreted the parent's usage of "reliability" to refer to engineering-lifecycle reliability. I completely agree that confidence in tests is critical, but I also agree that taking design shortcuts here and there in test code is fine, if it gets the job done and doesn't adversely impact the production API.
It is one of those fuzzy, learn from experience type of concepts.
In my experience, taking shortcuts in tests breeds more and more shortcuts (as less experienced developers look to existing tests as reference). As the shortcuts accumulate, the tests become more brittle, less effective and harder to maintain. This is the path to having a test suite that you no longer trust.
Any test that requires monkey patching or DI to work is going to have these reliability issues. I've seen plenty of DI'd mock objects that screwed the test.
I personally take issue with that statement. If your tests are not reliable and robust, you will stop trusting them. At the point, why even have tests if you cannot trust them to verify correctness of your program (or at least increase your confidence that the program is not broken)?