Just found out that one of our (slind) recent night builds failed to the point of destroying the repository on the main server (that is, rsyncing up an empty repository). Which in itself is not much of a problem, but still sets it several days back and not to mention the bandwidth. This, however is not why we're here.
Investigation of the problem that caused said failure was much shorter than I dreaded and the problem itself turned out to be pretty simple. Each time the build starts, one of the scripts involved downloads a database file, which is most important for the build infrastructure. It used to do so by issuing 'curl $URL'. Pretty simple, but not safe against network and/or server failures. Then,
one of our engineers decided to make it safer by patching it like this (these snippets are simplified, see below for original code):
- curl $URL
+ curl $URL || die "Failed to download a very important file!"
...which is a nice try, except that curl will always exit with 0 code regardless of HTTP response code, unless given -f/--fail option. IOW, we still have incorrect code, but this one gives us false sense of security.
And of course, if -f was there the night before, the repository wouldn't have been destroyed at all.
P.S.
And the commit in question is here.