One day I arrived at the office and the first thing that I heard was:
The replication is down from yesterday evening and we have over a million undistributed commands. The BI guys are not happy.
Now when you are aware of the problem, here is bet that I made:
“It cannot get worse than that” – I said to myself…
When we started our investigation on what caused so much data to be sitting still in the distribution database(the problem with its size will not be discussed at all) we quickly noticed that one of our network adapters was not working as expected and it turned out we were right because the network engineers went to the data center itself to fix some cabling problems. However, this took time and the undistributed commands got quite a lot. Once that was fixed we just started seeing how the data started flying as fast as before, but that was not the end. No, no. The “next big thing” happened on the destination server where one of the drives failed and so a rebuild process started. That slowed us a lot again, but we were not able to do anything this time. And we were not also able to do anything when 2 days after the rebuild completed the whole array went down…
So “it cannot get worse”? … What we did once everything was up and running again was to play a bit with the Agent Profiles(part1|part2) and the parameters there and even though they all promise boost in the delivery of your transactions we were not able to fasten it that much (we were almost 8 million commands behind with quite a lot generated every second). So it was clear that we will have to wait for the replication to catch up once we found the best configuration for our profiles, because at the end of the day it is the network that is playing the most important part in a replication topology!
That was my worst bet ever and I learned an important lesson from it – it can always get worse! 🙂
Thanks, Jason, for hosting this one! Already looking forward to seeing the other bets!