T-SQL Tuesday #051: The Bet

Once upon a time I was working for a customer that had quite an impressive HA + DR + Reporting solution. It was needed because the main systems were located in Asia and Europe, but the reporting part was actually done in the US. What the system looked like back then (as long as I can remember) was 3 four-node active-active clusters(1 per region) with 4 SQL Server instances each. Between all of those, however, there was transactional replication which was moving the data from Asia to Europe and from there to the US databases in order for the reporting teams to query the data(critical part of the service that was provided).

One day I arrived at the office and the first thing that I heard was:

The replication is down from yesterday evening and we have over a million undistributed commands. The BI guys are not happy.

Now when you are aware of the problem, here is bet that I made:

“It cannot get worse than that” – I said to myself…

When we started our investigation on what caused so much data to be sitting still in the distribution database(the problem with its size will not be discussed at all) we quickly noticed that one of our network adapters was not working as expected and it turned out we were right because the network engineers went to the data center itself to fix some cabling problems. However, this took time and the undistributed commands got quite a lot. Once that was fixed we just started seeing how the data started flying as fast as before, but that was not the end. No, no. The “next big thing” happened on the destination server where one of the drives failed and so a rebuild process started. That slowed us a lot again, but we were not able to do anything this time. And we were not also able to do anything when 2 days after the rebuild completed the whole array went down…

So “it cannot get worse”? … What we did once everything was up and running again was to play a bit with the Agent Profiles(part1|part2) and the parameters there and even though they all promise boost in the delivery of your transactions we were not able to fasten it that much (we were almost 8 million commands behind with quite a lot generated every second). So it was clear that we will have to wait for the replication to catch up once we found the best configuration for our profiles, because at the end of the day it is the network that is playing the most important part in a replication topology!

That was my worst bet ever and I learned an important lesson from it – it can always get worse! :)

Thanks, Jason, for hosting this one! Already looking forward to seeing the other bets!

Share on...

    Share This

    4 Comments on “T-SQL Tuesday #051: The Bet

    1. Pingback: T-SQL Tuesday #051: Bets and Results | SQL RNNR

    2. Pingback: T-SQL Tuesday #051: Bets and Results - SQL Server - SQL Server - Toad World

    What do you think?