It is my 3rd month as a SQL Server DBA at HP and boy what a months just! I am learning a whole bunch of new things, I am surrounded by people who talk 24×7 for SQL Server and finally I can do all the things I know in the real world and see what is the difference between practicing at home and doing the same things, but for a Fortune 500 company.
Back on the subject! 2 or 3 weeks ago, I did my first huge mistake as a DBA(or at least, I consider it as huge). The customer I did it for – believe me – you don’t waNNa know. However, even if you want, I can’t tell you. Let’s just say that losing any data for this customer can be crucial, so it is, let’s say, not even recommended to think about making any mistakes in their environment!
Very, very simple – we received a ticket for failed SQL Server Agent Job(for those of you interested in my failure, but without any knowledge for SQL Server, SQL Server Agent job is, simply said, a set of steps, where every step execute specific set of commands and the whole job can be scheduled i.e. to execute every Monday at 6 p.m. for example). That’s it. Not a big deal!
What I did
I found why the job has failed, fixed it(SQL Geeks – it was the job owner…) and restarted it, hoping that this time it will complete successfully, as it happened by the way.
Most of our customers have specific instructions or rules or … a must DOs and DON’Ts, let’s call them that way! The one the ticket was for has a MUST DON’T rule, that says that an application job (every SQL Server Agent job, which is not related to backups/restores/maintenance in some way) MUST NOT BE RESTARTED! Why? Cause there’s a “little chance” for data loss! Cool, isn’t it?
What happened next?
When I said to my colleague about the case(without any reason), she(yes, there are women DBAs!) was quite, hm… (searching for the word), amazed by what I have done. She didn’t spend time and told me a story. Here it is, short and simple – once upon a time there was a DBA (men, this time) and he done exactly what I have done – job failure, fix, restart. For the same customer. Some hours later, someone from customer’s side calls our team and says something like the follwoing: “HP, huge thanks for your enthusiasm to work! Because of it, we just lost 6 hours of data…”!
Please, tell me, is this not Funny with capital F?
My answer – NO! I went back home thinking and thinking and shaking from time to time and thinking again what could have happened, what will happen if what I have done lead to the same results as those mentioned just sentances above and so on.
At the end
Because I wrote to our Database Delivery Lead for this customer after the moment the story was told to me, I expected that if there was indeed a data loss, we can do something about it as soon as possible. Fortunately, there is a happy end! There was neither data loss, nor something else bad happened to our customer’s environment!
Be extremly careful every time when you have to do something in a production environment and be twice as careful if you know that this “thing” you have to do can cause a data loss! No matter what time of the day/night it is(in my case it was 4am when i’ve restarted the job), be careful. Ask if you are not sure about what has to be done, have a backout plan. Ensure yourself!
My learned lesson
I will never restart an application job for this customer again(because I do not want to be escalated, I don’t want to meet my boss for that reason, I want to sleep when I get back home after night shift and hundreds of hundreds of more reasons)! Nor will I do it for any other customer, because it’s not in my scope. Our team does not support the application jobs for any of our customers and what they get when a job fails is a simple e-mail – notification.
That’s it, folks! Hope you loved my failure! I now feel completely free to type my favourie last word.