T-SQL Tuesday is back around again and this time it’s the first one of 2018! This wonderful idea by Adam Machanic ( Blog | Twitter ), is this time hosted by Arun Sirpal ( Blog | Twitter ).

The topic Arun has chosen is:

Write about and share with the world a time when you faced a technical challenge that you overcame and you can go technical with both the issue and solution if you like.

In the years I faced lots of technical challenges and with me lots of other DBAs have. But one that still sticks with me the most is the story of the failing cluster setup. It was back in 2008/2009 and I was a consultant in a hospital for a job to set up a fail-over cluster with SQL Server 2005.

If you’re wondering why SQL Server 2005, it’s because SQL Server 2008 was just released, and the medical industry is not that keen on being an early adopter if it comes to technology. My task was to set up a fail-over cluster on a couple of HP servers.

They were already prepared by the system administrators and installed with Windows Server. Seems simple enough, I had done that many times before, and should give me any trouble. Oh, boy was I wrong about that!

Having set up all the different parts of Windows Server, I started the installation of SQL Server 2005 for the first node. Everything went smoothly and the installation was successful.

Now comes the installation of the second node. I start the installation of the node to add it to the cluster and it installs to about 90% and stops. It gives me a general error that something in the installation went wrong and that I should check the installation logs to find out what went wrong.

I checked the log files and didn’t find anything special that could explain why the installation failed. Let’s try this again and go through all the settings and try again. Was it that I entered the wrong credential or some other setting?

After starting the installer it again fails at the 90% mark. The installation rolls back the entire installation and I’m not sure what happened. Let’s check the installation logs again, hopefully, this time it shows me what went wrong.

Unfortunately, it didn’t show me anything. I wasn’t going to give up. I tried different variations of the installation settings to see if I made some sort of mistake. I checked the hardware for any defects, replaced the memory, and swapped the CPUs of the machines.

Nothing there that could point out why the setup failed.

At this point, it’s a good idea to get some higher power involved and we created a ticket with Microsoft support. It even got a few levels higher and even they weren’t able to figure out why the installation failed. Their answer was; “Just use SQL Server 2008, that should work”.

Besides the fact that was not the answer I was looking for, I couldn’t install SQL Server 2008 because it was not yet proven technology. So as the consultant, I was kind of in a pickle. How was I going to get the job done because my reputation was on the line?

I’m supposed to be the one with the knowledge and there I was struggling for over a week to get a cluster setup. Because at this point I had to start thinking outside the box and I was looking at the hardware again. The hardware was all approved and supported hardware by Microsoft. We didn’t need any third-party drivers or anything like that.

Looking at the hardware, besides the motherboard power supply and some other parts, we were only able to control the CPU, memory, and hard disks.
The servers were identical with the following configuration:

  • CPU: 4 x 6 core
  • Memory: 64 GB
  • Hard disk:
    • System disk - 40 GB
    • Application disk - 30 GB
    • Data disk: 300 GB
    • Log disk: 100 GB

I first examined the disks. There was nothing special about them. The system and application disk were the only local disks and the data and log disk were LUNs.
The memory was nothing special either and was set up in 4 x 16GB modules.
The CPU was something I had not seen yet and what caught my attention was the 6-core specification. In total, we had 24 cores and for some reason, that number was odd to me.

Let me explain why the number 24 is an odd number for me. It’s because when you’re used to working with computers you are used to working with a number in the power of 2. We have 1,2,4,8,16,32,64 etc but not the number 24. That’s right between 16 and 32.

It’s possible to let Windows believe there is a different amount of processors in your server. You can change this setting in the MSCONFIG utility by setting the number of processors in the BOOT.INI tab. At this point, I was good with any new idea so I changed the value for the number of processors to 2. The number 2 is in the power of 2 so let’s see what happens.

I restarted the server and looked into the task manager and I saw two processors. That all worked now let’s run the setup of the SQL Server node again. I would not have believed it if this didn’t happen to me but for some reason the setup was successful.

At this point, it could’ve been a fluke and I wanted to be sure. I removed the node from the cluster, reset the number of processors to the original version, and ran the setup again. This time it failed as I had seen before.

This is still not good enough for me but I was getting closer to a conclusion. Let’s see what happens if we change the number of processors to 8. I changed the amount, rebooted the server, and started the setup again. Again this time it’s successful.

My conclusion was that for some reason the setup couldn’t handle the number of processors that were not in the power of 2. In my mind, this was insane because I had never encountered such a problem before. Now that everything is installed I can again set the number of processors back to 24 and get on installing the updates like service pack 3.

I changed the number of processors, rebooted the server and as soon as the server booted up, my cluster installation failed. It couldn’t be the number of processors right? Again no clear error and because I knew what went wrong I changed the number of processors back to 8 and rebooted it. This time the server and cluster booted up.

I contacted Microsoft support and told them about my findings. They replied that indeed something odd was fixed in SQL Server and the fix was released in service pack 3. Back with SQL Server 2005 you weren’t able to slipstream your service packs.

You first had to install the initial release and then apply the service pack. The next step was to install SP 3 with the 8 cores. I started the setup and of course, things didn’t go as expected. The installation failed and gave no clear reason why.

Again I was stuck. I had an uncompleted cluster install because we couldn’t use the full amount of resources, and I couldn’t install the right update to fix it. I started to dig into the updates that would be applied in SP 3.

I noticed some changes to the system databases, especially the mssqlresource database. The mssqlresource database is a system database that’s not visible to the user and is only used internally by SQL Server. I ran the setup again and this time I wanted to see what happened during the installation with the system databases. I saw the master database being replaced by a new version, and the msdb database as well, and as soon as the installation wanted to finish the update of the mssqlresource database it disappeared, the original version was placed back and the installation failed.

That’s it! That’s the reason the service pack was failing. With a little research and some gut feeling, I saw what happened. But why did it happen? Was the installation not able to rename the database? To be sure I did something rigorous and don’t do this in production ever! I went into the SQL Server settings and I updated the physical name of the mssqlresource database.

I shut down the instance and I renamed the files for the database on disk. I started the instance successfully. I checked the name of the database and it showed the renamed files. I started the installation and watched the system database files being changed. First the master database, then msdb, and finally the mssqlresource database.

I see a new mssqlresource database being created next to the original one, the old one is removed, and the installation finishes! That was the problem for SP3, it was not able to do an in-place upgrade of the mssqlresource database.

We’re almost done. The only thing I have to change back is the number of processors and reboot the system. I changed the setting, rebooted the server and everything booted up successfully. This one hell of a journey gave me the grey hairs I have now.

This took about 2 weeks of my life what could’ve been an afternoon. This was an experience that changed my way of thinking as a DBA and I learned a lot along the way.

The moral of the story is that there is always a solution. Go and dig into the material. Think outside the box and do not let any challenge get the best of you.