Pages

Monday, June 25, 2018

RPC Server Unavailable when creating a SharePoint Farm… the curse of dodgy legacy NetDOM!

Every so often a real blast from the past comes back to haunt me. Usually it’s some obscure “infrastructure” gubbins – you know, the sort of thing that 80% of so called IT Pros knew in 1999. These days thou. Not so much.

With SharePoint in particular there is a whole boat load of legacy. Not that legacy is bad. Lot’s of it is awesome. That’s why the product remains so successful. On the other hand some of it is real, real, real nasty!  :)

It always seems to come in waves. Over the last two weeks I’ve had six emails regarding problems creating the first server in a SharePoint Farm. The ye olde “An error occurred while getting information about the user SPFarm at server fabrikam.com: The RPC server is unavailable”.

Naturally, there are a whole bunch of wild goose chases out there on the interwebz about how to potentially resolve this issue. Most of them are complete claptrap. Or even worse, a ‘support bot’ automated answer, something like “are you running as an administrator?” :)

Now this old chestnut can have a multitude of root causes and the API that raises this exception isn’t very clever about that – it just bubbles it back up the stack. As you might imagine, when SharePoint was first developed nobody was sat there running through all the various deployment scenarios and fire testing every code path to ensure a nice neat experience for the next 20+ years.  But more importantly 17-18 years ago, idempotent, designed to be managed via automation and state independence were not so much of a thing as they are now.  Let’s say we do it via PowerShell using New-SPConfigurationDatabase. That isn’t doing one thing. It’s doing a WHOLE BUNCH of things. All masked via the Server Side OM. The same is true if we use the Configuration Wizard. After all, they are both simply masks or wrappers for the OM.

The real problem with this error (and others like it), when creating a Farm, is that it only partly fails. The Configuration database is created on the SQL Server. It’s sitting there pretty happy. Indeed, it will have 711 Objects. New-SPConfigurationDatabase has sorta kinda worked. Even though it’s raised an exception. But it’s not really worked, and as soon as we go to the next stage of farm creation (typically Initialize-SPResourceSecurity) that will fail as well, with the error “Cannot perform security configuration because the configuration database does not exist.  You must create or join to a configuration database before doing security configuration.”

Interesting. Most people will actually check if the Farm exists after New-SPConfigurationDatabase by calling Get-SPFarm. This is mainly because the very first “build a farm using PowerShell” posts included this. However, in this case – it is entirely worthless. because Get-SPFarm will report the farm exists and that it is Online.

The sample script that virtually everyone has been following since SharePoint 2010 is entirely flawed. There is zero point in calling Get-SPFarm at this point. If New-SPConfigurationDatabase works it appears to be a nice little bit of defensive coding. And it does actually deal with some other errors that can occur. But it won’t help in the case of bubbled up exceptions from dependencies.

What we ACTUALLY should be doing here is restarting the PowerShell session entirely. Indeed if you try and do anything which would otherwise be sensible, like removing and adding the Snap In – it will likely crash the process anyway. Indeed. if you just leave it sitting there for a bit, the process will crash all on it’s own. How do you like those apples?

This is an absolutely critical thing to realise about the Administration OM and the PowerShell that wraps it. This is why delivering true desired state configuration for SharePoint is currently impossible. The only way to catch this is to catch it. Neither the OM nor the PowerShell cmdlets are designed or developed to be leveraged in an automation first approach. Doesn’t mean we can’t get pretty close, but as soon as you start delving into the details, it becomes obvious just how much work is required to do all the things the back end OM doesn’t.  And how exactly does one create a DSC resource that can detect a failed process, know exactly when it failed, and kick off another one to carry on? Yeah, exactly.

Again, SharePoint itself doesn’t handle the exception it just bubbles it back. It does no clean up. Our box is shafted. So to speak. Until we delete the database, and of course restart (or create a new) PowerShell session. Thus, in order to deal with this properly, we’d need to catch the specific exception and handle it appropriately. Do some remoting and kill the database (using SQL Server PowerShell), then restart our session. In other words clean up after the mess that SharePoint has made for us.

But of course we also need to fix the root cause of the problem before retrying.

“The RPC server is unavailable” is one of the classic, generic, “we have no idea what happened, throw this” errors. Now we live in a modern, transformed, cloud first world, we only have one generic error, “access denied” :). But back in the day, when type was understood, we had lots of them.

Sometimes the RPC server really is unavailable. But when creating a SharePoint Farm there is a 98% chance your machine has multiple Network Interfaces, and the prioritised (default) NIC cannot route to the SQL Server. It looks like this.

2018-06-25_12-51-09

In this example “Ethernet1” is a shared network to a backup device. The customer is using this network to separate farm traffic from management traffic. Is that a good idea? Well that’s a post for another time (or perhaps not!) but the thing is it’s extremely common. Even in IaaS lots of people do this.

If I move the “10net” network to the top of the list. Hey presto. No more RPC server is unavailable. It’s a common gotcha. The trouble is most people don’t know there even IS a NIC order. Never mind where in the UI to configure it. For over a decade all but one causes of this error, where I’ve been involved either as escalation or hands on, has been the NIC order.  It’s in my checklist before deploying farms – not that I really do that anymore, but whatever, that checklist doesn’t get updated much.

The good news is we can fix it with PowerShell, decent PowerShell….
Set-NetIPInterface -InterfaceIndex <index> -InterfaceMetric <new metric>

It of course shouldn’t be this way – the reason it is – is because the SharePoint APIs (especially the core Farm Admin ones) rely on some real legacy. In this case, what old timers refer to as the NetDOM stack. That’s an in joke related to an old utility used to hack AD things, back before there was a RTM AD and Microsoft were still doing NetBIOS. Now, it works. But that is the depth of a product like this. A rabbit’s warren of more or less every API made by Microsoft from around 1996 through to today. It would take a brave (and quite possibly barking mad) person indeed to make the decision to re-architect and re-implement the core admin stack.

Anyway the point of this post was to:

  • document the 98% case cause and resolution (fix the NIC order, even if only temporarily whilst the farm is built) so I don’t have to actually explain it ever again (I hope!)
  • provide a worked example of the importance of understanding how things are actually built, rather than merely the mask of veneer that so many products have today.

Not to be all Donald Rumsfeld, but remember with most things in life, and especially SharePoint - The more you know, the more you know you don’t know. You know?! :)

s.


by Spence via harbar.net

No comments:

Post a Comment