Some background. We have a mixed windows server environment of mostly Windows Server 2012R2 member servers and a handful of 2016 and 2008R2 servers. We are currently migrating our 2008R2 and 2012R2 systems to 2016. Almost all systems are Hyper-V guest VM's (including our Exchange servers) hosted on Windows Server 2016 Hyper-V hosts. Our domain is at a functional level of 2012R2 and is split over two sites, site BC and site DY. The sites are connected via VPN tunnel over the internet. In each site are two domain controllers, both are 2016 in the BC site and one 2016 and one 2012R2 in the DY site. One of those DC's in the BC site carry all the FSMO roles.
Our current project is to replace our existing Exchange 2010SP3RU18 servers running on Windows Server 2008R2. We have 2 in place right now. One in the DY site that has only a couple mailboxes on a single mailbox database and one in the BC site that has an mailbox database with about 80 mailboxes and an archive mailbox database. We have already extended AD and built up two new Exchange 2016CU7 servers on Windows Server 2016. Again, one in the BC site and one in the DY site.
Originally, the plan for the DY Exchange server was to host mailboxes for some other projects, but that never happened. As such, we decided to maintain an Exchange server in the DY site and utilize it for DR by deploying a DAG and replicating our mailboxes for the BC site to the DY site. Instead of doing this on the 2010 servers, we are going to deploy the 2016 servers in this fashion and migrate the mailboxes to them.
On to the problem. As it stands, the Exchange 2016 servers are built with CU7, one in DY (called D-EXCHSRV1), and one in BC (called B-EXCHSRV1), and a DAG called BC-DY-DAG in DAC mode was configured. This DAG has a witness configured in BC and an alternate witness configured in DY. We moved a mailbox or two to a mailbox database configured in the DAG and tested it out. Using Test-ReplicationHealth on both servers reports no errors. Using Get-MailboxDatabaseCopyStatus shows all mailboxes as Healthy and/or Mounted, depending on their current owner. Moving the database from server to server using Move-ActiveMailboxDatabase works as expected. If we leave automatic activation on and shut one of the servers down, the mailboxes mount up on the other server as expected and fail back after the fail back time has passed.
Everything appears to work fine in this setup as configured. However, our plan was to disable automatic activation and only use the DY site server for DR. So we changed the DatabaseCopyAutoActivationPolicy to Blocked on both servers. To verify our solution, we killed power to the BC Exchange server and witness server. We then followed the procedure at https://technet.microsoft.com/en-us/library/dd351049(v=exchg.160).aspx to verify we can bring the database copy online at the DY site in the even of a DR. On the DY site server, we first execute Stop-DatabaseAvailabilityGroup specifying the BC AD site and -ConfigurationOnly parameter. We then stop the cluster service on the DY server and then execute the Restore-DatabaseAvailabilityGroup specifying the DY AD site.
At this point, when running the Restore-DatabaseAvailabilityGroup CmdLet is where things get messed up. The Restore-DatabaseAvailabilityGroup CmdLet reports the following error:
[2018-01-29T16:48:03] Server 'B-EXCHSRV1' was marked as stopped in database availability group 'BC-DY-DAG' but couldn't be removed from the cluster. Error: A server-side database availability group administrative operation failed. Error The operation failed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: An error occurred while attempting a cluster operation. Error: Cluster API failed: "EvictClusterNodeEx('B-EXCHSRV1.Domain.local') failed with 0x46. Error: The remote server has been paused or is in the process of being started". [Server: D-EXCHSRV1.Domain.local]
Checking the Restore-DatabaseAvailabilityGroup log file in the C:\ExchangeSetupLogs\DagTasks\ folder seems to show that the local cluster service never completely starts up when the cmd attempts to remove the remote Exchange server from the cluster. This log shows the cluster service on the DY server in the "Joining" state, when it probably should be in the "Up" state. Researching this error seems to point to a timing issue, and most recommendations say to rerun the command. But, that makes no difference in our case.
I have ran the Get-ClusterLog command and found the section of the log where the service is started with ForceQuorum and can see that the service never fully starts up. The Log shows that it ends up stopping with the following errors:
00004c50.00003fac::2018/01/29-16:47:33.125 INFO [VSAM] Node Id for FD info: 92ac05bc-1da2-8bc8-bd73-e800ddb1f70a
00004c50.0000121c::2018/01/29-16:47:33.126 INFO [VSAM] Node Id for FD info: 364faf35-a476-379f-9e67-bba72d7bd352
00004c50.00003fac::2018/01/29-16:47:33.126 INFO [VSAM] BuildNetworkTarget: remote endpoint , node id 1, bufsize 744
00004c50.0000121c::2018/01/29-16:47:33.126 INFO [VSAM] BuildNetworkTarget: remote endpoint \Device\CLUSBFLT\BlockTarget$, node id 2, bufsize 744
00004c50.0000121c::2018/01/29-16:47:33.126 INFO [VSAM] SetClusterViewWithTarget: nodeid 2, nodeset 0x2
00004c50.00003fac::2018/01/29-16:47:33.126 INFO [VSAM] SetClusterViewWithTarget: nodeid 1, nodeset 0x2
00004c50.0000121c::2018/01/29-16:47:33.126 ERR [VSAM] IOCTL_CLUSPORT_GET_UPDATE_MEMBERSHIP_STATE failed: error 87
00004c50.0000121c::2018/01/29-16:47:33.126 INFO [VSAM] SetClusterViewWithTarget: waiting for completion for node 2
00004c50.00003fac::2018/01/29-16:47:33.126 ERR [VSAM] IOCTL_CLUSPORT_GET_UPDATE_MEMBERSHIP_STATE failed: error 87
00004c50.00003fac::2018/01/29-16:47:33.126 INFO [VSAM] SetClusterViewWithTarget: waiting for completion for node 1
00004c50.0000121c::2018/01/29-16:47:34.127 ERR [VSAM] IOCTL_CLUSPORT_GET_UPDATE_MEMBERSHIP_STATE failed: error 87
00004c50.0000121c::2018/01/29-16:47:34.127 INFO [VSAM] SetClusterViewWithTarget: waiting for completion for node 2
00004c50.00003fac::2018/01/29-16:47:34.127 ERR [VSAM] IOCTL_CLUSPORT_GET_UPDATE_MEMBERSHIP_STATE failed: error 87
00004c50.00003fac::2018/01/29-16:47:34.127 INFO [VSAM] SetClusterViewWithTarget: waiting for completion for node 1
00001270.00001228::2018/01/29-16:47:34.990 WARN [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
00002d34.00001180::2018/01/29-16:47:34.990 WARN [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
00001270.00001228::2018/01/29-16:47:34.992 INFO [RHS] Exiting.
00002d34.00001180::2018/01/29-16:47:35.003 INFO [RHS] Exiting.
I can't find much on the above, but I do know that error 87 is "The parameter is incorrect". Which, is not very helpful in the case. If we try and force the cluster service to start manually with ForceQuorum, it never fully starts up and actually gets stuck in a starting loop where it starts and stops constantly and logs to the event log with the error "The parameter is incorrect".
We have since rebuilt and reconfigured both Exchange 2016 servers in an attempt to resolve this problem and have ended up facing the exact same issue. I have included a link to the logs below, as that may provide some more information that I may be missing here.
https://1drv.ms/f/s!ApEl8Q3xIvLoiDrVH8juRS0TjQHB
Personally, I think this may be a clustering issue, as we can't get the cluster service to start once the other Exchange server and witness are offline. We have configured multi-site SQL servers with AlwaysOn Database Availability groups and have forced fail over and forced quorum to test bringing those online without issue. So, I am a bit surprised this is doing this. This is our first experience with a failover cluster without an administrative access point, but using the PowerShell CmdLets to check cluster, node and resource health before the attempted fail over shows everything in a good state. I'm not sure what else to look at. Any help with this would be greatly appreciated.