Storage Spaces Direct Overview

Storage Spaces Direct is an area which I have been meaning to look into, but for one reason or another it has slipped through the gaps until now.

What Is Storage Spaces Direct

Storage Spaces Direct is a shared nothing software defined storage which is part of the Windows Server 2016 operating system.  It creates a pool of storage by using local hard drives from a collection (two or more) individual servers.

The storage pool is used to create volumes which have in built resilience, so if a server or hard drive fails, data remains online and accessible.

What Is The Secret Sauce?

The secret sauce is within the ‘storage bus’ which is essentially the transport layer that provides the interaction between the physical disks across the network using SMB3. It allows each of the Hosts to see all disks as if they where it’s own local disk using Cluster Ports and Cluster Block Filter.

The Cluster Ports is like an initiator in iSCSI terms and Cluster Block Filter is the target, this allows each disk to presented to each Host as if it was it’s own.

Storage Bus v0.1

For a Microsoft supported platform you will need a 10GbE network with RDMA compliant HBA’s with either iWARP or RoCE for the Storage Bus.

Disks

When it comes to Storage Spaces Direct, all disks are not equal and you have a number of disk configurations which can be used.   Drive choices are as follows:

  • All Flash NVMe
  • All Flash SSD
  • NVMe for Cache and SSD for Capacity (Writes are cached and Reads are not Cached)
  • NVMe for Cache and HDD for Capacity
  • SSD for Cache and HDD for Capacity (could look at using more expensive SSD for cache and cheaper SSD for capacity)
  • NVMe for Cache and SSD and HDD for Capacity

In a SSD and HDD configuration the Storage Bus Layer Cache binds SSD to HDD to create a read/write cache.

Using NVMe based drives will provide circa 3 x times performance at typically 50% lower CPU cycles versus SSD, but come at a far greater cost point.

It should be notes that as a minimum 2 x SSD and 4 x HDD are needed for a supported Microsoft configuration.

Hardware

In relation to the hardware it must be on Windows Server Catalog and Certified for Windows Server 2016.  Both HPE DL380 Gen10 and Gen9 are supported along with HPE DL360 Gen10 and Gen9.  When deploying Storage Spaces Direct you need to ensure that the Cluster creation passes all validate tests to be supported by Microsoft.

  • All servers need to be the same make and model
  • Minimum of Intel Nehalem process
  • 4GB of RAM per TB of cache drive capacity on each server to store metadata e.g. 2 x 1TB SSD per Server then 8GB of RAM dedicated to Storage Spaces Direct
  • 2 x NICS that are RDMA capable with either iWARP or RoCE dedicated to the Storage Bus.
  • All servers must have the same drive configuration (type, size and firmware)
  • SSDs must have power loss protection (enterprise grade)
  • Simple pass through SAS HBA for SAS and SATA drives

Things to Note

  • The cache layer is completely consumed by Cluster Shared Volume and is not available to store data on
  • Microsoft recommendation is to make the cache drives a multiplier of capacity drives e.g. 2 x SSD per server then either 4 x HDD or 6 x HDD PER SERVER
  • Microsoft recommends a single Storage Pool per cluster e.g. all the disks across A 4 x Hyper-V Hosts contribute to a single Storage Pool
  • For a 2 x Server deployment the only resilience choice is a two way mirror.  Essentially data is written to two different HDD in two different servers, meaning your capacity layer is reduced by 50%.
  • For a 3 + Server deployment Microsoft recommends a three way mirror.  Essentially three copies of data across 3 x HDD on 3 x Servers reducing capacity to 33%.  You can undertake single parity (ALA RAID5) but Microsoft do not recommend this.
  • Typically a 10% cache to capacity scenario is recommended e.g. 4 x 4TB SSD is 16TB capacity then 2 x 800GB SSD should be used.
  • When the Storage Pool is configured Microsoft recommend leaving 1 x HDD worth of capacity for immediate in-place rebuilds of failed drives.  So with 4 x 4TB you would leave 4TB un allocated in reserve
  • Recommendation is to limit storage capacity per server to 100TB, to reduce resync of data after downtime, reboots or updates
  • Microsoft recommends using ReFS for Storage Spaces Direct for performance accelerations and built in protection against data corruption, however it does not support de-duplication yet.  See more details here https://docs.microsoft.com/en-us/windows-server/storage/refs/refs-overview

Windows Server 2016 – Role Upgrades

windows-server-2016On the 19th October 2016, Microsoft have clarified what can and cannot be upgraded in-place from Windows Server 2012 and 2012 R2 to Windows Server 2016.

The applications/services which cannot be directly upgraded are:

  • Active Directory Federation Services
  • Hyper-V
  • Print and Fax Services

More details can be found Server Role Upgrades and migration matrix for Windows Server 2016.

Server Role Upgradeable from Windows Server 2012 R2? Upgradeable from Windows Server 2012? Migration Supported? Can migration be completed without downtime?
Active Directory Certificate Services Yes Yes Yes No
Active Directory Domain Services Yes Yes Yes Yes
Active Directory Federation Services No No Yes No (new nodes need to be added to the farm)
Active Directory Lightweight Directory Services Yes Yes Yes Yes
Active Directory Rights Management Services Yes Yes Yes No
DHCP Server Yes Yes Yes Yes
DNS Server Yes Yes Yes No
Failover Cluster Yes with Cluster OS Rolling Upgrade process which includes node Pause-Drain, Evict, upgrade to Windows Server 2016 and rejoin the original cluster. Yes, when the server is removed by the cluster for upgrade and then added to a different cluster. Not while the server is part of a cluster. Yes, when the server is removed by the cluster for upgrade and then added to a different cluster. Yes No for Windows Server 2012 Failover Clusters. Yes for Windows Server 2012 R2 Failover Clusters with Hyper-V VMs or Windows Server 2012 R2 Failover Clusters running the Scale-out File Server role. See Cluster OS Rolling Upgrade.
File and Storage Services Yes Yes Varies by sub-feature No
Hyper-V Yes. (When the host is part of a cluster with Cluster OS Rolling Upgrade process which includes node Pause-Drain, Evict, upgrade to Windows Server 2016 and rejoin the original cluster.) No Yes No for Windows Server 2012 Failover Clusters. Yes for Windows Server 2012 R2 Failover Clusters with Hyper-V VMs or Windows Server 2012 R2 Failover Clusters running the Scale-out File Server role. See Cluster OS Rolling Upgrade.
Print and Fax Services No No Yes (Printbrm.exe) No
Remote Desktop Services Yes, for all sub-roles, but mixed mode farm is not supported Yes, for all sub-roles, but mixed mode farm is not supported Yes No
Web Server (IIS) Yes Yes Yes No
Windows Server Essentials Experience Yes N/A – new feature Yes No
Windows Server Update Services Yes Yes Yes No
Work Folders Yes Yes Yes Yes from WS 2012 R2 cluster when usingCluster OS Rolling Upgrade.

Credit to Mike Brannigan for bringing this to my attention.

vSphere 5.x Space Reclamation On Thin Provisioned Disks

Space reclamation can be performed either on vSphere after a Storage vMotion has taken place or when files have been deleted from within a guest operating system.

With the release of LeftHand OS 12.0 as covered in my post ‘How To: HP StoreVirtual LeftHand OS 12.0 With T10 UNMAP‘, I thought it would be an idea to share the process of space reclamation within the guest operating system.

The reason for covering space reclamation within the guest operating system, is that I believe it’s the more common in business as usual operations.  Space reclamation on vSphere and Windows is a two step process.

  • Zero the space in the guest operating system if you are running Windows Server 2008 R2 or below.
    • UNMAP is enabled automatically as in Windows Server 2012 or above
    • If VMDK is thin provisioned you might want to shrink it back down again
  • Zero the space on your VMFS file system

I’m going to run space reclamation on a Windows Server 2008 R2 on a virtual machine called DC01-CA01 and has the following storage characteristics:

Original Provisioned Space

  • Windows C: Drive – 24.9GB free space
  • Datastore – 95.47GB free space
  • Volume – 96.93GB consumed space
    • 200GB Fully Provisioned with Adaptive Optimisation enabled

Space Reclaimation 05

Next I’m going to drop two files onto the virtual machine which total 2.3GB in space.  This changes the storage characteristics of DC01-CA01 to the following:

Increased Provisioned Space

  • Windows C: Drive – 22.6GB free space
    • 2.3GB increase in space usage
  • Datastore – 93.18GB free space
    • 2.29GB increase in space usage
  • Volume – 99.22GB consumed space
    • 2.29GB increase in space usage

Space Reclaimation 06

Sdelete

Next I have deleted the files from the C: Drive on DC01-CA01 and emptied the recycle bin.  Followed by running sdeldete with the command parameters ‘sdelete.exe -z C:’ This takes a bit of time, so I’m going to make a cup of tea!

Space Reclaimation 07

WARNING: Running Sdelete will increase the size of the thin provisioned disk to it’s maximum size.  Make sure you have space to accommodate this on your volume(s).

VMKFSTools

Now sdelete has finished, we need to run vmkfstools on the datastore to shrink the thin provisioned VMDK back down to size. To do this the virtual machine needs to be powered off.

SSH into the ESXi Host and CD into the directory in which your virtual machine resides.  In my case this is cd /vmfs/volumes/DC01-NODR01/DC01-CA01

Next run the command ls -lh *.vmdk which shows the space being used by the virtual disks.  Currently stands at 40GB.

Space Reclaimation 13

Next we want to get rid of the zero blocks in the MDK by issuing the command vmkfstools –punchzero DC01-CA01.vmdk

Space Reclaimation 15

Now that’s done let’s check our provisioned space to see what is happening.

Interim Provisioned Space

  • Windows C: Drive – 24.9GB free space
    • Back to the original size
  • Datastore – 95.82GB free space
    • 0.35GB decrease from original size
  • Volume – 121.35GB consumed space
    • 24.42GB increase from the original size!

Space Reclaimation 16

So what’s going on then?  Well Windows is aware that blocks have been deleted and passed this information onto the VMFS file system, which has decreased the VMDK size using the vmkfstools –punchzero command, however no one has told my HP StoreVirtual it can reclaim the space and allocate it back out again.

The final step is to issue the vmkfstools -y 90 command.  More details about this command are covered in Jason Boche’s excellent blog post entitled ‘Storage: Starting Thin and Staying Thin with VAAI UNMAP‘ on this function.

Note: vmkfstools was deprecated in ESXi 5.1 and replaced with esxcli storage vmfs unmap -l datastorename  See VMware KK2057513 for more details

WARNING: Running vmkfstools -y 90 will create a balloon file on your VMFS datastore.  Make sure you have space to accommodate this on your datastore and that no operations will happen that could drastically increase the size of the datastore whilst the command is running

Space Reclaimation 17

One final check of provisioned space now reveals the following:

Final Provisioned Space

  • Windows C: Drive – 24.9GB free space
    • Back to the original size
  • Datastore – 95.81GB free space
    • 0.34GB decrease from original size
  • Volume – 95.04GB consumed space
    • 1.89GB decrease from the original size

Final Thought

Space reclamation has three different levels, guest operating system, VMFS file system and the storage system.  Reclamation needs to be performed on each of these layers in turn so that the layer beneath knows it can reclaim the disk space and allocate it out accordingly.

The process of space reclamation isn’t straight forward and should be ran out of hours as each step will have an impact on the storage sub system especially if it’s ran concurrently across virtual machines and datastores.

My recommendation is to reclaim valuable disk space out of hours to avoid potential performance or capacity problems.

vCenter: Stuck On Applying Computer Settings

Problem Description

Windows 2008 R2 vCenter stuck on applying computer settings.

Event logs shows the following errors:

Event 4, Security Kerberos, The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server dc01-ad01$. The target name used was GC/DC01-AD01.gascompany.com/gascompany.com. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server name is not fully qualified, and the target domain (GASCOMPANY.COM) is different from the client domain (GASCOMPANY.COM), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

Event 7038, Service Control Manager, The vpxd service was unable to log on as GASCOMPANY\Service.vCenter with the currently configured password due to the following error:
The trust relationship between this workstation and the primary domain failed.

Event 7000, Service Control Manager, The VMware VirtualCenter Server service failed to start due to the following error:
The service did not start due to a logon failure.

Resolution

The key is Event 7038, the trust relationship between this workstation and the primary domain failed.  To resolve the issue perform the following steps:

  • Power off vCenter and edit settings to disconnect the Network Adapter.  By doing this, you will be able to get to the login screen.

DC01-VCT01 NIC

  • Login to vCenter using Local Credentials (in my case this was DC01-VCT01\Administrator).  Re-enable Network Adapter and perform a ping using DNS to another server to confirm that the TCP/IP network stack is functioning

Now that we are in the server, it is time to resolve the actual issue, being the trust relationship with the primary domain.

  • Run CMD as Administrator
  • Type netdom resetpwd /Server:DomainControllerName /UserD:Domain\Administrator /PasswordD:*

You will be prompted to enter your password.  If all details are correct, you will be prompted to say machine account has been reset.

NetDom

  • Reboot your vCenter

When you login you will now see the prompt to state ‘the trust relationship between this workstation and the primary domain failed’

  • Select > Switch User and login using Local Credentials
  • Remove vCenter from the domain and join to a Workgroup
  • Remove the vCenter Computer Object from Active Directory
  • Reboot vCenter
  • Join the domain

PowerCLI Fails To Launch

I encountered a strange issue the other day after installing PowerCLI 5.0, as soon as I launched it, it closed.

Troubleshooting

Windows Event Logs for Windows PowerShell, Application and System, revealed erm nothing

Checking the properties of the PowerCLI Shortcut its is launched using the target from C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -psc “C:\Program Files\VMware\Infrastructure\vSphere PowerCLI\vim.psc1” -noe -c “. \”C:\Program Files\VMware\Infrastructure\vSphere PowerCLI\Scripts\Initialize-PowerCLIEnvironment.ps1\””

and Starts in “C:\Program Files\VMware\Infrastructure\vSphere PowerCLI\”

PowerCLI01

Trying to run the .ps1 from Powershell resulted in ‘the term’Initialize-PowerCLIEnvironment.ps1’ is not recognized as the term of a cmdlet.

PowerCLI02

Next, was to go into the Windows Folder C:\Program Files\VMware\Infrastructure\vSphere PowerCLI\ and right click Initialize-PowerCLIEnvironment.ps1 and Run with PowerShell

PowerCLI03

This time, I received the error message:

‘internal Windows PowerShell error com initialization failed while reading windows PowerShell console file with error 80010106’

Resolution

Googling the issue, I came across this PowerShell forum post which suggested changing the ‘Number of recent items to display in Jump Lists to 10’

PowerCLI04

After making this change, I was able to launch PowerCLI!

PowerCLI05