vSphere Replication has been embedded in the ESXi kernel for quite sometime now. When a virtual machine performs a storage ‘write’ this is mirrored by the vSCSI filter at the ESXi Host level before it is committed to disk. The vSCSI filter sends its mirrored ‘write’ to the vSphere Replication Appliance which is responsible for transmitting the ‘writes’ to it’s target. normally in a DR site.
The process is shown at a high level in the diagram below.
I’m often asked by customer if they shoud consider using it given the benefits which it provides, which include:
- Simplified management using hypervisor based replication
- Multi-point in time retention policies to store more than one instance of a protected virtual desktop
- Application consistency using Microsoft Windows Operation System with VMware Tools installed
- VM’s can be replicated from and to any storage
- An initial seed can be performed
As a impartial adviser, I have to provide the areas in which vSphere Replication isn’t as strong. These are the points, I suggested are considered as part of any design:
- vSphere Replication relies on the vRA, if this is offline or unavailable then replication stops for all virtual machines.
- vSphere Replication requires the virtual machine to be powered on for replication to occour
- vSphere Replication is not usually as efficient as array based replication which often have compression and intelligence built into the replication process. If you have limited bandwidth you may violate restore point objectives
- vSphere Replication will reduce the bandwidth available to other services/functions if you are using logically separated networks over 10GbE
- Note that Network IO Control can be used to prioritise access to bandwidth in times of contention, but required Enterprise Plus licenses
- vSphere Replication requires manual routing to send traffic across a replication VLAN which increases the complexity of the environment
- vSphere Replication is limited to 200 virtual machines per Replication Appliance and 2000 virtual machines overall as detailed in VMware KB2102453
- After an unplanned failover and reprotect, vSphere Replication uses an algorithm to perform a checksum, this can result in a full sync depending on length of separation and amount of data being changed.
- vSphere Replication only provides replication for powered on virtual machines
- In a HA event on an ESXi Host at the Production site will trigger a full synchronisation of the virtual machines that resided on the failed host. See vSphere Replication FAQ’s
The last point which for me is a deal breaker. Let’s consider that last point again, if we have an ESXi Host that has a PSOD then all of the VM’s will require a full synchronisation.
What’s The Impact?
If we have an inter-site link of 100Mbps which has an overhead of 10%, this gives us an effect throughput of 90Mbps.
We have an average sized VMware environment with a couple of VM’s which hold 2TB of data each which are being replicated across a 100Mbps inter-site link then you are looking at over 4 days to perform a full synchronisation.
We also need to consider the impact on the rest of your VM’s who will have their restore point objective violated as the bandwidth is being consumed by the 2 x 2TB VM’s. Not exactly where you want to be!
The Maths Per 2TB VM
8Mb equals 1MB
2TB equals = 16,777,216 Mbps
16,777,216 Mbps / 90 Mbps = 186,414 Seconds
186,414 seconds / 60 seconds = 3,107 Minutes
3,107 minutes / 60 hours = 51 Hours 47 Minutes
Another good point is that the license comes as part of all editions of ESXi apart from from the Essentials Kit. So as long as it meets your requirements there is no cost to purchase 3rd party or storage array based replication licenses if needed.
Have you tested this Craig to see what actually happens when a host fails? The vSphere Replication documentation says “a full sync is likely to be required” and not “a full sync WILL be required”. In reality does the full sync take 2 days for 2TB or does the full sync compare the source and destination data and only send the differences?
Wouldn’t imementing something like Riverbed help reduce the network utilisation of the replication traffic between sites.
Read further down it states ‘ What happens to a virtual machine’s replication state if the host on which the virtual machine was running crashes during replication?
When a replicated virtual machine is powered back on, a full sync will be initiated. After the full sync has completed, regular delta syncs will continue.’
One thing to note about the “Full Sync”, despite the less than clear name it is not copying all data again, rather it is comparing the source and target using checksums and copying only what is different.
For more detail on the different types of VR syncronization see this post: http://blogs.vmware.com/vsphere/2015/06/vsphere-replication-synchronization-types.html
I have just done a test install of vSphere Replication, which I wanted to use to replace our EMC RecoverPoint array-based replication. I discovered a similar issue that looks like a show-stopper for us. A full sync is required for every reprotect operation! I want to be able routinely test the our DR networking by doing “planned migration” operations of production VMs to (and back from) the recovery site. With our array-based replication, a round-trip test can easily be completed in a hour or two pretty much regardless of the size of the VMs. But with vSphere replication I have to wait 30 or 40 minutes to do a full sync in order to reprotect a 40GB VM. That means larger VMs could take all weekend! As somebody else already pointed out above, that ‘full sync’ is not retransmitting all the data but doing checksums on all the data in the VM (at both sites). That is still a big job for large VMs.