Wednesday, August 17, 2011

vSphere Replication 1.0


With another problem comes another opportunity...

I have been working on upgrading our vSphere host hardware and migrating VMs from our old EMC Celerra NS350 to a newer HP EVA4400 (that in of itself is worthy of its own blog post).  We had purchased the EVA two years ago to host our ERP data and it has run with a single host accessing it ever since.

So we ordered additional disks and shelves for the EVAs at both primary and DR datacenters.  I installed the HBAs in the hosts, added them to the fabric, created zones, created the LUNs, masked theme off, etc, etc.  Everything was going great until...

I went to setup replication.  I setup array-based replication (ABR) for the first LUN - no problem.  Storage vMotioned a VM over to it and it replicated without issue.  Tried to setup replication for the second LUN - major obstacle time.  HP's replication mechanism for the EVA, Continuous Access (CA), is licensed based on capacity.  And of course, we had licensed 1TB but needed more like 12TB.  Great.  Meanwhile, there's grumblings and doubt by others on the IT team that CA would even be the right choice for replicating this data.

Come on HP, really? Does any vendor license replication by capacity anymore?  You don't do this with LeftHand/P4000 or 3PAR arrays.  Fustrating...

Now I'll be the first to tell you that I hate, hate, hate vendor lock-in.  Technology changes so fast that whatever you're using today, probably isn't what you'll be using 3-5-10 years from now.  Again, a good topic that deserves its own post.  This is one reason that, as a vSphere and storage engineer, I've become a fan of host-based replication (HBR).  There are third-party products that provide this capability for virtual machines today: Veeam Backup and Replication and Quest vReplicator just to name a couple.

But here comes vSphere 5 and SRM 5.  We'll be entitled to both when they're released.  As part of the upgrade we'll get the capability to replicate VMs using vSphere Replication 1.0 for free.  I've started setting up a testing environment and will post my experiences with this new feature.  One thing I'm really curious about is how the bits actually get replicated.  Different arrays handle this differently.  I will have my investigative hat on at VMworld and will ask the storage vendors all the gory details.  I'll follow-up with another article detailing how different vendors implement their replication (geesh, I've got a lot of writing to do!).

In the mean-time, I've gathered some information on vSphere Replication 1.0, all of which is publicly available.  Exciting stuff!  Here are the details:
  • This feature is included with all editions of SRM 5
  • VMs can be replicated from any storage to any storage, including local disk
    • Replicated disks can be place on any ESXi-compatible disks/filesystem
    • Breaks storage vendor lock-in
  • Replication is an attribute of the VM (not the LUN or some other element)
  • You can choose which VMDKs to replicate within the VM
    • In some cases you may not want to replicate the system drive/VMDK, only the data drive/VMDK
  • Disks are replicated in a "group consistent" manner
  • Does not use CBT to track and replicate deltas.  Instead, VMware developed another technology that tracks I/O changes to VMDKs and captures them in a "PSF" or persistent state file.  It does not use VM snapshots
    • I'm not sure why they didn't leverage existing CBT technology - more details to follow
  • Initial "seed" copy can be made in advanced by FTP, external disk/sneaker net, etc.
    • Saves bandwidth - great if you have a slower WAN connection and/or a large number of VMs to replicate
  • RPO can be set on a per-VM basis
    • 5 - minutes to ?
    • If you need an RPO smaller than 5 minutes, you got other challenges to face!

Some limitations:
  • VM must be powered-on
    • My guess is that the thinking here is that if it's powered-off it must not be critical enough to recovery in a DR scenario.  I hope VMware reconsiders on this one.  I don't have any of these today, but it I can see the possibility of it in the future.
  • Will not replicate swap, logs, dumps
  • Will replicate VMs with snapshots.  However, snapshots will not be replicated.  Instead, the I/O from the source snapshot is written to the destination VM, effectively making the destination VM look like the source VM after collapsing the snapshot.
  • No FT VMs, linked clones, templates, physical RDMs, ISOs or floppies
  • Requires VM hardware version 7 or later

That wasn't too painful.  Here's what the (high-level) architecture looks like:
  • vRMS - vSphere Replication Management Server
    • Required at both sites
    • This is a virtual appliance (VA) imported into vCenter
  • vRA - vShpere Replciation agent
    • Required at the protected site
    • Runs on the ESXi 5 hosts
  • vRS - vSphere Replication Server
    • Runs on the recovery site
    • This too is a VA imported into the vCenter at the recovery site

Scalability info:
  • VM totals = 500 replicated (1000 total for SRM)
    • If you need to protect more that 500 VMs, not only do you have a large environment, you'll need to use ABR or find an alternative HBR solution that can scale higher (if it exists).  With that size of an environment I'd recommend working with your VMware account representative and/or storage vendor.

For a storage geek like me this is pretty exciting stuff.  I think a lot of VMware customers, from the small SMB to the mid-sized and even some larger companies, are going to benefit from this new feature.

Time to kick the tires, stay tuned!

Monday, August 1, 2011

ERROR: Cannot login vi-admin00@IPADDRESS

Don't you love it when, during a standard log review of your vSphere environment, you find an error like this that zaps the next four hours of your time?  Not!  Maybe this will save you some time.

Scenario
I had the ESXi 4.1 hosts in my vSphere cluster setup to remote syslog to the VMware vMA appliance per Simon's excellent instructions:  Using vMA as Your ESXi Syslog Server
I recently upgraded our vSphere cluster hardware which included a fresh installation of ESXi.
With that in mind, after recently reviewing tasks and events in vCenter, I noticed the error message "Cannot login vi-admin00@IPADDRESS" where IPADDRESS was the IP of the vMA system.  I found this error on all of the hosts' local events and it occurred often.

Troubleshooting
Reading through the comments of the above post, I noticed someone else had the same problem, but no responses.  I did the "chown" change on the syslog directory but this did not solve the problem.

I then did ran the following command directly on the vMA appliance:
vilogger list --server SERVERNAME
Per the results, I found that the host was "enabled" but it had an "Authentication Failure".  This got me wondering about that vi-admin00 account in the original error message. The vMA has a "vi-admin" local account, but what is "vi-admin00"?   I fired up the vSphere client and logged directly in to one of the host.  Sure enough, the account didn't exist.

Solution
A little more investigation (er, Google searching), and I found the answer here:
How to Remove Stale Targets from vMA
Apparently, rebuilding/replacing the hosts wiped out the accounts vilogger creates including vi-admin00!

First step to fix this is to remove the server.  I did not need to use the "force" parameter:
sudo vifp removeserver SERVERNAME

Then add the server back in:
sudo vifp addserver SERVERNAME

Finally, re-register the host with vilogger:
vilogger enable --server SERVERNAME --numrotation 20 --maxfilesize 10 --collectionperiod 10

You'll know it worked if you get the green "Enabled" result messages.

To verify:
vilogger list --server SERVERNAME

You should see each of the three logs listed as "Enabled" and "Collecting".  I also WinSCP'ed to the system and made sure the logs were updating with new data.

Conclusion
Do this for all hosts in your cluster and you'll be back in business.  And don't forget to add this to the host rebuild/replace checklist.  It's always the little things, isn't it?