Hey OP! Hopefully, I may be able to help provide some possible suggestions. I will say, after reading through your post, replacing this kind of infrastructure is coincidentally what the Scale team built Scale ComputingOpens a new window for. With Scale clusters at both sides, you would be able to radically streamline your processes, eliminate the SAN arrays entirely, eliminate your layer2 iSCSI SAN subnets, utilize the built in replication between the sites and do it all for less than you would spend on a refresh of his infra. Likewise, you could also leverage Acronis to send additional backups to literally ANY cost effective storage location (up to 5 simultaneous storage locations).
If you happen to be interested, I'm happy to help get you connected with the team. Just send over a private message my way!
Or, if you'd like to contact directly, you can do so here: https://www.scalecomputing.com/contactOpens a new window
Good luck in your search! And, feel free to reach out.
I've recently started working at a place, where I'm trying to plan out the IT strategy for business continuity.
The portion of this that I'm looking at right now is with server uptime, accessibility, and recovery. That is, I'm looking at fault tolerance, disaster recovery, and backups. (EDIT: Failover, not fault tolerance.)
Currently we have very little in the way of DR or failover, and backups that are in poor shape.
We have two data centers located 25 miles apart.
In each location, we have Cisco UCS blade clusters with vSphere 6.7 (soon to be 7), running 95% Win Server. The SAN at each place is a Nimble iSCSI array that's several years old. At the primary site, it's a hybrid array, and at the secondary site it's SAS/NL-SAS only and significantly smaller. The Nimbles are replicating some of the most-critical datastores from the primary to the secondary. We have two new matching & much larger Nimble HF40 iSCSI arrays that will be put in place this week.
Backups are currently being handled by Avamar, going directly to a DataDomain 6300, which replicates to a matching unit at the secondary site.
The two sites are linked by two 1 gbps circuits, and we may upgrade one or both of those later this year or early next year. We have about 60 VMs (all pets), but that's all production. We'll be putting a Dev > Test > Prod structure in place this year, so the number of VMs is about to go way up due to that. That doesn't include potential failover clusters, nor VDI (which will look at next year). There is nothing in the cloud, although email & SharePoint will be moved to O365 soon. New VMs are currently deployed manually from templates.
We have a dozen VAS VULs to determine how we want to implement Veeam, and we'll order more once we've got that sorted out.
Initially, my thoughts were that we would stop replicating SAN snapshots, and use Veeam CDP to replicate VMs to the new SAN at the secondary site for the best RPOs & RTOs, bring the old SAN from the secondary site to the primary site and use both old SANs to land primary short-term backups, then copy weekly/monthly backups to the primary DD 6300 and let it replicate to the secondary DD 6300, getting rid of Avamar entirely. I think this is a straightforward path to getting DR & backups taken care of (and it is basically what I put in place at my last company except with Zerto + Veeam).
However, this doesn't address address the possibility of reducing the likelihood of a disaster through failover clustering. And it also leaves out the possibility that a faster, more automated way of deploying VMs could play a role in DR.
For clustering, I'm not sure what the best route may be. We have app, file, & SQL servers that would need to be clustered. This requires shared storage, or at least storage that appears to be shared to Windows. Given the infrastructure we have, what are our options? I don't know if the Nimble's snapshot replication would be ideal for this. Would we be better off with something like VMware vSAN, Starwind VSAN, SIOS DataKeeper, or something else?
For DR, do any of you have rules you follow for which servers get replicated vs which get rebuilt?
I'm open to suggestions on each and every part of this.