How We Migrated to Terremark’s Cloud, Part II »
Created at: 19.05.2011 18:38, source: Engine Yard Blog, tagged: Technology migration Terremark
This is the second of two posts about the migration from our legacy infrastructure to xCloud on Terremark. You can find the first post here.
Step 4: VM Creation
After the archives for a complete grouping of VMs reached the Terremark migration server, we could create VMs. When creating VMs at Terremark, we normally use their vCloud-compatible API. It supports creating from pre-defined templates and cloning existing VMs. Neither creating nor cloning were good options for us because we had a specific root filesystem. What we really wanted was to boot a “bare” VM with blank disks, lay down the root filesystem and the contents of other archives, and then run a VM-specific configuration script. The Terremark management UI allows creation of blank VMs. We noticed that such VMs tried to PXE boot. PXE booting sounded like the ideal route because that’s how our legacy Xen hosts booted normally. Unfortunately, the API didn’t support creating blank VMs. And even if it did, we would have had to maintain necessary PXE infrastructure in many places due to other aspects of how the API works. We worked with Terremark professional services to create a custom process that would let us create blank VMs where they needed to be and maintain only one PXE setup. We needed professional services for these items:- Creating VMs with a specific name, CPU, memory, and disk configuration on the same layer 2 network segment as the PXE setup running on the migration server.
- Provisioning VMs by moving configured VMs to their final home including: VMware VDC; internal network; and hooking up the VMs in the Terremark management UI so they could be managed through the UI and API.
Step 5: VM Configuration
The fetched and sourced script came from this mustache template. This step summarizes the interesting parts, with the database-related parts in Step 6. The script kept the migration service updated during the script’s execution with calls to the state_update and message_update functions. This let the migration service UI show these state changes, log messages, and hold up other VMs from progressing if one in their group has erred. The script ran with the errexit shell option. This meant that if any command exited with a non-0 status and wasn’t checked by an if or similar statement, the script exited. The trap at line 2 told the migration service an error had occurred. To start, the script mounted a couple NFS shares at the migration service’s IP. These shares contained the filesystem archives for VMs that were being migrated and support data, such as a new kernel and modules. The script partitioned the first disk, creating a small partition for /boot and another encompassing the rest of the disk. Because they used LVM, any disks after the first had their partition tables zeroed. The second partition on the first disk and the whole of the other disks became physical volumes (PVs). A single volume group (VG) was set up to contain all the PVs named local. Using LVM on the migrated VMs was necessary to retain snapshot ability. Logical volumes (LVs) were created matching the filesystems data from the migration document. The LVs were formatted. For /boot, ext3 was used. For all other non-swap filesystems, we chose xfs. We were using reiserfs on our legacy infrastructure but wanted to move to something more supported. Xfs allows for online growing even for the root filesystem and is supported by current distributions. After formatting, the filesystems were mounted with hierarchy under /newroot. For new, non-migrating VMs that had no archived root filesystem, our custom Gentoo distribution stage4 was used instead of a root archive. Each VM archive was unpacked under /newroot/source_item where source_item was the original path, such as /data. This produced a system identical to the snapshots and GFSes previously archived. For database VMs, the archived database dumps were copied to /newroot/database_data_dir, where database_data_dir was usually /db. They were left compressed because the database setup part of the script handled them. The next sections of the script dealt with getting the VM ready to operate in its new Terremark environment. VMs running on our legacy infrastructure didn’t require a kernel or boot loader inside the VM. VMs on Terremark run under VMWare, which is fully virtualized, and requires a kernel inside the VM to boot. When we realized that we needed to change the kernel, we considered various options including running recent vanilla kernels from kernel.org, but ultimately decided to use the CentOS 5.4 kernel. These kernels were unpacked and made available to the VMs via an NFS share. The script started by copying the kernel, source code, and modules into their appropriate locations. Our Terremark VMs run a CentOS kernel and our custom Gentoo userland. We used a custom initrd to boot VMs. The mini system that was PXE booted to run the migration script was really a highly functional initrd. New VMs were booted with something slimmer, a modified LVM initrd script that included loading the disk and LVM modules and mounting the root filesystem. GRUB was installed as the boot loader on the new VMs. It booted the CentOS kernel using the custom initrd, telling the initrd to find the root filesystem at /dev/local/root. After that, some higher-level configuration was done:- A file that let us and tools see if the VM in the “migrating” phase is touched
- /etc/fstab was written
- Puppet, the configuration management tool used on xCloud, was set up to run via init.
- The new network configuration was written to /etc/conf.d/net, /etc/resolv.conf, and other related files.
- The contents of /etc/conf.d/local.start were created.
- The contents of /firstboot.sh were created.
Step 6: VM Provisioning
Terremark shutdown any VMs listed at the “needs provisioning” endpoint of the migration service; moved them to their final, customer-specific networks; and booted them. When a Gentoo VM boots, /etc/conf.d/local.start is run at the end of the boot order. For non-database VMs, this script checked back in with the migration service a final time to say “all done.” For database VMs, it loaded the customer’s database dumps and established replication before reporting in. Much of the first boot scripts deal with MySQL. Because most of our customer databases use MySQL, it was the most automated part of the migration process. PostgreSQL was handled with scripts outside the process and later by Puppet. For MySQL, the process started with creating a config for importing data. It’s nothing special and not necessarily tuned for the VM’s makeup but is good enough to start with. Puppet came around later to tailor the config to the VM. Then, the bulk of the work was between lines 543 and 568. Most important to the migration process was the data load and replication configuration. Database archives that had been copied to database VMs were decompressed with gunzip and piped into mysql -B. The data load was done simultaneously on the master and replica and used set sql_log_bin=0 to prevent the master from writing binary logs because replication was setup later. After loading the data, the master was done. The replica used the hold_until_complete function to query the migration service for the status of its master. After the master reported as “complete,” replicas knew the master was done and used the establish_mysql_replication function to set up replication.Step 7: Testing and Verification
After the VMs were configured and provisioned, it was time to test the new setup. The migration team had a collection of tools for fixing up common application-related configuration changes, such as hostnames. Once complete, an initial test was as simple as accessing the site at the new IP. The migration team created a simple gem, eymigrate, to help with customer testing. During this phase, the migration team and customer were comfortable doing anything that might impact shared assets or database data because they would be reset during cutover. With the iptable rules in place to prevent general outbound access from the VMs, we ensured customers that no external production services could be contacted by the being-verified VMs. When the customer was satisfied with their new setup, we proceeded to the cutover.Step 8: Cutover
Before that there were a few things to take care of:- Because public IPs were changing, we worked with our customers to lower DNS TTLs. We did this directly for customers whose domains we hosted.
- Rsync was used to “catch-up” the customers shared assets; syncing from the still-live VMs on our legacy infrastructure to the new VMs. This reduced the downtime for the final sync during the cutover. We shortened the sync times by excluding directories containing transient session files or files only used for local processing.
- To cutover databases, we had two options: complete-dump-and-restore or cutover-via-replication. For smaller databases, dump-and-restore was simple and worked fine. Larger databases required converting the new master database VM into a replica of the current master on our legacy infrastructure. Using ssh tunnels between the VMs, our DBAs established replication and got the new master (acting as a replica) in sync so the cutover just involved breaking the replication link and converting the new master back to a master.
- The migration team and customer agreed on the date and time to perform the cutover.
- A maintenance page was posted and the application running on our legacy infrastructure was shut down.
- The database procedure was followed.
- The application was started on the new VMs.
- Necessary DNS changes were made and to complement this we used iptables to create rules to forward traffic from each of the customer’s public IPs on our infrastructure to the corresponding public IP at Terremark.
- Removed the “migrating” file.
- Ran Puppet, to start things like cron.
- Removed the iptable rules limiting outbound access.
At cutover time:
[gist id=970791]
With that, traffic was live on the new setup. Downtime with a maintenance page was generally less than 10 minutes.
Thanks
My team, including Edward Muller and Lee Jensen (now with Big Cartel), worked on the migration service and customer UI, but the migrations would not have been possible without our migration team. They worked tirelessly with customers to get migrations done while meeting and exceeding the requirements listed in the first post. So, to Kevin Rutten, Matt Dolian, Will Jessop (now with 37signals), Taylor Weibley (also now with 37signals), Matt Reider and Daniel Vu, I say thank you for all the work you did for our customers and Engine Yard.more »
Announcement: New Engine Yard Private Cloud Infrastructure »
Created at: 11.02.2010 18:30, source: Engine Yard Blog, tagged: News Engine Yard Cloud Terremark
Today is an exciting day at Engine Yard, and I wanted you hear about it from me first. We’ve selected Terremark, a major hosting and infrastructure provider, to provide the infrastructure for our next generation private cloud services.
For Engine Yard Cloud (Amazon Web Services) customers, this move will have no impact on you whatsoever.
When we opened for business more than three years ago, racking and stacking our own hardware wasn’t really a choice: being self-funded well before the concept of cloud computing existed, doing it ourselves was the only way we could introduce our customers to our vision for application deployment and management.
How times have changed! Infrastructure vendors now agree with most of the concepts that we pioneered back then, eliminating the need for us to do it ourselves. We’ve always felt that specializing in Ruby on Rails and the surrounding stack would allow us to make deploying and scaling Rails applications as easy and efficient as it is to create those applications.
Today’s announcement will allow us to further focus on enabling our customers to leverage today’s and tomorrow’s rapidly evolving infrastructure and providing the best Rails Platform-as-a-Service technologies and support.
While there are many advantages to the Terremark infrastructure, we’re most excited about their sophisticated fibre-channel storage area network. The Terremark SAN affords greater reliability and substantially higher throughput than our current storage system; we know that our customers will see great benefit and peace of mind from this.
Terremark has an excellent track record supporting the needs of large enterprise and federal government agencies. Their datacenters have SAS 70 level II, PCI and HIPAA certifications, and we’re confident that our private cloud customers will find this new infrastructure meets the most demanding application requirements.
Over the next six months, we will migrate all current Slice, Fractional Cluster and Dedicated Cluster environments that currently reside on the Engine Yard private cloud to Terremark.
At a high level, not much will change for our private cloud customers. In particular, I want to emphasize that there are no changes in your support team or support processes.
Based on our extensive planning with Terremark, we expect migrations to require minimal effort for our private cloud customers.
If you’re a private cloud customer, you will hear from your Engine Yard account manager in the next few weeks to discuss a migration plan that makes sense for you.
more »
