Virtualization: Behind the scenes
I recently wrapped up my second (nearly) annual look at three leading Intel Mac virtualization products—VMware Fusion, Parallels Desktop, and VirtualBox—along with an overview piece designed to help you figure out which one best suits your needs.
If you followed my Tweets during the development of these articles, you got a glimpse behind the scense at what went into the project: “Spent 5+ hours yesterday setting up a test that will probably only merit a couple sentences in final writeup. But it had to be done.”
Based on the response to those Tweets and some e-mailed inquiries, it appeared there was some interest in a “behind the scenes” look at just how a large comparison review/roundup like this comes together. So if you’re interested, grab an All Access pass from the bin on the right, and join me on a tour of the virtualization review production studio. I’ll try to give you some idea what it takes to put together a comparison test like this, and why it can take so long for it to finally appear online. (I started this project in mid-October, and it wrapped up in mid-January.)
After deciding to tackle this project, I had to figure out how I was going to actually do it this time around. Last time, I didn’t include much specific testing data. This time out, I wanted to change that. But what to test, and on which operating systems, and on which computers? Hard decisions, indeed. In the end, I chose my Mac Pro (2.66GHz quad core, 8GB RAM) because it’s neither the fastest nor the slowest of Apple’s Intel-powered Macs. It also had a lot of drive space available, which I was going to need.
Because readers are primarily interested in running Windows on the Mac, that was an obvious focal point. XP Pro is the longstanding speed champ in virtualization, so I included it in the testing. Windows 7, as the newcomer, also gained a seat at the table, both one- and two-CPU versions; because 64-bit computing is emerging as a standard, I included the 64-bit version of Windows 7, too. Finally, I added Ubuntu 9.10 into the mix, though not for performance testing—I mainly used it to judge how well, or not well, the various apps handled OpenGL acceleration in Linux.
Why not Vista, you might ask? Vista never really took hold during its short lifespan, so it’s not widely used. For example, one study reported that 67% of web site visitors in December of 2009 were using Windows XP, versus only 18% on Windows Vista—and nearly 6% were using Windows 7, despite it only being available since late October. Given the limited time available, it made sense to focus on XP Pro and Windows 7.
To make the testing as fair as possible, I installed a fresh copy of OS X 10.6.2 on a spare hard drive, and then installed each of the virtualization apps on that drive. For each app, I then installed the four separate operating systems, a copy of Office 2007 Pro (in Windows), and ran the requisite software updates within each of these setups. This alone took up a lot of time. It also took up a lot of drive space: when everything was set up, I had over 170GB of drive space devoted to virtual machines.
To get a sense of the performance of Windows in the various apps, we ran a suite of tests:
- WorldBench 6 (These test were run in the Macworld Labs, not using my Mac Pro.)
- File copy tests, to gauge disk and shared folder performance.
- Boot, sleep, wake, and shut down timing tests.
- Windows Media HD playback tests, using the Coral Reef Adventure 1080p video from Microsoft’s WMV HD Content Showcase page.
- 3D gaming framerate tests.
In total, I ran 90 tests per virtual machine, spread across four different operating systems—that’s 270 tests in total, plus the WorldBench scores (which entails running one program, and then waiting a long time to see a number of different results.) You can see all of the data in this spreadsheet (Excel 2008 required), which is what I used to track the results for each and every test.
I ran each benchmark at least twice (so that’s 540 tests), and would then run them twice more if there were gross disparities in the first two results. I also had to rerun the tests wheneverone of the apps was updated. (That happened three times for VirtualBox, twice for Parallels, and once for Fusion). If you’re scoring at home, that’s a total of at least 1,620 separate tests. Ugh.
The benchmark testing was, by far, the most time-consuming part of the project. Even with all the testing I did, though, I barely scratched the surface of what’s possible. I could have tested on different hardware, with different guest operating systems and with more tests. I had to cut it off somewhere.
The unused benchmarks
I ran other tests, beyond those you see here, but didn’t include their results, for a variety of reasons. For instance, a test might have been too tricky to reliably repeat, the results might not have shown anything of interest, or the results might have been hard to interpret.
As an example, I thought it would be interesting to see how RAM and CPU resources were used in the course of a test: Did any of the programs exhibit huge increases in memory consumption or out-of-control CPU usage? To measure this, I used BigTop (included with XCode) to record CPU and memory statistics while I had Windows 7 (64-bit dual CPU) first sit for 10 minutes with two Office files open, and then run a looping Windows Media Player movie for 10 more minutes. I spent a few hours pulling this all together, and when I was done, I had some very nice looking graphs…that really didn’t say much of anything.
In the end, I chose not to use these graphs, because they would require a lot of interpretation, and nothing in any of them struck me as alarming. Still, if I hadn’t spent the time on it, I wouldn’t have known that, so it wasn’t a complete waste of time. (If you’d like to see these charts, they’re available in this 4.4MB PDF. To understand the memory usage charts, this Apple Knowledge Base article explains Free, Wired, Active, and Inactive memory.)
Still, even running 1,620 benchmark tests shouldn’t take three months, right? You’re correct. But with a project of this scope, there are little things that crop up all the time. For example, in one Windows installation one day, audio stopped working. Debugging that took a couple hours. Another day, another Windows installation simply failed to boot at all; that took up another couple of hours.
Windows updates, too, wrought havoc on my schedule. Anytime Microsoft issued an update, I had to install it 12 times (if it affected every version of Windows that I was testing). I thought keeping a few Macs current was hard work; keeping 12 Windows installations current was a near nightmare.
Oh, and that five-hour setup I Tweeted about? I needed to test Fusion’s new migration feature, for migrating an actual PC into a virtual machine. The only problem was that I didn’t have a physical PC in the house. First I tried migrating the Boot Camp partition from my MacBook Pro, but that didn’t work. So then I had to install a hard drive in my Linux machine, install Windows on that drive, set it up to boot into Windows, run all the software updates, and then (finally) test the migration assistant. As I expected, it worked just fine—but I didn’t know that until I did the test.
Parallels, too, demanded some extra testing work. The new version supports gestures in Windows guests on Mac laptops and the Magic Mouse. Not owning a Magic Mouse, I had to install Parallels and Windows on my MacBook Pro in order to test the gesture support. Again, this took a couple hours’ time for what wound up being a sentence or two in the final review. Again, it had to be tested.
Still, come late November, I was making progress and feeling good about the reviews and the roundup. Then the real trouble started.
The worst thing that can happen when you’re working on a project like this: the vendor issues an upgrade. First it was VirtualBox: I was informed that a relatively major “dot upgrade” (from 3.0 to 3.1) was coming in late November. So I kept working on the rest of the project until 3.1 was released. I then re-ran all of its benchmarks, and updated its review.
As I finished that rework, I heard that Fusion 3.0.1 was coming out in early December. So again, when it came out, I updated its review and benchmark results. Next it was Parallels, with an update in mid-December (build 9308). Again, I worked through the benchmark suite and review for Parallels, updating the results.
By the end of December, I thought I saw some light at the end of the tunnel. But then Macworld shut down for the week between Christmas and New Year’s. By early January, the end of the project really was in sight. I spent a week or so finishing up the three reviews, confirming that I had all the details correct. Then, finally, on January 11th, I sent the entire piece in for editing.
On Friday, January 15th, though, we seemed to have finally reached the end. Final edits were in place or happening, and we were all set to go live with all of it early in the week of January 18th. Then, on Sunday the 17th, Parallels released another upate (build 9310).
Yes, that meant re-running all of the Parallels tests again, which I did. But then another gotcha popped up: Some of the test results for build 9310 were inconsistent with those for build 9308, on which I’d based my review. While most of the test times were about the same from build to build, in other cases the new version was two or three times slower than the previous one.
There was no detectable pattern to those differences, so I couldn’t point to any one function that was noticeably worse. Investigating and clarifying these discrepancies will take time, so we elected not to include them in the review. I’ll continue to investigate and talk to the vendor and will publish a follow-up once I know more.
Wrapping it all up
As you can see, getting this roundup done took a lot of time and effort. I hope the results reflect it. I’ve done my best to provide real-world data on how each of these virtualization apps perform when put to the test. (In some cases, again and again and again…)
For now, though, I’m signing out of virtualization land, and happily returning to my primarily OS X cocoon—at least until this time next year, I hope!