Thursday, November 19, 2015

Scaling ZFS on Linux to many CPUs - Part 1

As servers have gained CPU cores, locking within OpenZFS has become a bottleneck for some highly-parallel workloads.  This posting describes my entrĂ©e to the world of high-concurrency performance testing of ZFS on Linux (ZoL).

In February, 2015, I posted the first pull request for a port of Prakash Surya's multilist and ARC re-work to ZFS on Linux.  The goal was to reduce the lock contention on arcs_mtx, a single mutex embedded within each of the per-state ARC lists.  The new multilist facility provided as part of this pull request includes an almost drop-in replacement to the standard linked list type list_t.  Rather than maintaining a single lock for a single linked list (per ARC state), the lists were split up into a number of sub-lists, each of which had their own mutex.

The benchmark used for testing this work consists of numerous concurrent 4K reads of 100% cached data.  In the original OpenZFS ARC implementation, with a single arcs_mtx lock, the benchmark didn't scale well as additional reader tasks were added.  There was a great deal of contention on the single mutex.  The before-and-after results for illumos are described in the illumos link shown above.

Given ZoL's divergent development with respect to the "upstream" OpenZFS code (from illumos), porting this patch required dealing with a number of conflicts which developed over time.  Some of the issues are documented in the final commit.

Once the code was ported and in working condition, my next step was to try to duplicate the benchmark results under Linux.  My initial results were not encouraging:  The performance wasn't improved much at all and in some cases, was even worse.  My benchmarking was also handicapped by the lack of access to sufficiently "big" hardware.  The largest system which I had direct access to was a 2x6-core Opteron (2-node NUMA system) with only 64GiB RAM.  I began using large spot instances on Amazon EC2 to run the tests but it wasn't very convenient.  It also brought to light the differences in the locking primitives under a virtualized (Xen) environment as opposed to running on bare metal.

I was eventually put in touch with the good people at ServerCentral and, in the name of furthering the ZoL development effort, gave me access to a dedicated server with 4 E7-4850 CPUs, each of which has 10 cores and 2 threads per core.  In all the system has 80 threads available and backing it is 512GiB of RAM, and a bunch of hard drives in several JBODs.  In short, it's a perfect system on which to perform this type of testing.

Using this 4xE7 system, not only was I able to find some (rather trivial) bottlenecks which greatly improved the performance of the benchmark mentioned above, but I also found several other similar bottlenecks, some of which have been fix, some of which have not yet.

In subsequent postings, I'll outline some of the specific bottlenecks I encountered and their fixes, if any.  Pretty much any scaling-related fix or issue I posted or commented on regarding ZoL (zfs or spl repositories) were discovered through testing on the E7 system.

  - Tim

Sunday, June 21, 2015

Keeping up with the Kernels

As filesystems go, ZoL (zfsonlinux) is somewhat unique in that each release should build and run under any kernel from the 2.6.32 "enterprise" version up to the current version at the time of its release.  For example, ZoL version 0.6.4 should work properly under all release kernels from 2.6.32 to 4.0.  This post is the result of some of my observations over several years following and contributing to the ZoL project.

Monday, June 15, 2015

Triviality

Even though I plan on using this blog mainly as a ZFS sounding board, I decided to rename it to better reflect the chaotic nature of my day-to-day work and also to give an excuse for writing posts about unrelated topics.

One such bit of triviality would be the silly shirt I'm wearing in my profile photo (which, BTW, is a copy of my gravatar.com picture).  It is, in fact, a servermonkey.com shirt which was given to me by an associate who picked it up from a trade show.  The other useless piece of information about the picture is that I am, in fact, in a bowling alley.

OK, there, I posted some useless bits of triviality.  Maybe I can post some interesting ZFS stuff soon.

Sunday, June 14, 2015

Yay, I've got a blog!

Uh oh, I've started a blog, what's next?

I'm calling this one "ZFS stuff", but I suppose the most important issue to start with is where this crazy handle of "dweeezil" came from.  Than answer is very simple: I've always liked the connotations of the word "weasel" and would have used it as a handle if it were not such a commonly used word.  Since I was a bit of a Zappa fan in my high school days, and I always though his son had a cool name, I came up with "dweeezil" as a compromise and started using it as a nickname in a few places.  Among others, I'm using it on Github.

Oh yeah, the ZFS part:  I've been trying to help improve ZFS, mainly in its incarnation of OpenZFS on Linux, for several years now and decided I'd start writing about it outside the context of its github site, issue and pull request lists.

I figure this is as good a place as any to post rants, thoughts, ideas, etc. regarding ZFS.