Thursday, November 19, 2015

Scaling ZFS on Linux to many CPUs - Part 1

As servers have gained CPU cores, locking within OpenZFS has become a bottleneck for some highly-parallel workloads.  This posting describes my entrée to the world of high-concurrency performance testing of ZFS on Linux (ZoL).

In February, 2015, I posted the first pull request for a port of Prakash Surya's multilist and ARC re-work to ZFS on Linux.  The goal was to reduce the lock contention on arcs_mtx, a single mutex embedded within each of the per-state ARC lists.  The new multilist facility provided as part of this pull request includes an almost drop-in replacement to the standard linked list type list_t.  Rather than maintaining a single lock for a single linked list (per ARC state), the lists were split up into a number of sub-lists, each of which had their own mutex.

The benchmark used for testing this work consists of numerous concurrent 4K reads of 100% cached data.  In the original OpenZFS ARC implementation, with a single arcs_mtx lock, the benchmark didn't scale well as additional reader tasks were added.  There was a great deal of contention on the single mutex.  The before-and-after results for illumos are described in the illumos link shown above.

Given ZoL's divergent development with respect to the "upstream" OpenZFS code (from illumos), porting this patch required dealing with a number of conflicts which developed over time.  Some of the issues are documented in the final commit.

Once the code was ported and in working condition, my next step was to try to duplicate the benchmark results under Linux.  My initial results were not encouraging:  The performance wasn't improved much at all and in some cases, was even worse.  My benchmarking was also handicapped by the lack of access to sufficiently "big" hardware.  The largest system which I had direct access to was a 2x6-core Opteron (2-node NUMA system) with only 64GiB RAM.  I began using large spot instances on Amazon EC2 to run the tests but it wasn't very convenient.  It also brought to light the differences in the locking primitives under a virtualized (Xen) environment as opposed to running on bare metal.

I was eventually put in touch with the good people at ServerCentral and, in the name of furthering the ZoL development effort, gave me access to a dedicated server with 4 E7-4850 CPUs, each of which has 10 cores and 2 threads per core.  In all the system has 80 threads available and backing it is 512GiB of RAM, and a bunch of hard drives in several JBODs.  In short, it's a perfect system on which to perform this type of testing.

Using this 4xE7 system, not only was I able to find some (rather trivial) bottlenecks which greatly improved the performance of the benchmark mentioned above, but I also found several other similar bottlenecks, some of which have been fix, some of which have not yet.

In subsequent postings, I'll outline some of the specific bottlenecks I encountered and their fixes, if any.  Pretty much any scaling-related fix or issue I posted or commented on regarding ZoL (zfs or spl repositories) were discovered through testing on the E7 system.

  - Tim

No comments:

Post a Comment