2017-11-27 | Adam Boliński

NFS over RDMA

NFS over RDMA Network

Last month I was asked by my friends CEO of Software Development Company to create low-level latency I/O subsystem for Oracle Databases, this software company mainly produce software for banking and insurance companies and they need something similar in terms of performance and latency as big banks I/O subsystem.  It is very difficult to have this kind of infrastructure in relatively small amount of money and they ask me to do it.

Main requirements were :

  1. Easy to attach to over Network
  2.  Low Latency
  3. Cheap infrastructure components

Among all possible infrastructure SAN,dNFS, PCI switches I choose to use low latency network layer which is RDMA Network (Infiniband or RoCE) and NFS service which support NFS over RDMA and we can achieve great performance and low latency.  My idea to build this infrastructure was added to NFS over RDMA low latency storage like NVMe disks (number of disks depends on how much bandwidth you need).

The first step was to build Infiniband/RoCE environment and this step and cost depend on your budget so you could buy Mellanox EDR (100GBit/s) equipment for 10k-12k USD or you can buy Mellanox FDR (56GBit/s) equipment for 3k-4k USD – this depends on you. Here I must add one important information if you are afraid or your budget is too low you can still test RDMA network using only HCA cards (for example 2-port ConnectX-3 card), copper cables directly connected (without switch) to one or two servers, this will reduce the cost to 500USD so it is not big deal.

Before we will start to test I would like to explain a little bit what is RDMA, in shortcut RDMA is Remote Direct Memory Access so the client can access directly memory region on server side bypassing CPU using only HCA card processor so we omit few layers and offload work to HCA.

So from the hardware perspective, you know how to build RDMA network and how about Software layer, this is very easy we will be using OEL/CentOS operating system and standard OFED yum repository, installation is very easy :

public-yum-ol7.repo
[ol7_UEKR3_OFED20]
[ol7_UEKR4_OFED]


Server Side
yum groupinstall "Infiniband Support"
mount_share * (rw,async,insecure,no_root_squash)
modprobe svcrdma
service nfs start
echo rdma 20049> /proc/fs/nfsd</portlist
cat /proc/fs/nfsd/portlist
rdma 20049
udp 2049
tcp 2049

Client Side
mount –o rdma,port=20049 10.10.10.110:/mount_share /mnt/orafiles

After setup this RDMA configuration we are prepared to test and compare performance with dNFS and 10GBit/s using Jumbo Frames, in this scenerio, we were using NVMe consumer disk Samsung 960 EVO, to testing we were using SLOB.

 

As you can see latency is similar between dNFS and NFS over RDMA, test scenario was the same,  a number of the transaction we achieved in two test was different, we get twice more in NFS over RDMA: dNFS 79.4 TPS, NFS  over RDMA 151.8 TPS.

What about CPU latency that it should be lower regarding NFS over RDMA compering to dNFS, to measure this I used eBPF (strongly advised to check this out http://www.brendangregg.com/ebpf.html )

You can see that CPU latency in dNFS is higher comparing two each other, so let’s check how it will be inside Oracle database, so let’s check  IO Wait histograms using SLOB and read operation:

As you can see above NFS over RDMA is two times faster (latency) comparing to dNFS, but let check how it will look like when we try to check all speed of Infiniband/RoCE network (40 GBits/s in my case).

As you can see the bandwidth is around 3,65 GB/s so it is very promising, next of my work will be checking the new Oracle 12c functionality of underscored parameter _dnfs_rdma_enable.