2016-04-26 | Adam Boliński

Infiniband SR-IOV on Exadata OVM

Based on our session in Barcelona on Enkitec E4 regarding Exadata Cons. and Pros. we prepare live demo – SR-IOV on OVM in Exadata environment.

Virtual hosts on Exadata with OVM are HVM and not PV. This is one of the limitations of Infiniband SR-IOV – can’t be used with PV. So there is a qemu used to emulate the hardware

[root@exa2dbadm01 ~]# egrep "builder|qemu" /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg
builder = 'hvm'
device_model = '/usr/lib/xen/bin/qemu-dm'

While accessing a physical device from within a DOMU we can see that actual work is being done on DOM0 machine.

Physical IOs on filesystem:
Physical disks on DOMUs are presented with loop devices:

[root@exa2dbadm01 ~]# losetup -a | grep exa2adm01vm02.arrowecs.hub
/dev/loop4: [0803]:57999656 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/System.img*)
/dev/loop5: [0803]:57999797 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/grid12.1.0*)
/dev/loop6: [0803]:57999799 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexad*)
/dev/loop7: [0803]:57999798 (/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/db12.1.0.2*)

The „disk” variable points actualy to symbolic links to the above loop devices:

[root@exa2dbadm01 ~]# grep disk /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg
disk = ['file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img,xvda,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img,xvdb,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img,xvdc,w','file:/OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img,xvdd,w']
[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img
lrwxrwxrwx 1 root root 62 mar 30 23:02 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/cbb71de28cec45cbb5dac61020b12b44.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/System.img
[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img
lrwxrwxrwx 1 root root 75 mar 30 23:02 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/625339924f0e48cb8d19dc107a2e0ce2.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/grid12.1.0.2.160119.img
[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img
lrwxrwxrwx 1 root root 75 mar 30 23:03 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/3d6e30bfa5da4e9c84d02a0503c476dd.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/db12.1.0.2.160119-3.img
[root@exa2dbadm01 ~]# ls -al /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img
lrwxrwxrwx 1 root root 67 mar 30 23:04 /OVS/Repositories/f48fa9c9a5a34638b1f52f50b53a995b/VirtualDisks/da9a404caad24d15bd92e8ebbe14c8b1.img -> /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexadb.img

Let’s check the filesystem and LVMs on the exa2adm01vm02 virtual host:

[oracle@exa2adm01vm02 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
 24G 6.5G 17G 29% /
tmpfs 12G 164M 12G 2% /dev/shm
/dev/xvda1 496M 34M 437M 8% /boot
/dev/mapper/VGExaDb-LVDbOra1
 20G 15G 3.8G 80% /u01
/dev/xvdb 50G 14G 34G 30% /u01/app/12.1.0.2/grid
/dev/xvdc 50G 8.5G 39G 19% /u01/app/oracle/product/12.1.0.2/dbhome_1
[root@exa2adm01vm02 ~]# pvs
 PV VG Fmt Attr PSize PFree
 /dev/xvda2 VGExaDb lvm2 a-- 24,50g 508,00m
 /dev/xvdd1 VGExaDb lvm2 a-- 58,00g 1020,00m

If I’ll create a tablespace in /u01/app/oracle/oradata it will be actualy located in /EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/pv1_vgexadb.img file which is accesible through /dev/loop6 device.

Let’s try to prove it. At DOM0 there are processes associated with all loop devices:

[root@exa2dbadm01 ~]# ps aux | grep loop | grep -v grep
root 193012 0.0 0.0 0 0 ? S< Mar31 0:59 [loop4]
root 193075 0.0 0.0 0 0 ? S< Mar31 0:17 [loop5]
root 193101 0.0 0.0 0 0 ? S< Mar31 1:23 [loop6]
root 193130 0.0 0.0 0 0 ? S< Mar31 0:15 [loop7]
root 364287 0.0 0.0 0 0 ? S< Mar30 0:56 [loop0]
root 364346 0.0 0.0 0 0 ? S< Mar30 0:18 [loop1]
root 364372 0.0 0.0 0 0 ? S< Mar30 1:09 [loop2]
root 364399 0.0 0.0 0 0 ? S< Mar30 0:16 [loop3]

We will trace the [loop6] process while doing some IOs at the virtual host level.

Virutal Host (exa2adm01vm02):

SQL> select file_name
 2 from dba_data_files
 3 where tablespace_name='TBS_TEST';
 
FILE_NAME
--------------------------------------------------------------------------------
/u01/app/oracle/oradata/RICO/datafile/o1_mf_tbs_test_cjncko3x_.dbf

DOM0

[root@exa2dbadm01 ~]# perf record -g -p 193101
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.040 MB perf.data (~1759 samples) ]
[root@exa2dbadm01 ~]# perf report
# Events: 172 cpu-clock
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ..............................
#
 53.49% loop6 [kernel.kallsyms] [k] xen_hypercall_xen_version
 |
 --- xen_hypercall_xen_version
 check_events
 |
 |--32.61%-- __blk_run_queue
 | |
 | |--86.67%-- __make_request
 | | generic_make_request
 | | submit_bio
 | | dio_post_submission
 | | __blockdev_direct_IO_bvec
 | | ocfs2_direct_IO_bvec
 | | mapping_direct_IO
 | | generic_file_direct_write_iter
 | | ocfs2_file_write_iter
 | | aio_write_iter
 | | aio_kernel_submit
 | | lo_rw_aio
 | | loop_thread
 | | kthread
 | | kernel_thread_helper
 | |
 | --13.33%-- blk_run_queue
 | scsi_run_queue
 | scsi_next_command
 | scsi_end_request
 | scsi_io_completion
 | scsi_finish_command
 | scsi_softirq_done
 | blk_done_softirq
 | __do_softirq
 | call_softirq
 | do_softirq
 | irq_exit
 | xen_evtchn_do_upcall
 | xen_do_hypervisor_callback

Tracing network

With network traffic we can observe similar situation to physical device emulation.
When transfering data to and from virtual machine through Ethernet, we can see that DOM0 is doing actual work with xen_netback driver.

To measure this I’ll be transfering a file between cell storage server and virtual guest system using admin network – during this operation I’ll measure the network traffic with nttop.stp script (systemtap script provided by sourceware.org – https://sourceware.org/systemtap/examples/)

From Virtual Guest:

[root@exa2adm01vm02 ~]# scp 10.8.8.53:*.rpm .
root@10.8.8.53's password:
cell-12.1.2.3.0_LINUX.X64_160207.3-1.x86_64.rpm

At the DOM0:

PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
35476 0 eth0 16351 0 960 0 netback/2
 0 0 eth0 0 14750 0 281612 swapper
 0 0 vif6.0 14747 0 281813 0 swapper
376527 0 eth0 0 181 0 3567 perl
376527 0 vif6.0 181 0 3569 0 perl
 0 0 eth1 0 12 0 0 swapper
338283 0 eth0 0 11 0 194 LGWRExaWatcher.
338283 0 vif6.0 11 0 194 0 LGWRExaWatcher.
39142 0 eth0 0 10 0 200 python
39142 0 vif6.0 10 0 200 0 python

At the top we can see the netback/2 process which is responsible for transmitting the data.
We can see the perf call-graph of the netback/2 process while transmitting the data through the public network.

[root@exa2dbadm01 ~]# perf record -g -p 35476
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (~831 samples) ]
 
[root@exa2dbadm01 ~]# perf report
# Events: 86 cpu-clock
#
# Overhead Command Shared Object Symbol
# ........ ......... ................. ..............................
#
 69.77% netback/2 [kernel.kallsyms] [k] xen_hypercall_grant_table_op
 |
 --- xen_hypercall_grant_table_op
 |
 |--98.33%-- xen_netbk_rx_action
 | xen_netbk_kthread
 | kthread
 | kernel_thread_helper
 |
 --1.67%-- xen_netbk_tx_action
 xen_netbk_kthread
 kthread
 kernel_thread_helper

But when we try to record netback/2 process when using the infiniband address, we will see that no actions have been performed:

[root@exa2dbadm01 ~]# perf record -g -p 35476
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.011 MB perf.data (~491 samples) ]
 
[root@exa2dbadm01 ~]# perf report

The perf.data file has no samples!

This is because SR-IOV implementation in the OVM. At the DOM0 we can see one PF and 16 VFs of the InfiniBand PCI card:

[root@exa2dbadm01 ~]# lspci | grep -i infiniband
19:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
19:00.1 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.2 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.3 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.4 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.5 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.6 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:00.7 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.0 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.1 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.2 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.3 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.4 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.5 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.6 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:01.7 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)
19:02.0 InfiniBand: Mellanox Technologies MT25400 Family [ConnectX-2 Virtual Function] (rev b0)

In the VM configuration file there is information about assiging the PCI address for the InfiniBand card to the virtual machine:

[root@exa2dbadm01 ~]# grep "ib_" /EXAVMIMAGES/GuestImages/exa2adm01vm0[1-2].arrowecs.hub/vm.cfg
/EXAVMIMAGES/GuestImages/exa2adm01vm01.arrowecs.hub/vm.cfg:ib_pfs = ['19:00.0']
/EXAVMIMAGES/GuestImages/exa2adm01vm01.arrowecs.hub/vm.cfg:ib_pkeys = [{'pf':'19:00.0','port':'1','pkey':['0xffff',]},{'pf':'19:00.0','port':'2','pkey':['0xffff',]},]
/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg:ib_pfs = ['19:00.0']
/EXAVMIMAGES/GuestImages/exa2adm01vm02.arrowecs.hub/vm.cfg:ib_pkeys = [{'pf':'19:00.0','port':'1','pkey':['0xffff',]},{'pf':'19:00.0','port':'2','pkey':['0xffff',]},]

Although the parameters are the same for both DOMUs, the virtual guest is being assigned with only one exclusive VF at the system startup:

[root@exa2dbadm01 ~]# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 8192 4 r----- 376916.6
exa2adm01vm01.arrowecs.hub 2 49152 8 -b---- 365576.8
exa2adm01vm02.arrowecs.hub 6 12288 2 -b---- 276681.1
[root@exa2dbadm01 ~]# xl pci-list exa2adm01vm01.arrowecs.hub
Vdev Device
04.0 0000:19:00.1
[root@exa2dbadm01 ~]# xl pci-list exa2adm01vm02.arrowecs.hub
Vdev Device
04.0 0000:19:00.2

After those assignments I have only 14 VFs left:

[root@exa2dbadm01 ~]# xl pci-list-assignable-devices
0000:19:00.3
0000:19:00.4
0000:19:00.5
0000:19:00.6
0000:19:00.7
0000:19:01.0
0000:19:01.1
0000:19:01.2
0000:19:01.3
0000:19:01.4
0000:19:01.5
0000:19:01.6
0000:19:01.7
0000:19:02.0

So this is actualy a limitation on how many virtual machines I can run in the Exadata OVM environment.
On X-4 X-5 and X-6 – you have 63 VFs.