Pages

Tuesday, April 29, 2014

The Story of the 30 seconds freeze on Netapp filer every five hours at 9am, 2pm, 7pm and 12am

We have two netapp filers in active/active HA. One filer is for the SAS drives mostly storing the OS disks for the vmware VMs and the second filer filer02 hosting the data virtual disks for our VMware vms, CIFS and exchange database and logs.

A few weeks ago, we started noticing about 30-45 seconds freeze in outlook at 9:00am and 2:00pm. We only noticed the freeze at these times because this was during hours. We thought it was something to do with exchange servers. There were events related to delayed writes on the database and logs LUNs. However, the issue was occurring on all the exchange mailbox servers. We changed the backup schedules on all the servers. We take a backup every 4 hours. However, the issue persisted.
We then thought it could be a network issue. We use dedicated switches for storage traffic. However, our VMs ran fine and we did not notice the freeze on the VMs. When we checked the performance logs for the SAN we noticed a spike in NFS latency and CIFS latency around 9:01am, 2:01pm, 7:01pm and 12:00am. None of the de-dup schedules were running at all these four times. In the syslogs of the filer we saw the following for each of the exchange server’s iscsi initiators at the times mentioned above.
We still couldn’t understand what was causing the LUNS to reset the connection.
Event:
iscsi.notice






Severity:
notice

Message:
ISCSI: Initiator (iqn.1991-05.com.microsoft:server-mbx-05.domain.com) sent LUN Reset request, aborting all SCSI commands on lun 7
Triggered:
Sun Apr 27 00:01:32 PDT

iscsi.notice
Severity:
notice

Message:
ISCSI: New session from initiator iqn.1991-05.com.microsoft:srv-mbx-07.domain.com at IP addr 172.20.1.64

Triggered:
Fri Apr 11 09:01:19 PDT

We changed the schedule for all our VMware backups and any other backups that happened on the hour. There was still no change in this behavior.
Then I learned about the bug mentioned in
about the Netapp and NFS issue. We applied the workaround by changing the NFS.maxqueuedepth to 64 on all our hosts. Because the issue had started appearing after we had added a new host.
It still did not make a difference. Something was off and didn’t make sense. Because our VMs were not showing the freeze condition.
We went through every netapp log, every vmware log on each host but the only thing that pertained to this situation was the following in the \var\log\vmkernel.log for each host at the mentioned time.

2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop accessing fd 0x410013247ce8  3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop accessing fd 0x4100132692a8  3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop accessing fd 0x41001219b328  3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop accessing fd 0x41001218a228  3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop accessing fd 0x410013242c68  3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop accessing fd 0x41001219fae8  3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop accessing fd 0x41001219e5e8  3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop accessing fd 0x41001219fe68  3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop accessing fd 0x410012190268  3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop accessing fd 0x410013234fe8  3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop accessing fd 0x410012190968  3
2014-04-28T21:01:51.783Z cpu6:8198)NFSLock: 568: Start accessing fd 0x410013246468 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x41001219b328 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x410012190268 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x410013247ce8 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x41001219fe68 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x410013242c68 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x410012190968 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x41001219fae8 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x41001218a228 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x4100132692a8 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x410013234fe8 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x41001323dda8 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start accessing fd 0x41001219e5e8 again


We were about to give up and call support for vmware and netapp, when I thought of aggregate snapshots. When I checked the aggregate snapshots….there it was..each aggregate snapshot was taking place at the mentioned times. Changed the schedule to off-peak hours, and saw the iscsi.notice at the times i changed the schedule to. I would have never thought aggregate snapshots could cause a freeze.