We have two netapp filers in active/active HA. One filer is
for the SAS drives mostly storing the OS disks for the vmware VMs and the
second filer filer02 hosting the data virtual disks for our VMware vms, CIFS
and exchange database and logs.
A few weeks ago, we started noticing about 30-45 seconds
freeze in outlook at 9:00am and 2:00pm. We only noticed the freeze at these
times because this was during hours. We thought it was something to do with
exchange servers. There were events related to delayed writes on the database
and logs LUNs. However, the issue was occurring on all the exchange mailbox
servers. We changed the backup schedules on all the servers. We take a backup
every 4 hours. However, the issue persisted.
We then thought it could be a network issue. We use
dedicated switches for storage traffic. However, our VMs ran fine and we did
not notice the freeze on the VMs. When we checked the performance logs for the
SAN we noticed a spike in NFS latency and CIFS latency around 9:01am, 2:01pm,
7:01pm and 12:00am. None of the de-dup schedules were running at all these four
times. In the syslogs of the filer we saw the following for each of the
exchange server’s iscsi initiators at the times mentioned above.
We still couldn’t understand what was causing the LUNS to
reset the connection.
Event:
|
iscsi.notice
|
|
|||||||||
Severity:
|
notice
|
Message:
|
ISCSI: Initiator (iqn.1991-05.com.microsoft:server-mbx-05.domain.com)
sent LUN Reset request, aborting all SCSI commands on lun 7
|
Triggered:
|
Sun Apr 27 00:01:32 PDT
|
iscsi.notice
Severity:
|
notice
|
||
Message:
|
ISCSI:
New session from initiator iqn.1991-05.com.microsoft:srv-mbx-07.domain.com at
IP addr 172.20.1.64
|
||
Triggered:
|
Fri
Apr 11 09:01:19 PDT
|
We changed the schedule for all our VMware backups and any
other backups that happened on the hour. There was still no change in this
behavior.
Then I learned about the bug mentioned in
about the Netapp and NFS issue. We applied the workaround by
changing the NFS.maxqueuedepth to 64 on all our hosts. Because the issue had
started appearing after we had added a new host.
It still did not make a difference. Something was off and didn’t
make sense. Because our VMs were not showing the freeze condition.
We went through every netapp log, every vmware log on each
host but the only thing that pertained to this situation was the following in
the \var\log\vmkernel.log for each host at the mentioned time.
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop
accessing fd 0x410013247ce8 3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop
accessing fd 0x4100132692a8 3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop
accessing fd 0x41001219b328 3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop
accessing fd 0x41001218a228 3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop
accessing fd 0x410013242c68 3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop
accessing fd 0x41001219fae8 3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop
accessing fd 0x41001219e5e8 3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop
accessing fd 0x41001219fe68 3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop
accessing fd 0x410012190268 3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop
accessing fd 0x410013234fe8 3
2014-04-28T21:01:42.787Z cpu9:10653)NFSLock: 608: Stop
accessing fd 0x410012190968 3
2014-04-28T21:01:51.783Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x410013246468 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x41001219b328 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x410012190268 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x410013247ce8 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x41001219fe68 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x410013242c68 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x410012190968 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x41001219fae8 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x41001218a228 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x4100132692a8 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x410013234fe8 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x41001323dda8 again
2014-04-28T21:01:51.801Z cpu6:8198)NFSLock: 568: Start
accessing fd 0x41001219e5e8 again
We were about to give up and call support for vmware and
netapp, when I thought of aggregate snapshots. When I checked the aggregate
snapshots….there it was..each aggregate snapshot was taking place at the
mentioned times. Changed the schedule to off-peak hours, and saw the iscsi.notice at the times i changed the schedule to. I would have never thought
aggregate snapshots could cause a freeze.