Wednesday, April 25, 2007

lustre failover experimentation

'lconf --failover' didn't seem to work, but 'lconf --cleanup --force --service=mds /root/config.xml' did. It removed all the modules. Once I was satisfied that mds-1 was not using the device, I started the failover device, mds-2, just by running 'lconf --node mds-2 /root/config.xml'.

On the client that was mounting the resource:

LustreError: 19533:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sen t at 1177507980, 0s ago) req@f7d55600 x8852471556/t0 o400->mds_UUID@mds-1_UUID:1 2 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0
LustreError: 19533:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 29 previ ous similar messages
Lustre: 12:0:(linux-debug.c:96:libcfs_run_upcall()) Invoked LNET upcall /usr/lib /lustre/lnet_upcall ROUTER_NOTIFY,192.168.241.229@tcp,down,1177507956
LustreError: MDC_mds-1_mds_MNT_client-f70a0400: Connection to service mds via ni d 192.168.241.229@tcp was lost; in progress operations using this service will w ait for recovery to complete.
Lustre: Changing connection for MDC_mds-1_mds_MNT_client-f70a0400 to mds-2_UUID/192.168.241.227@tcp

The share was fine from the client after the switchover.

There were no messages on the OSSes. But I did this:

oss-4:/home/cmcleay/lustre-1.4.10# lctl ping mds-2
12345-0@lo
12345-192.168.241.227@tcp
oss-4:/home/cmcleay/lustre-1.4.10# lctl ping mds-1
failed to ping 192.168.241.229@tcp: Input/output error



I used the init scripts after I was satisfied with this - all worked well.
Doing an 'ls' on the share hung, but it came back after a little while after starting the failover mds-1

You need to have the config file in the right place for the init scripts to work properly

There were some messages on the oss:
Apr 25 23:40:48 oss-4 kernel: Lustre: 6:0:(linux-debug.c:98:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall ROUTER_NOTIFY,192.168.241.229@tcp,down,1176793786
Apr 25 23:44:11 oss-4 kernel: Lustre: 17433:0:(filter.c:3236:filter_set_info_async()) ost-beta: received MDS connection from 192.168.241.229@tcp


Then it all stopped working :(

No comments: