'lconf --failover' didn't seem to work, but 'lconf --cleanup --force --service=mds /root/config.xml' did. It removed all the modules. Once I was satisfied that mds-1 was not using the device, I started the failover device, mds-2, just by running 'lconf --node mds-2 /root/config.xml'.
On the client that was mounting the resource:
LustreError: 19533:0:(client.c:940:ptlrpc_expire_one_request()) @@@ timeout (sen t at 1177507980, 0s ago) req@f7d55600 x8852471556/t0 o400->mds_UUID@mds-1_UUID:1 2 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0
LustreError: 19533:0:(client.c:940:ptlrpc_expire_one_request()) Skipped 29 previ ous similar messages
Lustre: 12:0:(linux-debug.c:96:libcfs_run_upcall()) Invoked LNET upcall /usr/lib /lustre/lnet_upcall ROUTER_NOTIFY,192.168.241.229@tcp,down,1177507956
LustreError: MDC_mds-1_mds_MNT_client-f70a0400: Connection to service mds via ni d 192.168.241.229@tcp was lost; in progress operations using this service will w ait for recovery to complete.
Lustre: Changing connection for MDC_mds-1_mds_MNT_client-f70a0400 to mds-2_UUID/192.168.241.227@tcp
The share was fine from the client after the switchover.
There were no messages on the OSSes. But I did this:
oss-4:/home/cmcleay/lustre-1.4.10# lctl ping mds-2
12345-0@lo
12345-192.168.241.227@tcp
oss-4:/home/cmcleay/lustre-1.4.10# lctl ping mds-1
failed to ping 192.168.241.229@tcp: Input/output error
I used the init scripts after I was satisfied with this - all worked well.
Doing an 'ls' on the share hung, but it came back after a little while after starting the failover mds-1
You need to have the config file in the right place for the init scripts to work properly
There were some messages on the oss:
Apr 25 23:40:48 oss-4 kernel: Lustre: 6:0:(linux-debug.c:98:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall ROUTER_NOTIFY,192.168.241.229@tcp,down,1176793786
Apr 25 23:44:11 oss-4 kernel: Lustre: 17433:0:(filter.c:3236:filter_set_info_async()) ost-beta: received MDS connection from 192.168.241.229@tcp
Then it all stopped working :(
Wednesday, April 25, 2007
lustre errors with a new config
The MDS writes a log file on the MDS device (if you mount a MDS volume, you can see it)
I tried re-writing it, but I decided to reformat it instead (I'd also tried a reboot, but that didn't help either)
MDSDEV: mds mds_UUID /dev/sdb1 ldiskfs no
! /usr/sbin/lctl (22): error: setup: Invalid argument
mds-1:~# lconf -v --node mds-1 /root/config.xml
configuring for host: ['mds-1']
Checking XML modification time
+ debugfs -c -R 'stat /LOGS' /dev/sdb1 2>&1 | grep mtime
xmtime 1177503678 > kmtime 1176793916
Error: MDS startup logs are older than config /root/config.xml. Please run --write_conf on stopped MDS to update. Use '--old_conf' to start anyways.
mds-1:~# lconf -v --node mds-1 --write-conf /root/config.xml
configuring for host: ['mds-1']
Service: network NET_mds-1_lnet NET_mds-1_lnet_UUID
loading module: libcfs srcdir None devdir libcfs
+ /sbin/modprobe libcfs
loading module: lnet srcdir None devdir lnet
+ /sbin/modprobe lnet
+ /sbin/modprobe lnet
loading module: ksocklnd srcdir None devdir klnds/socklnd
+ /sbin/modprobe ksocklnd
Service: ldlm ldlm ldlm_UUID
loading module: lvfs srcdir None devdir lvfs
+ /sbin/modprobe lvfs
loading module: obdclass srcdir None devdir obdclass
+ /sbin/modprobe obdclass
loading module: ptlrpc srcdir None devdir ptlrpc
+ /sbin/modprobe ptlrpc
Service: mdsdev MDD_mds_mds-1 MDD_mds_mds-1_UUID
original inode_size 0
stripe_count 1 inode_size 512
loading module: lquota srcdir None devdir quota
+ /sbin/modprobe lquota
loading module: mdc srcdir None devdir mdc
+ /sbin/modprobe mdc
loading module: osc srcdir None devdir osc
+ /sbin/modprobe osc
loading module: lov srcdir None devdir lov
+ /sbin/modprobe lov
loading module: mds srcdir None devdir mds
+ /sbin/modprobe mds
loading module: ldiskfs srcdir None devdir ldiskfs
+ /sbin/modprobe ldiskfs
loading module: fsfilt_ldiskfs srcdir None devdir lvfs
+ /sbin/modprobe fsfilt_ldiskfs
Service: mdsdev MDD_mds_mds-1 MDD_mds_mds-1_UUID
original inode_size 0
stripe_count 1 inode_size 512
MDSDEV: mds mds_UUID /dev/sdb1 ldiskfs no
+ /usr/sbin/lctl
attach mds mds mds_UUID
quit
+ /usr/sbin/lctl
cfg_device mds
setup /dev/sdb1 ldiskfs
quit
+ /usr/sbin/lctl
ignore_errors
cfg_device $mds
cleanup
detach
quit
! /usr/sbin/lctl (22): error: setup: Invalid argument
I tried re-writing it, but I decided to reformat it instead (I'd also tried a reboot, but that didn't help either)
MDSDEV: mds mds_UUID /dev/sdb1 ldiskfs no
! /usr/sbin/lctl (22): error: setup: Invalid argument
mds-1:~# lconf -v --node mds-1 /root/config.xml
configuring for host: ['mds-1']
Checking XML modification time
+ debugfs -c -R 'stat /LOGS' /dev/sdb1 2>&1 | grep mtime
xmtime 1177503678 > kmtime 1176793916
Error: MDS startup logs are older than config /root/config.xml. Please run --write_conf on stopped MDS to update. Use '--old_conf' to start anyways.
mds-1:~# lconf -v --node mds-1 --write-conf /root/config.xml
configuring for host: ['mds-1']
Service: network NET_mds-1_lnet NET_mds-1_lnet_UUID
loading module: libcfs srcdir None devdir libcfs
+ /sbin/modprobe libcfs
loading module: lnet srcdir None devdir lnet
+ /sbin/modprobe lnet
+ /sbin/modprobe lnet
loading module: ksocklnd srcdir None devdir klnds/socklnd
+ /sbin/modprobe ksocklnd
Service: ldlm ldlm ldlm_UUID
loading module: lvfs srcdir None devdir lvfs
+ /sbin/modprobe lvfs
loading module: obdclass srcdir None devdir obdclass
+ /sbin/modprobe obdclass
loading module: ptlrpc srcdir None devdir ptlrpc
+ /sbin/modprobe ptlrpc
Service: mdsdev MDD_mds_mds-1 MDD_mds_mds-1_UUID
original inode_size 0
stripe_count 1 inode_size 512
loading module: lquota srcdir None devdir quota
+ /sbin/modprobe lquota
loading module: mdc srcdir None devdir mdc
+ /sbin/modprobe mdc
loading module: osc srcdir None devdir osc
+ /sbin/modprobe osc
loading module: lov srcdir None devdir lov
+ /sbin/modprobe lov
loading module: mds srcdir None devdir mds
+ /sbin/modprobe mds
loading module: ldiskfs srcdir None devdir ldiskfs
+ /sbin/modprobe ldiskfs
loading module: fsfilt_ldiskfs srcdir None devdir lvfs
+ /sbin/modprobe fsfilt_ldiskfs
Service: mdsdev MDD_mds_mds-1 MDD_mds_mds-1_UUID
original inode_size 0
stripe_count 1 inode_size 512
MDSDEV: mds mds_UUID /dev/sdb1 ldiskfs no
+ /usr/sbin/lctl
attach mds mds mds_UUID
quit
+ /usr/sbin/lctl
cfg_device mds
setup /dev/sdb1 ldiskfs
quit
+ /usr/sbin/lctl
ignore_errors
cfg_device $mds
cleanup
detach
quit
! /usr/sbin/lctl (22): error: setup: Invalid argument
Wednesday, April 18, 2007
gotcha with /dev and udev when copying system
Usually I use
cd /
tar -clf - .|(cd /mnt;tar -xpf -)
to make a copy of a system. However, with newer systems running udev, this will cause problems as it does not copy /dev, which gets put on its own partition. So you will not be able to boot into a system unless you copy the /dev entries (it has a basic skeleton including vital files such as /dev/console and /dev/sda etc, stuff not needed for most boot environments will be created by udev dynamically)
cd /
tar -clf - .|(cd /mnt;tar -xpf -)
to make a copy of a system. However, with newer systems running udev, this will cause problems as it does not copy /dev, which gets put on its own partition. So you will not be able to boot into a system unless you copy the /dev entries (it has a basic skeleton including vital files such as /dev/console and /dev/sda etc, stuff not needed for most boot environments will be created by udev dynamically)
Tuesday, April 17, 2007
Failover for Lustre nodes
Lustre does not provide the tool set for the system-level components necessary for a complete failover solution (node failure detection, power control, and so on), as this functionality has been available for some time from third party tools. CFS does provide the necessary scripts to interact with these packages, and exposes health information for system monitoring. The recommended choice is the Heartbeat package from linux-ha.org. Lustre will work with any HA software that supports resource (I/O) fencing. The Heartbeat software is responsible for detecting failure of the primary server node and controlling the failover.
Subscribe to:
Posts (Atom)