Skip to content

Commit

Permalink
Re-introduce vmcore creation notification to kdump
Browse files Browse the repository at this point in the history
Motivation
==========

People may forget to recheck to ensure kdump works, which as a result, a
possibility of no vmcores generated after a real system crash. It is
unexpected for kdump.

It is highly recommended people to test kdump after any system modification,
such as:

a. after kernel patching or whole yum update, as it might break something
   on which kdump is dependent, maybe due to introduction of any new bug etc.
b. after any change at hardware level, maybe storage, networking,
   firmware upgrading etc.
c. after implementing any new application, like which involves 3rd party modules
   etc.

Though these exceed the range of kdump, however a simple vmcore creation
status notification is good to have for now.

Design
======

Kdump currently will check any relating files/fs/drivers modified before
determine if initrd should rebuild when (re)start. A rebuild is an
indicator of such modification, and kdump need to be tested. This will
clear the vmcore creation status specified in $VMCORE_CREATION_STATUS,
and as a result, a notification of vmcore creation test will be
outputted.

To test kdump, there is an entry for doing that by "kdumpctl test". It
will generate a timestamp string as the ID of the current test, along
with a "pending" status in $VMCORE_CREATION_STATUS, then a real crash &
dump process will be triggered.

After system reboot back to normal, a vmcore creation check will start at
"kdumpctl (re)start/status", and will report the results as
success/fail/manual status to users.

To achieve that, program will first check the status in $VMCORE_CREATION_STATUS.
If "pending" status if found, which means the test result is
undetermined and need a retrive from remote/local dump folder. Then if test
id is found in the dump folder and vmcore is complete, then "pending"
would be overwritten by "success", which indicates a successful kdump
test. If test id is found in the dump folder but vmcore is incomplete,
then it is a "fail" kdump test. If no test id is found, then it is a "manual"
status, which indicates users should check the test results manually.

If $VMCORE_CREATION_STATUS is already success/fail/manual status, it indicates
the test result has already been determined, so the program will not access
the remote/local dump folder again. This can limite any unnecessary
access to dump target, shorten the time consumption.

User should check for the root cause of fail/manual status when get
reports.

$VMCORE_CREATION_STATUS is used for recording the vmcore creation status of
the current env. The format is like:

   <status> kdump_test_id=<timestamp sec>-<timestamp nanosec>
e.g:
   success kdump_test_id=1729823462-938751820

Which means, there has been a successful kdump test at
$(date -d "@1729823462") timestamp for the current env. Timestamp
nanosec is only meaningful for uniquify id string.

Difference
==========
Previously there is one commit 88525eb ("Introduce vmcore creation
notification to kdump") merged and addressing the same issue, but
implemented differently:

The prev one:
Save the $VMCORE_CREATION_STATUS to local drive during the 2nd kernel
dumping. If vmcore dumping target is different from $VMCORE_CREATION_STATUS's
drive, then the latter one need to be mounted in 2nd kernel.

This one:
Save the $VMCORE_CREATION_STATUS to local drive only in 1nd kernel, that
is, the test result is retrived after 2nd kernel dumping. So it doesn't
load or mount other drive in 2nd kernel.

The advantage:
Extra mounting in 2nd kernel will introduce higher risk of failure,
as a result, lower the success of vmcore dumping, which is
unaccepted. So keep the code for 2nd kernel as simple is preferred.

Usage
=====
[root@localhost ~]# kdumpctl restart
kdump: kexec: unloaded kdump kernel
kdump: Stopping kdump: [OK]
kdump: kexec: loaded kdump kernel
kdump: Starting kdump: [OK]
kdump: Notice: No vmcore creation test performed!

[root@localhost ~]# kdumpctl status
kdump: Kdump is operational
kdump: Notice: No vmcore creation test performed!

[root@localhost ~]# kdumpctl test

[root@localhost ~]# cat /var/lib/kdump/vmcore-creation.status
pending kdump_test_id=1729823462-938751820

[root@localhost ~]# kdumpctl status
kdump: Kdump is operational
kdump: Notice: Last successful vmcore creation on Fri Oct 25 02:31:02 AM UTC 2024

[root@localhost ~]# cat /var/lib/kdump/vmcore-creation.status
success kdump_test_id=1729823462-938751820

[root@localhost ~]# kdumpctl restart
kdump: kexec: unloaded kdump kernel
kdump: Stopping kdump: [OK]
kdump: kexec: loaded kdump kernel
kdump: Starting kdump: [OK]
kdump: Notice: Last successful vmcore creation on Fri Oct 25 02:31:02 AM UTC 2024

Note: the notification for kdumpctl (re)start/status can be disabled by
setting VMCORE_CREATION_NOTIFICATION in /etc/sysconfig/kdump. And fadump
is NOT supported for this feature.

Signed-off-by: Tao Liu <ltao@redhat.com>
  • Loading branch information
liutgnu committed Nov 29, 2024
1 parent 36845f4 commit 2b71e42
Show file tree
Hide file tree
Showing 4 changed files with 244 additions and 2 deletions.
60 changes: 58 additions & 2 deletions dracut/99kdumpbase/kdump.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ KDUMP_PATH="/var/crash"
KDUMP_LOG_FILE="/run/initramfs/kexec-dmesg.log"
KDUMP_LOG_DEST=""
KDUMP_LOG_OP=""
KDUMP_TEST_ID=""
KDUMP_TEST_STATUS=""
CORE_COLLECTOR=""
DEFAULT_CORE_COLLECTOR="makedumpfile -l --message-level 7 -d 31"
DMESG_COLLECTOR="/sbin/vmcore-dmesg"
Expand Down Expand Up @@ -154,7 +156,12 @@ dump_fs() {
;;
esac

_dump_fs_path=$(echo "$1/$KDUMP_PATH/$HOST_IP-$DATEDIR/" | tr -s /)
if [ -z "$KDUMP_TEST_ID" ]; then
_dump_fs_path=$(echo "$1/$KDUMP_PATH/$HOST_IP-$DATEDIR/" | tr -s /)
else
_dump_fs_path=$(echo "$1/$KDUMP_PATH/" | tr -s /)
fi

dinfo "saving to $_dump_fs_path"

# Only remount to read-write mode if the dump target is mounted read-only.
Expand Down Expand Up @@ -397,7 +404,12 @@ dump_raw() {
dump_ssh() {
_ret=0
_ssh_opts="-i $1 -o BatchMode=yes -o StrictHostKeyChecking=yes"
_ssh_dir="$KDUMP_PATH/$HOST_IP-$DATEDIR"
if [ -z "$KDUMP_TEST_ID" ]; then
_ssh_dir="$KDUMP_PATH/$HOST_IP-$DATEDIR"
else
_ssh_dir="$KDUMP_PATH"
fi

if is_ipv6_address "$2"; then
_scp_address=${2%@*}@"[${2#*@}]"
else
Expand Down Expand Up @@ -588,6 +600,48 @@ fence_kdump_notify() {
fi
}

kdump_test_set_status() {
_status="$1"

[ -n "$KDUMP_TEST_STATUS" ] || return

case "$_status" in
success|fail) ;;
*)
derror "Unknown test status $_status"
return 1
;;
esac

if is_ssh_dump_target; then
_ssh_opts="-i $SSH_KEY_LOCATION -o BatchMode=yes -o StrictHostKeyChecking=yes"
_ssh_host=$(echo "$DUMP_INSTRUCTION" | awk '{print $3}')

ssh -q $_ssh_opts "$_ssh_host" "mkdir -p ${KDUMP_TEST_STATUS%/*}" \
|| return 1
ssh -q $_ssh_opts "$_ssh_host" "echo $_status kdump_test_id=$KDUMP_TEST_ID > $KDUMP_TEST_STATUS" \
|| return 1
else
_target=$(echo "$DUMP_INSTRUCTION" | awk '{print $2}')

mkdir -p "$_target/$KDUMP_PATH" || return 1
echo "$_status kdump_test_id=$KDUMP_TEST_ID" > "$_target/$KDUMP_TEST_STATUS"
sync -f "$_target/$KDUMP_TEST_STATUS"
fi
}

kdump_test_init() {
is_raw_dump_target && return

KDUMP_TEST_ID=$(getarg kdump_test_id=)
[ -z "$KDUMP_TEST_ID" ] && return

KDUMP_PATH="$KDUMP_PATH/kdump-test-$KDUMP_TEST_ID"
KDUMP_TEST_STATUS="$KDUMP_PATH/vmcore-creation.status"

kdump_test_set_status 'fail'
}

if [ "$1" = "--error-handler" ]; then
get_kdump_confs
do_failure_action
Expand All @@ -614,6 +668,7 @@ if [ -z "$DUMP_INSTRUCTION" ]; then
DUMP_INSTRUCTION="dump_fs $NEWROOT"
fi

kdump_test_init
if ! do_kdump_pre; then
derror "kdump_pre script exited with non-zero status!"
do_final_action
Expand All @@ -634,4 +689,5 @@ if [ $DUMP_RETVAL -ne 0 ]; then
exit 1
fi

kdump_test_set_status "success"
do_final_action
4 changes: 4 additions & 0 deletions gen-kdump-sysconfig.sh
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,10 @@ KDUMP_IMG="vmlinuz"
#What is the images extension. Relocatable kernels don't have one
KDUMP_IMG_EXT=""
# Enable vmcore creation notification by default, disable by setting
# VMCORE_CREATION_NOTIFICATION=""
VMCORE_CREATION_NOTIFICATION="yes"
# Logging is controlled by following variables in the first kernel:
# - @var KDUMP_STDLOGLVL - logging level to standard error (console output)
# - @var KDUMP_SYSLOGLVL - logging level to syslog (by logger command)
Expand Down
173 changes: 173 additions & 0 deletions kdumpctl
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ KDUMP_INITRD=""
TARGET_INITRD=""
#kdump shall be the default dump mode
DEFAULT_DUMP_MODE="kdump"
VMCORE_CREATION_STATUS="/var/lib/kdump/vmcore-creation.status"

# Some default values in case /etc/sysconfig/kdump doesn't include
KDUMP_COMMANDLINE_REMOVE="hugepages hugepagesz slub_debug"
Expand Down Expand Up @@ -45,8 +46,10 @@ if ! dlog_init; then
fi

KDUMP_TMPDIR=$(mktemp --tmpdir -d kdump.XXXX)
TMPMNT="$KDUMP_TMPDIR/target"
trap '
ret=$?;
is_mounted $TMPMNT && umount -f $TMPMNT;
rm -rf "$KDUMP_TMPDIR"
exit $ret;
' EXIT
Expand Down Expand Up @@ -185,6 +188,11 @@ rebuild_initrd()
else
rebuild_kdump_initrd
fi

_ret=$?

set_vmcore_creation_status 'clear'
return $_ret
}

#$1: the files to be checked with IFS=' '
Expand Down Expand Up @@ -1674,6 +1682,170 @@ _should_reset_crashkernel() {
[[ $(kdump_get_conf_val auto_reset_crashkernel) != no ]] && systemctl is-enabled kdump &> /dev/null
}

set_kdump_test_id()
{
local _id=$1

KDUMP_COMMANDLINE_APPEND+=" $_id "
reload >& /dev/null

if [[ "$?" -ne 0 ]]; then
derror "Set kdump test id fail."
exit 1
fi
}

# $1: success/fail/pending/manual/clear
# $2: test id
set_vmcore_creation_status()
{
local _status=$1
local _kdump_test_id
_dir=$(dirname "$VMCORE_CREATION_STATUS")

[[ -d "$_dir" ]] || mkdir -p "$_dir"
[[ -w "$_dir" ]] || chmod +w "$_dir"

case "$_status" in
pending)
_kdump_test_id="kdump_test_id=$(date +%s-%N)"
set_kdump_test_id "$_kdump_test_id"
echo "$_status $_kdump_test_id" > "$VMCORE_CREATION_STATUS"
;;
success | fail | manual)
sed -E -i "s/^\w+/$_status/" "$VMCORE_CREATION_STATUS"
;;
clear)
rm -f "$VMCORE_CREATION_STATUS"
;;
*)
return
esac
sync -f "$_dir"
}

fetch_status()
{
local _test_id="$1" _mnt
local _status

is_raw_dump_target && return 2

_status="${OPT[path]}/kdump-test-$_test_id/vmcore-creation.status"

if is_nfs_dump_target || is_local_target; then
_mnt=$(get_mntpoint_from_target "${OPT[_target]}")
if [[ -z "$_mnt" ]] || ! is_mounted "$_mnt"; then
mkdir -p $TMPMNT
mount "${OPT[_target]}" "$TMPMNT" -t "${OPT[_fstype]}" -o defaults || \
{ dwarn "Failed to mount ${OPT[_target]}" && return 2; }
_mnt="$TMPMNT"
fi
_status="$_mnt/$_status"
elif is_ssh_dump_target; then
scp -i "${OPT[sshkey]}" -o BatchMode=yes \
"${OPT[_target]}:$_status" \
"$KDUMP_TMPDIR"
case "$?" in
0)
# success
;;
1)
# file not found
return 1
;;
255)
# no connection to host
return 2
esac
_status="$KDUMP_TMPDIR/vmcore-creation.status"
fi

[[ -f "$_status" ]] || return 1
grep -q "success" "$_status" && return 0 || return 1
}

check_vmcore_creation_status()
{
local _status _test_id _timestamp _status_date

[[ ${VMCORE_CREATION_NOTIFICATION,,} == "yes" ]] || return

[[ "$DEFAULT_DUMP_MODE" == "kdump" ]] || return

if [[ ! -s "$VMCORE_CREATION_STATUS" ]]; then
dwarn "Notice: No vmcore creation test performed!"
return
fi

[[ "${#OPT[@]}" -eq 0 ]] && { parse_config || return; }

read -r _status _test_id < "$VMCORE_CREATION_STATUS"
_test_id=${_test_id#*=}
_timestamp=${_test_id%-*}
_status_date=$(date -d "@$_timestamp")

if [[ "$_status" == "pending" ]]; then
fetch_status "$_test_id"
case "$?" in
0)
_status="success"
;;
1)
_status="fail"
;;
*)
_status="manual"
;;
esac
set_vmcore_creation_status "$_status"
fi

case "$_status" in
success)
dinfo "Notice: Last successful vmcore creation on $_status_date"
;;
fail)
dwarn "Notice: Last NOT successful vmcore creation on $_status_date"
;;
manual)
dwarn "Notice: Require manual check for kdump test of $_status_date"
;;
*)
derror "Unknown test status: $_status"
;;
esac
}

kdump_test()
{
if ! is_kernel_loaded "$DEFAULT_DUMP_MODE"; then
derror "Kdump needs be operational before test."
exit 1
fi

if [[ ! "$DEFAULT_DUMP_MODE" == "kdump" ]]; then
derror "Only kdump is supported for test."
exit 1
fi

if [[ ! "$1" == "--force" ]]; then
read -p "DANGER!!! Will perform a kdump test by crashing the system, proceed? (y/N): " input
case $input in
[Yy] )
dinfo "Start kdump test..."
;;
* )
dinfo "Operation cancelled."
exit 0
;;
esac
fi

set_vmcore_creation_status 'pending'
echo c > /proc/sysrq-trigger
}

main()
{
# Determine if the dump mode is kdump or fadump
Expand Down Expand Up @@ -1705,6 +1877,7 @@ main()
EXIT_CODE=3
;;
esac
check_vmcore_creation_status
exit $EXIT_CODE
;;
reload)
Expand Down
9 changes: 9 additions & 0 deletions kdumpctl.8
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,16 @@ Note: The memory requirements for kdump varies heavily depending on the
used hardware and system configuration. Thus the recommended
crashkernel might not work for your specific setup. Please test if
kdump works after resetting the crashkernel value.
.TP
.I test [--force]
Test the kdump by actually trigger the system crash & dump, and check if a
vmcore can really be generated successfully based on current config and
environment. After system reboot back to normal, check the test result
by "kdumpctl status". Note, fadump is not supported.

If the optional parameter [--force] is provided, there will be no confirmation
before triggering the system crash. Dangerous though, this option is meant
for automation testing.
.SH "SEE ALSO"
.BR kdump.conf (5),
.BR mkdumprd (8)

0 comments on commit 2b71e42

Please sign in to comment.