-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected DB outage when cluster is removed from maintenance mode after cluster service restart. #121
Comments
Not able to push my code to a develop branch hence uploading the modified code as .md file. Used Code from Jun 30, 2022 which is working, because felt like code committed on sep 5th is having a bug. Modified code is from line 2713 till 2780. BTW a SUSE support case 00360697 was already opened for the same, but the solution provided is not addressing the root cause. |
Hello @sairamgopal The maintenance procedure that you seem to follow looks incomplete. The maintenance procedure as defined in the https://github.com/SUSE/SAPHanaSR/blob/master/man/SAPHanaSR_maintenance_examples.7 in section "Overview on maintenance procedure for Linux, HANA remains running, on pacemaker-2.0." requires that the resources needs to be refreshed to know the current status. I am quoting below from the manpage: "6. Let Linux cluster detect status of HANA resource, on either node. crm resource refresh cln_...crm resource refresh msl_..."Moreover, the maintenance procedure has been updated to only set maintenance on the msl_ resource and not on the whole cluster. Please refer section "11.3.2 Updating SAP HANA - seamless SAP HANA maintenance" of https://documentation.suse.com/sbp/all/single-html/SLES4SAP-hana-sr-guide-costopt-15/#cha.hana-sr.administrate |
Hi @ksanjeet,
crm resource refresh commands didnt make any difference in
####################################################################################### 2 . Yes I have gone through those steps. Suse support engineer provided that document before but I feel that procedure to set only msl resource to maintenance is just a workaround for the issue. https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-maintenance.html#sec-ha-maint-shutdown-node-maint-mode
It says If we include a check for checking the primary node status and set the Role attribute during the probe in SAPHana Resouce agent then SAPHana Resource agent will set the score based on the Role attribute and that resolves the issue. Also this will not disturb the cluster functionality because we do this check only during probe and set the attribute only when we identify a fully operational primary. |
Dear @sairamgopal , As per the output of I have tested these procedures many many times and your result seems unexpected to me. Although the maintainer will look at your valuable patch in due course of time, may I request you to look at my blog which details these procedures aavailable at: https://www.suse.com/c/sles-for-sap-os-patching-procedure-for-scale-up-perf-opt-hana-cluster/ and an important blog about checking pre-requisites before starting a maintenance procedure available at: https://www.suse.com/c/sles-for-sap-hana-maintenance-procedures-part-1-pre-maintenance-checks/ I am quite sure, these will help you define procedure for your organization and give you an optimal maintenance experience. |
Hi @PeterPitterling, I saw you did commit on Sep 5th to get rid of HDBSettings.sh call to cdpy; python Below command is working fine if we are not changing to python directory using
Example:
command with cdpy
I tried from my end on two different cluster SLES12 and SLES15 and on both clusters I got this error. |
Hi @ksanjeet , Thank you very much for your feedback, I have waited long enough, more than 15 min but the result is same, it is not happening on one or two clusters i see this behavior in all our clusters. If possible and if you have any Test clusters for Hana I request you to test the below steps by putting entire cluster to maintenance mode.
If incase you didnt face any issues with the above mentioned procedure then we can discuss or check other cluster parameters. if you face same issue then you can try using the SAPHana_Jun30.md by renaming it to SAPHana and replacing with your current SAPHana Resource agent and test the same above mentioned steps. Thank you very much in advance. |
Hi @sairamgopal, cdpy is an alias which is defined by the hdbenv.sh script. This script is invoked by this chain What shell are you using for your s4cadm user? su - s4cadm
alias
cdpy BR |
Hello @PeterPitterling, Alias is defined under s4cadm user.
Running cdpy (or any alias) command from root without timout is working fine.
Running cdpy command with timeout is working but only when we run a command before running an alias
running cdpy command with timeout is failing
changing $pre_scipt to after timeout value is working.
So may be output command should be like this ? |
Hello @sairamgopal , The document you are referring to https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-maintenance.html#sec-ha-maint-shutdown-node-maint-mode is a generic HA documentation and not for SLES for SAP for HANA for which the maintenance pocedure is defined at https://documentation.suse.com/sbp/all/single-html/SLES4SAP-hana-sr-guide-costopt-15/#cha.hana-sr.administrate Setting of maintenance on msl_ resource instead of the whole cluster is not a workaround. The logic is that you need to have SAPHanaTopology resource (the cln_ ) resource running to check the current system replication status and the landscape configuration details. What do you think is the role of a second resource agent(SAPHanaTopology) for HANA which you don't see in other databases like sybase? This resource needs to be running and when it does you don't see the problem of unexpected node attributes and eventually a restart of HANA DB at primary node. As mentioned previously, you need to adapt your procedure by accomodating 2 changes:
There can be many ways to solve a problem. It is good to discuss about better ways and I support that we should go through your patch and give it a careful consideration. However, the current state of supported procedure requires that you set maintenance only on msl_ resource and not on whole cluster. |
Hello @ksanjeet, Hi @PeterPitterling, @angelabriel, @fdanapfel, @fmherschel Could you please provide your thoughts or suggestion on below comment Modified/additional code is from line 2713 to 2780 |
@sairamgopal Nevertheless just putting true; in front of cdpy will fix it
This would need to be added in the calling function. Currently it is not clear if this inner timeout is required at all .. we will align and come back. see #122 Btw, the Repo version is currently not shipped, so you are working more less with a in development version .. |
Hi @PeterPitterling Also could you please check and consider the other point to |
please create a separate issue for this request |
Issue:
On a fully operational cluster, when cluster is put to maintenance mode and Pacemaker/Cluster service is restarted then after removing cluster from maintenance mode DB on primary is stopped and started again which results in an outage to the customers.
Recreate the issue with below steps
After Step 5 DB on Primary will be restarted or sometimes triggers the failover.
Reason:
This is happening because If you attempt to start cluster services on a node while the cluster or node is in maintenance mode, Pacemaker will initiate a single one-shot monitor operation (a “probe”) for every resource to evaluate which resources are currently running on that node. However, it will take no further action other than determining the resources' status.
so after step 4 a probe is initiated using SAPHana and SAPHanaTopology Resources.
In SAPHanaTopology when it is identified as probe in monitoring clone function it only check and Sets the attribute for Hana Version, but it is not doing any check for current cluster state. Because of this "hana_roles" and "master-rsc_SAPHana_HDB42" attributes are not set in the cluster primary.
Also in SAPHana Resource agent it is trying to get the status of role attribute (which is not set by that time) and setting score to 5 during the probe and later when cluster is removed from maintenance mode, resource agent checks for the roles attribute and its score, as those values are not as expected, agent is trying to fix the cluster and DB stop-Start is happening.
Resolution:
To overcome this issue, If we add a check to identify the status of the primary node and set the "hana__roles" attribute during probe, then when cluster is removed from the maintenance, cluster will not try to stop and start the DB or to trigger a failover as it will see the operational primary node.
I have already modified the code and tested multiple scenarios, cluster functionality is not disturbed and the mentioned issue is resolved. I don't think these changes to SAPHana Resource agent will cause additional issues because, during probe we are setting the attributes only if the we identify the primary node. But need your expertise to check and finalize if this approach can be used or suggest any other alternative/fix to overcome the above mentioned issue.
The text was updated successfully, but these errors were encountered: