SuSE cluster exhibiting new behavior since Hana Upgrade #39

tjdorian · 2017-05-26T14:33:20Z

Hi:

I have a SuSE cluster running on SLES 11 SP4 for SAP. It was built about two years ago. It has all of the latest SLES patches as of April 2017. I currently have a service request (682288682) open with Cisco at the moment. Yes, I said Cisco for they provide my SuSE support. The reason for this SR is that since our Basis team upgraded HANA to 1.00.111.00.1454183496 we are now witnessing new behavior we have not seen before and it is adding an extra step and some confusion for your junior team members during maintenance activities. Let me explain.

In the past on our two node cluster we could take the slave host out operation by running “crm node standby” This would shut the application down on the slave. After whatever work was planned we could perform a “crm node online” on the same host and the server would come back online. This included resyncing the replication. Now however we are getting a different results. After the bringing the slave online we receive startup errors out with the message below.

Failed actions: rsc-hana-TP1_start_0 on mn500d5a209 'unknown error' (1): call=69, status=complete, exit-reason='none', last-rc-change='Mon May 8 09:57:52 2017', queued=0ms, exec=19762ms

The only way to clear this is to re-register the host with the sr_register command telling where to find the master (the other side of the cluster) and then clear the failed action. At this point the cluster starts and resumes replication. I gather we can overcome this by setting the parameter AUTOMATED_REGISTER=true , however that would introduce registration on cluster failovers and we don’t want to do that. SAP also discourages this.

So crux of the issue is this never behaved like this before. Brining a slave host up and down either by putting it in standby mode or stopping and starting openais never use to force us to register the server for the replication when it was previously already a slave. Somehow it understood that this was not necessary.

So a bit more information. We are using delta data shipping between hosts. The nameserver trace does kick out these messages below.

[29160]{-1}[-1/-1] 2017-05-25 08:14:05.599552 e sr_dataaccess DisasterRecoveryProtocol.cpp(01171) : This situation can come up, if the primary has been recovered to an older point in time and a new full data shipping is needed.
[29160]{-1}[-1/-1] 2017-05-25 08:14:05.599564 e sr_dataaccess DisasterRecoveryProtocol.cpp(01172) : If this the case, please execute 'hdbnsutil -sr_register ..' on the secondary to trigger a new full data shipping.

Another thing we see from the SAPHana ofc script is this…

if version "$hdbver" ">=" "1.00.110"; then
newParameter=1

This implies the script takes different actions as before because of our upgrade. I did not dig into it since you folks are already in know. Is this normal behavior now and what can we do to bring back the old behavior? Or is something missing and is a PTF required? I will try to get the corresponding SR number for SuSE from cisco.

Tim

tjdorian · 2017-05-26T19:09:48Z

The SuSE case number is 101068132821

tjdorian · 2017-05-31T15:45:00Z

Here’s an update. Out of curiosity we changed our sync mode from delta_datashipping to logreplay. The results did not change. Same issue as stated earlier. What is even more interesting is we changed the cluster configuration parameter AUTOMATED_REGISTER=false to true. At this point we expected the slave host to automatically register and the HANA to start. However that was not the case. The behavior was the same.

Tim

fmherschel · 2017-06-01T15:01:01Z

Hi Tim,

thank you for your bug report. Yes, please use either a SAP ticket or SUSE (SR) to track your specific
situation in detail. This thread is not the correct place to share server details, logs and so on.

SUSE support would contact me internally, if they feel that this is really a bug. Fixes will of course then find
the way back to this repository :)

Yes the SAPHana resource agents now need to differ between HANA versions, because SAP gave us
either new command line options (to allow easier or even more stable parsing) or discuntinues some
command line options depending on the SAP HANA database status.

If you look at the sources of SAPHana the newParameter=1 code twick only uses the newer hdbnsutil command line options to avoid a nasty command line waring. We got feedback from customers that they
are afraid about hdbnsutil claiming that is has been called with depricated parameters.

I also have seen situations where I needed to do a full replica. Maybe this is triggered by already cleared logs or
files which would have been needed to be available for an 'optimized' replication.

Regards
Fabian

tjdorian · 2017-06-01T15:23:46Z

Hi Fabian: Thanks for getting back to me. I have had this issue logged for about 4 weeks with no fix. Out desperation I included this in github. I'm starting to think that a bug was introduced when we upgraded HANA. Yesterday I put the cluster in maintenance mode. While in maintenance mode we were able to stop and start the secondary by hand without having to re-register it. The cluster software always forces the sr_register no matter what now. In my case I get SuSE support through Cisco. This is a horrible arrangement for they create their own ticket with SuSE and play middle-man. I'll reach out to my SuSE rep for assistance. At this point I believe you're my only hope. Tim

…

On Thu, Jun 1, 2017 at 11:01 AM, fmherschel ***@***.***> wrote: Hi Tim, thank you for your bug report. Yes, please use either a SAP ticket or SUSE (SR) to track your specific situation in detail. This thread is not the correct place to share server details, logs and so on. SUSE support would contact me internally, if they feel that this is really a bug. Fixes will of course then find the way back to this repository :) Yes the SAPHana resource agents now need to differ between HANA versions, because SAP gave us either new command line options (to allow easier or even more stable parsing) or discuntinues some command line options depending on the SAP HANA database status. If you look at the sources of SAPHana the newParameter=1 code twick only uses the newer hdbnsutil command line options to avoid a nasty command line waring. We got feedback from customers that they are afraid about hdbnsutil claiming that is has been called with depricated parameters. I also have seen situations where I needed to do a full replica. Maybe this is triggered by already cleared logs or files which would have been needed to be available for an 'optimized' replication. Regards Fabian — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/fmherschel/SAPHanaSR/issues/39#issuecomment-305520793>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Abo8zJuHhcyIRtdPMkE2nNjlsp4vEh3Uks5r_tItgaJpZM4NnrFo> .

fmherschel · 2017-06-09T12:20:54Z

Hi Tim,

the issue is already been discussed between your HW partner and us. I suggest to concentrate on the
formal processes and not to interference by using different communication channels.

Regards
Fabian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SuSE cluster exhibiting new behavior since Hana Upgrade #39

SuSE cluster exhibiting new behavior since Hana Upgrade #39

tjdorian commented May 26, 2017

tjdorian commented May 26, 2017

tjdorian commented May 31, 2017

fmherschel commented Jun 1, 2017

tjdorian commented Jun 1, 2017 via email

fmherschel commented Jun 9, 2017

SuSE cluster exhibiting new behavior since Hana Upgrade #39

SuSE cluster exhibiting new behavior since Hana Upgrade #39

Comments

tjdorian commented May 26, 2017

tjdorian commented May 26, 2017

tjdorian commented May 31, 2017

fmherschel commented Jun 1, 2017

tjdorian commented Jun 1, 2017 via email

fmherschel commented Jun 9, 2017