Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SuSE cluster exhibiting new behavior since Hana Upgrade #39

Open
tjdorian opened this issue May 26, 2017 · 5 comments
Open

SuSE cluster exhibiting new behavior since Hana Upgrade #39

tjdorian opened this issue May 26, 2017 · 5 comments

Comments

@tjdorian
Copy link

Hi:

I have a SuSE cluster running on SLES 11 SP4 for SAP. It was built about two years ago. It has all of the latest SLES patches as of April 2017. I currently have a service request (682288682) open with Cisco at the moment. Yes, I said Cisco for they provide my SuSE support. The reason for this SR is that since our Basis team upgraded HANA to 1.00.111.00.1454183496 we are now witnessing new behavior we have not seen before and it is adding an extra step and some confusion for your junior team members during maintenance activities. Let me explain.

In the past on our two node cluster we could take the slave host out operation by running “crm node standby” This would shut the application down on the slave. After whatever work was planned we could perform a “crm node online” on the same host and the server would come back online. This included resyncing the replication. Now however we are getting a different results. After the bringing the slave online we receive startup errors out with the message below.

Failed actions: rsc-hana-TP1_start_0 on mn500d5a209 'unknown error' (1): call=69, status=complete, exit-reason='none', last-rc-change='Mon May 8 09:57:52 2017', queued=0ms, exec=19762ms

The only way to clear this is to re-register the host with the sr_register command telling where to find the master (the other side of the cluster) and then clear the failed action. At this point the cluster starts and resumes replication. I gather we can overcome this by setting the parameter AUTOMATED_REGISTER=true , however that would introduce registration on cluster failovers and we don’t want to do that. SAP also discourages this.

So crux of the issue is this never behaved like this before. Brining a slave host up and down either by putting it in standby mode or stopping and starting openais never use to force us to register the server for the replication when it was previously already a slave. Somehow it understood that this was not necessary.

So a bit more information. We are using delta data shipping between hosts. The nameserver trace does kick out these messages below.

[29160]{-1}[-1/-1] 2017-05-25 08:14:05.599552 e sr_dataaccess DisasterRecoveryProtocol.cpp(01171) : This situation can come up, if the primary has been recovered to an older point in time and a new full data shipping is needed.
[29160]{-1}[-1/-1] 2017-05-25 08:14:05.599564 e sr_dataaccess DisasterRecoveryProtocol.cpp(01172) : If this the case, please execute 'hdbnsutil -sr_register ..' on the secondary to trigger a new full data shipping.

Another thing we see from the SAPHana ofc script is this…

if version "$hdbver" ">=" "1.00.110"; then
newParameter=1

This implies the script takes different actions as before because of our upgrade. I did not dig into it since you folks are already in know. Is this normal behavior now and what can we do to bring back the old behavior? Or is something missing and is a PTF required? I will try to get the corresponding SR number for SuSE from cisco.

Tim

@tjdorian
Copy link
Author

The SuSE case number is 101068132821

@tjdorian
Copy link
Author

Here’s an update. Out of curiosity we changed our sync mode from delta_datashipping to logreplay. The results did not change. Same issue as stated earlier. What is even more interesting is we changed the cluster configuration parameter AUTOMATED_REGISTER=false to true. At this point we expected the slave host to automatically register and the HANA to start. However that was not the case. The behavior was the same.

Tim

@fmherschel
Copy link
Owner

Hi Tim,

thank you for your bug report. Yes, please use either a SAP ticket or SUSE (SR) to track your specific
situation in detail. This thread is not the correct place to share server details, logs and so on.

SUSE support would contact me internally, if they feel that this is really a bug. Fixes will of course then find
the way back to this repository :)

Yes the SAPHana resource agents now need to differ between HANA versions, because SAP gave us
either new command line options (to allow easier or even more stable parsing) or discuntinues some
command line options depending on the SAP HANA database status.

If you look at the sources of SAPHana the newParameter=1 code twick only uses the newer hdbnsutil command line options to avoid a nasty command line waring. We got feedback from customers that they
are afraid about hdbnsutil claiming that is has been called with depricated parameters.

I also have seen situations where I needed to do a full replica. Maybe this is triggered by already cleared logs or
files which would have been needed to be available for an 'optimized' replication.

Regards
Fabian

@tjdorian
Copy link
Author

tjdorian commented Jun 1, 2017 via email

@fmherschel
Copy link
Owner

Hi Tim,

the issue is already been discussed between your HW partner and us. I suggest to concentrate on the
formal processes and not to interference by using different communication channels.

Regards
Fabian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants