-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SuSE cluster exhibiting new behavior since Hana Upgrade #39
Comments
The SuSE case number is 101068132821 |
Here’s an update. Out of curiosity we changed our sync mode from delta_datashipping to logreplay. The results did not change. Same issue as stated earlier. What is even more interesting is we changed the cluster configuration parameter AUTOMATED_REGISTER=false to true. At this point we expected the slave host to automatically register and the HANA to start. However that was not the case. The behavior was the same. Tim |
Hi Tim, thank you for your bug report. Yes, please use either a SAP ticket or SUSE (SR) to track your specific SUSE support would contact me internally, if they feel that this is really a bug. Fixes will of course then find Yes the SAPHana resource agents now need to differ between HANA versions, because SAP gave us If you look at the sources of SAPHana the newParameter=1 code twick only uses the newer hdbnsutil command line options to avoid a nasty command line waring. We got feedback from customers that they I also have seen situations where I needed to do a full replica. Maybe this is triggered by already cleared logs or Regards |
Hi Fabian:
Thanks for getting back to me. I have had this issue logged for about 4
weeks with no fix. Out desperation I included this in github. I'm starting
to think that a bug was introduced when we upgraded HANA. Yesterday I put
the cluster in maintenance mode. While in maintenance mode we were able to
stop and start the secondary by hand without having to re-register it. The
cluster software always forces the sr_register no matter what now.
In my case I get SuSE support through Cisco. This is a horrible arrangement
for they create their own ticket with SuSE and play middle-man. I'll reach
out to my SuSE rep for assistance. At this point I believe you're my only
hope.
Tim
…On Thu, Jun 1, 2017 at 11:01 AM, fmherschel ***@***.***> wrote:
Hi Tim,
thank you for your bug report. Yes, please use either a SAP ticket or SUSE
(SR) to track your specific
situation in detail. This thread is not the correct place to share server
details, logs and so on.
SUSE support would contact me internally, if they feel that this is really
a bug. Fixes will of course then find
the way back to this repository :)
Yes the SAPHana resource agents now need to differ between HANA versions,
because SAP gave us
either new command line options (to allow easier or even more stable
parsing) or discuntinues some
command line options depending on the SAP HANA database status.
If you look at the sources of SAPHana the newParameter=1 code twick only
uses the newer hdbnsutil command line options to avoid a nasty command line
waring. We got feedback from customers that they
are afraid about hdbnsutil claiming that is has been called with
depricated parameters.
I also have seen situations where I needed to do a full replica. Maybe
this is triggered by already cleared logs or
files which would have been needed to be available for an 'optimized'
replication.
Regards
Fabian
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/fmherschel/SAPHanaSR/issues/39#issuecomment-305520793>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Abo8zJuHhcyIRtdPMkE2nNjlsp4vEh3Uks5r_tItgaJpZM4NnrFo>
.
|
Hi Tim, the issue is already been discussed between your HW partner and us. I suggest to concentrate on the Regards |
Hi:
I have a SuSE cluster running on SLES 11 SP4 for SAP. It was built about two years ago. It has all of the latest SLES patches as of April 2017. I currently have a service request (682288682) open with Cisco at the moment. Yes, I said Cisco for they provide my SuSE support. The reason for this SR is that since our Basis team upgraded HANA to 1.00.111.00.1454183496 we are now witnessing new behavior we have not seen before and it is adding an extra step and some confusion for your junior team members during maintenance activities. Let me explain.
In the past on our two node cluster we could take the slave host out operation by running “crm node standby” This would shut the application down on the slave. After whatever work was planned we could perform a “crm node online” on the same host and the server would come back online. This included resyncing the replication. Now however we are getting a different results. After the bringing the slave online we receive startup errors out with the message below.
Failed actions: rsc-hana-TP1_start_0 on mn500d5a209 'unknown error' (1): call=69, status=complete, exit-reason='none', last-rc-change='Mon May 8 09:57:52 2017', queued=0ms, exec=19762ms
The only way to clear this is to re-register the host with the sr_register command telling where to find the master (the other side of the cluster) and then clear the failed action. At this point the cluster starts and resumes replication. I gather we can overcome this by setting the parameter AUTOMATED_REGISTER=true , however that would introduce registration on cluster failovers and we don’t want to do that. SAP also discourages this.
So crux of the issue is this never behaved like this before. Brining a slave host up and down either by putting it in standby mode or stopping and starting openais never use to force us to register the server for the replication when it was previously already a slave. Somehow it understood that this was not necessary.
So a bit more information. We are using delta data shipping between hosts. The nameserver trace does kick out these messages below.
[29160]{-1}[-1/-1] 2017-05-25 08:14:05.599552 e sr_dataaccess DisasterRecoveryProtocol.cpp(01171) : This situation can come up, if the primary has been recovered to an older point in time and a new full data shipping is needed.
[29160]{-1}[-1/-1] 2017-05-25 08:14:05.599564 e sr_dataaccess DisasterRecoveryProtocol.cpp(01172) : If this the case, please execute 'hdbnsutil -sr_register ..' on the secondary to trigger a new full data shipping.
Another thing we see from the SAPHana ofc script is this…
if version "$hdbver" ">=" "1.00.110"; then
newParameter=1
This implies the script takes different actions as before because of our upgrade. I did not dig into it since you folks are already in know. Is this normal behavior now and what can we do to bring back the old behavior? Or is something missing and is a PTF required? I will try to get the corresponding SR number for SuSE from cisco.
Tim
The text was updated successfully, but these errors were encountered: