-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TDbCache::setValue collisions - TApplicationComponent::getClassFxEvents thrashing cache #940
Comments
Possibly related to issue #820 and this change in PR #841: https://github.com/pradosoft/prado/pull/841/files#diff-e89e3dcbd35b5518b965e802e9aa3d2d3ecb051a7c6fe6feaafd0928c17bc277 |
Indeed, I would confirm that problem is related to the issue #820. If I change the application mode to Normal instead of Performance, it seems to be ok. Yes, I'am with mariadb/innodb 10.11.2. The cache was removed completly. Maybe the folloing code here in the function TApplicationComponent::getClassFxEvents: if ($mode === TApplicationMode::Performance || isset(Prado::$classMap[$baseClassName])) { |
It is difficult to debug that function. I have three comments:
The whole point is to load all the class data once and put it in one place for loading once. |
@majuca The issue is that the code "shouldn't" be getting to that point. The TApplicationSTatePersister should have had the GetClassFxEvents loaded from cache already. That code only runs once to load the cache, and not a second time, unless the cache is manually busted. If that code is running multiple times (as it seems to be doing in your instance), then the classFxEvents are not being cached properly. |
PS. The TApplicationStatePersister (being loaded) is in the Class map, and would set the cache regardless of being in performance mode. That's the weird part. |
I think I got it. The scope of When |
@belisoful Many thanks for your realy fast fix. I will test it today, I have several server where I can check this. |
Mathematically (technically), this is the issue. and God bless you for having servers to test the correction. |
Bad news, I applied the fix on one server and the error still happens, some examples:
|
It still seems that the cache is not getting set properly and is setting the cache much more than needed. I can imagine the first second of a web server going online, loading all its classes, and there being a collision here or maybe there, but after loading the first time, the code being executed should never be hit, especially in a performance mode. That's the part that is boggling my mind. Are you running a shared DB cache system? Could it be that the other instances are clobbering the cache causing the correction to not work properly? If other instances are incorrectly invalidating the shared DB cache, the new code is going to have the same issue. |
I can see how your situation would happen in a shared DB cache environment where other instances are not behaving nicely like the new fix. The other misbehaving servers will thrash the cache despite the fix. In such a case, the solution is to deploy the micro-fix to all servers in the shared environment. It'll speed things up too. |
such suspense... lol. 😅 (j/k. more accurately: 🕵️) |
I doing some tests this week, last end of wee was very busy. But, in our application, the database is not shared. But when this problem happens, it seams on servers where we have a lot of request by seconds. |
Especially in performance mode, that code shouldn't be touched after initial load. So somehow, the cache is still being thrashed. |
Mmm, if I put a debug in the function getClassFxEvents, this one is not called only one time, or perhaps I misunderstood your remark concerning the initial load. Indeed, each time a new class is used (like a TPage), this the APP_COMPONENT_FX_CACHE cache is updated. I would like to say that the problem is coming from the function setValue of TDbCache. Indeed, this function is not thread safe (atomic):
As we delete the value and create it just after, two threads could have conflict and specialy if the two thread started a transaction. This explain the message :
To have the function setValue as an atomic function, I would propose the following code:
I currently trying this code on one server to see if it is better. |
There is SQL for adding or updating a key if it exists, which may be a better route than transaction with deleting then re-adding. The "Add but update if exists" may be more efficient than a transaction as I believe that would be atomic. What I am saying is that the first half of getClassFxEvents should be called with each new instance. but once the cache is loaded, the second half shouldn't be run, ever, especially in performance mode. your code keep running the second half and, again, it shouldn't be running the second half once you are cached and loaded. Somehow, your cache is getting busted... leading to "setValue" when it shouldn't be called. the fact that you are even having a "setValue" is the problem. That it isn't atomic is a second problem. To explain further, the first half gets the data, but all the data should be cached and shouldn't have to I hope this isn't lost in translation. |
I think the non-transaction atomic SQL we are looking for is |
Put simply.... if you are getting a a butch of requests at the same time doing a "setValue" (and even remotely possibly colliding) then the caching isn't working right!!!! |
Sure, you are right, your sql INSERT is for sure better than a SQL transaction and should fix the atomic problem of the funtion setValue. And concerning the function getClassFxEvents, for me is it now ok and the only problem is see is on the setValue. |
Thank you for that. The getClassFxEvents working right was keeping me up. |
This should work for you:
NM on the addition stuff. |
This is a better implementation of getClassFXEvents. it shaves a few milliseconds from the testing time. woot.
The getClassFxEvents still isn't working quite right because TDBCache isn't loaded for most of its use. It's used mainly in loading modules (and their global fx event handlers; speeding it up) but TDBCache is a module that isn't installed until after the modules are loaded. As such, getClassFXEvents shouldn't be using TDBCache. So this update to TDBCache is important for the DB Cache to work correctly in heavily concurrent environments, but it will be removed from this context. The TDBCache won't be any further issue in this function because it's being totally removed. The solution for getClassFXEvents is to do an exclusive lock write to a new FXEvents cache file in the "runtime" folder. When loading, it reads that file. If there are any new classes to be cached by any instance, they do an exclusive lock write to the file. If the file cannot be locked, the writing is (properly) skipped. this will stop any concurrency issues. This way, the cache is built and restored immediately and properly without TDBCache. I have a new function for setValue that takes into account most things. This SQL has been tested in SQLite3 and simulated Oracle and MSSQL (using ChatGPT as if it were an Oracle and MSSQL database). It also includes your proposed transaction fix for any unsupported DBs. Thank you for that.
|
Alright. I have a new getClassFxEvents method that doesn't use TDBCache. So the DB collisions won't be a problem at that point any more. In essence, due to TDBCache being a Module, it wasn't loading the cache properly for when the modules load... which is the point this method is designed to alleviate. TDBCache loads too late. This uses a file to store the cache as the cache is loaded and from then on its only read only, so no further writes when things are loaded up. Normal mode only stores the events of the framework, and performance mode stores the events of all classes. Regarding TDBCache, that isn't a problem but it still should be updated to handle concurrency better. I have included an interesting solution: Driver specific SQL for 3 different kinds of systems and a fallback to @majuca beautifully simple transaction based "set" (delete/insert, "upsert"). (I mean "simple" in the mathematical simple solutions that solve things like E=mc^2; where simplicity is the beauty of the thing) Others have run into this problem of "universal DB Set row" and reason why I think we should include a good solution in PRADO. Regarding testing the Oracle, DB2, MSSQL, and Postgres SQL.... Interestingly, I had ChatGPT pretend to be these databases in turn, "...accept only SQL statements that the DB would accept, and output only what the DB would output. Do not output anything but what the DB would output" I then entered the DB specific code (with prep work for the db having data to "set"). I tested with both existing data (update) and new data (insert). It worked properly. I believe these to be valid tests, I entered the SQL for the other DB systems and only got errors... as expected. I think the new code on both fronts will resolve your issues. I didn't expect this bug to expose the faults of getClassFXEvents. We found the faults, and the PR should fix it. |
Nice solution. I could try it on a server having a lot of requests and give you a feed back. |
Indeed. That would be lovely. This should speed things up a little bit too. |
…Value uses atomic SQL (#952) * #940 TApplicationComponent::getClassFxEvents This is a better implementation of getClassFXEvents. it shaves a few milliseconds from the testing time. woot. * #940 TDBCache::setValue uses Atomic SQL "INSERT... ON DUPLICATE" * TDBCache::setValue uses "REPLACE" rather than "INSERT INTO ON DUPLICATE KEY UPDATE" * shortened the sql * TDBCache::setValue selects on database type for SQL, fallback to transaction delete/add And corrected a "/" into "DIRECTORY_SEPARATOR" in TApplicationStatePersister * TApplicationComponent::getClassFxEvents uses runtime file cache Rather than waiting for TDBCache module to load, this uses a runtime file cache to store the fxEvents. * remove testing * getClassFXEvents doesn't cache in Debug Mode * removed cleanup that doesn't need to happen for Debug mode * no change to the ticket 589test
No news is good news. |
Are we ready to close this? |
Also, I'd like to reiterate that this bug is severe enough and the fix important enough as to maybe warrant releasing 4.2.3 ASAP (sooner rather than later). I think it's also important to get closures as PRADO event handlers out there too. |
Sorry for my delay, I was not able to give you a feedback these last months for personal reasons. I retry the patch this week and give you a feedback at least friday. |
@majuca I hope things are well for you and/or moving in the right direction. |
I try the patch on two server with a lot of traffic. I does't have any problem or error this last days. I would confirm that it is working as expected. |
Could we imagine making a new version of PRADO with just this one change? The difference is truly incredible. |
That's what I was suggesting.... 4.2.3 w/ all the event updates 'n stuff. I saw immediately how critical this update was when I put it in place. |
I can prepare a release, but since we bumped up requirements from php 7.4 to php 8.1 I'd go directly for a 4.3.0 if you don't mind |
Done, 4.3.0 is released. Updated prado-demos, prado-app, the pradoframework.net and api docs websites. |
yeeeeeesssss..... I entirely agree. and thank you. I hope people start contributing PRADO extensions in composer/packagist. That's a huge one. It was such a small change, with two functions, to unlock so much potential. |
Good news. A big thank you to both of you for the incredible job you're doing. |
HI, since the version 4.2.2, we have a lot error like this:
An idee?
The text was updated successfully, but these errors were encountered: