Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using cache.shared in JavaScript rules can lead to unexpected java.lang.IllegalStateException: Multi threaded access requested by thread #4413

Open
SkyLined opened this issue Oct 9, 2024 · 28 comments
Labels
bug An unexpected problem or unintended behavior of the Core

Comments

@SkyLined
Copy link

SkyLined commented Oct 9, 2024

Expected Behavior

The cache.shared object is documented as allowing any script to access it and "share" information. The documentation comes with an example for cache.private that we can adjust to use cache.shared:

var counter = cache.private.get('counter');
if (counter === null) {
  counter = { times: 0 };
  cache.private.put('counter', counter);
}
console.log('Count', counter.times++);

We would expect the above code to work when used by two rules; every time either of the rules fires, the code is executed, the counter increased and the old value logged.

Current Behavior

cache.shared allows sharing JavaScript objects between two rules. JavaScript code for two different rules will get executed in two different threads. The JavaScript engine used by OpenHAB is not designed to be run in multiple threads simultaneously. The Java code is aware of this and will throw a java.lang.IllegalStateException exception when two threads attempt to access the same JavaScript object. The message associate with this exception is:

Multi threaded access requested by thread Thread[<thread details>] but is not allowed for language(s) js.

This exception unexpectedly stop the execution of the rule using cache.shared and breaks its functionality.

Possible Solution

  • Create a lock that is acquired when JavaScript code accesses a shared object and released once that objects is garbage collected. This prevents two scripts accessing the same object simultaneously but does potentially introduce unexpected delays because scripts may end up waiting for each other to finish execution. If a script never releases a shared object, this might introduce a deadlock: other scripts that attempt to use the object will never be able to do so, as the lock is permanently held.
  • Create a global lock that is acquired before executing any JavaScript code and release the lock after. This makes sure JavaScript code never runs simultaneously in multiple threads. This can also introduce delays in rule execution, as one rule has to wait for another to finish. There is less risk of a deadlock, because if one script never releases a reference to an object, another can still execute and access the object. Only if a script gets into an infinite loop will no other scripts ever be able to execute. That itself seems like a bug in the script that should already be detected and the script terminated.
  • Document this behavior and advise people not to use cache.shared to share information between multiple rules. Obviously, limiting the amount of sharing seriously reduces the usefulness of cache.shared, so this is more of a stop-gap than a solution IMHO.

Steps to Reproduce (for Bugs)

This issue quite easy to reproduce If we create two JavaScript rules that accesses a shared object over a prolonged period and execute them at nearly the same time. In this set-up, the JavaScript code is almost guaranteed to simultaneously attempt to access the object, which triggers the exception. Here's a step by step guide to do this:

  1. Create two rules named Test1 and Test2
  2. Create a trigger for Test1 that fires every second (cron: * * * * * ? *)
  3. Test2 does not need a trigger: you will run it manually later.
  4. Add an action to both Test1 and Test2 to execute the same JavaScript code (found below).
  5. Start Test1 (the cron rule)
  6. Manually run Test2
  7. Check the log for exception reports.

Here is the code used in both rules:

step = "try {...}";
try {
  step = "cache.shared.get(...)"
  var counter = cache.shared.get('counter');
  if (counter === null) {
    counter = { times: 0 };
    step = "cache.shared.put(...)"
    cache.shared.put('counter', counter);
  }
  // repeatedly access the shared object for half a second to create a large time window
  // in which two threads are likely to attempt to access the shared object.
  for (const end_time = new Date().valueOf() + 500; new Date().valueOf() < end_time;) {
    step = "counter.times++";
    counter.times++;
  };
} catch (exception) {
  // Log in which step the exception happened and what event we were passed (timer or manual).
  console.log(`Error in step ${step} for event ${event}`);
  throw exception;
};

Here is an example log output showing the exception after starting the first rule and manually running the second:

2024-10-08 17:12:45.019 [INFO ] [g.openhab.automation.script.ui.Test1] - Error in step cache.shared.get(...) for event Timer 1 triggered.
2024-10-08 17:12:45.022 [ERROR] [internal.handler.ScriptActionHandler] - Script execution of rule with UID 'Test1' failed: java.lang.IllegalStateException: Multi threaded access requested by thread Thread[OH-rule-Test1-1,5,main] but is not allowed for language(s) js.
<snip: exception details and stack follow>

The exception can happen in either Test1 or Test2 depending on timing.

Context

I am creating a queue for audio files to be played. The queue is stored in an Array in cache.shared. Any rule can use Array.push() to add a file to the queue. A dedicated rule uses a timer to repeatedly check if it is time to play the next file and uses Array.shift() to get the next file from the queue and send it to the player. This system breaks when two rules try to add or remove files to/from the queue at the same time. These exceptions make the system unreliable.

Your Environment

  • openHAB 4.2.1 (Release Version)
  • Java 17.0.12
  • OS Linux/6.1.21-v8+ (aarch64) (Raspbian bullseye)
@rkoshak
Copy link

rkoshak commented Oct 22, 2024

Given this is only an issue for JS Scripting, this issue might need to be moved to the add-ons repo unless core is amenable to putting a lock on the shared cache for all languages. A lock on the private cache might be advisable as well given timers. If not, locks will need to be impelmented in the JS Scripting add-on similar to the locks already in place for timers, assuming such locks are feasible.

Note that if you have rules that call other rules, there are also no locks in place in that case so if two rules call the same other rule at the same time the exception will occur and as far as I've been told it is not technically feasible to put locks in place in this context, at least not in the JS Scripting add-on by itself.

@florian-h05, you are most familiar with the locking stuff that was done for JS Scripting so far, wdyt?

@SkyLined
Copy link
Author

As I work-around, I have been trying to create String Items to store values (as JSON) and Groups to simulate arrays of such values. This seems to mostly avoid the issue: I very infrequently encounter this exception now. It doesn't completely resolve it and does make the Items page near impossible to use (as explained in openhab/openhab-webui#2819). If I come up with a better work-around, I'll let you know.

@florian-h05
Copy link
Contributor

Wrt to the proposed solutions by OP, 2 and 3 are no option.
1 seems like the best of the proposed solutions, however I think any solution is difficult to implement.

Though I wonder if the shared cache does need a synchronisation in general, because even if the script engine allows multi-threaded access to it‘s context, modifying something in the shared cache by two different threads can lead to bad side-effects.
But in that case the difficulty would be that the lock for JS Scripting should be the lock of the one who stored the value to the shared cache.

If a solution is only to be implemented in JS Scripting (which however would not synchronise the access from scripts in other languages than JS), I think we could wrap core‘s shared cache and associate the lock of the engine with the value in the shared cache on value store.
Then on value retrieval, that lock would need to be locked and the script engine would need to unlock the lock after execution . This way we always have an „owner“ of something in the shared script and synchronise on the lock of that owner.

GraalJS approach to disallow multi-threaded access is sometimes a PITA, but I understand that decision given the side-effects that can occur and given that NodeJS is single-threaded as well.

A lock on the private cache might be advisable as well given timers.

I don‘t think that is needed, as the values in the private cache can only be accessed by their „owner“ engine. As timer execution is synchronised in the engine, it is not possible for a timer to execute while the script is active somewhere else.

@rkoshak
Copy link

rkoshak commented Oct 23, 2024

Though I wonder if the shared cache does need a synchronisation in general, because even if the script engine allows multi-threaded access to it‘s context, modifying something in the shared cache by two different threads can lead to bad side-effects.

I assumed calls to put would already be synchronized at least. If it's not I agree that's a problem above and beyond GraalVM's restrictions.

it is not possible for a timer to execute while the script is active somewhere else.
👍
I wasn't thinking that through. Of course there's no problem with private cache given the locks already in place for timers.

@florian-h05
Copy link
Contributor

I assumed calls to put would already be synchronized at least.

They are, just checked the core code 👍

@florian-h05
Copy link
Contributor

@J-N-K May I ask for your help here?

The best solution I can currently think of is that every value in the shared cache has an owner, that can provide a lock which has to be acquired if another script is accessing the value. The lock has to be unlocked after the execution of that other script. However this poses the risk that scripts block others from execution for a possibly long time.

@SkyLined
Copy link
Author

this poses the risk that scripts block others from execution for a possibly long time.

It sounds like that is going to be a solution that will cause unexpected and unexplained (e.g. hard to debug) freezes in rules. For instance, I have a rule that handles calendar events. The rule uses timers to do things a number of minutes before/after an event starts/ends. These timers are set as soon as the event is seen (i.e. one of the next two events on the calendar). This script may therefore set these timers hours or even days before they fire. This rule accesses values from the shared cache before setting these timers, so it will block access to those values pretty much continuously. If anyone has a similar rule, they will find themselves with a completely broken system when this fix is released. I'm sure that's not ideal.

@florian-h05
Copy link
Contributor

This rule accesses values from the shared cache before setting these timers, so it will block access to those values pretty much continuously.

Wrong. The shared cache is only blocked while the script engine is executing code. In case of timers, you run your script and create the timer in that run. The timer is then scheduled, and once the timer expires the callback is run by the same engine. In between, this engine can do other stuff and if it is not doing anything, the stuff in the shared cache it owns can be accessed by others.

@J-N-K
Copy link
Member

J-N-K commented Oct 24, 2024

I don't fully understand the issue. The shared caches uses locks on put/get/remove, so where do you see a problem here?

@florian-h05
Copy link
Contributor

You have two rules, both having their own GraalJS script engine.
Script A puts a JS object into the shared cache and does some heavy calculations.
At the same time, script B gets that JS object from the shared cache.

The problem here is not that the cache itself does not synchronize its operations, but that script B accesses an object that is owned by script A while script A is currently running and this is a illegal access to script A's context from script B's thread.
This problem can be fixed by having script B wait with accessing the object until script A is not running anymore, but the script B has to use the lock of the engine of script A for that.

@J-N-K
Copy link
Member

J-N-K commented Oct 24, 2024

IMO that's wrong usage of the cache. You should store unmodifiable objects there, and instead of modifying an existing object create a modified copy and put that to the cache.

@SkyLined
Copy link
Author

SkyLined commented Oct 24, 2024

Indeed: I can get a reference to a single JS object instance in two rules through the shared cache. These rules can run in separate threads and JavaScript engines are designed to run single-threaded. Using the cache to access properties of a single object shared between two rules at the same time appears to result in exceptions.

The example code I created writes to the object but the problem should exist if it reads properties as well. I will run a quick test to make sure.

@SkyLined
Copy link
Author

I've updated the repro code to only read a property and not modify the object:

step = "try {...}";
try {
  step = "cache.shared.get(...)"
  let object = cache.shared.get('object');
  if (object === null) {
    object = { property: 1 };
    step = "cache.shared.put(...)"
    cache.shared.put('object', object);
  }
  // repeatedly access the shared object for half a second to create a large time window
  // in which two threads are likely to attempt to access the shared object.
  let counter = 0;
  for (const end_time = new Date().valueOf() + 500; new Date().valueOf() < end_time;) {
    step = "read object.property";
    counter += object.property;
  };
} catch (exception) {
  // Log in which step the exception happened and what event we were passed (timer or manual).
  console.log(`Error in step ${step} for event ${event}`);
  throw exception;
};

Here's a sample of the errors I'm getting after starting Test1 to run every second and manually running Test2:

2024-10-24 16:43:51.454 [ERROR] [e.automation.internal.RuleEngineImpl] - Failed to execute rule 'Test1': Failed to execute action: 1(null)
2024-10-24 16:43:56.451 [ERROR] [e.automation.internal.RuleEngineImpl] - Failed to execute rule 'Test1': Failed to execute action: 1(null)
2024-10-24 16:43:57.450 [ERROR] [e.automation.internal.RuleEngineImpl] - Failed to execute rule 'Test1': Failed to execute action: 1(Multi threaded access requested by thread Thread[OH-rule-Test1-1,5,main] but is not allowed for language(s) js.)
2024-10-24 16:43:59.481 [INFO ] [g.openhab.automation.script.ui.Test2] - Error in step cache.shared.get(...) for event Execution triggered by manual
2024-10-24 16:43:59.488 [ERROR] [internal.handler.ScriptActionHandler] - Script execution of rule with UID 'Test2' failed: java.lang.IllegalStateException: Multi threaded access requested by thread Thread[qtp1338110219-1366,5,main] but is not allowed for language(s) js.

@SkyLined
Copy link
Author

SkyLined commented Oct 24, 2024

Note that sometimes I can catch an exception and sometimes I get the "Failed to execute rule..." error.

@J-N-K
Copy link
Member

J-N-K commented Oct 24, 2024

I think the issue really is that you try to use an object that itself does not support multi-thread access. It's not an issue of the cache. Try something simple, put an integer to the cache, start with 0, increase after every read and put the new number to the cache. This should work perfectly fine.

@florian-h05
Copy link
Contributor

I think the issue really is that you try to use an object that itself does not support multi-thread access.

That's the problem with objects in Graal ... primitives should work perfectly fine.
You should be able to store objects and arrays by serialising to JSON and storing the string, later parse the JSON string.

It's not an issue of the cache

👍

@SkyLined
Copy link
Author

Sure that would work, but I'm trying to allow two rules to communicate with more than just integers. I've got a number of rules that want to play messages through audio, these store a string that refers to the mp3 file to be played in a shared array. A single rule control the audio player and removes the top entry from the array to play, and the the next, etc. I cannot implement that with a shared cache that only allows integers.

@SkyLined
Copy link
Author

SkyLined commented Oct 24, 2024

Note that my original repro is based on an example from the documentation, which uses an object with a property that serves as a counter. This suggests that this is a totally normal thing to do and should not result in exceptions. See here: https://www.openhab.org/addons/automation/jsscripting/#cache

@SkyLined
Copy link
Author

SkyLined commented Oct 24, 2024

You should be able to store objects and arrays by serialising to JSON and storing the string, later parse the JSON string.

This will result in issues. One rule will read the string, parse it to an array, add an mp3 file, stringify and set. Meanwhile another rule will do the same. Both get the same initial string from the cache, add a string, and write to the cache. The last one will overwrite the changes made by the first one.

@florian-h05
Copy link
Contributor

Now that I know what you want to achieve, try that out:

Have one String Item. All rules that want to queue an mp3 file should sent the filename to this Item.
Use this Item as trigger for the audio player rule: If the audio player rule is triggered by that Item, add the received command to the array, which is internally stored by the audio player, either in private cache or if you use file-based scripting, a global variable.

@SkyLined
Copy link
Author

I had that setup initially but dropped it for some reason - I don't recall what it was. I can try it again and find out why that didn't work for me.

But if the shared cache should never contain anything other than primitives, that should be documented and it should throw an error when you try to do so IMHO.

@florian-h05
Copy link
Contributor

But if the shared cache should never contain anything other than primitives, that should be documented and it should throw an error when you try to do so IMHO.

Sure - the docs are only that way because no one noticed this problem before. I will take care of documentation and to log a warning when putting an object into shared cache.

@rkoshak
Copy link

rkoshak commented Oct 24, 2024

Have one String Item. All rules that want to queue an mp3 file should sent the filename to this Item...

This is almost exactly the use case I had when I created the Separation of Behavior and Gatekeeper design patterns and Gatekeeper class in openhab_rules_toole.

In either case you'd only need the private cache, not the shared cache, so there won't be any multithreaded access problems. However, note that you will want to use a String Item for this and not calling the rule from another rule becuase that latter approach is not locked so if two rules attempt to call the same rule at the same time the multithreaded exception will occur.

I still think having locks on the Objects in the shared cache is still a good idea because even if one used immutable Objects, the problem @SkyLined points out could still occur where two rules (regardless of the language) start at near the same time, each grab the value from the cache at near the same time, one modifies it and puts the modifications back into the cache. The other one modifies their now stale copy of the Object and puts that back into the cache, wiping out the changes from the first.

This remains a potential problem regardless of the rules language.

@florian-h05
Copy link
Contributor

However, note that you will want to use a String Item for this and not calling the rule from another rule becuase that latter approach is not locked so if two rules attempt to call the same rule at the same time the multithreaded exception will occur.

I think that there won't be a multithreaded exception, just tried the following: Have a rule A that has a JS script action with a sleep of 10 seconds to ensure the rule runs long enough. Have a rule B that attemps to run rule A and pass some information. Running B while A is running does not lead to a multithreaded exception but instead to:

17:38:56.055 [ERROR] [re.automation.internal.RuleEngineImpl] - Failed to execute rule ‘A' with status 'RUNNING'

florian-h05 added a commit to florian-h05/openhab-js that referenced this issue Oct 24, 2024
Update docs and log a warning if done so.

Refs openhab/openhab-core#4413.

Signed-off-by: Florian Hotze <dev@florianhotze.com>
florian-h05 added a commit to openhab/openhab-js that referenced this issue Oct 24, 2024
Update docs and log a warning if done so.

Refs openhab/openhab-core#4413.

Signed-off-by: Florian Hotze <dev@florianhotze.com>
@rkoshak
Copy link

rkoshak commented Oct 24, 2024

I routinely see Multi-threaded exceptions in my config. Something is not being properly locked in the combination of rules running which create timers which call another rule. Assuming that rule locking and timer locking are working all that's left is two rules calling the same other rule at the same time.

But I don't think that test case actually tests what I'm talking about. My internet is down right now so I can't test this just yet, but I'd test it as follows:

Create a script or rule A with a JS Action and make sure that the script also has a variable or two. Put the sleep in that for sure at the end to make sure it ruins long enough but I don't know that the sleep is necessary. But now create rules B and C with a cron trigger and JS action that calls rule A at the same time with passed values, at least one of which is an Object (I don't know it that matters or not, primites may also cause the error).

You need three rules to test the behavior I'm talking about because I don't think it has to do with the fact that rule A is already running as much as rule B and C are trying to call it at the same time. IIRC you've said that rules.runRule bypasses the normal rule locking so this is what I've figure is the root cause

Or maybe some recent changes have changed this behavior and the engine now generates a new error? Im on milestone releases. I wish I could get to my logs right now.

I'll try to remember to come back once I have internet again.

@SkyLined
Copy link
Author

I think I remember why I stopped using a single string Item to queue files and a rule that triggers when it's set: The rule needed a way to play files one after the other when two rules set the Item around the same time. For that I created a queue in a local variable, which did not cause problems, and started a timer to play the next file in the queue when the previous one was done. I experienced a lot of exceptions with these timers, which would break this setup. I attempted to work around that by creating a rule that runs every second in place of the timer, and use the shared cache to allow both rules to access the queue.

I'll revert back to using timers and see if I can pinpoint all multi-threaded access issues I get with them. Fixing that would make life a lot easier for me, and I expect everyone that uses them.

@florian-h05
Copy link
Contributor

@rkoshak

With the introduction of script pre-compilation, a regression wrt to multi-thread access synchronization has been introduced for UI-based scripts, which has been fixed just yesterday: openhab/openhab-addons#17510
This PR also fixes a multi-thread access synchronization issue with UI-based scripts that occured when a timer was waiting for a script execution to finish to take over the context. The cause of this issue is unknown - the debugger proved the lock was properly acquired and unlocked and during debugging the issue was gone. I suspect that there is some timing issue in the milliseconds inside the JVM, and I have been able to reproducible fix it on my dev and prod systems with openhab/openhab-addons#17510.
If you update to the latest milestone, you can use the script from openhab/openhab-addons#17510 description to verify the fix on your machine.

@SkyLined

If you have used UI-based rules for your first attempt, the above could apply to your problems.
If you want to be sure that there are no multi-threading issues, use file-based rules or at least use a file-based rule for the audio player - my production system uses a huge number of timers and I have never experienced any multi-threading problem with my 100% file-based rules.

@rkoshak
Copy link

rkoshak commented Oct 25, 2024

the debugger proved the lock was properly acquired and unlocked and during debugging the issue was gone.

Gotta love Heisenbugs. I'll see about updating to the snapshot this weekend and see if my periodic multithreaded exceptions go away.

The rule needed a way to play files one after the other when two rules set the Item around the same time.
Again, this is exactly why I created Gatekeeper. It does the queueuing and timer management.

They way I've set this up in the past was something along these lines.

I trigger the rule based on command to a String Item. The code for the Action is along these lines:

var {Gatekeeper} = require("openhab_rules_tools");
var gk = cache.private.get("gatekeeper", () => Gatekeeper());

// Calculate how long after this command runs to wait before the next one can run
var timeout = <some calculation, maybe based on the number of characters in event.itemCommand>;
gk.addCommand(timeout, () => {
  // what ever you want to do with this command, e.g. tts
});

The gatekeeper is in the private cache and it manages a queue of commands and a Timer to run a command (a passed in function) and wait the timeout time minus how ever long it took to run the command function before running the next command on the queue.

I don't think I ever experienced a multithreaded exception from Gatekeeper in my setup, so this approach might work around the regression that @florian-h05 just fixed too. Though it could also just be a fluke of timing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior of the Core
Projects
None yet
Development

No branches or pull requests

4 participants