Skip to content

Conversation

@ahgittin
Copy link
Contributor

Builds on #480, really quite small once that is merged, tidies up many of the "immediate"-related annoyances.

/cc @aledsage @grkvlt @neykov @sjcorbett

…-blocking.

Also updates tests. Mainly uses ImmediateSupplier and InterruptingImmediateSupplier
for true non-blocking evaluation, with some other tricks used in other places.

Some non-reliable calls may still fail, but most have been repaired,
and the rest should be.
(If the old semantics are _really_ needed you can resolve with a short wait.)

Re-enables many of the tests disabled for  https://issues.apache.org/jira/browse/BROOKLYN-272
@aledsage
Copy link
Contributor

aledsage commented Mar 2, 2017

@ahgittin can you rebase this against master please, now that #480 is merged

@ahgittin
Copy link
Contributor Author

ahgittin commented Mar 6, 2017

@aledsage rebased and ready for review

Copy link
Contributor

@aledsage aledsage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahgittin I've been playing around with more tests, and I fear there's a chance we have a fundamental difference of opinion of what an "immediate supplier" should be allowed to do!

For me, "immediate supplier" means that the value's resolution is not dependent on other tasks executing within Brooklyn. However, it could do "blocking" calls such as IO, which would normally result in an InterruptedException being thrown (if the thread were set to interrupted). For example, the supplier could be defined as $brooklyn:external("vault", "aws.secretKey").

This use-case is supported by the explicit implementation in BrooklynDslCommon.DslExternal of the ImmediateSupplier interface. We are mostly ok with using this pattern, because we check if the value implements ImmediateSupplier, and call getImmediately() (without wrapping it in a InterruptingImmediateSupplier, which would cause it to be interrupted).

However, when I play around with a variant of EntityConfigTest.testGetImmediateSupplierNonBlockingMap() that uses a a value of type ImmediateSupplier, it now always fails. Previously, it passed (most of the time, failing occasionally on slow/overloaded servers!) because we gave it 500ms to execute. Now it wraps it in an InterruptingImmediateSupplier which causes it to fail.

See #581 for that test. If you agree that is a valid use-case, then can you please merge that PR first and rebase you branch accordingly?

Following the code path (see stacktrace snippet below):

  1. the AbstractConfigurationSupportInternal.getNonBlockingResolvingStructuredKey() has created a DynamicSequentialTask, whose job is to call get(key().
  2. BasicExecutionContext.getImmediately has retrieved the underlying job from this task, which is a DynamicSequentialTask.DstJob (that contains a single job).
  3. It therefore wraps it in InterruptingImmediateSupplier (because DstJob does not implement ImmediateSupplier).
  4. It executes the task, but with the task marked as cancelled. Therefore the Thread.sleep(1) gets interrupted immediately.
Thread [main] (Suspended)	
	BasicExecutionContext.getImmediately(Object) line: 141	
	AbstractEntity$BasicConfigurationSupport(AbstractConfigurationSupportInternal).getNonBlockingResolvingStructuredKey(ConfigKey<T>) line: 115	
	AbstractEntity$BasicConfigurationSupport(AbstractConfigurationSupportInternal).getNonBlocking(ConfigKey<T>) line: 82	
	EntityConfigTest2$ConfigNonBlockingFixture.runGetConfigNonBlockingInMap() line: 421	
	EntityConfigTest2.testGetExplicitImmediateSupplierNonBlockingMap() line: 534	

This looks tricky to fix "properly". Even if we avoid the use of DynamicSequentialTask, getImmediately() still only gets the wrapper job org.apache.brooklyn.core.objs.AbstractConfigurationSupportInternal$1. It therefore doesn't know that the underlying value is an ImmediateSupplier so it gets wrapped and thus interrupted.

I suggest for now that we put getNonBlockingResolvingStructuredKey() back to what it was like before (i.e. the ugly use of t.get(ValueResolver.NON_BLOCKING_WAIT)).

Otherwise, anyone who is using $brooklyn:external inside a MapConfigKey will never get their value if using "immediately".

.getMaybe();
if (resolved.isAbsent()) return Maybe.Absent.<T>castAbsent(resolved);
if (resolved.get()==marker) {
// TODO changed Feb 2017, previously returned absent, in contrast to what the javadoc says
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this code will ever be called. The .defaultValue(marker) only affects the behaviour of .get(). It doesn't affect the result of .getMaybe(), so we'll never have the marker object returned (now that you've changed it to use getMaybe()).

I'm fine with us leaving this in for your PR, but I'd be interested what situation you're trying to cover with this. I tried writing a few more tests but could never get into this code path.

My understanding of what it was doing previously... By setting the defaultValue(marker) and calling get(), it was trying to avoid the exception being thrown when getMaybe() would have return an absent (so instead it returned us the marker). That would seem to agree with the javadoc (which javadoc are you referring to?).

However, I agree that your change to call getMaybe() instead is much nicer - we get to keep the original Maybe.absent() value has the better explanation of why it's absent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you are correct here. I've removed the use of marker altogether.

@ahgittin
Copy link
Contributor Author

ahgittin commented Mar 7, 2017

@aledsage Sorry for the confusion. Thanks for long exposition. I thought almost everyone hated the "wait" as part of "immediate", from discussions ages ago somewhere. Waiting 200ms is smelly. And it hurts us writing deterministic tests, and causes nasty delays when we validate (e.g. add 1000 unresolved dependents as config keys and startup will be 3m+ while it waits for them to "validate"!). Surprised that you do like it as usually it's me being loosy goosy!

Immediate eval was not guaranteed before in any circumstances (anything can take >200ms or whatever the threshhold was) so it is a strict improvement in contract that now we do guarantee immediate eval for non-blocking code.

Yes some blocking tasks which previously succeeded in practice for immediate eval now will not. However I think there are very few of these. None of the old tests failed which gave confidence for this change (much of which was done in #480 based on https://issues.apache.org/jira/browse/BROOKLYN-329 hence the brevity in this PR).

I'm investigating your new test but to me sleeping (for any length) while someone is waiting on your "immediate" value shouldn't be acceptable. If the caller does want short-timeout behaviour that codepath is still available (and used by immediate in a few cases, e.g. shared tasks as config value). (It might be appropriate when the REST API is doing get-all-config for instance, possibly as an API argument?)

As for external(...) is it really the case that such refs now all fail? Or is it just those that require blocking I/O? Either way I think we rewrite any bits that are important (local file access? though hardly necessary) and accept that immediate isn't supported for others. (Haven't looked at this yet -- wanted to reply conceptually before looking at code.)

I think there are some places that a suprising InterruptedException might be thrown (ISTR latches check this even when not blocking) but we can live with this.

@ahgittin
Copy link
Contributor Author

ahgittin commented Mar 7, 2017

@aledsage I think I better understand now. It isn't the sleeps that are the offenders -- it is the DST/map interplay. Still getting my head round it but wanted to say that.

@neykov
Copy link
Member

neykov commented Mar 7, 2017

To add some more context (without having looked at the changes) - before adding ImmediateSupplier tasks with a short timeout (the way we implemented "immediate") would get cancelled even before the thread had the chance to get scheduled. So it's not all about whether the thread itself blocks for more than what we are ready to wait for.
I think it's fine for ImmediateSupplier to take longer than ~200ms. That's just a random limit we chose. It's easy to get over it on systems under load when threads are involved - it was easily reproducible with just the clocker blueprints. What matters is it knows it will finish in a sensible amount of time and is not blocked indefinitely by tasks it doesn't have control over.

@aledsage
Copy link
Contributor

aledsage commented Mar 7, 2017

@neykov you beat me to replying - agreed :-)

@ahgittin to give some more background, the original use-case for "immediate" stemmed from a failure in clocker: https://github.com/brooklyncentral/clocker/blob/v2.1.0/swarm/catalog/swarm/swarm.bom#L260-L270

In a Transformer enricher, the target value is evaluated each time the "trigger sensor" changes, but its value may not yet be resolvable. It uses the idea of "immediately" to avoid leaving a bunch of threads running - see https://github.com/apache/brooklyn-server/blob/rel/apache-brooklyn-0.10.0/core/src/main/java/org/apache/brooklyn/enricher/stock/Transformer.java#L81-L98. Originally this was purely time-based (cancel after 200ms).

However, sometimes on a heavily loaded machine, the "immediately" would time out after 200ms even though the value really was available. The trigger sensor would never change again, so the target sensor would never be set.

By adding ImmediateSupplier we were able to call something that would only return "absent" if it really could not be resolved yet.

It is that style of use-case / usage that makes me want to let $brooklyn:external do IO operations (rather than aborting them by interrupting the thread).

I believe this model would be simpler for a user to understand, rather than explaining to tell them that their config resolution will fail in certain situations, such as the Transformer, if it requires IO operations (e.g. if it tries to contact vault, or read from a file, etc).

Longer term, I definitely want to get rid of the ugly/brittle "wait for up to 500ms", but I'd prefer we leave that in for a while longer rather than remove support for the above use-case.

@ahgittin
Copy link
Contributor Author

ahgittin commented Mar 7, 2017

Thanks for the explanation. I realized from your (excellent) tests that the thing we want to allow is to permit an ImmediateSupplier to block, which was permitted in my code if it was a simple key but not if it was deeper eg in a map. I think we agree that a "dumb" non-immediate Supplier should fail if it tries to block, but we should be consistent wrt blocking behaviour in getImmediate, and I agree we should allow it (if someone writes a bad getImmediate that's on their head, GIGO).

I think we can do this by clearing an interrupt flag in BasicExecutionContext.getImmediately(...) if it is executing an ImmediateSupplier. Does that sound reasonable?

I think we can't avoid the fact that immediate behaviour is complex. However we can say something like:

Simplistically, immediate / non-blocking evaluation uses heuristics under the covers to make best-effort without blocking to find a value. This applies among other places to the validation we run during initialization: validation will not be performed where the value isn't yet available. To permit some types of very-brief blocking operations (I/O) or any action sensitive to thread interrupts (latch checking), the action should expose an ImmediateSupplier interface. If a caller needs fine-grained control over resolution timings, use ValueResolver.

(Should we add this to the docs?)

@ahgittin
Copy link
Contributor Author

ahgittin commented Mar 7, 2017

All tests pass for me locally with this. And re-reading I think @aledsage we're not in disagreement; I simply overlooked the subtlety of getImmediate having explicit permission to block. Very good catch.

@ahgittin ahgittin force-pushed the config-immediate-more branch from c7de1a8 to 617256d Compare March 7, 2017 16:38
@ahgittin
Copy link
Contributor Author

ahgittin commented Mar 7, 2017

Failure at https://builds.apache.org/job/brooklyn-server-pull-requests/1820/ seems due to build environment --

[INFO] Installing /home/jenkins/jenkins-slave/workspace/brooklyn-server-pull-requests/camp/camp-brooklyn/target/brooklyn-camp-0.11.0-SNAPSHOT-test-sources.jar to /home/jenkins/jenkins-slave/maven-repositories/0/org/apache/brooklyn/brooklyn-camp/0.11.0-SNAPSHOT/brooklyn-camp-0.11.0-SNAPSHOT-test-sources.jar
Build timed out (after 60 minutes). Marking the build as aborted.
[WARNING] Failed to notify spy hudson.maven.Maven3Builder$JenkinsEventSpy: java.lang.InterruptedException
[WARNING] Failed to notify spy hudson.maven.Maven3Builder$JenkinsEventSpy: java.util.concurrent.ExecutionException: Invalid object ID 36 iota=46

Doing git commit --amend to force rebuild.

@ahgittin
Copy link
Contributor Author

ahgittin commented Mar 8, 2017

This time error is

testDisablingOnFire(org.apache.brooklyn.policy.ha.ServiceFailureDetectorTest)  Time elapsed: 0.016 sec  <<< FAILURE!
java.lang.AssertionError: expected [running] but found [on-fire]
	at org.apache.brooklyn.policy.ha.ServiceFailureDetectorTest.testDisablingOnFire(ServiceFailureDetectorTest.java:223)

... unrelated non-deterministic ?

@ahgittin ahgittin force-pushed the config-immediate-more branch from 617256d to d214ca9 Compare March 8, 2017 09:30
Copy link
Member

@neykov neykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Looks like @ahgittin addressed the "blocking immediate" concern. Can you kick another build - the one is stuck in limbo.

Is it worth flipping immediate on the ValueResolver by default if it detects it's in a interrupted thread? This is to support immediate in deep/recursive resolving where not everything is ImmediateSupplier.

: Maybe.<T>absent();
.getMaybe();
if (resolved.isAbsent()) return Maybe.Absent.<T>castAbsent(resolved);
return TypeCoercions.tryCoerce(resolved.get(), key.getTypeToken());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there's no marker object you can rely on ValueResolver's coercion directly (by passing in the expected type).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could, but not 100% sure. There may be some corner cases and some lossiness between as(Class<T>) and tryCoerce(..., TypeToken<T>). Have added a comment.

@ahgittin
Copy link
Contributor Author

@neykov I've added a comment in ValueResolver re your immediate flip

                // TODO svet suggested at https://github.com/apache/brooklyn-server/pull/565#pullrequestreview-27124074
                // that we might flip the immediately bit if interrupted -- or maybe instead (alex's idea)
                // enter this block
                // if (allowImmediateExecution && (isEvaluatingImmediately() || Thread.isInterrupted())
                // -- feels right, and would help with some recursive immediate values but no compelling 
                // use case yet and needs some deep thought which we're deferring for now

does that cover your thinking?

if not -- and/or if you're sure of the Object -> key.getType() change (your other comment) -- just open a new PR. going to merge this as i can't figure out how to unstuck the build. will babysit the official build.

@asfgit asfgit merged commit f4dc19d into apache:master Mar 20, 2017
asfgit pushed a commit that referenced this pull request Mar 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants