forked from apache/pulsar
-
Notifications
You must be signed in to change notification settings - Fork 0
[fix][flaky-test]Fix PersistentSubscriptionMessageDispatchStreamingDispatcherThrottlingTest.testMultiLevelDispatch #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
HQebupt
wants to merge
1,021
commits into
master
Choose a base branch
from
fixFlakyTest1
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Sync recent changes from apache#17030, apache#17039, apache#16315, and apache#17057 * fix apache#17119 * minor updates * add link of release notes to navigation * fix * update release process as per PIP-190 * minor fix * minor fix * Update release-process.md
Signed-off-by: Zixuan Liu <nodeces@gmail.com> Signed-off-by: Zixuan Liu <nodeces@gmail.com>
… and added test cases including default time (apache#16130)
### Motivation When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior. In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker. ### Analysis Since we introduced Prometheus metrics in apache#294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time. This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/. Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says: > Note, however, that adding timestamps is an extremely niche use case. Most of the users who think the need it should actually not do it. > > The main usecases within that tiny niche are federation and mirroring the data from another monitoring system. The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp. The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps. As such, I think we are not a niche use case, and we should not add timestamps to our metrics. ### Reproducing the problem 1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config: ```yaml - job_name: broker honor_timestamps: true scrape_interval: 30s scrape_timeout: 10s metrics_path: /metrics scheme: http follow_redirects: true static_configs: - targets: ["localhost:8080"] ``` 2. Start pulsar standalone on the same machine. I used a recently compiled version of master. 3. Publish messages to a topic. 4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090) 5. Stop the producer. 6. Unload the topic from the broker. 7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported. 8. Watch the metrics get reported in prometheus for 5 additional minutes. When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior. ### Modifications * Remove all timestamps from metrics * Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules) ### Verifying this change This change is accompanied by updated tests. ### Does this pull request potentially affect one of the following parts: This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility. Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them. ### Documentation - [x] `doc-not-needed`
…17378) Co-authored-by: huangzegui <huangzegui@didiglobal.com>
…pulsar-perf-producer (apache#17381) * [improve][cli] Add option to disable batching in pulsar-testclient * [improve][cli] Add option to disable batching in pulsar-testclient * [improve][cli] Add option to disable batching in pulsar-testclient * [improve][cli] Add option to disable batching in pulsar-testclient * [improve][cli] Add option to disable batching in pulsar-testclient * [improve][cli] Add option to disable batching in pulsar-testclient Co-authored-by: Vineeth <vineeth.polamreddy@verizonmedia.com> Co-authored-by: Rajan Dhabalia <rdhabalia@apache.org>
…sOpenCacheSetEnabled=true (apache#17465)
* remove backlink from develop-schema.md Signed-off-by: tison <wander4096@gmail.com> * fix backlink from reference-configuration.md Signed-off-by: tison <wander4096@gmail.com> * partial fix backlink from sql-getting-started Signed-off-by: tison <wander4096@gmail.com> * partial fix backlink from tiered-storage-aliyun Signed-off-by: tison <wander4096@gmail.com> * partial fix backlink from tiered-storage-s3 Signed-off-by: tison <wander4096@gmail.com> * fix for Install built-in connectors Signed-off-by: tison <wander4096@gmail.com> * fix backlinks for sql-getting-started Signed-off-by: tison <wander4096@gmail.com> * fix backlinks to getting-started-standalone.md for tiered-storages Signed-off-by: tison <wander4096@gmail.com> * Apply suggestions from code review Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: labuladong <labuladong@foxmail.com> Signed-off-by: tison <wander4096@gmail.com> Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com> Co-authored-by: labuladong <labuladong@foxmail.com>
* fix the note layout * sync changes to 2.7.5 and more * Update release-process.md
* Following the action plan defined in https://lists.apache.org/thread/yh72bmlf2w97593c5d5pqf3jz1ldb501 . * A lazy consensus was reached on this matter to fix CI following the action plan.
- Follow up on apache#17539 changes due to error message https://lists.apache.org/thread/hj8b9n1n29vg4c6oslz6n1j230f9v284 - Following the action plan defined in https://lists.apache.org/thread/yh72bmlf2w97593c5d5pqf3jz1ldb501 . - A lazy consensus was reached on this matter to fix CI following the action plan.
- move the filter from the build step to the build job - only start the build job if the comment includes "/pulsarbot" - no need to checkout code - no need to run the tune VM action
Co-authored-by: Lari Hotari <lhotari@users.noreply.github.com>
…ache#17252) - fixes issue with stats where timestamps might be inconsistent because of visibility issues - fields should be volatile to ensure visibility of updated values in a consistent manner - in replication, the lastDataMessagePublishedTimestamp field in PersistentTopic might be inconsistent unless volatile is used
…he#16891) * add reader config doc * update to the versioned doc * Update site2/docs/io-debezium-source.md Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com> * Update site2/docs/io-debezium-source.md Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com> * revert changes to 2.10.1 and 2.9.3 Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com>
Signed-off-by: Zixuan Liu <nodeces@gmail.com> Signed-off-by: Zixuan Liu <nodeces@gmail.com> ### Motivation Improve the compactor tool, using separate TLS config ### Modifications - Add separate TLS config on the compactor, both Keystore and PEM formats are supported - Fix correct use of service URL by `brokerConfig.isBrokerClientTlsEnabled()` value ### Verifying this change Test has been added. ### Documentation Check the box below or label this PR directly. Need to update docs? - [ ] `doc-required` (Your PR needs to update docs and you will update later) - [x] `doc-not-needed` (Please explain why) - [ ] `doc` (Your PR contains doc changes) - [ ] `doc-complete` (Docs have been already added)
…ts (apache#17686) * [feat][broker] Add config to count filtered entries towards rate limits * Make fixes for checkstyle * Remove * import * Fix incorrect conflict resolution in merge commit ### Motivation Currently, when using entry filters, filtered out messages do not count against the rate limit. Therefore, a subscription that is completely filtered will never be throttled due to rate limiting. When the messages are delivered to the consumer for a filtered subscription, those messages will count against the rate limit, and in that case, the message filtering can be throttled because the check to delay `readMoreEntries()` happens before message filtering. Therefore, the rate limit will essentially be increased as a function of the percent of messages let through the filter (some quick math is that the new rate is likely `dispatchRate * (1 / percentDelivered)`, where percent delivered is a percent as a decimal). It's possible that some use cases prefer this behavior, but in my case, I think it'd be valuable to include these filtered messages in the dispatch throttling because these messages still cost the broker network, memory, and cpu. This PR adds a configuration to count filtered out messages towards dispatch rate limits for the broker, the topic, and the subscription. ### Modifications * Add configuration named `dispatchThrottlingForFilteredEntriesEnabled`. Default it to false so we maintain the original behavior. When true, count filtered messages against rate limits. * Refactor the code to `acquirePermitsForDeliveredMessages` so that it is in the `AbstractBaseDispatcher`, which makes it available to the entry filtering logic. ### Verifying this change A new test is added as part of this PR. ### Does this pull request potentially affect one of the following parts: This PR introduces a new config while maintaining the current behavior. ### Documentation - [x] `doc-not-needed` Config docs are auto-generated.
…ache#17837) Fixes: apache#14109 ### Motivation The expected execution flow for this test is: 1. send 505 messages 2. commit 10 transactions, every transaction ack 50 messages 3. receive the last 5 messages in the last transaction, wait for transaction timeout 4. confirm that the last 5 messages can be consumed by new consumer <strong>(High light)</strong> The default value for transaction TTL is 10 seconds, and the default value for `Awaitility.await` is also 10 seconds, so this test is not stable. Note: This is a guess cause, the problem is not reproduced locally. But after transaction TTL is set to 11s, the probability of the problem occurring is 100%. ### Modifications Fix flaky test - set transaction TTL to 5s Other changes - define a name for the task thread - acknowledge the last 5 messages ### Matching PR in forked repository PR in forked repository: - poorbarcode#13
…trics (apache#17701) Master Issue: apache#15370 ### Modifications - Make transaction `MLTransactionMetadataStoreProvider` & `MLPendingAckStoreProvider` support buffered writer metrics. - Motivation: apache#15370 ---- - Delete constructor of `TxnLogBufferedWriter` without parameter `metrics`. - Motivation: it is unnecessary. ---- - Add a default `DisabledTxnLogBufferedWriterMetricsStats` implementation. ---- - Previous PR remaining code to optimize: remove the check code `if (metrics != null)`. The motivation see: - Motivation: apache#16758 (comment) ---- - Make transaction log buffered writer only create by the `MLTransactionMetadataStoreProvider` & `MLPendingAckStoreProvider`. - Motivation: apache#16758 (comment) ### Documentation - [ ] `doc-required` - [x] `doc-not-needed` - [ ] `doc` - [ ] `doc-complete` ### Matching PR in forked repository PR in forked repository: - poorbarcode#3
…tion (apache#15914) * Truncate topic before deletion to avoid orphaned offloaded ledgers * CR feedback
* [fix][proxy] Fix refresh client auth Signed-off-by: Zixuan Liu <nodeces@gmail.com> * Fix style Signed-off-by: Zixuan Liu <nodeces@gmail.com>
Fixes: apache#15773 apache#16863 apache#16860 ### Motivation ``` Error: Tests run: 11, Failures: 1, Errors: 0, Skipped: 3, Time elapsed: 87.06 s <<< FAILURE! - in org.apache.pulsar.packages.management.storage.bookkeeper.BookKeeperPackagesStorageTest Error: setUp(org.apache.pulsar.packages.management.storage.bookkeeper.BookKeeperPackagesStorageTest) Time elapsed: 13.089 s <<< FAILURE! org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.waitForConnection(ZooKeeperWatcherBase.java:159) at org.apache.bookkeeper.zookeeper.ZooKeeperClient$Builder.build(ZooKeeperClient.java:260) at org.apache.bookkeeper.test.ZooKeeperUtil.restartCluster(ZooKeeperUtil.java:133) at org.apache.bookkeeper.test.ZooKeeperUtil.startCluster(ZooKeeperUtil.java:104) at org.apache.pulsar.packages.management.storage.bookkeeper.bookkeeper.test.BookKeeperClusterTestCase.startZKCluster(BookKeeperClusterTestCase.java:238) at org.apache.pulsar.packages.management.storage.bookkeeper.bookkeeper.test.BookKeeperClusterTestCase.setUp(BookKeeperClusterTestCase.java:178) at org.apache.pulsar.packages.management.storage.bookkeeper.bookkeeper.test.BookKeeperClusterTestCase.setUp(BookKeeperClusterTestCase.java:166) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132) at org.testng.internal.MethodInvocationHelper.invokeMethodConsideringTimeout(MethodInvocationHelper.java:61) at org.testng.internal.ConfigInvoker.invokeConfigurationMethod(ConfigInvoker.java:366) at org.testng.internal.ConfigInvoker.invokeConfigurations(ConfigInvoker.java:320) at org.testng.internal.TestInvoker.runConfigMethods(TestInvoker.java:701) at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:527) at org.testng.internal.TestInvoker.invokeTestMethod(TestInvoker.java:174) at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:46) at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:822) at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:147) at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146) at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128) at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) at org.testng.TestRunner.privateRun(TestRunner.java:764) at org.testng.TestRunner.run(TestRunner.java:585) at org.testng.SuiteRunner.runTest(SuiteRunner.java:384) at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378) at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337) at org.testng.SuiteRunner.run(SuiteRunner.java:286) at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53) at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96) at org.testng.TestNG.runSuitesSequentially(TestNG.java:1218) at org.testng.TestNG.runSuitesLocally(TestNG.java:1140) at org.testng.TestNG.runSuites(TestNG.java:1069) at org.testng.TestNG.run(TestNG.java:1037) at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:135) at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112) at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeLazy(TestNGDirectoryTestSuite.java:123) at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:90) at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:146) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) ``` The root cause is that the zk client randomly selects IPV4 and IPV6 when parsing localhost, can connect when using IPV4, and fails when using IPV6. Therefore, if you continue to randomly connect to IPV6, the connection will timeout. https://github.com/apache/zookeeper/blob/bc1b231c9e32667b2978c86a6a64833470973dbd/zookeeper-server/src/main/java/org/apache/zookeeper/client/StaticHostProvider.java#L140-L146 Thanks to @poorbarcode for helping me locate the problem ### Modifications add @AfterMethod(alwaysRun = true) use Adress replace hostName ### Documentation - [x] `doc-not-needed` ### Matching PR in the forked repository PR in forked repository: - congbobo184#1
…pic Deletion (apache#15914)" (apache#17889) This reverts commit 9026d19.
* PIP-209: Removed C++/Python clients from main repo * Removed python directory from Docekrfile * Fixed python client version argument scoping * Fixed handling of pulsar.functions.serde
…patcherThrottlingTest.testMultiLevelDispatch
…patcherThrottlingTest.testMultiLevelDispatch
2 tasks
977a410 to
602aa1a
Compare
Owner
Author
|
/pulsarbot run-failure-checks |
|
The pr had no activity for 30 days, mark with Stale label. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Fixed apache#17225
Modifications
It is better to remove the check, and keep the time limit check 8000ms at most in line:476-478.
https://github.com/apache/pulsar/blob/0517423b0a8d9c981cc5550abfec9e60b55bf3e7/pulsar-broker/src/test/java/org/apache/pulsar/client/api/SubscriptionMessageDispatchThrottlingTest.java#L467-L479
Verifying this change
This change is a trivial rework / code cleanup without any test coverage.
Does this pull request potentially affect one of the following parts:
If
yeswas chosen, please highlight the changesDocumentation
Check the box below and label this PR (if you have committer privilege).
Need to update docs?
doc-not-needed