Skip to content

Conversation

@HQebupt
Copy link
Owner

@HQebupt HQebupt commented Sep 30, 2022

Motivation

Fixed apache#17225

Modifications

  • Asynchronously produce messages, but it used to send message synchronously in line:468-471
  • The 2500ms time limit at least is unreasonable due to the time to consume messages depends on the environment, which cause the failure.
    It is better to remove the check, and keep the time limit check 8000ms at most in line:476-478.

https://github.com/apache/pulsar/blob/0517423b0a8d9c981cc5550abfec9e60b55bf3e7/pulsar-broker/src/test/java/org/apache/pulsar/client/api/SubscriptionMessageDispatchThrottlingTest.java#L467-L479

Verifying this change

  • Make sure that the change passes the CI checks.

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)
  • The rest endpoints: (no)
  • The admin cli options: (no)
  • Anything that affects deployment: (no)

Documentation

Check the box below and label this PR (if you have committer privilege).

Need to update docs?

  • doc-not-needed

nodece and others added 30 commits September 7, 2022 08:42
* Sync recent changes from apache#17030, apache#17039, apache#16315, and apache#17057

* fix apache#17119

* minor updates

* add link of release notes to navigation

* fix

* update release process as per PIP-190

* minor fix

* minor fix

* Update release-process.md
Signed-off-by: Zixuan Liu <nodeces@gmail.com>

Signed-off-by: Zixuan Liu <nodeces@gmail.com>
### Motivation

When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior.

In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker.

### Analysis

Since we introduced Prometheus metrics in apache#294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time.

This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/.

Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says:

> Note, however, that adding timestamps is an extremely niche use
case. Most of the users who think the need it should actually not do
it.
>
> The main usecases within that tiny niche are federation and mirroring
the data from another monitoring system.

The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp.

The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps.

As such, I think we are not a niche use case, and we should not add timestamps to our metrics.

### Reproducing the problem

1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config:
```yaml
  - job_name: broker
    honor_timestamps: true
    scrape_interval: 30s
    scrape_timeout: 10s
    metrics_path: /metrics
    scheme: http
    follow_redirects: true
    static_configs:
      - targets: ["localhost:8080"]
```
2. Start pulsar standalone on the same machine. I used a recently compiled version of master.
3. Publish messages to a topic.
4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090)
5. Stop the producer.
6. Unload the topic from the broker.
7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported.
8. Watch the metrics get reported in prometheus for 5 additional minutes.

When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior.

### Modifications

* Remove all timestamps from metrics
* Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules)

### Verifying this change

This change is accompanied by updated tests.

### Does this pull request potentially affect one of the following parts:

This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility.

Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them.

### Documentation
- [x] `doc-not-needed`
…17378)

Co-authored-by: huangzegui <huangzegui@didiglobal.com>
…pulsar-perf-producer (apache#17381)

* [improve][cli] Add option to disable batching in pulsar-testclient

* [improve][cli] Add option to disable batching in pulsar-testclient

* [improve][cli] Add option to disable batching in pulsar-testclient

* [improve][cli] Add option to disable batching in pulsar-testclient

* [improve][cli] Add option to disable batching in pulsar-testclient

* [improve][cli] Add option to disable batching in pulsar-testclient

Co-authored-by: Vineeth <vineeth.polamreddy@verizonmedia.com>
Co-authored-by: Rajan Dhabalia <rdhabalia@apache.org>
* remove backlink from develop-schema.md

Signed-off-by: tison <wander4096@gmail.com>

* fix backlink from reference-configuration.md

Signed-off-by: tison <wander4096@gmail.com>

* partial fix backlink from sql-getting-started

Signed-off-by: tison <wander4096@gmail.com>

* partial fix backlink from tiered-storage-aliyun

Signed-off-by: tison <wander4096@gmail.com>

* partial fix backlink from tiered-storage-s3

Signed-off-by: tison <wander4096@gmail.com>

* fix for Install built-in connectors

Signed-off-by: tison <wander4096@gmail.com>

* fix backlinks for sql-getting-started

Signed-off-by: tison <wander4096@gmail.com>

* fix backlinks to getting-started-standalone.md for tiered-storages

Signed-off-by: tison <wander4096@gmail.com>

* Apply suggestions from code review

Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: labuladong <labuladong@foxmail.com>

Signed-off-by: tison <wander4096@gmail.com>
Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com>
Co-authored-by: labuladong <labuladong@foxmail.com>
* fix the note layout

* sync changes to 2.7.5 and more

* Update release-process.md
* Following the action plan defined in https://lists.apache.org/thread/yh72bmlf2w97593c5d5pqf3jz1ldb501 .
* A lazy consensus was reached on this matter to fix CI following the action plan.
- Follow up on apache#17539 changes due to error message https://lists.apache.org/thread/hj8b9n1n29vg4c6oslz6n1j230f9v284

- Following the action plan defined in https://lists.apache.org/thread/yh72bmlf2w97593c5d5pqf3jz1ldb501 .
- A lazy consensus was reached on this matter to fix CI following the action plan.
)

- there's no need for this anymore since GitHub Actions has built-in support
  for cancelling duplicate workflows and this is used in Pulsar CI workflows.
- move the filter from the build step to the build job
   - only start the build job if the comment includes "/pulsarbot"
- no need to checkout code
- no need to run the tune VM action
Co-authored-by: Lari Hotari <lhotari@users.noreply.github.com>
mattisonchao and others added 22 commits September 28, 2022 21:55
…ache#17252)

- fixes issue with stats where timestamps might be inconsistent because of visibility issues
  - fields should be volatile to ensure visibility of updated values in a consistent manner

- in replication, the lastDataMessagePublishedTimestamp field in PersistentTopic might be inconsistent
  unless volatile is used
…he#16891)

* add reader config doc

* update to the versioned doc

* Update site2/docs/io-debezium-source.md

Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com>

* Update site2/docs/io-debezium-source.md

Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com>

* revert changes to 2.10.1 and 2.9.3

Co-authored-by: momo-jun <60642177+momo-jun@users.noreply.github.com>
Signed-off-by: Zixuan Liu <nodeces@gmail.com>

Signed-off-by: Zixuan Liu <nodeces@gmail.com>

### Motivation

Improve the compactor tool, using separate TLS config

### Modifications

- Add separate TLS config on the compactor, both Keystore and PEM formats are supported
- Fix correct use of service URL by `brokerConfig.isBrokerClientTlsEnabled()` value

### Verifying this change

Test has been added.
 
### Documentation

Check the box below or label this PR directly.

Need to update docs? 

- [ ] `doc-required` 
(Your PR needs to update docs and you will update later)
  
- [x] `doc-not-needed` 
(Please explain why)
  
- [ ] `doc` 
(Your PR contains doc changes)

- [ ] `doc-complete`
(Docs have been already added)
…ts (apache#17686)

* [feat][broker] Add config to count filtered entries towards rate limits

* Make fixes for checkstyle

* Remove * import

* Fix incorrect conflict resolution in merge commit

### Motivation

Currently, when using entry filters, filtered out messages do not count against the rate limit. Therefore, a subscription that is completely filtered will never be throttled due to rate limiting. When the messages are delivered to the consumer for a filtered subscription, those messages will count against the rate limit, and in that case, the message filtering can be throttled because the check to delay `readMoreEntries()` happens before message filtering. Therefore, the rate limit will essentially be increased as a function of the percent of messages let through the filter (some quick math is that the new rate is likely `dispatchRate * (1 / percentDelivered)`, where percent delivered is a percent as a decimal).

It's possible that some use cases prefer this behavior, but in my case, I think it'd be valuable to include these filtered messages in the dispatch throttling because these messages still cost the broker network, memory, and cpu. This PR adds a configuration to count filtered out messages towards dispatch rate limits for the broker, the topic, and the subscription.

### Modifications

* Add configuration named `dispatchThrottlingForFilteredEntriesEnabled`. Default it to false so we maintain the original behavior. When true, count filtered messages against rate limits.
* Refactor the code to `acquirePermitsForDeliveredMessages` so that it is in the `AbstractBaseDispatcher`, which makes it available to the entry filtering logic.

### Verifying this change

A new test is added as part of this PR.

### Does this pull request potentially affect one of the following parts:

This PR introduces a new config while maintaining the current behavior.

### Documentation

- [x] `doc-not-needed` 
Config docs are auto-generated.
…ache#17837)

Fixes: apache#14109


### Motivation

The expected execution flow for this test is: 

1. send 505 messages
2. commit 10 transactions, every transaction ack 50 messages
3. receive the last 5 messages in the last transaction, wait for transaction timeout
4. confirm that the last 5 messages can be consumed by new consumer

<strong>(High light)</strong> The default value for transaction TTL is 10 seconds, and the default value for `Awaitility.await` is also 10 seconds,  so this test is not stable.

Note: This is a guess cause, the problem is not reproduced locally. But after transaction TTL is set to 11s, the probability of the problem occurring is 100%.

### Modifications

Fix flaky test
- set transaction TTL to 5s

Other changes
- define a name for the task thread
- acknowledge the last 5 messages

### Matching PR in forked repository

PR in forked repository: 

- poorbarcode#13
…trics (apache#17701)

Master Issue: apache#15370

### Modifications

- Make transaction `MLTransactionMetadataStoreProvider` & `MLPendingAckStoreProvider` support buffered writer metrics.
  - Motivation: apache#15370

----

- Delete constructor of `TxnLogBufferedWriter` without parameter `metrics`.
  - Motivation: it is unnecessary.

---- 

- Add a default `DisabledTxnLogBufferedWriterMetricsStats` implementation.

----

- Previous PR remaining code to optimize: remove the check code `if (metrics != null)`. The motivation see:
  - Motivation: apache#16758 (comment)

----

- Make transaction log buffered writer only create by the `MLTransactionMetadataStoreProvider` & `MLPendingAckStoreProvider`. 
  - Motivation: apache#16758 (comment)

### Documentation

- [ ] `doc-required` 

- [x] `doc-not-needed` 

- [ ] `doc` 

- [ ] `doc-complete`


### Matching PR in forked repository

PR in forked repository: 

- poorbarcode#3
…tion (apache#15914)

* Truncate topic before deletion to avoid orphaned offloaded ledgers

* CR feedback
* [fix][proxy] Fix refresh client auth

Signed-off-by: Zixuan Liu <nodeces@gmail.com>

* Fix style

Signed-off-by: Zixuan Liu <nodeces@gmail.com>
Fixes: apache#15773  apache#16863 apache#16860

### Motivation
```
  Error:  Tests run: 11, Failures: 1, Errors: 0, Skipped: 3, Time elapsed: 87.06 s <<< FAILURE! - in org.apache.pulsar.packages.management.storage.bookkeeper.BookKeeperPackagesStorageTest
  Error:  setUp(org.apache.pulsar.packages.management.storage.bookkeeper.BookKeeperPackagesStorageTest)  Time elapsed: 13.089 s  <<< FAILURE!
  org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
  	at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
  	at org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.waitForConnection(ZooKeeperWatcherBase.java:159)
  	at org.apache.bookkeeper.zookeeper.ZooKeeperClient$Builder.build(ZooKeeperClient.java:260)
  	at org.apache.bookkeeper.test.ZooKeeperUtil.restartCluster(ZooKeeperUtil.java:133)
  	at org.apache.bookkeeper.test.ZooKeeperUtil.startCluster(ZooKeeperUtil.java:104)
  	at org.apache.pulsar.packages.management.storage.bookkeeper.bookkeeper.test.BookKeeperClusterTestCase.startZKCluster(BookKeeperClusterTestCase.java:238)
  	at org.apache.pulsar.packages.management.storage.bookkeeper.bookkeeper.test.BookKeeperClusterTestCase.setUp(BookKeeperClusterTestCase.java:178)
  	at org.apache.pulsar.packages.management.storage.bookkeeper.bookkeeper.test.BookKeeperClusterTestCase.setUp(BookKeeperClusterTestCase.java:166)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
  	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
  	at org.testng.internal.MethodInvocationHelper.invokeMethodConsideringTimeout(MethodInvocationHelper.java:61)
  	at org.testng.internal.ConfigInvoker.invokeConfigurationMethod(ConfigInvoker.java:366)
  	at org.testng.internal.ConfigInvoker.invokeConfigurations(ConfigInvoker.java:320)
  	at org.testng.internal.TestInvoker.runConfigMethods(TestInvoker.java:701)
  	at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:527)
  	at org.testng.internal.TestInvoker.invokeTestMethod(TestInvoker.java:174)
  	at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:46)
  	at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:822)
  	at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:147)
  	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
  	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128)
  	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
  	at org.testng.TestRunner.privateRun(TestRunner.java:764)
  	at org.testng.TestRunner.run(TestRunner.java:585)
  	at org.testng.SuiteRunner.runTest(SuiteRunner.java:384)
  	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378)
  	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337)
  	at org.testng.SuiteRunner.run(SuiteRunner.java:286)
  	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53)
  	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96)
  	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1218)
  	at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
  	at org.testng.TestNG.runSuites(TestNG.java:1069)
  	at org.testng.TestNG.run(TestNG.java:1037)
  	at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:135)
  	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112)
  	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeLazy(TestNGDirectoryTestSuite.java:123)
  	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:90)
  	at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:146)
  	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
  	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
  	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
  	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
```

The root cause is that the zk client randomly selects IPV4 and IPV6 when parsing localhost, can connect when using IPV4, and fails when using IPV6. Therefore, if you continue to randomly connect to IPV6, the connection will timeout.

https://github.com/apache/zookeeper/blob/bc1b231c9e32667b2978c86a6a64833470973dbd/zookeeper-server/src/main/java/org/apache/zookeeper/client/StaticHostProvider.java#L140-L146
Thanks to @poorbarcode  for helping me locate the problem

### Modifications
add     @AfterMethod(alwaysRun = true)
use Adress replace hostName

### Documentation

- [x] `doc-not-needed` 

### Matching PR in the forked repository

PR in forked repository: 

- congbobo184#1
* PIP-209: Removed C++/Python clients from main repo

* Removed python directory from Docekrfile

* Fixed python client version argument scoping

* Fixed handling of pulsar.functions.serde
…patcherThrottlingTest.testMultiLevelDispatch
…patcherThrottlingTest.testMultiLevelDispatch
@HQebupt
Copy link
Owner Author

HQebupt commented Oct 1, 2022

/pulsarbot run-failure-checks

@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Nov 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky-test: PersistentSubscriptionMessageDispatchStreamingDispatcherThrottlingTest.testMultiLevelDispatch