Skip to content

TransientIOErrorsRetryingIterator trigger one more roundtrip when using Cosmos Spark connector to query CosmosDB data #47777

@BrunoLiu-MS

Description

@BrunoLiu-MS

When connecting to CosmosDB to query data through Spark connector, e.g. the total item count = 20k, setting maxItemCount = 5k in the client code, the round trip is supposed to be 4, however, it often somehow triggers 5 round trips, which brings extra RU consumption.

We found that by commenting out 'cancelOn' and recompiled the jar file for testing, the issue never reoccurred, roundtrip count is always 4, which is correct count. Is this a BUG? Or any particular purpose of the below code?
override def close(): Unit = {
lastPagedFlux.getAndSet(None) match {
case Some(oldPagedFlux) => oldPagedFlux.cancelOn(Schedulers.boundedElastic()).onErrorComplete().subscribe().dispose()
case None =>
}
}

case Some(oldPagedFlux) => oldPagedFlux.cancelOn(Schedulers.boundedElastic()).onErrorComplete().subscribe().dispose()

Below is the code I use in data bricks:

from pyspark.sql.types import StructType, StructField, StringType

cosmos_schema = StructType([
StructField('createddatetime', StringType(), True),
StructField('closeddatetime', StringType(), True),
StructField('id', StringType(), False),
StructField('owner', StringType(), True),
StructField('ownerregion', StringType(), True),
StructField('ownermanager', StringType(), True),
StructField('casenumber', StringType(), True)
])

cosmos_config = {
"spark.cosmos.accountEndpoint": COSMOS_ENDPOINT,
"spark.cosmos.accountkey": CosmosMasterKey,
"spark.cosmos.database": COSMOS_DATABASE,
"spark.cosmos.container": COSMOS_CONTAINER,
"spark.cosmos.diagnostics": "simple",
'spark.cosmos.read.inferSchema.enabled': 'false',
"spark.cosmos.read.customQuery": customQuery,
"spark.cosmos.read.maxItemCount": 5000
}

sc.setLogLevel("DEBUG")

df = spark.read.format("cosmos.oltp").options(**cosmos_config).schema(cosmos_schema).load()
total_item_count=df.count()
print(f'Total documents retrieved: {total_item_count}')

Metadata

Metadata

Labels

ClientThis issue points to a problem in the data-plane of the library.CosmosService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamquestionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions