Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-18073. Upgrade AWS SDK to v2 in S3A [work in progress] #5163

Conversation

passaro
Copy link
Contributor

@passaro passaro commented Nov 24, 2022

Description of PR

This is an initial draft PR containing all the changes implemented so far to upgrade S3A to the AWS SDK v2. Note that this is still a work in progress and we plan to further contribute to it to fill existing gaps and update the SDK when missing features are released (e.g. support for Client-side Encryption and public release of the new Transfer Manager, currently in preview).

In the meantime, this PR should provide a view of the whole set of changes and start a conversation on the remaining open questions and on how to handle breaking changes that affect S3A.

The new document at hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/aws_sdk_v2_changelog.md
discusses the key changes contained in this PR and is the suggested starting point for the review.

Further open questions to be discussed:

  1. The region logic. Previously, if an endpoint was configured and no region, parse the region from the endpoint. If configured endpoint is the standard us-east-1 endpoint, set region as null, let SDK figure out the region. If no endpoint is configured, set region as us-east-1, and set .withForceGlobalBucketAccessEnabled. In SDK v2, there’s no cross region access, so the correct region of the bucket needs to be set. So we now get the region of the bucket using head bucket, and set it. In general, the guidance for the new SDK is to only set the region, and let the SDK determine the endpoint.

  2. Bucket probes. Currently done with doesBucketExist and doesBucketExistV2. Why do we need these two separate levels? There is no doesBucketExist operation in SDK V2, it will need to be replaced with a HeadBucket/GetBucketACL. Also consider that, with the new region logic, we will need to do a HeadBucket while configuring the client if the region isn’t specified.

  3. Progress Listeners. SDK V2 currently does not support attaching progress listeners on requests outside the Transfer Manager. We use them in Put and UploadPart in S3ABlockOutputStream. Are they required for the upgrade?

  4. ACLs. LogDeliveryWrite, which is a bucket level ACL, is no longer supported in the SDK V2. S3A seems to use ACLs at the object level only. Can this ACL be removed?

  5. Transfer Manager. You can no longer set a threshold for when to use the Transfer Manager. The default is 8MB.

How was this patch tested?

Run mvn -Dparallel-tests -DtestsThreadCount=8 clean verify in eu-west-2.

The following tests are currently failing:

Test Suite Test Name. Reason
TestS3AExceptionTranslation test301ContainsEndpoint Missing endpoint in SDK exception (aws/aws-sdk#578)
TestStreamChangeTracker testCopyETagRequired, testCopyVersionIdRequired Transfer Manager response does not yet have CopyObjectResult
ITestCustomSigner testCustomSignerAndInitializer Signers not upgraded yet
ITestS3AFileContextStatistics testStatistics ProgressListeners not attached to non-TM uploads
ITestS3AEncryptionSSEC multiple tests (14 out of 24) Transfer Manager issue with SSE-C
ITestXAttrCost testXAttrRoot. headObject() with empty key fails
ITestSessionDelegationInFileystem testDelegatedFileSystem Succeeds, but headObject() with empty key commented out
ITestS3ACannedACLs testCreatedObjectsHaveACLs AWSCannedACL.LogDeliveryWrite not supported in SDK v2

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

slfan1989 and others added 30 commits September 27, 2022 13:28
…during vectored read. (apache#4921)


part of HADOOP-18103.

Contributed by: Mukund Thakur
…er (apache#4917)

Contributed-by: navinko <nakumr@cloudera.com>
Make S3APrefetchingInputStream.seek() completely lazy. Calls to seek() will not affect the current buffer nor interfere with prefetching, until read() is called.

This change allows various usage patterns to benefit from prefetching, e.g. when calling readFully(position, buffer) in a loop for contiguous positions the intermediate internal calls to seek() will be noops and prefetching will have the same performance as in a sequential read.

Contributed by Alessandro Passaro.
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
Add to XMLUtils a set of methods to create secure XML Parsers/transformers, locking down DTD, schema, XXE exposure.

Use these wherever XML parsers are created.

Contributed by PJ Fanning
… Contributed by Ashutosh Gupta.

Reviewed-by: Akira Ajisaka <aajisaka@apache.org>
Signed-off-by: Chris Nauroth <cnauroth@apache.org>
…er if the item exist. (apache#4987). Contributed by ZanderXu.

Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
…ervice (apache#4775)

Co-authored-by: Ashutosh Gupta <ashugpt@amazon.com>
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
…izedBlocks (apache#4942).  Contributed by ZanderXu.

Reviewed-by: Mingxiang Li <liaiphag0@gmail.com>
Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
apache#4948). Contributed by ZanderXu.

Signed-off-by: He Xiaoqiao <hexiaoqiao@apache.org>
HDFS-16774. Improve async delete replica on datanode to reduce the probability of ReplicationNotFoundException

Co-authored-by: Haiyang Hu <haiyang.hu@shopee.com>
Reviewed-by: He Xiaoqiao <hexiaoqiao@apache.org>
ChengbingLiu and others added 24 commits January 10, 2023 10:03
…atic members (apache#5246)

Co-authored-by: Chengbing Liu <liuchengbing@qiyi.com>
Signed-off-by: Erik Krogen <xkrogen@apache.org>
…ailed (apache#5280)

Reviewed-by: Takanobu Asanuma <tasanuma@apache.org>
Signed-off-by: Tao Li <tomscut@apache.org>
* YARN-11413. Fix Junit Test ERROR Introduced By YARN-6412.

* YARN-11413. Fix CheckStyle.

* YARN-11413. Fix CheckStyle.

Co-authored-by: slfan1989 <louj1988@@>
Signed-off-by: Tao Li <tomscut@apache.org>
Signed-off-by: Chris Nauroth <cnauroth@apache.org>
…dirs (apache#4237)

Signed-off-by: Chris Nauroth <cnauroth@apache.org>
…refreshUserToGroupsMappings API's for Federation. (apache#5193)
…der (apache#5019)

Co-authored-by: Ashutosh Gupta <ashugpt@amazon.com>
Reviewed-by: Shilun Fan <slfan1989@apache.org>
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
…ugins (apache#5023)

Co-authored-by: Ashutosh Gupta <ashugpt@amazon.com>
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
…en transformer factories do not support attributes (apache#5253)


Part of HADOOP-18469 and the hardening of XML/XSL parsers.
Followup to the main HADOOP-18575 patch, to improve performance when
working with xml/xsl engines which don't support the relevant attributes.

Include this change when backporting.

Contributed by PJ Fanning.
Signed-off-by: Nikita Eshkeev <neshkeev@yandex.ru>
See aws_sdk_v2_changelog.md for details.

Co-authored-by: Ahmar Suhail <ahmarsu@amazon.co.uk>
Co-authored-by: Alessandro Passaro <alexpax@amazon.co.uk>
addresses review comments + yetus errors

Co-authored-by: Ahmar Suhail <ahmarsu@amazon.co.uk>
@ahmarsuhail ahmarsuhail force-pushed the feature-HADOOP-18073-s3a-sdk-upgrade branch from b2ef9e2 to 369fcfa Compare January 18, 2023 14:26
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 24s #5163 does not apply to feature-HADOOP-18073-s3a-sdk-upgrade. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.
Subsystem Report/Notes
GITHUB PR #5163
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5163/7/console
versions git=2.17.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 26s #5163 does not apply to feature-HADOOP-18073-s3a-sdk-upgrade. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.
Subsystem Report/Notes
GITHUB PR #5163
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5163/8/console
versions git=2.17.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@asfgit asfgit merged commit 3671db2 into apache:feature-HADOOP-18073-s3a-sdk-upgrade Jan 19, 2023
@steveloughran
Copy link
Contributor

merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.