Transition to pyiceberg by itakserman-cloudinary · Pull Request #5 · aws-samples/monitoring-apache-iceberg-table-metadata-layer

itakserman-cloudinary · 2024-08-07T12:10:11Z

Description of changes:
Deprecation usage of glue session by transitioning Iceberg tables metrics retrieval to pyiceberg

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

amitgilad3

great work, a few questions

amitgilad3 · 2024-08-07T12:20:46Z

lambda/app.py

    catalog = GlueCatalog(glue_db_name)
    table = catalog.load_table((glue_db_name, glue_table_name))
-    logger.info(f"current snapshot id={table.metadata.current_snapshot_id}")
    snapshot = table.metadata.snapshot_by_id(table.metadata.current_snapshot_id)


you can use table.current_snapshot() this will bring the current snapshot instance, why do you need the id if you can just get the current one??

Correct, I'm using the suggested version in other places. Will update 👍

amitgilad3 · 2024-08-07T12:27:16Z

lambda/app.py

-    )
+def send_files_metrics(table: Table, snapshot: Snapshot):
+    logger.info(f"send_files_metrics() -> snapshot_id={snapshot.snapshot_id}")
+    df = table.inspect.files().to_pandas()


take into account that this method can be very expensive , for example a table with thousands of files this can be long , does this metric help with understanding anything ??

in the end if the table is partitioned what matters is the state of partitions and not avg files across entire table

Initially this metric was exposed. I guess it's a question for the actual template users whether to use it or not...
@moryachok - wdyt ?

Same could be true for partitions as well - with thousands of partitions this could be expensive. We can do send_files_metrics and send_partitions_metrics optional

i agree that their can be thousands of partitions but for this the best case would be 1 file per partition which would make it n files but im sure that would not be the case and in reality it much more than 1 file per partition.
their will always be more files than partitions

README.md

moryachok · 2024-08-30T20:39:26Z

lambda/template.yaml

        - CloudWatchPutMetricPolicy: {}
        - AWSLambdaBasicExecutionRole
        - AmazonS3ReadOnlyAccess
+        - ECR-Pull-Read-Only


Hey @itakserman-cloudinary ,
I am getting CloudFormation error when trying to deploy. It looks like you're using ECR-Pull-Read-Only policy which isn't defined in a SAM. There are pre-configured SAM policies that we can reuse, if the one you suggesting isn't there you will need to add relevant permissions as separate statement in the template.yaml.

See the list of SAM policies here:
https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-policy-templates.html

moryachok · 2024-08-30T20:45:18Z

README.md

 ### Build and Deploy

-> ! Important - The guidance below uses AWS Serverless Application Model (SAM) for easier packaging and deployment of AWS Lambda. However if you use your own packaging tool or if you want to deploy AWS Lambda manually you can explore following files:
+> ! Important - The guidance below uses AWS Serverless Application Model (SAM) and Amazon ECR for easier packaging and deployment of AWS Lambda. However if you use your own packaging tool or if you want to deploy AWS Lambda manually you can explore following files:


Using ECR for lambda packaging block looks good, but I would make it more easier for developers by suggesting bash env vars before the commands, like so:

export CLOUDWATCH_NAMESPACE={{ cw_namespace }} export AWS_REGION={{ aws_region }} export aws_account_id={{ aws_account_id }} export ecr_repository_name={{ repository_name }} export STACK_NAME={{ your stack name }} export S3_ARTIFACTS_BUCKET_NAME={{ s3_bucket_name }} export S3_ARTIFACTS_PATH={{ s3_bucket_path }} export ecr_repository_uri=${aws_account_id}.dkr.ecr.$AWS_REGION.amazonaws.com/${ecr_repository_name}

Once defined those let them just run the code

docker build -f Dockerfile --platform linux/amd64 -t ${ecr_repository_name}:main --build-arg CLOUDWATCH_NAMESPACE=$CLOUDWATCH_NAMESPACE . sam build --use-container aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${aws_account_id}.dkr.ecr.us-east-1.amazonaws.com aws ecr create-repository --repository-name $ecr_repository_name --region $AWS_REGION --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE docker tag ${ecr_repository_name}:main ${ecr_repository_uri}:latest docker push ${ecr_repository_uri}:latest sam deploy --debug --region $AWS_REGION \ --parameter-overrides ImageURL=${ecr_repository_uri}:latest \ --image-repository $ecr_repository_uri \ --stack-name $STACK_NAME --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \ --s3-bucket $S3_ARTIFACTS_BUCKET_NAME --s3-prefix $S3_ARTIFACTS_PATH

@moryachok - done I think :)

moryachok · 2024-08-30T20:52:33Z

README.md

+aws ecr create-repository --repository-name iceberg-monitoring --region {{ aws_region }} --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE
+docker tag iceberg-monitoring:main {{ ecr_repository_uri }}:latest
+docker push {{ aws_account_id }}.dkr.ecr.{{ aws_region }}.amazonaws.com/iceberg-monitoring:latest
+sam deploy --debug --region {{ aws_region }} \


I think that for the very first deploy you have to add --guided attribute

moryachok · 2024-08-30T22:11:38Z

lambda/app.py

-    snapshot = table.metadata.snapshot_by_id(table.metadata.current_snapshot_id)
+    snapshot = table.current_snapshot()
+    logger.info(f"current snapshot id={snapshot.snapshot_id}")
    logger.info("Using glue IS to produce metrics")


let's remove this log

amirgoldman1 · 2024-11-25T07:03:49Z

@moryachok @itakserman-cloudinary Are you planning on merging this PR?

itakserman-cloudinary added 4 commits August 7, 2024 11:32

Initial commit

02b0e46

finalize tests

dbc81f9

finalize tests

53c13c8

Adjust readme

649cd88

amitgilad3 reviewed Aug 7, 2024

View reviewed changes

itakserman-cloudinary added 3 commits August 7, 2024 15:37

slight updates

0c8b47c

Update template

4f6fab4

Update template

eb7230e

moryachok reviewed Aug 8, 2024

View reviewed changes

README.md Show resolved Hide resolved

itakserman-cloudinary added 11 commits August 8, 2024 15:45

Adjust readme

41fe55e

finalize tests

e157ddb

linting

bd480e1

Support partition of complex types and multiple columns

e2d7d9d

Repalce readme arch diagram

e118a22

gitignore

882c58c

Add deploy from ECR

a14126c

Add deploy from ECR

f9fce1a

Add deploy from ECR

e3f9772

Add deploy from ECR

a0c2075

Add deploy from ECR

8bd7f93

moryachok reviewed Aug 30, 2024

View reviewed changes

itakserman-cloudinary added 2 commits September 1, 2024 09:36

pr comments

d02e9bf

pr comments

fedb2f3

Conversation

itakserman-cloudinary commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amitgilad3 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

moryachok Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

moryachok Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amirgoldman1 commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

itakserman-cloudinary commented Aug 7, 2024 •

edited

Loading

moryachok Aug 8, 2024 •

edited

Loading

moryachok Aug 30, 2024 •

edited

Loading