Transition to pyiceberg#5
Conversation
lambda/app.py
Outdated
| catalog = GlueCatalog(glue_db_name) | ||
| table = catalog.load_table((glue_db_name, glue_table_name)) | ||
| logger.info(f"current snapshot id={table.metadata.current_snapshot_id}") | ||
| snapshot = table.metadata.snapshot_by_id(table.metadata.current_snapshot_id) |
There was a problem hiding this comment.
you can use table.current_snapshot() this will bring the current snapshot instance, why do you need the id if you can just get the current one??
There was a problem hiding this comment.
Correct, I'm using the suggested version in other places. Will update 👍
| ) | ||
| def send_files_metrics(table: Table, snapshot: Snapshot): | ||
| logger.info(f"send_files_metrics() -> snapshot_id={snapshot.snapshot_id}") | ||
| df = table.inspect.files().to_pandas() |
There was a problem hiding this comment.
take into account that this method can be very expensive , for example a table with thousands of files this can be long , does this metric help with understanding anything ??
in the end if the table is partitioned what matters is the state of partitions and not avg files across entire table
There was a problem hiding this comment.
Initially this metric was exposed. I guess it's a question for the actual template users whether to use it or not...
@moryachok - wdyt ?
There was a problem hiding this comment.
Same could be true for partitions as well - with thousands of partitions this could be expensive. We can do send_files_metrics and send_partitions_metrics optional
There was a problem hiding this comment.
i agree that their can be thousands of partitions but for this the best case would be 1 file per partition which would make it n files but im sure that would not be the case and in reality it much more than 1 file per partition.
their will always be more files than partitions
lambda/template.yaml
Outdated
| - CloudWatchPutMetricPolicy: {} | ||
| - AWSLambdaBasicExecutionRole | ||
| - AmazonS3ReadOnlyAccess | ||
| - ECR-Pull-Read-Only |
There was a problem hiding this comment.
Hey @itakserman-cloudinary ,
I am getting CloudFormation error when trying to deploy. It looks like you're using ECR-Pull-Read-Only policy which isn't defined in a SAM. There are pre-configured SAM policies that we can reuse, if the one you suggesting isn't there you will need to add relevant permissions as separate statement in the template.yaml.
See the list of SAM policies here:
https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-policy-templates.html
| ### Build and Deploy | ||
|
|
||
| > ! Important - The guidance below uses AWS Serverless Application Model (SAM) for easier packaging and deployment of AWS Lambda. However if you use your own packaging tool or if you want to deploy AWS Lambda manually you can explore following files: | ||
| > ! Important - The guidance below uses AWS Serverless Application Model (SAM) and Amazon ECR for easier packaging and deployment of AWS Lambda. However if you use your own packaging tool or if you want to deploy AWS Lambda manually you can explore following files: |
There was a problem hiding this comment.
Using ECR for lambda packaging block looks good, but I would make it more easier for developers by suggesting bash env vars before the commands, like so:
export CLOUDWATCH_NAMESPACE={{ cw_namespace }}
export AWS_REGION={{ aws_region }}
export aws_account_id={{ aws_account_id }}
export ecr_repository_name={{ repository_name }}
export STACK_NAME={{ your stack name }}
export S3_ARTIFACTS_BUCKET_NAME={{ s3_bucket_name }}
export S3_ARTIFACTS_PATH={{ s3_bucket_path }}
export ecr_repository_uri=${aws_account_id}.dkr.ecr.$AWS_REGION.amazonaws.com/${ecr_repository_name}Once defined those let them just run the code
docker build -f Dockerfile --platform linux/amd64 -t ${ecr_repository_name}:main --build-arg CLOUDWATCH_NAMESPACE=$CLOUDWATCH_NAMESPACE .
sam build --use-container
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin ${aws_account_id}.dkr.ecr.us-east-1.amazonaws.com
aws ecr create-repository --repository-name $ecr_repository_name --region $AWS_REGION --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE
docker tag ${ecr_repository_name}:main ${ecr_repository_uri}:latest
docker push ${ecr_repository_uri}:latest
sam deploy --debug --region $AWS_REGION \
--parameter-overrides ImageURL=${ecr_repository_uri}:latest \
--image-repository $ecr_repository_uri \
--stack-name $STACK_NAME --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND \
--s3-bucket $S3_ARTIFACTS_BUCKET_NAME --s3-prefix $S3_ARTIFACTS_PATH
README.md
Outdated
| aws ecr create-repository --repository-name iceberg-monitoring --region {{ aws_region }} --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE | ||
| docker tag iceberg-monitoring:main {{ ecr_repository_uri }}:latest | ||
| docker push {{ aws_account_id }}.dkr.ecr.{{ aws_region }}.amazonaws.com/iceberg-monitoring:latest | ||
| sam deploy --debug --region {{ aws_region }} \ |
There was a problem hiding this comment.
I think that for the very first deploy you have to add --guided attribute
lambda/app.py
Outdated
| snapshot = table.metadata.snapshot_by_id(table.metadata.current_snapshot_id) | ||
| snapshot = table.current_snapshot() | ||
| logger.info(f"current snapshot id={snapshot.snapshot_id}") | ||
| logger.info("Using glue IS to produce metrics") |
|
@moryachok @itakserman-cloudinary Are you planning on merging this PR? |
Description of changes:
Deprecation usage of glue session by transitioning Iceberg tables metrics retrieval to pyiceberg
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.