Conversation
| @@ -0,0 +1,3 @@ | |||
| #!/bin/bash -xe | |||
|
|
|||
| sudo aws s3 cp s3://dig-data-registry/hail.jar /usr/lib/spark/jars/ | |||
There was a problem hiding this comment.
I built the hail.jar from the hail source (it needs to compiled using java 8 since that's what our EMR clusters use) and then uploaded it to S3. We probably need a better location in s3, but I used this one now since it's not in production use. This solution also relies on EMR continuing to put /usr/lib/spark/jars on the classpath.
There was a problem hiding this comment.
So for things like this we tend to use s3://dig-aggregator-data/bin/
psmadbec
left a comment
There was a problem hiding this comment.
Magical. For the generation of the hail.jar either we'll want instructions, or probably we'll want something to be put into dig-analysis-data/scripts which is where I put things that have a specific generation sequence to help in case someone needs to generate or update the file itself.
| @@ -0,0 +1,3 @@ | |||
| #!/bin/bash -xe | |||
|
|
|||
| sudo aws s3 cp s3://dig-data-registry/hail.jar /usr/lib/spark/jars/ | |||
There was a problem hiding this comment.
So for things like this we tend to use s3://dig-aggregator-data/bin/
Making this work comes down to specifying the codec when writing from pyspark and also having that codec on the Spark classpath.