Skip to content

[Bug]: Main process does not fail on OOM of spark-job #131

@Phil1602

Description

@Phil1602

What happened?

Sometimes Spark Jobs are failing, because of OOM (There are other open issues related to the OOM itself, e.g. this one).

But even worse: the Job running as a Kubernetes Pod with embedded Spark mode (no dedicated Spark master) will not fail, but run forever without completion.

If the spark-job fails, the main-process should fail as well instead of just catching the OOM-Exception without any handling. Otherwise it will block resources without realizing the failure.

Unfortunately I'm not into the Spark framework to suggest a implementable solution.

Steps to reproduce

  1. Deploy Spark-Dependency Job
  2. Collect a huge amount of traces
  3. Wait for the spark-dependency Job "failing", because of OOM

Expected behavior

The main process should exit with a non-zero error code, so that the Container failed and can be restarted.

Relevant log output

023-03-23T10:30:01.200696521Z WARNING: An illegal reflective access operation has occurred
2023-03-23T10:30:01.200724780Z WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/app/jaeger-spark-dependencies-0.0.1-SNAPSHOT.jar) to method java.nio.Bits.unaligned()
2023-03-23T10:30:01.200729036Z WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
2023-03-23T10:30:01.200732490Z WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
2023-03-23T10:30:01.200735314Z WARNING: All illegal access operations will be denied in a future release
2023-03-23T10:30:01.480102373Z 23/03/23 10:30:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-03-23T10:30:02.292442909Z 23/03/23 10:30:02 INFO ElasticsearchDependenciesJob: Running Dependencies job for 2023-03-23T00:00Z, reading from jaeger-span-2023-03-23 index, result storing to jaeger-dependencies-2023-03-23
2023-03-23T11:04:55.809920514Z 
2023-03-23T11:04:55.809944477Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RemoteBlock-temp-file-clean-thread"
2023-03-23T11:05:24.292046946Z 
2023-03-23T11:05:24.292075347Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Spark Context Cleaner"
2023-03-23T11:07:32.650964109Z 
2023-03-23T11:07:32.650988857Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "netty-rpc-env-timeout"
2023-03-23T11:08:58.216093703Z 
2023-03-23T11:08:58.216120176Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "driver-heartbeater"
2023-03-23T11:09:55.117794169Z 
2023-03-23T11:09:55.117816339Z Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "heartbeat-receiver-event-loop-thread"
2023-03-23T11:11:50.397346152Z Exception in thread "Executor task launch worker for task 8" java.lang.OutOfMemoryError: Java heap space

Screenshot

No response

Additional context

No response

Jaeger backend version

1.35.2

SDK

No response

Pipeline

No response

Stogage backend

No response

Operating system

No response

Deployment model

Kubernetes via Jaeger Operator

Deployment configs

- name: JAVA_OPTS
      value: -XX:InitialRAMPercentage=50.0 -XX:MaxRAMPercentage=75.0 -XX:-UseCompressedOops
        -XX:+UseG1GC -XX:+UseContainerSupport
----
Image: ghcr.io/jaegertracing/spark-dependencies/spark-dependencies:latest

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions