Skip to content

Add new extension yagp hooks collector#1629

Open
leborchuk wants to merge 122 commits intoapache:mainfrom
open-gpdb:mergeYagpHooks
Open

Add new extension yagp hooks collector#1629
leborchuk wants to merge 122 commits intoapache:mainfrom
open-gpdb:mergeYagpHooks

Conversation

@leborchuk
Copy link
Contributor

@leborchuk leborchuk commented Mar 18, 2026

A diagnostic and monitoring extension for Cloudberry clusters

Briefly, the interaction scheme is:

Cloudberry (Run Executor hooks) -> hooks collector extension (Create protobuf messages and send them using UDS) -> local yagpcc agent

This is the part that sends data from the PG instance to an external agent.

External agent receives telemetry, store, aggregate and expose to the consumers. The agent source code is https://github.com/open-gpdb/yagpcc We're going to propose add it to Apache Cloudberry infrastructure later, after fixing all the tests.

This is quite similar to the #1085 idea. However, it has a different implementation. It is completely separate from database application and written in Go. Could be separately maintained. No influence to the main database, and so different development requirements.

Extension has pg_regress tests, they were copied from pg_stat_statements (PG18) and adjusted to the extension behaviour.

Original extension authors are Smyatkin-Maxim and NJrslv I left their commits as-is to have blame history commited source.

Just create the overal project structure with a Makefile to
generate protobufs, compile it into a shared library extension
and install it
as a fundation to build on
- borrow GlowByte code to generate plan text and SessionInfo
- borrow code from our in-house pg_stat_statements to generate
  query id and plan id
- refactor code to follow common name conventions and identations
- do some minor refactoring to follow common naming convention
- add additional message right after ExecutorStart hook
1) Query instrumentation
2) /proc/self/* stats
It allows finer granularity than executor hooks. Also removed some code
duplication and data duplicaton
1. Initialize query instrumentation to NULL so that it can be properly
checked later (temporary solution, need to find a proper fix)
2. Don't collect spillinfo on query end. Reason: a) it will always be
zero and b) it could crash if we failed to enlarge a spillfile. Seems
like we need some cummulative statistics for spillinfo. Need to check
what explain analyze use.
1. Sync with protobuf changes to collect segment info
2. Remove noisy logging
3. Fix some missing node types in pg_stat_statements
Reason: when query info hook is called with status 'DONE' planstate
is already deallocated by ExecutorEnd
1) Give higher gRPC timeouts to query dispatcher as losing messages
there is more critical
2) If we've failed to send a message via gRPC we notify a background
thread about it and refuse sending any new message until this thread
re-establishes the lost connection
Don't collect system queries with empty query text and ccnt == 0
Rethrowing them might break other extensions and even query execution
pipeline itself
NJrslv and others added 13 commits March 18, 2026 14:04
Change makefile, test and add it to CI of yagp_hooks_collector.
Add option --with-yagp-hooks-collector. Similarly to [1].

[1] open-gpdb/gpdb@7be8893
Correct CI for yagp_hooks_collector to use correct env script.
Correct defines for token ids copied from gram.y.
Similarly to [1] add missing executor query info hooks.

[1] open-gpdb/gpdb@87fc05d
Copy of [1] with additinal changed needed for Clouberry are described below:

The testing C functions have changed to set-returning ones if comparing with
[1] because we need a control over the place where function is executed -
either on master or segments, and in Cloudberry these functions must return
set of values so they were changed to return SETOF.

[1] open-gpdb/gpdb@989ca06
Copy of [1] - send() may return -1 in case of an error, do not add -1 to total_bytes sent.

[1] open-gpdb/gpdb@e1f6c08
The extension generates normalized query text and plan using jumbling functions.
Those functions may fail when translating to wide character if the current locale
cannot handle the character set.

Fix changes functions that generate normalized query text/plan to noexcept versions
so we can check if error occured and continute execution.

The test checks that even when those functions fail, the plan is still executed.
This test is partially taken from src/test/regress/gp_locale.sql.
Cloudberry builds treat compiler warnings as errors. For consistency,
this behavior has been enabled in yagp_hooks_collector.

This commit also fixes the warnings in yagp_hooks_collector.
We faced an issue - segments fail with backtrace
```
#7  0x00007f9b2adbf2e0 in set_qi_error_message (req=0x55f24a6011f0) at src/ProtoUtils.cpp:124
#8  0x00007f9b2adc30d9 in EventSender::collect_query_done (this=0x55f24a5489f0, query_desc=0x55f24a71ca68, status=METRICS_QUERY_ERROR) at src/EventSender.cpp:222
#9  0x00007f9b2adc23e1 in EventSender::query_metrics_collect (this=0x55f24a5489f0, status=METRICS_QUERY_ERROR, arg=0x55f24a71ca68) at src/EventSender.cpp:53
```

the root cause here is we're trying to send info about error message in a hooks collector. For some queries ErrorData struckture could be NULL despite the fact that an error has occurred. it depends on error type and location of the error. So we should check if we had info about error details before using it.
@leborchuk leborchuk marked this pull request as ready for review March 24, 2026 08:23
@leborchuk
Copy link
Contributor Author

Right now we are not fully support all the features from #1085 but I believe one day solution will be won't be any worse.

Copy link
Member

@tuhaihe tuhaihe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your great work! Just left some comments, FYI.

<exclude>gpcontrib/yagp_hooks_collector/protos/yagpcc_metrics.proto</exclude>
<exclude>gpcontrib/yagp_hooks_collector/.clang-format</exclude>
<exclude>gpcontrib/yagp_hooks_collector/Makefile</exclude>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can move these new lines to the Cloudberry part, like:

cloudberry/pom.xml

Lines 1263 to 1272 in 6c37f8f

<!-- The following files are from pg_cron under PostgreSQL
license, introduced by Cloudberry.
-->
<exclude>src/include/task/bitstring.h</exclude>
<exclude>src/include/task/cron.h</exclude>
<exclude>src/include/task/pg_cron.h</exclude>
<exclude>src/include/task/task_states.h</exclude>
<exclude>src/include/task/job_metadata.h</exclude>
<!-- The following files are used by PAX as the

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1 @@
This directory contains a slightly modified subset of pg_stat_statements for PG v9.4 to be used in query and plan ID generation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add the standard ASF license header to the markdown file, like:

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{"test":"gpcontrib-yagp-hooks-collector",
"make_configs":["gpcontrib/yagp_hooks_collector:installcheck"],
"extension":"yagp_hooks_collector"
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to add the new test to build-cloudberry-rocky8.yml or *.*-deb.yml file? FYI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I don't see a reason to not include it.


package yagpcc;
option java_outer_classname = "SegmentYAGPCCM";
option go_package = "a.yandex-team.ru/cloud/mdb/yagpcc/api/proto/common;greenplum";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 5: Is it private? If so, do we need to make it public?

Copy link
Contributor

@NJrslv NJrslv Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need them, so I just got rid of those options.

@@ -0,0 +1 @@
This directory contains a slightly modified subset of pg_stat_statements for PG v9.4 to be used in query and plan ID generation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add an ASF license header. FYI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- **GUCs:** `yagpcc.enable`, `yagpcc.min_analyze_time`, `yagpcc.enable_cdbstats`(ANALYZE), `yagpcc.enable_analyze`(BUFFERS, TIMING, VERBOSE).

#### 4. Other Metrics
- **What:** Captures Instrument, Greenplum, System, Network, Interconnect, Spill metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **What:** Captures Instrument, Greenplum, System, Network, Interconnect, Spill metrics.
- **What:** Captures Instrument, Apache Cloudberry (Greenplum), System, Network, Interconnect, Spill metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just got rid of it - done

@@ -0,0 +1,5 @@
# yagp_hooks_collector extension
comment = 'Intercept query and plan execution hooks and report them to Yandex GPCC agents'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
comment = 'Intercept query and plan execution hooks and report them to Yandex GPCC agents'
comment = 'Intercept query and plan execution hooks and report them to Cloudberry monitor agents'

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

configure.ac Outdated
# yagp_hooks_collector
#
PGAC_ARG_BOOL(with, yagp_hooks_collector, no,
[build with YAGP hooks collector extension])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[build with YAGP hooks collector extension])
[build with hooks collector extension])

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# to build Postgres with a different make, we have this make file
# that, as a service, will look for a GNU make and invoke it, or show
# an error message if none could be found.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no actual changes to Makefile, we can restore it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

configure.ac Outdated
AC_SUBST(with_zstd)

#
# yagp_hooks_collector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this new extension need other dependencies? Not sure about this. If so, we can add the pre-required dependencies here. FYI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the protoc, included it.

increment_command_count();

/* null this field until set by YAGP Hooks collector */
qd->yagp_query_key = NULL;
Copy link
Contributor

@NJrslv NJrslv Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename yagp_query_key, in similar places too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@NJrslv
Copy link
Contributor

NJrslv commented Mar 24, 2026

Hi all!

I want to extend the description of this project a bit.

Here we collect state of the query during the executor stages and send it via unix sockets. It was originally made for Greenplum, and I adapted it for Cloudberry here.

For testing purposes only, we support writing data to the catalog table so it can be selected during regression tests. The tuples are logged even during transaction aborts, using frozen tuples (only for the tests).

The state of the query is created before the executor_start stage and released after the executor_end stage. If the executor_end is missing for some unknown reason, the state is not freed. To fix this problem, new changes to the code outside of the project were introduced to correctly handle queries with a missing executor_end stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants