-
Notifications
You must be signed in to change notification settings - Fork 362
Description
One of the problem we face at grafana is profiling cluster-wide with a memory limit. Let's say we set the limit to 400Mib and in average it consumes 350Mib. We have some clickhouse / oracle database running and the binaries are pretty huge. Processing such binaries takes seconds of CPU time. And hundreds Mib of memory (both heap and ebpf maps). Most of the time it leads to OOM. This leaves us with either an unnecessary high memory cgroup limit or constantly OOMed profiler on some nodes.
func TestExtractStackDeltasFromFilename(t *testing.T) {
elf, _ := pfelf.Open("/home/korniltsev/Downloads/clickhouse-common-static-25.11.2.24/usr/bin/clickhouse")
var data sdtypes.IntervalData
t1 := time.Now()
_ = extractFile(elf, nil, &data)
fmt.Println(len(data.Deltas) * int(unsafe.Sizeof(data.Deltas[0])) / 1024 / 1024) // 128
fmt.Println(time.Since(t1)) // 2.276861659s
}The other problem is short lived processes which die by the time or even before we configure profiling for them.
Reported here grafana#37 (comment)
What would be the solution to deal with these cases?
While I agree the #955 proposed configuration is not flexible and does not allow to support many possible configurations, I do believe it's possible to develop a better interface to allow users not wasting resources (at least for the users of the profiler as a library, but for the collector as well).
The cons I see of process filtering
- It is not idiomatic to the collector's processor architecture.
- It may complicate the profiler logic a bit and therefore maintenance .