Commit Graph

15845 Commits

Author SHA1 Message Date
György Krajcsovits
caff76d86b chore(promql): add public API to be able to evaluate duration expressions
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
2025-07-10 13:09:41 +02:00
Björn Rabenstein
0672a5b045 Merge pull request #16847 from prometheus/beorn7/promql
promqltest: Test NaN sample values for quantile aggregator
2025-07-10 11:16:12 +02:00
Dmitry Ponomaryov
b18272a572 Add template functions to support various use cases. (#16619)
Presumably, this will help with Loki alerts, but the added functionality is also generally useful.

For one, this enables `parseDuration` to also accept negative duration (as that's something that is also used in PromQL by now).

This also adds a function `now` to return the evaluation time of the template (as seconds since epoch AKA Unix time) and a function `toDuration` (akin to `toTime`), which creates a Go `time.Duration` from a duration in seconds.

---------

Signed-off-by: Dmitry Ponomaryov <me@halje.ru>
Signed-off-by: Dmitry Ponomaryov <iamhalje@gmail.com>
2025-07-10 00:33:20 +02:00
machine424
846acc10bb chore(tsdb): remove NewLeveledCompactorWithChunkSize constructor as unused, library users ca can redefine it on their side
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-09 17:10:13 +01:00
machine424
020e803ee0 chore(discovery): remove unused StaticProvider struct, library users can easily define it on their side
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-09 17:10:13 +01:00
George Krajcsovits
1d79f0f47e chore(tsdb): add a few more testcases for unlock of unlocked mtx 16332 (#16848)
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
2025-07-09 16:24:46 +02:00
Banana Duck
89f011ba13 fix: unlock of unlocked mutex (#16332)
* fix: unlock on unlocked mutex

Signed-off-by: Usama Alhanaqtah <a.usama@yandex.ru>

* test coverage

Signed-off-by: Usama Alhanaqtah <a.usama@yandex.ru>

---------

Signed-off-by: Usama Alhanaqtah <a.usama@yandex.ru>
Co-authored-by: alhanaqtah.usama <alhanaqtah.usama@DEV-254.local>
2025-07-09 15:37:55 +02:00
Björn Rabenstein
d86796863f Merge pull request #16764 from bboreham/go-get-no-d
[BUILD] Don't specify -d for go get
2025-07-09 14:14:05 +02:00
beorn7
107e4a00c3 promqltest: Test NaN sample values for quantile aggregator
Signed-off-by: beorn7 <beorn@grafana.com>
2025-07-09 13:38:19 +02:00
Björn Rabenstein
181415c7b7 Merge pull request #16846 from liangmulu/main
docs: fix some minor issues in comments
2025-07-09 13:00:13 +02:00
liangmulu
b1a7df2c0c chore: fix some minor issues in comments
Signed-off-by: liangmulu <liangmulu@outlook.com>
2025-07-09 18:05:41 +08:00
Björn Rabenstein
d8c921804e Merge pull request #16824 from afhassan/main
tsdb: add count of histogram samples to block stats
2025-07-08 20:16:13 +02:00
Björn Rabenstein
dbee82267a Merge pull request #16725 from MichaHoffmann/mhoffmann/fix-topk-nan-arg-error-on-nonexisting-series
promql: fix topk error on NaN argument for non-existing series
2025-07-08 19:42:20 +02:00
Vlad Shulcz
19fa1ed008 test(rulefmt): fix description annotation index in TestParseFileSuccessWithAliases (#16839)
Signed-off-by: shulcz <vshulcz@gmail.com>
2025-07-08 18:38:34 +02:00
Björn Rabenstein
c565e95808 Merge pull request #16825 from prometheus/beorn7/histogram
promql: add tests to demonstrate extrapolation below zero
2025-07-08 16:42:56 +02:00
chenlujjj
a2735494e1 chore: complete error message in RegisterSDMetrics function (#14635)
Signed-off-by: chenlujjj <953546398@qq.com>
2025-07-08 12:05:24 +00:00
Ahmed Hassan
01be7bfb2e add NumFloatSamples to TSDB block stats
Signed-off-by: Ahmed Hassan <afayekhassan@gmail.com>
2025-07-07 13:48:18 -07:00
machine424
ffcba01c5a chore: do not hardcode required versions in README.md
add links to the sources of truth.

It's hard to keep up to date, the "go" one
is "wrong" (not really as an old 1.22 binray could still
download/use newer toolchains...) for example.

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
2025-07-07 08:42:31 +01:00
Charles Korn
1e58d792a5 storage/remote: fix "http: read on closed response body" errors if chunkedSeriesSet.Next is called again after the series set is exhausted (#16838)
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-07 09:23:34 +02:00
Michael Hoffmann
44ee5e2ad6 promql: fix topk error on NaN argument for non-existing series
Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>
2025-07-07 06:19:39 +00:00
RaphSku
938e5cb62b docs: Added documentation for promtool configuration with http.config.file (#16522)
Includes an example.

Signed-off-by: RaphSku <rapsku.dev@gmail.com>
2025-07-07 00:00:51 +02:00
beorn7
c0a13223e7 promql: add tests to demonstrate extrapolation below zero
This shows how float counters cannot go below zero when extrapolationg
for rate/increase, and how histograms do not have that protection yet,
leading to an overestimation of the rate/increase.

This also demonstrates edge cases where the count extrapolation does
not need to be limited, but an individual bucket still goes below
zero.

Signed-off-by: beorn7 <beorn@grafana.com>
2025-07-06 23:42:55 +02:00
Michael Hoffmann
21b1536b5a storage: add projection fields to select hints (#16423)
This commit adds Projection metadata to SelectHints so that downstream
storage implementations can use it to save effort when answering to
Select calls.

Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>
2025-07-06 12:57:19 +02:00
Arve Knudsen
f561aa795d OTLP receiver: Generate target_info samples between the earliest and latest samples per resource (#16737)
* OTLP receiver: Generate target_info samples between the earliest and latest samples per resource

Modify the OTLP receiver to generate target_info samples between the earliest
and latest samples per resource instead of only one for the latest timestamp.
The samples are spaced lookback delta/2 apart.

---------

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-07-04 14:38:16 +00:00
Jon Kartago Lamida
819500bdbc Add ByteSize method for Labels (#16717)
Add `ByteSize()` method to different labels implementations.
One of the use case so that we can track the memory used by Labels.

Signed-off-by: Jon Kartago Lamida <me@lamida.net>
2025-07-04 15:09:01 +01:00
Arve Knudsen
5a5424cbc1 Consolidate around prometheus/common/model.ValidationScheme (#16806)
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
2025-07-03 15:37:46 +02:00
Bartlomiej Plotka
419d436a44 Merge pull request #16822 from prometheus/bump-otlptranslator
Bump otlptranslator to latest SHA
2025-07-03 12:40:31 +01:00
Matthias Loibl
61064cb774 Merge pull request #16819 from jscheffner/prometheus-dashboard-uid
mixin: add uid to prometheus overview dashboard
2025-07-03 11:16:05 +02:00
Julien
011c7fe87d Merge pull request #16820 from prymitive/discoveryRace
discovery: fix a race in ApplyConfig while Prometheus is being stopped
2025-07-03 10:52:59 +02:00
github-actions[bot]
3c25eb2a0d Merge pull request #16815 from prometheus/dependabot/go_modules/github.com/oklog/run-1.2.0
build(deps): bump github.com/oklog/run from 1.1.0 to 1.2.0
2025-07-03 10:09:10 +02:00
Ahmed Hassan
6d77b47d13 add numHistogramSamples to block stats
Signed-off-by: Ahmed Hassan <afayekhassan@gmail.com>
2025-07-02 19:52:04 -07:00
Arthur Silva Sens
0502f2d8fb Bump otlptranslator to latest SHA
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
2025-07-02 14:55:51 -03:00
Bryan Boreham
74aca682b7 Merge pull request #16807 from bboreham/test-sizeoflabels
[TESTS] Labels: Add a test for SizeOfLabels
2025-07-02 18:44:10 +01:00
Lukasz Mierzwa
b49d143595 Fix a race in discovery manager ApplyConfig & shutdown
If we call ApplyConfig() at the same time the manager is being stopped we might end up hanging forever.
This is because ApplyConfig() will try to cancel obsolete providers and wait until they are cancelled.
It's done by setting a done() function that call Done() on a sync.WaitGroup:

```
if len(prov.newSubs) == 0 {
	wg.Add(1)
	prov.done = func() {
		wg.Done()
	}
}
```

then calling prov.cancel() and finally waiting until all providers run done() function
that by blocking it all on a wg.Wait() call.

For each provider there is a goroutine created by calling Manager.startProvider(*Provider):

```
func (m *Manager) startProvider(ctx context.Context, p *Provider) {
	m.logger.Debug("Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs))
	ctx, cancel := context.WithCancel(ctx)
	updates := make(chan []*targetgroup.Group)

	p.mu.Lock()
	p.cancel = cancel
	p.mu.Unlock()

	go p.d.Run(ctx, updates)
	go m.updater(ctx, p, updates)
}
```

It creates a context that can be cancelled and that cancel function becomes prov.cancel. This is what ApplyConfig will call.
If we look at the body of updater() method:

```
func (m *Manager) updater(ctx context.Context, p *Provider, updates chan []*targetgroup.Group) {
	// Ensure targets from this provider are cleaned up.
	defer m.cleaner(p)
	for {
		select {
		case <-ctx.Done():
			return
[...]
```

we can see that it will exit if that context is cancelled and that will trigger a call to Manager.cleaner().
That cleaner() is where done() is called.
So ApplyConfig() -> calls cancel() -> causes cleaner() to be executed -> calls done().

cancel() is also called from cancelDiscoverers() method that will be called by Manager.Run() when Manager is stopping:

```
func (m *Manager) Run() error {
	go m.sender()
	<-m.ctx.Done()
	m.cancelDiscoverers()
	return m.ctx.Err()
}
```

The problem is that if we call both ApplyConfig and stop the manager at the same time we might end up with:

- We call Manager.ApplyConfig()
- We stop the Manager
- Manager.cancelDiscoverers() is called
- Provider.cancel() is called for every Provider
- cancel() causes provider context to be cancelled which terminates updater() for given Provider
- cancelling context causes cleaner() method to be called for given Provider
- cleaner() calls done() and exits
- Provider is considered stopped at this point, there is no goroutine running that will call done() anymore
- ApplyConfig iterates providers and decides that one is obsolete is must be stopped
- It sets a custom done() function body with a WaitGroup.Done() call in it
- Then ApplyConfig waits until all Providers run done()
- But they are all stopped and no done() will be run
- We wait forever

This only happens if cancelDiscoverers() is run before ApplyConfig, if ApplyConfig runs first done() will be called,
if cancelDiscoverers() is called first it will stop updater() instances and so done() won't be called anymore.

Part of the problem is that there is no distinction between running and stopped providers. There is Provider.IsStarted() method
that returns a bool based on the value of cancel function but ApplyConfig doesn't check it.
Second problem is that although there is a mutex on a Provider it's used much in the code, so two goroutines can try to read and/or write
provider.cancel and/or provider.done at the same time, making it all more likely to race.

The easiest way to fix it is to check if the provider is started inside ApplyConfig so we don't try to stop a provider that's already stopped.
For that we need to mark it as stopped after cancel() is called, by setting cancel to nil.
This also needs better lock usage to avoid different parts of the code trying to set cancel and done at the same time.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:03:10 +01:00
Lukasz Mierzwa
357e652044 Add a test for a rare shutdown hang
When doing a config reload that need to stop some providers while also sending SIGTERM to Prometheus at the same time can sometimes hang

1: sync.WaitGroup.Wait [83 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
    sync         sema.go:110              runtime_SemacquireWaitGroup(*uint32(#166))
    sync         waitgroup.go:118         (*WaitGroup).Wait(*WaitGroup(#23))
    discovery    manager.go:276           (*Manager).ApplyConfig(#23, #167)
    main         main.go:964              main.func5(#120)
    main         main.go:1505             reloadConfig({#183, 0x1b}, 1, #40, #43, #50, {#31, 0xa, 0})
    main         main.go:1182             main.func22()
    run          group.go:38              (*Group).Run.func1(*Group(#26), #51)

Add a test for it.

Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
2025-07-02 16:01:42 +01:00
wmTJc9IK0Q
c481aaf762 codemirror-promql: Preserve source files in npm package (#16804)
* Preserve source files in codemirror-promql package

This allows for sourcemaps to work when the package is imported via ESM-native CDNs such as esm.sh

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>

* Preserve source files in lezer-promql package

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>

---------

Signed-off-by: wmTJc9IK0Q <171362836+wmTJc9IK0Q@users.noreply.github.com>
2025-07-02 15:31:02 +02:00
jscheffner
1be2deec88 mixin: add uid to prometheus overview dashboard
Signed-off-by: jscheffner <jscheffner@users.noreply.github.com>
2025-07-02 15:02:50 +02:00
Julien
f62d0e0385 Merge pull request #16777 from roidelapluie/add-step-promql
Add step(), min() and max() in promql duration expressions
2025-07-02 14:27:45 +02:00
Julien
432f130a32 PromQL: min/max/step: Address review comments
Signed-off-by: Julien <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:36 +02:00
Julien Pivotto
984c8de0da PromQL: Fix printing +min()
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:17 +02:00
Julien Pivotto
3af0bdee68 PromQL: min/max/step: add more tests
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:17 +02:00
Julien Pivotto
ee7d5158a7 Add step(), min(a,b) and max(a,b) in promql duration expressions
step() is a new keyword introduced to represent the query step width in duration expressions.

min(a,b) and max(a,b) return the min and max from two duration expressions.

Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
2025-07-02 11:17:17 +02:00
Bryan Boreham
4eafbcae93 lint
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-07-02 09:56:28 +01:00
Bryan Boreham
e7ac3f440d [TESTS] Labels: Add a test for SizeOfLabels
This requires a bit of repetition to cover all the different builds, but
it seems worth checking that the function does what is expected.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-07-02 09:31:27 +01:00
Bryan Boreham
507227781b [REFACTOR] Labels: Extract test case data from TestLabels_String
So we can use them in other tests.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2025-07-02 09:31:25 +01:00
Julius Volz
bfbae39931 Merge pull request #16716 from charleskorn/charleskorn/binops-docs
docs: clarify and expand binary operations documentation
2025-07-02 10:02:17 +02:00
dependabot[bot]
6bb7e088c5 build(deps): bump github.com/oklog/run from 1.1.0 to 1.2.0
Bumps [github.com/oklog/run](https://github.com/oklog/run) from 1.1.0 to 1.2.0.
- [Release notes](https://github.com/oklog/run/releases)
- [Commits](https://github.com/oklog/run/compare/v1.1.0...v1.2.0)

---
updated-dependencies:
- dependency-name: github.com/oklog/run
  dependency-version: 1.2.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-07-01 23:42:33 +00:00
Charles Korn
d19a9ab673 Remove other instances of "obvious"
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:13:46 +10:00
Charles Korn
1977452331 Address PR feedback: adjust docs to match current behaviour
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:10:20 +10:00
Charles Korn
665eb3d6cb Address PR feedback: remove use of "obvious"
Signed-off-by: Charles Korn <charles.korn@grafana.com>
2025-07-01 20:08:18 +10:00