Presumably, this will help with Loki alerts, but the added functionality is also generally useful.
For one, this enables `parseDuration` to also accept negative duration (as that's something that is also used in PromQL by now).
This also adds a function `now` to return the evaluation time of the template (as seconds since epoch AKA Unix time) and a function `toDuration` (akin to `toTime`), which creates a Go `time.Duration` from a duration in seconds.
---------
Signed-off-by: Dmitry Ponomaryov <me@halje.ru>
Signed-off-by: Dmitry Ponomaryov <iamhalje@gmail.com>
add links to the sources of truth.
It's hard to keep up to date, the "go" one
is "wrong" (not really as an old 1.22 binray could still
download/use newer toolchains...) for example.
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
This shows how float counters cannot go below zero when extrapolationg
for rate/increase, and how histograms do not have that protection yet,
leading to an overestimation of the rate/increase.
This also demonstrates edge cases where the count extrapolation does
not need to be limited, but an individual bucket still goes below
zero.
Signed-off-by: beorn7 <beorn@grafana.com>
This commit adds Projection metadata to SelectHints so that downstream
storage implementations can use it to save effort when answering to
Select calls.
Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>
* OTLP receiver: Generate target_info samples between the earliest and latest samples per resource
Modify the OTLP receiver to generate target_info samples between the earliest
and latest samples per resource instead of only one for the latest timestamp.
The samples are spaced lookback delta/2 apart.
---------
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Add `ByteSize()` method to different labels implementations.
One of the use case so that we can track the memory used by Labels.
Signed-off-by: Jon Kartago Lamida <me@lamida.net>
If we call ApplyConfig() at the same time the manager is being stopped we might end up hanging forever.
This is because ApplyConfig() will try to cancel obsolete providers and wait until they are cancelled.
It's done by setting a done() function that call Done() on a sync.WaitGroup:
```
if len(prov.newSubs) == 0 {
wg.Add(1)
prov.done = func() {
wg.Done()
}
}
```
then calling prov.cancel() and finally waiting until all providers run done() function
that by blocking it all on a wg.Wait() call.
For each provider there is a goroutine created by calling Manager.startProvider(*Provider):
```
func (m *Manager) startProvider(ctx context.Context, p *Provider) {
m.logger.Debug("Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs))
ctx, cancel := context.WithCancel(ctx)
updates := make(chan []*targetgroup.Group)
p.mu.Lock()
p.cancel = cancel
p.mu.Unlock()
go p.d.Run(ctx, updates)
go m.updater(ctx, p, updates)
}
```
It creates a context that can be cancelled and that cancel function becomes prov.cancel. This is what ApplyConfig will call.
If we look at the body of updater() method:
```
func (m *Manager) updater(ctx context.Context, p *Provider, updates chan []*targetgroup.Group) {
// Ensure targets from this provider are cleaned up.
defer m.cleaner(p)
for {
select {
case <-ctx.Done():
return
[...]
```
we can see that it will exit if that context is cancelled and that will trigger a call to Manager.cleaner().
That cleaner() is where done() is called.
So ApplyConfig() -> calls cancel() -> causes cleaner() to be executed -> calls done().
cancel() is also called from cancelDiscoverers() method that will be called by Manager.Run() when Manager is stopping:
```
func (m *Manager) Run() error {
go m.sender()
<-m.ctx.Done()
m.cancelDiscoverers()
return m.ctx.Err()
}
```
The problem is that if we call both ApplyConfig and stop the manager at the same time we might end up with:
- We call Manager.ApplyConfig()
- We stop the Manager
- Manager.cancelDiscoverers() is called
- Provider.cancel() is called for every Provider
- cancel() causes provider context to be cancelled which terminates updater() for given Provider
- cancelling context causes cleaner() method to be called for given Provider
- cleaner() calls done() and exits
- Provider is considered stopped at this point, there is no goroutine running that will call done() anymore
- ApplyConfig iterates providers and decides that one is obsolete is must be stopped
- It sets a custom done() function body with a WaitGroup.Done() call in it
- Then ApplyConfig waits until all Providers run done()
- But they are all stopped and no done() will be run
- We wait forever
This only happens if cancelDiscoverers() is run before ApplyConfig, if ApplyConfig runs first done() will be called,
if cancelDiscoverers() is called first it will stop updater() instances and so done() won't be called anymore.
Part of the problem is that there is no distinction between running and stopped providers. There is Provider.IsStarted() method
that returns a bool based on the value of cancel function but ApplyConfig doesn't check it.
Second problem is that although there is a mutex on a Provider it's used much in the code, so two goroutines can try to read and/or write
provider.cancel and/or provider.done at the same time, making it all more likely to race.
The easiest way to fix it is to check if the provider is started inside ApplyConfig so we don't try to stop a provider that's already stopped.
For that we need to mark it as stopped after cancel() is called, by setting cancel to nil.
This also needs better lock usage to avoid different parts of the code trying to set cancel and done at the same time.
Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
When doing a config reload that need to stop some providers while also sending SIGTERM to Prometheus at the same time can sometimes hang
1: sync.WaitGroup.Wait [83 minutes] [Created by run.(*Group).Run in goroutine 1 @ group.go:37]
sync sema.go:110 runtime_SemacquireWaitGroup(*uint32(#166))
sync waitgroup.go:118 (*WaitGroup).Wait(*WaitGroup(#23))
discovery manager.go:276 (*Manager).ApplyConfig(#23, #167)
main main.go:964 main.func5(#120)
main main.go:1505 reloadConfig({#183, 0x1b}, 1, #40, #43, #50, {#31, 0xa, 0})
main main.go:1182 main.func22()
run group.go:38 (*Group).Run.func1(*Group(#26), #51)
Add a test for it.
Signed-off-by: Lukasz Mierzwa <l.mierzwa@gmail.com>
step() is a new keyword introduced to represent the query step width in duration expressions.
min(a,b) and max(a,b) return the min and max from two duration expressions.
Signed-off-by: Julien Pivotto <291750+roidelapluie@users.noreply.github.com>
This requires a bit of repetition to cover all the different builds, but
it seems worth checking that the function does what is expected.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>