Prometheus 3.0

Prometheus 3.0

206

10m

by dmazin

the_duke

10m

I'm curios: are many people here actually still running mainline Prometheus over one of the numerous compatible solutions that are more scalable and have better storage? (Mimir, Victoria, Cortex, OpenObserve, ...)

robinhoodexe

10m

We’re running standard Prometheus on Kubernetes (14 onprem Talos clusters, total of 191 nodes, 1.1k cpu cores, 4.75TiB memory and 4k pods). We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.

It works pretty good tbh. I’m excited about upgrading to version 3, as is does take a lot of resources to keep going, especially on clusters with a lot of pods being spawned all the time.

xgbi

10m

Hey robinhood, any feedback on Talos?

We've been using Talos for our internal clusters for a while, but with quite small ones (3 kube node, 5 worker nodes).

Upgrading has been generally a non event, and we're quite happy with them.

How do you deploy Thanos ? In one of the clusters ?

robinhoodexe

10m

We’re extremely pleased with Talos. Much more secure than Azure (our cloud of choice, unfortunately) which run a full-blown Ubuntu underneath. We haven’t run into any issues with Talos and upgrading is super easy with the talosctl tool, both Kubernetes and Talos version.

We currently have a thanos instance in each cluster. We could move it to a separate cluster to reduce some overhead, but the current approach works. We’re ingesting about 60Gi per day of metrics into the S3 bucket, so we might have to optimise that.

ChocolateGod

10m

> We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.

Any reason to not just use Mimir for this?

robinhoodexe

10m

I can’t recall the reason for using thanos over mimir to be honest. I think thanos seemed like a good choice given it’s part of the kube-prometheus-stack community helm charts.

aorth

10m

Using Victoria Metrics here. Very easy to set up and run. I monitor under 100 hosts and resource usage is low, performance is good.

One gripe is that they recently stopped publishing tarballs for LTS versions, which caused some grumbling in the community. Fair enough since they are developing for free, but felt like a bait and switch.

yla92

10m

Same. It's rock solid, low resource usage and good performance. The "single" version mostly just works.

never_inline

10m

I am curious to hear from people on this forum, at what point will people practically cross the limits of prometheus, and straightforward division (eg, different prometheus across clusters and environments) does not work?

hagen1778

10m

It usually comes with increase of active series and churn rate. Of course, you can scale Prometheus horizontally by adding more replicas and by sharding scrape targets. But at some point you'd like to achieve the following:

1. Global query view. Ability to get metrics from all Prometheis with one request. Or just simply not thinking which Prometheus has data you're looking for.

2. Resource usage management. No matter how you try, scrape targets can't be sharded perfectly. So you'll end up with some Prometheis using more resources than others. This could backfire in future in weird ways, reducing stability of the whole system.

majewsky

10m

Regular Prometheus inside clusters for collection and alerting, Thanos for cross-cluster aggregation and long retention.

rad_gruchalski

10m

Mimir

rapphil

10m

There is also Cortex metrics, from which Mimir was forked from

rad_gruchalski

10m

Mimir, Loki, Tempo and Alloy.

raffraffraff

10m

Nope. Mimir. Before that, Thanos.

pentagrama

10m

I didn't know this tool, and looking the homepage, is not clear to me if is something like Google Analytics to get metrics of website traffic, or something more dev oriented.

Someone can explain to me please? Maybe I'm a bit lost because I'm not a dev. I'm a designer considering selfhosted Google Analytics alternatives and this one may be interesting to add to the research (so far I have Matomo, plausible, open panel, umami, open replay, highlight). Thanks

remram

10m

It's usually meant to monitor infrastructure, like memory and CPU usage of various processes and machines, request per seconds for servers, number of HTTP errors returned, etc.

https://prometheus.io/docs/introduction/overview/#what-are-m...

pentagrama

10m

Thank you! Now is more clear to me.

mdaniel

10m

> I'm a designer considering selfhosted Google Analytics alternatives

I wanted to ensure that you saw this currently on the front page: https://news.ycombinator.com/item?id=42270389 (Show HN: Vince – A self hosted alternative to Google Analytics; 66 comments)

pentagrama

10m

That is timing. Thank you, adding to my list.

LorenzoGood

10m

Also check out https://umami.is

c0balt

10m

That's good news, especially the reduced memory usage and OTLP ingestion support look nice. I have experimented with OTLP metrics before but eventually fell back to prometheus to avoid adding another service to our systems.

kuon

10m

Migration to victoria metrics has been on my list for nearly a year, but the licensing of it always scared me a bit. My main issue is CPU and memory usage of prometheus, so maybe this upgrade will fix that.

smw

10m

Apache 2 is scary? Maybe I'm missing something.

kuon

10m

I think when I looked it was still half closed source or something, or maybe I was mislead by something I read. To be honest I didn't take the time to properly investigate it.

alphabettsy

10m

Victoria Metrics is well worth the migration. Much better performance and lower resource utilization in my experience.

nine_k

10m

It's a reminder to us all that when we think: "Hey, why sweating over this memory layout or that extra CPU expenditure, it's small and nobody will notice", there will be times when everybody will notice. Maybe notice as much as to switch to our competitors' products.

hinkley

10m

Developers tend to ignore C in order of complexity calculations but customers don’t.

Game developers and HFTs seem to understand this, and very few regular devs I’ve interacted with do. I’ve seen customers say they switched to someone else for speed reasons. And I’ve worked on projects where the engineers were claiming this as fast as we can make it, and they were off by at least a factor of three.

We like to think that being off by 10 or 30% doesn’t matter that much but lots of companies run on thin margins and publicly traded companies’ stock prices reflect EBITDA, it matters. Particularly in the Cloud era, where it’s much easier to see how sloppy programming leads more directly to hardware cost excess (as opposed to already purchased servers running closer to capacity)

therealdrag0

10m

Those margins also mean you have to pick your battles. Most software is not as performance sensitive as video games or HFT.

I take an efficient market hypothesis on this. Obviously devs can make stuff faster, and they do where it matters, as can be seen in games and HFT. In other software it’s a discussion with product of trade offs.

hinkley

10m

In most of the scenarios I’ve mentioned, everyone but the devs disagreed about it being good enough. Too many people think the flame graph is the pinnacle of performance analysis and it’s just table stakes.

Fizzadar

10m

VM was a game changer for us, 7x reduction in memory and 3x CPU, plus the scaling flexibility.

oulipo

10m

Hmmm but the documentation seems poorly written, what is the team behind?

trallnag

10m

Specifically the documention or VictoriaMetrics overall? The latter started with a small number of Ukrainians

hagen1778

10m

What makes you think that about docs? Of course, it was written by developers, not tech writers. But anyway, what do you think can be improved?

never_inline

10m

> Native Histograms are still experimental and not yet enabled by default, and can be turned on by passing --enable-feature=native-histograms. Some aspects of Native Histograms, like the text format and accessor functions / operators are still under active design.

Ah, slightly disappointed :). Looking at major version I thought it was going to be all about Native histograms.

vtodekl

10m

[dead]

RoxaneFischer1

10m

A solid upgrade :)))

dvh

10m

I've read entire page and still don't know what it is. Release notes are communication tool and this was a failure as such. You are losing random passerbys by not telling what your product is in first sentence of release notes, especially x.0

moondev

10m

Before reading the entire page did you consider clicking the header. This will bring you to the main landing page of the product/project which more often than not contains a helpful summary of what it is and why it exists. You can also apply this pattern to other unknown things you come across.

cess11

10m

I think most people that read release notes and changelogs want the text to be concise and easy to interpret when they're doing due diligence to decide when to start rolling out upgrades. They know what the software is about and don't care for some sales pitch.

never_inline

10m

Sir, for thought leaders and CTOs they have a home page.

This is for DevOps minions.

arccy

10m

if you can't tell what it is but still read the whole thing... i think the problem is on you?

soupbowl

10m

https://prometheus.io/

Crafted by Rajat

Source Code

hckrnws

Prometheus 3.0