hckrnws
I'm curios: are many people here actually still running mainline Prometheus over one of the numerous compatible solutions that are more scalable and have better storage? (Mimir, Victoria, Cortex, OpenObserve, ...)
We’re running standard Prometheus on Kubernetes (14 onprem Talos clusters, total of 191 nodes, 1.1k cpu cores, 4.75TiB memory and 4k pods). We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.
It works pretty good tbh. I’m excited about upgrading to version 3, as is does take a lot of resources to keep going, especially on clusters with a lot of pods being spawned all the time.
Hey robinhood, any feedback on Talos?
We've been using Talos for our internal clusters for a while, but with quite small ones (3 kube node, 5 worker nodes).
Upgrading has been generally a non event, and we're quite happy with them.
How do you deploy Thanos ? In one of the clusters ?
We’re extremely pleased with Talos. Much more secure than Azure (our cloud of choice, unfortunately) which run a full-blown Ubuntu underneath. We haven’t run into any issues with Talos and upgrading is super easy with the talosctl tool, both Kubernetes and Talos version.
We currently have a thanos instance in each cluster. We could move it to a separate cluster to reduce some overhead, but the current approach works. We’re ingesting about 60Gi per day of metrics into the S3 bucket, so we might have to optimise that.
> We use Thanos to store metrics in self-hosted S3 (seaweedfs) with 30 days retention, aggressively downsample after 3 days.
Any reason to not just use Mimir for this?
I can’t recall the reason for using thanos over mimir to be honest. I think thanos seemed like a good choice given it’s part of the kube-prometheus-stack community helm charts.
Using Victoria Metrics here. Very easy to set up and run. I monitor under 100 hosts and resource usage is low, performance is good.
One gripe is that they recently stopped publishing tarballs for LTS versions, which caused some grumbling in the community. Fair enough since they are developing for free, but felt like a bait and switch.
Same. It's rock solid, low resource usage and good performance. The "single" version mostly just works.
I am curious to hear from people on this forum, at what point will people practically cross the limits of prometheus, and straightforward division (eg, different prometheus across clusters and environments) does not work?
It usually comes with increase of active series and churn rate. Of course, you can scale Prometheus horizontally by adding more replicas and by sharding scrape targets. But at some point you'd like to achieve the following:
1. Global query view. Ability to get metrics from all Prometheis with one request. Or just simply not thinking which Prometheus has data you're looking for.
2. Resource usage management. No matter how you try, scrape targets can't be sharded perfectly. So you'll end up with some Prometheis using more resources than others. This could backfire in future in weird ways, reducing stability of the whole system.
Regular Prometheus inside clusters for collection and alerting, Thanos for cross-cluster aggregation and long retention.
Mimir
There is also Cortex metrics, from which Mimir was forked from
Mimir, Loki, Tempo and Alloy.
Nope. Mimir. Before that, Thanos.
I didn't know this tool, and looking the homepage, is not clear to me if is something like Google Analytics to get metrics of website traffic, or something more dev oriented.
Someone can explain to me please? Maybe I'm a bit lost because I'm not a dev. I'm a designer considering selfhosted Google Analytics alternatives and this one may be interesting to add to the research (so far I have Matomo, plausible, open panel, umami, open replay, highlight). Thanks
It's usually meant to monitor infrastructure, like memory and CPU usage of various processes and machines, request per seconds for servers, number of HTTP errors returned, etc.
https://prometheus.io/docs/introduction/overview/#what-are-m...
Thank you! Now is more clear to me.
> I'm a designer considering selfhosted Google Analytics alternatives
I wanted to ensure that you saw this currently on the front page: https://news.ycombinator.com/item?id=42270389 (Show HN: Vince – A self hosted alternative to Google Analytics; 66 comments)
That is timing. Thank you, adding to my list.
Also check out https://umami.is
That's good news, especially the reduced memory usage and OTLP ingestion support look nice. I have experimented with OTLP metrics before but eventually fell back to prometheus to avoid adding another service to our systems.
Migration to victoria metrics has been on my list for nearly a year, but the licensing of it always scared me a bit. My main issue is CPU and memory usage of prometheus, so maybe this upgrade will fix that.
Apache 2 is scary? Maybe I'm missing something.
I think when I looked it was still half closed source or something, or maybe I was mislead by something I read. To be honest I didn't take the time to properly investigate it.
Victoria Metrics is well worth the migration. Much better performance and lower resource utilization in my experience.
It's a reminder to us all that when we think: "Hey, why sweating over this memory layout or that extra CPU expenditure, it's small and nobody will notice", there will be times when everybody will notice. Maybe notice as much as to switch to our competitors' products.
Developers tend to ignore C in order of complexity calculations but customers don’t.
Game developers and HFTs seem to understand this, and very few regular devs I’ve interacted with do. I’ve seen customers say they switched to someone else for speed reasons. And I’ve worked on projects where the engineers were claiming this as fast as we can make it, and they were off by at least a factor of three.
We like to think that being off by 10 or 30% doesn’t matter that much but lots of companies run on thin margins and publicly traded companies’ stock prices reflect EBITDA, it matters. Particularly in the Cloud era, where it’s much easier to see how sloppy programming leads more directly to hardware cost excess (as opposed to already purchased servers running closer to capacity)
Those margins also mean you have to pick your battles. Most software is not as performance sensitive as video games or HFT.
I take an efficient market hypothesis on this. Obviously devs can make stuff faster, and they do where it matters, as can be seen in games and HFT. In other software it’s a discussion with product of trade offs.
In most of the scenarios I’ve mentioned, everyone but the devs disagreed about it being good enough. Too many people think the flame graph is the pinnacle of performance analysis and it’s just table stakes.
VM was a game changer for us, 7x reduction in memory and 3x CPU, plus the scaling flexibility.
Hmmm but the documentation seems poorly written, what is the team behind?
Specifically the documention or VictoriaMetrics overall? The latter started with a small number of Ukrainians
What makes you think that about docs? Of course, it was written by developers, not tech writers. But anyway, what do you think can be improved?
> Native Histograms are still experimental and not yet enabled by default, and can be turned on by passing --enable-feature=native-histograms. Some aspects of Native Histograms, like the text format and accessor functions / operators are still under active design.
Ah, slightly disappointed :). Looking at major version I thought it was going to be all about Native histograms.
[dead]
A solid upgrade :)))
I've read entire page and still don't know what it is. Release notes are communication tool and this was a failure as such. You are losing random passerbys by not telling what your product is in first sentence of release notes, especially x.0
Before reading the entire page did you consider clicking the header. This will bring you to the main landing page of the product/project which more often than not contains a helpful summary of what it is and why it exists. You can also apply this pattern to other unknown things you come across.
I think most people that read release notes and changelogs want the text to be concise and easy to interpret when they're doing due diligence to decide when to start rolling out upgrades. They know what the software is about and don't care for some sales pitch.
Sir, for thought leaders and CTOs they have a home page.
This is for DevOps minions.
if you can't tell what it is but still read the whole thing... i think the problem is on you?
Crafted by Rajat
Source Code