CERN swaps out databases to feed its petabyte-a-day habit

CERN swaps out databases to feed its petabyte-a-day habit

164

by valyala

keep_reading

I also dropped InfluxDB at work due to its terrible performance. VictoriaMetrics is great

I was using Promscale (TimescaleDB) but they EOL'd Promscale which forced us to Victoria. But either way both of these are much faster than Influx

Don't get fooled into the latest InfluxDB rewrite. I think the latest is cloud hosted only too? So stupid

somedude82

InfluxData is killing InfluxDB with their changes. Their v1->v2->v3 changes are beyong insane. They revealed that flux is deprecated in v3, their main selling point for v2. In database domain you would like stability not breaking change in every 2-3 years. See https://news.ycombinator.com/item?id=37206194

networkchad

[dead]

pphysch

I saw the writing on the wall with InfluxDB v2 (doubling down on closed platform / SaaS) and advocated exploring VictoriaMetrics, even though we had some Influx v1 running. No regrets.

I also prefer the golang-esque simplicity of the Prometheus ecosystem. Monitoring is the last place I want unnecessary abstraction layers and complicated configuration files.

contravariant

Honestly the database isn't half as useful as the tool they wrote to grab the metrics. At least I think telegraf was written by the same people? It seems to have the exact opposite design philosophy.

heipei

Yup, telegraf is what keeps me with InfluxDB, even if I couldn't care less about the database or the pain in the ass it is to operate.

contravariant

VictoriaMetrics does claim to be compatible with the inline format, which makes me wonder if it is possible to keep using telegraf with it.

keep_reading

You can.

I also use Telegraf with TimescaleDB.

Telegraf actually made interop with their competitors super easy

bouvin

One of my fondest memories as a summer student at CERN in 1993 (in the Electronics and Computing for Physics department) was the visit to the basement beneath the main computing facility, where a colossal tape robot was in operation. Even at that time, CERN was grappling with exceedingly vast amounts of data.

lmihaig

I was a summer student this summer, working right next to that exact building. Unfortunately we did not get to see the tape robot, we did get visits to the experiments though.

I can say from what I've seen that the amount of data they have to deal with is still the #1 problem. Before the multi layered filters they generate a petabyte of data per second.

jjtheblunt

I saw the same at FermiLab as a high school kid in 1985. Those things were neat.

ilyt

I really like VictoriaMetrics's architecture

vmagent takes care of all the pesky edge things like emulating prometheus config parsing and various scraping bits. It also does buffering in case you lose network connection for a while, and accept vast spread of different protocols

vminsert/vmselect scale separately from eachother and your queries don't bother your ingest all that much.

vmstorage does just that, storage. Only thing that bothers me (compared to say, Elasticsearch), is that data can't migrate between nodes so you can't "just" start a new one and drain an old one, but a tiny bit ops work in rare cases is IMO price worth paying for straightforwardness of the stack..

PromQL compatibility is also great, tools like Grafana "just work" without anyone having to write support for it.

We started migrating from InfluxDB at work, and on my private stuff I already did. Soo much less memory usage too.

theossuary

What version of Influx were you running? I'm interested if v3 will be more competitive than v2.

ilyt

1.8, migration path to 2.0 was a no-no. Don't remember exact reasons back then but we decided to have wait-and-see approach and see how alternatives grow up as our data generally grows in predictable rate

Also frankly Prometheus support is a massive positive. For better or worse industry standarized on apps using Prometheus as ingest for metrics, and also most of the materials related to that will of course give examples in PromQL

Flux is frankly hieroglyphs for people using it 20 minutes a month like our developers

This is given example on how to raise value in Flux to power of two

    |> map(fn: (r) => ({ r with _value: r._value * r._value }))

This is example of that in prometheus

    value ^ 2

This is example of calculating percentage in Flux (from their webpage)

    data
      |> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
      |> map(
          fn: (r) => ({
              _time: r._time,
              _field: "used_percent",
              _value: float(v: r.used) / float(v: r.total) * 100.0,
          }),
      )

This is how you do it in PromQL

    space_used / space_total * 100

Flux is atrocious for "normal users".

everforward

Flux looks like it has the opposite design philosophy of PromQL: it makes simple things harder and hard things simpler.

I probably wouldn't default to it, but I've had to solve some issues that were complicated in PromQL that would probably be easier here. E.g. I had to work on a Prometheus monitoring/alerting setup where metrics were all recorded in UTC, but alerts should only fire during local business hours. We ended up with something like a thousand character PromQL query that was utterly illegible (have pity on whatever poor soul has to update it when DST goes away).

Flux looks like it would have a more legible solution to that, though at the cost of making simple analysis way more complicated than it probably needs to be.

coder543

The original comment mentioned Influx v3, which doesn't support Flux at all but uses standard SQL instead. Influx v2 still supported InfluxQL from v1, so Flux wasn't essential there either. Flux doesn't seem worth focusing on here.

I've personally found InfluxDB more user-friendly than Prometheus for metrics work, since I've used both at scale and compared my own experiences. The industry leans towards Prometheus nowadays, so I'm used to dealing with Prometheus, but I've found PromQL particularly unpleasant for any kind of complex aggregations. You may disagree, and that's okay. Regular SQL seems far nicer to me in comparison, so it's nice to see Influx v3 focusing on that. I wish Prometheus and VM would develop a standard SQL interface.

VictoriaMetrics doesn't fully support PromQL either; its MetricsQL is only about 73% compatible according to this article[0] that the VM docs link to. I certainly hope that VM took this incompatibility and used it as an opportunity to make MetricsQL more enjoyable to use than PromQL.

[0]: https://medium.com/@romanhavronenko/victoriametrics-promql-c...

dengolius

AFAIK its not a serious issue for many users and companies who sucssesfuly migrated from prometheus/thanos/influxdb/opentsdb/graphite or even cortex to VictoriaMetrics stack. Think about it ;)

Too

Flux is noisy yes. But promql isn’t exactly the pinnacle of query languages either. Simple stuff is simple, but once you start having more complex queries like sum by/without, it makes absolutely no sense where the parenthesis go and you can place some of the operators on either side of the inner expression.

Too

Can you write alerting rules in Prometheus syntax also? And use with Prometheus alert manager?

hagen1778

Yes! Alerting and recording rules are supported by vmalert [0]. vmalert then integrates with alertmanager for sending alerts, and alertmanager then dispatches notifications. Besides this, vmalert has features of retro-active rules evaluation [1] (both alerting and recording) and detection of useless alerting rules [2].

[0] https://docs.victoriametrics.com/vmalert.html [1] https://docs.victoriametrics.com/vmalert.html#rules-backfill... [2] https://docs.victoriametrics.com/vmalert.html#never-firing-a...

ComputerGuru

Missing from the title: leaving InfluxDB and Prometheus for VictoriaMetrics.

hintymad

This is puzzling. I'm not sure how VictoriaMetrics solved the cardinality problem? When running an aggregate query that sums up some counters for a single metric over the dimension of instances in a time window of larger than a few hours, VictoriaMetrics would barf with error for the querying having too many time series (or data points? I forgot the exact wording). This clearly shows that 1/ Victoria Metrics does not treat a time series with multiple dimensions as a single time series; 2/ VictoriaMetrics does not perform hierarchical aggregation.

That is, VictoriaMetrics has not really built a true time series DB that handles reasonable cardinalities.

narism

There are a number of circuit breakers in VictoriaMetrics that limit the number of time series/datapoints in queries to limit CPU/RAM usage. These can be tweaked with the -search.max* command line flags. I think the default query limits for data points per series are 30M and number of series is 300K.

What do you consider reasonable cardinalities or a true TSDB?

hintymad

In my particular example, the cardinality of the instances (i.e., the number of of unit instance count in a query's time range) should not even matter. I was summing a counter over all the instances, and VictoriaMetrics should just add up the counters for every time unit while scanning all the data points -- this is implemented by pretty much all the OLAP engines. Or put it another way, logically I was query over a single time series. It's just that each value in the time series was associated with multiple values. Given such logical model, a good time series database should not not even bother me with any concern of cardinality.

dikei

https://github.com/VictoriaMetrics/VictoriaMetrics#cardinali...

If I understand this correctly, it deals with high cardinality by dropping data. The operators need to monitor for this and adjust their data to lower the cardinality.

narism

From your link:

> By default VictoriaMetrics doesn't limit the number of stored time series.

They have put out some benchmarks showing VictoriaMetrics ingesting 40M time series: https://valyala.medium.com/high-cardinality-tsdb-benchmarks-...

eclark

This is the only longer term scalable solution. High cardinality for TSDB's have to be dealt with by dropping. Or you run out of storage, write rate, memory, or network.

It's possible to smooth this loss out (by assuming a normal distribution of lost data) if it's noticed and limited. Though I do not think there's any commercial TSDB that does that automatically.

Comment was deleted :(

Havoc

That’s one hell of an endorsement. Marketing team won the jackpot.

esafak

tl,dr:

Speaking to The Register, Roman Khavronenko, co-founder of VictoriaMetrics, said the previous system had experienced problems with high cardinality, which refers to the level of repeated values – and high churn data – where applications can be redeployed multiple times over new instances.

Implementing VictoriaMetrics as backend storage for Prometheus, the CMS monitoring team progressed to using the solution as front-end storage to replace InfluxDB and Prometheus, helping remove cardinality issues, the company said in a statement.

eclark

We've been using and building with VictoriaMetrics for a while at Batteries Included. I have probably created and torn down 100+ clusters now. It's a remarkably easy to use piece of software for something with the capabilities.

qwertox

At the end of the article it says

"InfluxDB said in March this year it had solved the cardinality issue with a new IOx storage engine."

Does this mean that in the end it wasn't really necessary to switch to VictoriaMetrics' offering?

ericb

That should read:

InfluxDB said in March this year it had solved the cardinality issue with a new IOx storage engine for it's hosting customers only

There isn't an influxdb that you can download with cardinality solved.

coder543

As of today, they announced a plan for open sourcing a version of v3, but they make it clear in the article that it won't be a great fit for every existing use case.

https://www.influxdata.com/blog/the-plan-for-influxdb-3-0-op...

foobiekr

isn't this like their fourth storage engine? doesn't that get old after awhile?

heliodor

One half-baked thing after another!

m3kw9

Over 24hr period its more then 11 Gigabytes/second or rounding to 100 gbps. Those shards must be pretty crazy

formerly_proven

The headline is about the data processed on their compute, the amount of data in the monitoring system is considerably smaller (but still not small data):

> But Brij Kishor Jashal, a scientist in the CMS collaboration, told The Register that his team were currently aggregating 30 terabytes over a 30-day period to monitor their computing infrastructure performance.

So 1 TB / day, that's about 10 MB/s.

foota

Weird that the title talks about the petabyte a day, while the article is actually about their monitoring tooling, not the thing ingesting the data from experiments, iiuc.

jmakov

How come they don't support wire protocols for analytical workloads like arrow streaming or others like Clickhouse. Looks like they don't want to compete with CK.

faeriechangling

I’m amazed years after I heard about it how the tiny VictoriaMetrics team is thumping what far bigger organizations manage to do. The biggest reason I’ve heard people don’t want to adopt it is if the lead maintainer gets bussed the project is liable to fall apart.

I looked at some of the alternatives to victoriametrics for Prometheus and they all seem… much much worse…

__turbobrew__

I’m surprised nobody has mentioned grafana mimir yet. You get all the niceties of the prometheus ecosystem with a backend which can scale into the billions of metric streams.

valyala

Grafana Mimir is a great product, but it also has some rough edges - https://victoriametrics.com/blog/mimir-benchmark/

amelius

This is nothing compared to what dragnet surveillance has to deal with.

local_crmdgeon

And that's all on MSSQL or RDS, right?

fijiaarone

Amazing how much data you can generate with a small cluster piping out /dev/urandom continuously over every possible socket.

inv2004

Do not have very positive experience with influxdb, but the strange for me that clickhouse was not even mentioned in the article

Comment was deleted :(

zaps

Just use sqlite amirite

sharts

Nah it's all about Microsoft Access '97 if you actually care about your data

cube00

I said a real database (Excel)

RedShift1

Perfection

RachelF

stream to a csv file!

iFire

OPENSOURCE, APACHE2 LICENSE

https://github.com/VictoriaMetrics/VictoriaMetrics/blob/mast...

Comment was deleted :(

sgt101

I can do this on my laptop

/tumbleweed...

Crafted by Rajat

Source Code

hckrnws

CERN swaps out databases to feed its petabyte-a-day habit