PS- I just skimmed https://opentelemetry.io, which your readme.md links to.
Good stuff. Much industry progress since I was last in the arena.
Their site has words about manual and automatic instrumentation. I'd have to dig a bit to see what they mean.
So. Remembering a bit more... Will try to keep this brief; you're a busy person.
> tend to log useless information or fail to tag them in ways that are actually searchable
#1 - I don't know know to manage lifecycle of meta. Who needs what? When is it safe to remove stuff?
We logged a lot of URLs. So many URL params. And when that wasn't crazy enough, over flow into HTTP headers. Plus partially duplicate, incorrectly, info in the payloads, a la SOAP. ("A person with two watches has no idea what time it is.")
When individual teams were uncertain, they'd just forward everything they received (copypasta), and add their own stuff.
Just replace all that context with correlation IDs, right?
Ah, but there's "legacy". And unsupported protocols, like Redis and JDBC. And brain dead 3rd party services, with their own brain dead CSRs and engrs.
This is really bad, and just propagates badness, but a few times, in a pinch, I've created Q&D "logging proxy". Just to get some visibility.
So dumb. And yet... Why stop there? Just have "the fabric" record stuff. Repurpose Wireguard into an Omniscient Logger. (Like the NSA does. Probably.) That'd eliminate most I/O trace style logging, right?
Image all these "webservices" and serverless apps without any need for instrumentation. Just have old school app level logging.
#2 - So much text processing.
An egregious example is logging HTTP headers. Serialize them as JSON and send that payload to a logging service. Which then rehydrate and store it some where.
My radical idea, which exactly no one has bought into, is to just pipe HTTP (Requests and Responses) as-is to log files. Then rotate, groom, archive, forward, ingest, compress, whatever as desired.
That's what I did on the system I mentioned. All I/O was just streamed to files. And in the case of the HL7 (medical records stuff), it was super easy to extract the good bits, use that for Lucene's metadata, and store the whole message as the Lucene document.
I know such a radical idea is out of scope for your work. Just something fun to think about.
> if none of your logs let you link service performance to customer X
Yup. Just keep adding servers. Kick the can down the road.
One team I helped had stuff randomly peg P95. And then sometimes a seemingly unrelated server would tip over. Between timeouts, retries, and load balancers, it really seemed like the ankle bone was connected to the shoulder bone. It just made no sense.
Fortunately, I had some prior experience. Being new to nodejs, maybe 5 years ago, I was shocked to learn there was no notion of back pressure. It was a challenging concept to explain to those teammates. But the omission of backpressure, and a hunch, was a good place for me start. (I'm no Dan Luu or Bryan Cantrill.)
I'd like to think that proper end-to-end logging, and the ability to find signal in the noise, diagnosis would have been more mundane.