Cases where full scans are better than indexes
Just because the table scan is under some threshold doesn't automatically make it better. If a table scan takes 250ms vs 0.01 for a indexed lookup, you're still gonna have to justify to me why making silicon work that damn hard is worth even the electrical use. Are you inserting and deleting so many rows that maintaining the index is prohibitive? Do you have space concerns, and are not able to keep the index in memory, or maybe even on disk? The Bash history thing makes sense, because indexing is a lot of work. But otherwise, just use an index and move on to the next problem.
EDIT: Has anyone else forgotten to add an index, then noticed the performance regression a year later? That's always fun, because now adding the index could mean downtime, and even knowing if or how much is difficult because putting that much data somewhere else to test isn't trivial. No thank you.
For small enough tables doing a full scan will be faster than an index. This is also true for regular non-database applications like checking if an item is present in a small collection: it's faster to do a linear scan over a small vector than it is to do a lookup in a hash table or b-tree. With a linear scan (whether it's for data on disk, or a data structure in memory) you don't have to do hashing or branching (except possibly to terminate the loop). With a binary search you basically get worst possible case for branch predictor, because if you're searching random values whether you go right/left on each recursion level is basically random. This is true for in-memory data structures, and it's even more true on disk, since disks (even SSDs) are especially optimized for linear scans compared to random accesses.
It's been a while since I looked at this, but I think MySQL and Postgres both take this into account and will let you add indexes to small tables but will actually ignore them and do a full table scan for all queries if the table is small enough.
Yes, that's correct. If you do an explain for a tiny table, as long as your stats are up to date, the index will be ignored. In that case, it's there for insurance if the table grows in the future.
Benchmarking this (for non-database applications) is really interesting because you get to find the crossover point, and it's usually in the hundreds of items. An algorithm that searches maximum 50 items in a hash table but does it across many hash tables could be made much faster using a linear search.
Some of his examples aren't great, though. The cost of adding an index to a database you're already using and which has a reasonably small amount of data is low, and the benefit of getting a few hundred ms less latency for your users is underappreciated. A hundred million records is definitely above the limit at which e.g. prefix search on a mobile device becomes painful.
> With a binary search you basically get worst possible case for branch predictor, because if you're searching random values whether you go right/left on each recursion level is basically random.
I think folks writing databases know a thing or two about writing faster search and sort algorithms.
I'm not sure what you are trying to say, given the context of what you quoted was. Binary search has optimal worst case runtime complexity, but for small tables it is, on real computers, overall still slower than a linear scan, which effectively has the worst worst case runtime complexity (unless you start to get pathological). This is because the constant in front of the runtime complexity, the one that O-notation explicitly ignores, starts to matter: Branch prediction and cache locality come into play.
What exactly do you mean with "writing faster search and sort algorithms"?
> What exactly do you mean with "writing faster search and sort algorithms"?
- Branchless binary search: https://news.ycombinator.com/item?id=35737862 / https://news.ycombinator.com/item?id=14598098
- Static B-Trees for faster binary search: https://news.ycombinator.com/item?id=30376140
Presumably there's even more literature on the topic.
Those are neat optimizations of the constant, but even with that, a linear scan can still be much faster under the right (and not unrealistic) circumstances.
But I see now you were referring to the branch predictor part specifically, so all good.
Which is why they use b-trees and not binary search.
Why do you think that? Seems like a hard problem.
All of this looks accurate, but it's worth contextualizing: this is an optimization that bears the most fruit in "local frames of reference" -- the timescales where linear scans beat index lookups are likely to be strictly dominated by the network latency to transceive from the database. The conclusion is then that the optimization ~only optimizes for cases that effectively don't matter.
What do you think the threshold number is where scanning a list will outperform a hash table? 10 items? 100? 1000? For what it’s worth, for small data sets I agree with you. But I think it’s very hard to be calibrated correctly on what small means here.
Re: binary search, the biggest cost to scanning data in a database is loading that data from persistent storage. A few mispredicted branches aren’t going to matter much if it means you can do fewer reads.
> But I think it’s very hard to be calibrated correctly on what small means here.
If you measure or determine somehow the sizes of the L1-L2-L3 caches in the CPU, the relative speed and size of the main memory, and the relative speed and size of the local storage (in contrast with for example network-remote data), then it is a matter of performing some arithmetic, and just to be on the safe side, some benchmarks.
Any data set that fits into the CPU cache is definitely small, for a particular CPU.
If you want to decide a number that applies to every and all hardware, then it goes from hard to basically impossible.
The Rust BTreeMap implementation is an example of this as well. It does a linear scan through the nodes instead of a binary search because it's actually faster in this case!
Did we both just watch the latest Crust of Rust, or did you just know that? If the latter, I’m impressed!
>Did we both just watch the latest Crust of Rust
Yes that's exactly how I knew about this!
Interesting. But the doc doesn't go as far as to say it's faster, just fast - any measurements for eg string keys?
I can't seem to find any benchmarks for this but Gankra did a long write up about Rust's btree implementation!
I did not read this entire thing so it's very possible it contains a benchmark somewhere but I couldn't find any when doing a quick skim.
Yes but when the table is so small even if the index is slower it hardly matters because the table is so small.
Or in other words add the index every single time, unless you have so many writes and such a large table that the actual writes slow down the index.
Leaving out an index on a small table never helps anything.
I am glad the query execution planner (mostly) takes the strain. When it doesn’t it is a 99% chance you need to promote an ORM generated query to a, human<<<<< well probably GPT generated explicit one.
Comment was deleted :(
it's more then that, PostgreSQL will choose to ignore the index on a per query basis, if the query will return more then some % of the table it's faster to just do the scan.
Justify the dev time to save micro-pennies worth of electricity to me instead.
A typical naive index won't help with my regular expression based queries, which aren't easily accelerated by an index. Or given an in-memory index, you've just increased memory use from O(1) to O(N), and I'll OOM on large files. Perhaps you'll throw a database at the problem, complicating I/O (especially when the data is generated/accessed by third party tools that aren't database based), tying me to yet another library's update cycle, and perhaps forcing me to tackle the additional problem of cache invalidation. Perhaps I need reverse lookups, in which case whatever forward indexing I might've pessemistically added ahead of time will be of no help at all.
If it'a a 5 second "this is probably the right choice" kneejerk reaction, maybe it's fine. Or if you're google indexing the internet, I suppose. But I am frequently plagued by shit, buggy, useless indexes that merely distract from the proper alexanderian solution to the gordian knot - brute force - wasting more time than they ever saved, for both people and CPUs.
> Justify the dev time to save micro-pennies worth of electricity to me instead.
I mean, it's a dozen characters. Do you need to know how fast I type before you run the calculation?
If you're already using a relational database you should almost certainly set up indexes on your table ids and foreign keys. But that's pretty different from the examples in the post!
I'm not anti-index, I'm anti-"if you ever have a full table scan in production you're doing it wrong".
‘If you ever have a non-deliberate full table scan in production you are doing it wrong’ then?
If you ever have non-deliberate anything in production there's a non zero chance you're doing it wrong.
> Do you need to know how fast I type before you run the calculation?
I'll assume 100WPM, call that two words, billed at $200/hour and call that $0.06, falling under "too cheap to be worth arguing against", which falls under the aforementioned:
>> If it'a a 5 second "this is probably the right choice" kneejerk reaction, maybe it's fine.
That said, there's a decent chance those 6 cents won't pay for themselves if this is a login on a single user system, with any theoretical benefits of O(...) scaling being drowned out by extra compile times, extra code to load - and I'd be plenty willing to NAK code reviews that merely attempt to replace /etc/passwd and /etc/shadow with this, as the extra time code reviewing the replacement still has negative expected ROI, and it'll be a lot more involved than a mere dozen characters.
Now, maybe we're attempting to centralize login management with Kerberos or something, perhaps with good reasons, perhaps which does something similar under the hood, and we could talk about that and possible positive ROI, despite some actual downtime and teething concerns?
Now document the two words, run the test suite to verify no-breakagem commit to source control, and push to production.
Suddenly those two words cost a lot more than $0.06, and that's IF everything goes smoothly.
This falls under the “if we need to test that indexes still work on our mysql/postgres we have bigger problems” header.
Depending on your IO pattern, adding an index can make writes quite a bit slower.
First tell me the minimum amount of time typing this would have to take for you to agree it's not worth it and I will try to keep adding things like the time it takes for someone to ask you to do this, for you to start VS Code, find the file, press ctrl-s, deploy the changes, and possibly some estimation of how long it takes a new developer to read and understand this system (please tell me how fast you speak and an agreeable value for how fast the average developer reads as well for this part) vs a simpler one until it goes over that limit.
> Justify the dev time to save micro-pennies worth of electricity to me instead.
The time spent justifying it is longer than the dev time itself. Any semi experienced engineer will throw basic indexes into their data model without even thinking and cover the most common use cases.
If they never use them… who cares? It took no time to add.
An RDBMS is not the best data store for all data. Sometimes flat files or other are the simplest tool for the job, as shown in the article. Adding a database to any of those tools would definitely not be worth the trade-off.
My statement assumed that as a starting point. To be fair, the decision to use a database or other format, given a reasonable amount of experience, is also a fairly quick thought process. By the time I open my editor to write my first line of code I’ve already made my choice.
Sometimes, but I lean towards going with the RDBMS first, and then switch to flat text files if that proves to be a better choice.
Because 90% of the time the RDBMS is.
not sure if this is trolling?
> Justify the dev time
And this is exactly the sentiment that got us where we are.
The mistake in your statement I think, is assuming there is a course of action that wouldn’t result in a problem somewhere.
It may look a little different, but there is nothing without tradeoffs.
Including indexing everything.
> If a table scan takes 250ms vs 0.01 for a indexed lookup
I'm not sure whether the units of that latter number are still ms or now s or whatever, but either way isn't that where you are wrong? On real computers, there are lots of situations where linear access is trivially faster than a "better" data structure with theoretically better runtime complexity.
1000 accesses to a single page for a linear scan are going to be many, many orders of magnitude faster than 5 accesses chasing pointers across 5 pages that have no TLB entry, or excruciatingly worse, that need to be paged in from disk.
This is an extreme (but absolutely not unrealistic!) example for illustration. Slightly more subtle scenarios involving e.g. branch prediction have already been named by eklitzke.
This problem exists across multiple layers, not just close to the CPU like above. (As eklitzke also already mentioned with disk involvement, even if it's an SSD.)
If your table is that small won't the index also be a single page? I don't understand this scenario.
So, forgetting an index and then growth pushing you over the threshold is a valid concern. I think every dev has run into that at some early point in their career.
But what your comment is skipping past is there's potentially a 100x or higher difference in throughput for sequential scans vs indexes. If you know the data will have a bounded size this means indexes aren't necessarily a good choice. SSDs have narrowed this gap a great deal but it still exists because of latency and unpredictable prefecting. It even exists with pure in memory applications. Another key aspect is how much of the data the query ends up touching. If you're hitting 25% of the data anyhow a linear scan is likely faster.
There's also more niche ideas like arranging to convoy multiple queries along one linear scan, something impossible with indexed scans.
It's useful to know about this asymmetry and that sometimes a brute force linear scan is in fact the best tool.
I agree - for 90% of cases.
There are situations where the indexes end up larger than, or the same size as, the actual data and the query doesn’t meaningfully benefit from having the data indexed because, for instance, all the data is going to be analyzed anyway, or the search type doesn’t index usefully with the normal index types (like geo searches, or clustering type queries).
Adding indexes that never get used have real costs on an ongoing basis with insert/updates and schema updates too, as it adds potentially significant overhead on every operation and can make certain schema operations impossible without downtime too.
Foreign key columns, ‘soft delete’ columns (like deleted/not), basic auditing type stuff (created on, updated on, etc), ‘unique’ or external reference values like a order id or whatever (even if not a primary/unique key in the schema), basic numeric/analysis columns are almost always worth indexing though, to your point.
Stuff that is not always a clear win without some real thinking is Freeform text fields, structured binary data (like a PDF or image), geo location data without a clear existing use (tends to require specialized index types which are also expensive to load/use), etc.
Many times some preprocessing is necessary anyway to convert what you have to what you actually want, and putting that in a secondary column to be indexed is far more valuable.
Agree but you can always add an index without downtime. It just becomes a money and complexity issue ;)
can you do this (on a large table) without adding significant IO load/clogging up replication for an extended period of time?
It’s worth noting that if your DB instance is so heavily loaded that this is a real concern, you already have a huge problem that needs fixing.
It may be so loaded from all the full table scans it's doing.
AWS is particularly bad with their performance credit system on RDS... and there's to my knowledge no way to tell MySQL to limit index creation IOPS, which means in the worst case you're stuck with a system swamped under load for days and constantly running into IO starvation, if you forget to scale up your cluster beforehand.
Even if the cluster is scaled to easily take on your normal workload, indexing may prove to be too much for IO burst credits
That does seem like a real problem! Adding indexes periodically is a pretty regular thing for any production system where I come from.
You can use gh-ost (or most other online schema change tools) to get around that. You'll still have some necessary increase in write load since you are double-writing that table but you can control the pacing of the data backfill (gh-ost gives even more control as it doesn't use triggers to duplicate writes). It's not quite as simple as just running ALTER TABLE on the master though.
I have never had any problems with CONCURRENT index creations under significant load using Postgres, fwiw
“We’ll add an index when it gets slow” is saying “screw the guy who’s on call the night this system starts timing out”.
Invisible cliffs in your code that it will fall off at some unknown point in the future are the antithesis of engineering.
If you deliberately aren’t implementing indexing, know at what point indexing would begin to matter. Put in place guard rails so the system doesn’t just blow up unexpectedly.
There are 100% cases where you shouldn’t use indexes. That’s not the same thing as being able to get away without an index.
> Invisible cliffs in your code that it will fall off at some unknown point in the future are the antithesis of engineering.
I think this describes the general shape of problem seen in any engineering domain. After a certain point of time (i.e. unknown), we can no longer guarantee certain properties of the system. This is why we add all kinds of qualifiers and constraints in our discussions with the customer.
Certainly, you can make some predictions about how tables will grow over time, but there are also some potential headaches that can be incurred if you arbitrarily add indexing. "We might need to expire records to manage growth" so you go ahead and index that CreatedUtc column. Turns out, the usage patterns of your app changed such that that table is now experiencing 20x more insert volume than you could have ever anticipated and your prediction is now a gigantic liability. With 20x insert volume you would have _never_ indexed that column. You would have used some alternative path involving temporary tables, etc. Now, you are stuck with essentially same problem - a judgement call at 2am regarding whether or not you should drop an index while the rest of your team is in bed.
Since I am unable to predict the future I feel like waiting until the actual problem arises and then dealing with it at that moment is the best option. Strategically, not having an index will likely fail more gracefully than if an index is bottlenecking requests during peak load (i.e. instantaneous vs cumulative data volume).
The key to sleeping at night is to add metrics and alarms near those cliffs or 'edges of the map' so they aren't invisible anymore. It's hard to anticipate every angle to this but in any case where you're making this kind of assumption or tradeoff it's a good idea.
A simple example is setting an alarm on your max load test traffic. When you get that alarm you know your system is now operating in unknown territory and it's probably time to do another load test or at least take a close look at scaling.
Not advocating for building indexes for queries you might later want to run.
But for the queries you are building today? Don’t build them such that they will gradually slow down over time.
Largely agree with this take, it often becomes "add an index because query might be slow" without much discussion or trade offs around query patterns. There's a lot of "what if" over engineering that happens where you end up in hypothetical scenarios.
Look at your real world use cases, query patterns and think hard about it.
Comment was deleted :(
> screw the guy who’s on call the night this system starts timing out
This was a very small billing practice, and that person was going to be me. I thought then, and still think now, that I made a reasonable trade off between what would be work at the time and potential future urgent work.
Additionally, this wasn't the sort of thing that would fail suddenly when you hit a critical point. Instead, running full text searches (a very small fraction of what the database was handling) would just slow down in a way that would have been very clear to the users.
Stealing time from future you also doesn’t pay off. Future you wants to take vacations and change jobs and have family events and sleep and stuff.
It doesn’t take much work to do the back of envelope math:
How long should these queries take? Less than two seconds?
How much slower do they get as record counts increase? 100ms every 250,000 records? Okay, so this will become intolerably slow when the record count hits about 5 million. When’s that going to happen? Never? Then we’re done here. Within a few months? Well let’s plan for that now. Within a few years? Maybe not within the lifetime of this system? Judgement call. Let’s put something in place that forces us to at least look back at this before it gets bad. Even if that’s just a google calendar reminder.
When building systems we are always making trade-offs between the present and the future. It's only "stealing time from future you" when you make bad trade-off; otherwise it's prioritization.
In this case, leaving out an index for full text search meant I didn't need to maintain the software for that index either, which would have been stand-alone at the time I was building this. Possibly this even was a choice that was, in expectation, time-saving for future me.
Feeling a little bit like you’re applying a motte and Bailey argument here.
The bold claim in the article was that there are many circumstances where adding an index isn’t necessary. Diverse examples were given. These included a MySQL database, where adding an index is no additional maintenance overhead whatsoever (FULLTEXT idx in the table DDL). The implication was that there are many circumstances that affect many people where even cheap and easy to implement indexes were not needed.
Now when challenged you’re retreating to the more defensible position that in the case where you’re querying an ad hoc data structure where you would have to implement your own index that that is often a bad trade off. Or maybe it was a bad tradeoff because this was a small app you were building that you knew you would be the long term maintainer for.
But that’s not what I’m talking about is it, and I don't know all the specific circumstances of your single specific project where this was the right choice, so we’re not going to get anywhere.
> These included a MySQL database, where adding an index is no additional maintenance overhead whatsoever (FULLTEXT idx in the table DDL)
MySQL didn't add support for FULLTEXT until v5.6, released 2013-02, a few years after I was working on this. At the time if I had wanted a full text search index it would have needed to be an additional service.
Looking now (no longer on my phone) it's a bit more complex than that: while MySQL has supported FULLTEXT indexes since 3.23.23 (2000-09-01)  if you wanted to use InnoDB (and you probably did -- it was much better than MyISAM ) you initially couldn't use FULLTEXT. That was added in v5.6 , and at the time I was developing this software the standard option was to set up Sphinx.
I've edited the post to add some of this history, so future readers understand this was about whether to add a dependency on an external indexing service.
 http://dev.cs.ovgu.de/db/mysql/News-3.23.x.html ("Full-text search via the MATCH() function and FULLTEXT index type (for MyISAM files). This makes FULLTEXT a reserved word.")
 https://downloads.mysql.com/docs/mysql-5.6-relnotes-en.pdf ("MySQL now supports FULLTEXT indexes for InnoDB tables.")
Most systems get scrapped and never go to production. Early stage is validation and search for a useful concept. It often pays off very well to steal time from the future.
> Invisible cliffs in your code that it will fall off at some unknown point in the future are the antithesis of engineering.
On the other hand, they are the essence of _software_ engineering.
Indices are far more likely to cause a cliff than to remove one, IME. Full table scans are linear and grow linearly. Indices are crazy fast until one day they're suddenly not.
Can you clarify?
I don't think I've ever seen such a cliff where the database is using the same query plan? There's sometimes small incremental slow downs as you pass btree depth thresholds and have to add more, but tbh even that's not usually noticeable.
I have seen plenty of instances where the database decides an index is no longer the right access,but gets that wrong, and the result is terrible.. but I see that as a different issue.
> There's sometimes small incremental slow downs as you pass btree depth thresholds and have to add more, but tbh even that's not usually noticeable.
Even this kind of "narrow" slowdown can be substantial. Well, I don't know specifically that it's depth thresholds, but that kind of behaviour - the table passes a certain size and it's suddenly taking 30% longer to insert/read than it did when it was 2% smaller.
> I have seen plenty of instances where the database decides an index is no longer the right access,but gets that wrong, and the result is terrible.. but I see that as a different issue.
IMO it's the same issue, at least from the perspective of the poor chump on call.
> Even this kind of "narrow" slowdown can be substantial. Well, I don't know specifically that it's depth thresholds, but that kind of behaviour - the table passes a certain size and it's suddenly taking 30% longer to insert/read than it did when it was 2% smaller.
This has usually been a memory problem for us - our hot dataset no longer fits in it.
Joins, however, scale exponentially if you only have full table scans.
Yeah but it's not like you get 10 new customers and that suddenly adds 10 more joins to your query in the middle of the night. At least I hope not.
Comment was deleted :(
As the number of cliffs increases, so does the likelihood that you will hit a second cliff in the middle of fixing the first one. At some point you hit a Kessler Syndrome situation where people start quitting and the people who should have fixed the problem in the first place aren't around anymore.
Nobody wants to stick around for a lecture they can see coming.
Usually things will slow down gradually or fail right away under production loads.
Works until critical threshold.
In enterprise software I noticed software tends to work in a WUCT'ed up manner. Things may slow down over time, but no one complains about it because "the software is just slow", then suddenly one day you hit the timeout component of some higher layer and then the software is completely and utterly broke.
Well, this isn't a valid reason. It's not the software that is causing the WUCT.
If you let WUCT run free in any complex process, you are just asking for everything to break all the time and people to do nothing but fire-fighting. So, you are much better dealing with the WUCT causes instead of insisting that software or any other component scales forever? (Because they never will, and you will just make something else bream.)
WUCT is one of the main reasons why we monitor system operations.
Everything has a critical threshold. Even your perfect db schema will still eventually run out of disk. There are no magic solutions that work on all scale levels.
But there are ways to tell that that is going to happen before it happens.
Unless production loads are non-uniform in time, think Black Friday or similar.
Yup. Quarter end reporting has been going swimmingly for years but this time the orders query is hitting a connection timeout and the numbers are due tomorrow.
That threshold doesn’t gradually creep up on you, it hits you all at once at the worst possible time.
But when you have a 100 places all adding little slowdowns it can be difficult to realize.
When I wrote Reddit Gold (now called Reddit Premium) I intentionally left the paying-user table index-free, looking forward to the day it would become necessary.
It hadn’t happened by the time I left the company, even though sales were exceeding expectations.
That goes against my intuition, because the performance would be more impacted (beneficial to have an index) where you have very few bits set on a boolean field.
If you have 10m users and 100 have "IsPremium = 1" then an index massively speeds up anything with WHERE IsPremium = 1, compared to if the data is fairly evenly spread between IsPremium = 1 and IsPremium = 0 where an index won't be much benefit.
So low sales would increase the need for an index.
That said, I'm assuming searching for "Users WHERE IsPremium = 1" is a fairly rare thing, so an index might not make much difference either way.
We had lots of indexes on the user table; there was a separate table with details of paying users.
You wouldn't need any of these WHERE clauses if you have all the premium users in one table. Even managing the users in both tables in pretty trivial when you just add someone as they pay and remove them as the premium account expires.
> paying-user table index-free
Presumably, that means they created a new table intended to be straight-joined back to the user table. No need to search a column.
Joining a table on a column without an index will require a linear scan of the column. You almost always want an index on JOIN columns
The primary key (userid?) will almost always be indexed implicitly. You’d usually have to go out of your way to avoid it.
So it was probably being joined by an indexed column, but without an explicitly defined index.
Two relations with a one-to-one relationship implies identical primary keys. Most implementations default to creating an index on the primary key.
In this case, the premium user table only needs the user id primary key/surrogate key because it only contains premium users. A query starting with this table is naturally constrained to only premium user rows. You can think of this sort of like a partial index.
One consequence of this approach is that queries filtering only for non-premium users will be more expensive.
Comment was deleted :(
Having a separate table containing all the premium users is different than an extra column in the normal user table. In the extra table example you don't really need an index (in premium user table) if you have only 100 premium users
Indexes are such an easy thing to add. I don't get it. Seems like under optimization when you consider the tradeoffs.
They add overhead to updates and inserts. If you don't need them, why take that hit. Know your data.
If the table is small enough that indexes aren't critical for reads, they probably don't impact the rate of writes either.
Of course, if the object is large scale bulk inserts, with only very occasional selects, then yes.
Wait a second, the whole thing is you don't mind O(N) overhead in searching on every request, but you mind O(log N) overhead for updates and inserts?
Well, my guess is that updates and inserts are much more frequent than searches in their use case. You're assuming a balanced frequency for these operations and it hardly ever happens.
This is a read-heavy workload per the OP: https://news.ycombinator.com/item?id=36071799
It was neither read-heavy nor write-heavy.
Ah, thanks for the correction; I’m guessing I read your comment too literally.
Would strikethrough my prior comment if it weren’t past the edit window…
My experience is that it's uaually better to add indices where it's expected to be needed beforehand. Adding indices on large production tables will millions of rows can bring down databases for hours or even days worst case. It's tricky to manage.
Considering how much data is written to support redo/undo/wal logs, a single index for user id is very little overhead.
Consider the example from the post of searching your shell history. If I don't need indexes I can just have it all in one flat file and use grep. Switching to a tool that supports indexing adds a lot of potential complexity and ways things could go wrong.
Or consider the example of a developer querying frontend logs: queries are many orders of magnitude less common than writes, and (at the scale I used to work at) an index would be incredibly expensive.
I take the point of your examples as written in the post, but I think both of those are a bad comparison to the Reddit Premium subscriber table being discussed, because:
- We’re already using a database, so there’s very minimal added complexity
- This is an extremely hot table getting read millions of times a day
- The scale of the data isn’t well-bounded
- Usage patterns of the table are liable to change over time
It wasn't an extremely hot table, and the scale of the data was well-bounded insofar as the only way for it to become non-negligible would be for us to have had revenue beyond our then-wildest dreams.
If you have nothing to optimize yet, how will you know if you’re optimizing it right?
Typically, you create a large number of test records and see how your expected access patterns perform.
> Indexes are such an easy thing to add.
Indexes can have catastrophic consequences depending on the access patterns of your table. I use indexes very sparingly and only when I know for sure the benefit will outweigh the cost.
So no primary key or uniqueness constraints?
Unlikely, given Reddit's past schema design. One table of "things", and then another table of attributes of those things in a entity,key,value format.
I built something inspired by this very post in 2013/2014. Not sure how the scale compares, but we insert ~10 million “things” with an average of 30 data attributes per day with a 30 day window. It definitely uses primary and foreign keys. It took some time to tune. Had to support an additional access pattern using a non-unique index. Had to work with a DBA to get partitioning right to handle the large write volume and manage expiration efficiently. It worked out great and is still chugging along. It does suck not having all the tools of an RDBMS at your disposal, but it was a good trade off.
Comment was deleted :(
That doesn't mean those tables didn't have primary or unique keys.
According to that post, in 2010, they had about 10 million users. At a conservative 10 fields per user, you're looking at 100 million records.
I'm a bit skeptical that they table scanned 100 million records anytime they wanted to access a user's piece of data back in 2010.
Modern DB engines can enforce Unique or PK constraints without indexes. Yes, they perform scans.
Every major RDBMS requires an index for both constraints.
FK constraints however are a pretty common gotcha. You have a table that allows deletes, and every table that has a column pointing to that table gets scanned on each delete operation.
So you have to add an index on that column, or start talking about tombstoning data instead of deleting it, but in which case you may still need the FK index to search out references to an old row to replace with a new one.
An FK also adds the burden of keeping a copy of each related row during the lifespan of a transaction. This means small-but-frequently-updated tables that are joined to a lot can be a source of unexpected headaches.
After checking the docs, you are right, I stand corrected. I was pretty sure Oracle and DB2 don't.
It's not really index-free if a separate table has constraints of any kind.
What was the reasoning?
I had read a Joel Spolsky post about how they used to get an email every time they made a sale (“Ding!”) and the day they finally turned it off was full of pride. (I did that too, originally not even filtered out of my inbox.)
Wow really? This seems really surprising to me.
Speculating, but presumably this table only needed consultation during creation of a new user session. That’s probably a pretty heavyweight operation to begin with, so adding a scan of a few KB (user IDs being modestly sized integers) for every thousand currently paying users is NBD.
Nope; it was used every time we needed to look up whether someone was a subscriber, which was cached for some things but not all of them.
Self-replying to add: This wasn’t a particularly frequent occurrence.
What sort of events would trigger this lookup if it was infrequent?
It's hard to remember after 13 years, but for example if someone made a purchase and redeemed it, or checked the status of their subscription.
Makes sense :)
If the data is small, then the index will also be small, so it doesn't really matter either way. The reason to avoid indexes is if you can't accept the time or space cost. Only the last of OP's examples is a situation where choosing an index causes a problem; the rest are examples where an index isn't necessary but does not cause problems. Several examples are so trivial that you'd be crazy to use an external database at all. I'm not sure I'm really sold here.
In cases where your data is small enough that you don't need an index, an index is adding technical complexity. In some cases that will be small (you are already using a database and it has built-in support for the kind of index you need) and then, sure, go ahead. But in other cases it will not be so small (you can't get indexes without switching to a more complicated tool, you'd need a somewhat unusual kind of index which requires a plugin, etc) and it's worth thinking about whether your system will be more reliable and low-maintenance long-term if you go without.
I think we're in agreement here; this is the part where I said "Several examples are so trivial that you'd be crazy to use an external database at all". For your shell history, you not only don't need indexes, you don't need the database either (as you conclude in the post). But if you're already in the database, it falls into that first "the index feature is right there" category. This HN thread is full of people skipping the index on their SQL tables in situations where I would write the index. Mostly I think people are just missing that you need to write indices in response to queries rather than what you're imagining when you first create the table.
Unless the small data changes often -- then DBMS needs to change both data and each index.
By definition, the changes will also be small. It's just not possible for it to blow up in a way that would cause a problem, because we're only talking about tiny tables. There's nothing you can do in a single index for a table with 350 rows that will cause any heartache vs. an unindexed table with 350 rows. 4 of the 5 examples in the article are just "this table is too small to care."
 Edited per discussion; a large number of indices on a small table can still get you in trouble.
> There's nothing you can do in an index for a table with 350 rows
I meant if you have 100s of indices -- it might be slow even on small tables.
I actually had a production issues from using a lot of indices, but it's not apples-to-apples with current discussion, because the table sizes, DBMS and update rates were much larger, fixed by removing indices and splitting table into multiple.
Wow! You're right; I wasn't considering that anyone would try to have that many indices. 100 is a lot. I should qualify my statement and say that you won't get in trouble with one index on a small table. I've edited my post accordingly.
In those sort of cases your schema is probably not the correct pattern and needs to be reviewed.
I had one proj where they had 40 indexes on one table. Inserts were terrible for that table. No one wanted to touch it as it happened to be the table where everything was at. They started pulling out 'nosql' sharding and other sorts of tools to fix it. Anything to not change that table with 200+ columns and 30 indexes and the flotilla of stored procs to go with it. They even ported it from MSSQL to Oracle. Thinking somehow magically oracle was going to make the poor schema run faster.
> By definition, the changes will also be small
Even if all you're updating is an int32 column from 1 to 2, the RDBMS is still writing much more data than 4 bytes. At a minimum it's making a copy of the full page the data is in prior to the update (so 8k for pg, 16k for mysql) and usually multiples of that (mysql writes both before/after).
In the case i mention above of log data where data access patterns dominate storage without retrieval computing indices up front is absurdly expensive and challenging at extreme scales. In that case minimal time and perhaps probabilistic indices computed on ingestion and extremely parallel brute force search on search works out both in terms of unbounded scale and cost - by like one to two orders of magnitude.
Indeed, we also use an unindexed store for our logs (having migrated to that from a system that was indexed). This is clearly not the "data is so trivially small that nothing matters" situation that the OP is describing in 4 of the 5 of their examples. I hope you can see this isn't at all incompatible with my post.
I can. I’m just pointing out there are definitely cases where indexing sucks. I agree the original post doesn’t touch those cases.
Once upon a time (25+ years ago) I used to maintain an Oracle database that had around 50M records in its main table and 2M records in a related table. There were a dozen or so dimension (lookup) tables.
Each day, we would receive on the order of 100K records that had to be normalized and loaded into database. I am simplifying this a lot, but that was the gist of the process.
Our first design was a PL/SQL program that looped through the 100K records that would be loaded into a staging table from a flat file, performing the normalization and inserts into the destination tables. Given the size of our large tables, indices and the overhead of the inserts, we never could finish the batch process in the night that was reserved for the records intake.
We finally rewrote the entire thing using SQL, parallel table scans, staging tables and disabling and rebuilding the indexes and got the entire end-to-end intake process to finish in 45 minutes.
Set theory is your friend and SQL is great for that.
> loaded into a staging table from a flat file, performing the normalization and inserts into the destination tables.
This is how I've always done things, but I'm starting to think decoupling normalization from the entire process might be a good idea. This would have simplified design on a few projects (especially the ones that grew "organically") and seems like it would scale nicely.
I'm starting to look at normalization as a completely separate concern and this has led to some fun ideas. E.g. if a system operated on flat files that were already normalized it could save time writing tools; you're at the command line and everyone involved can get their grubby mitts on the data.
> Set theory is your friend and SQL is great for that.
Could you tell us how set theory was useful in your case?
The implication is that the PL/SQL job was operating rowwise.
Batch processing, backed by set theory in this case, is far more efficient than rowwise operations because of the potential for greater parallelism (SIMD) and fewer cache misses.
E.g., if inserting into a table with a foreign key pointing to the user table, batch processing means you can load the user id index once, and validate referential integrity in one fell swoop. If you do that same operation rowwise, the user id index is liable to get discarded and pulled from disk multiple times, depending on how busy the DB is.
Also, each insert will have to acquire locks, verify referential integrity, and write to any indexes individually. This is time consuming.
INTAKE_TABLE contains 100k records and each record has up to 8 names and addresses. The following SQL statements would perform the name de-duplication, new id assignment and normalization.
The database can do sorting and de-duplication (i.e., the query UNION operation) much much faster than any application code. Even though the INTAKE_TABLE (100k records) are TABLE-SCANNED 8 times, the union query runs quite fast.
-- initialize truncate table STAGING_NAMES_TABLE; -- get unique names insert /*+ APPEND PARALLEL(STAGING_NAMES_TABLE, 4) */ into STAGING_NAMES_TABLE (name) select /*+ PARALLEL(INTAKE_TABLE, 4) */ name1 from INTAKE_TABLE union select /*+ PARALLEL(INTAKE_TABLE, 4) */ name2 from INTAKE_TABLE union ... union select /*+ PARALLEL(INTAKE_TABLE, 4) */ name8 from INTAKE_TABLE; commit; -- assign new name ids using the name_seq sequence number insert /*+ APPEND */ into NAMES_TABLE (name, name_id) select name, name_seq.NEXT_VAL from (select name from STAGING_NAMES_TABLE minus select name from NAMES_TABLE); commit; -- normalize the names insert /*+ APPEND PARALLEL(NORMALIZED_TABLE, 4) */ into NORMALIZED_TABLE ( rec_id, name_id1, name_id2, ... name_id8 ) select /*+ PARALLEL(SNT, 4) */ int.rec_id rec_id, nt1.name_id name_id1, nt2.name_id name_id2, ..., nt8.name_id name_id8 from INTAKE_TABLE int, NAMES_TABLE nt1, NAMES_TABLE nt2, ... NAMES_TABLE nt8 where int.name1 = nt1.name and int.name2 = nt2.name and ... int.name8 = nt8.name; commit;
The new id generation does a set exclusion (i.e., the MINUS operation) and then generates new sequence numbers for each new unique name and adds the new records to the table.
The normalization (i.e., the name lookup) step joins the NAME_TABLE that now contains the new names and performs the name to id conversion with the join query.
Realizing that the UNION, MINUS and even nasty 8-way joins can be done by the database engine way faster than application code was eye-opening. I never feared table scans after that. Discovering that the reads (i.e., SELECTs) and writes (i.e., INSERTs) can be done in parallel with optimization hints such as the APPEND hint was a superpower.
Using patterns such a TRUNCATE TABLE (for staging tables) at the top of the SQL script made the scripts idempotent. i.e., we could trivially run the script again. The subsequent runs will not generate the new sequence numbers for the names. With some careful organization of the statements, this became rigorous.
Although I haven't shown here, we used to do the entire normalization process in staging tables and finally do a copy over to the main table using a final INSERT / SELECT statement.
My Oracle SQL-fu is a bit rusty. Apologies for any syntax errors.
One of the problems of the RDF and SPARQL world is that the standard index structure considers the 6 possible permutations of
You can answer many queries with good efficiency with those indexes but building them gets brutal when you are handling billions of triples.
?subject ?predicate ?object.
I went to a talk at Dreamforce years ago where the CTO explained their database architecture and it is interesting that the core table in Saleforce is literally a collection of triples. They have a smart optimizer that builds views, materialized views, indexes, etc. in Oracle based on the queries you run.
I built a bot that would download a lot of records from an organization, build a "machine learning" model (built by a geospatial analyst and we didn't call it that these days) that automatically assigns opportunities to salespeople, and then pushes the assignments into the system. The first time I ran the big query in the test system it timed out, then I went to the bathroom and when I came back the second time it worked perfectly. When it went to production exactly the same thing happened.
> it is interesting that the core table in Saleforce is literally a collection of triples
This is... not entirely true. Most of the data for each "Object" in salesforce are stored in a series of columns on a single table. These are called "flex columns" in the official whitepaper describing the architecture. 
However, _foreign key relationships_ are indeed stored in a table that acts like a collection of triples. This is used in place of native foreign keys, and is indeed deeply integrated with the optimizer.
Slightly off topic, but just the other day we had a case where we had an index, and deleting ~100k entries from a table with a few, small columns caused a huge delay.
Context: Mariadb 10.2. In this table we stored traffic statistics, aggregated by IP address and day. Probably less than 200 bytes per row. They were also written per day, and then cleaned up by a cron job (every month an entire month worth of data was discarded), so "DELETE FROM traffic_stats WHERE day < ?"
This worked fine on the primary, but our monitoring told us that the replicas started to lag, and showed that they spent inordinate amounts of time on the delete statement above. Even though we had an index on the "day" column.
Even weirder, the problem didn't show up in our QA database cluster, which had far less resources (vmware vm in QA vs. pretty solid bare metal in prod).
Even more puzzling, skipping the replication step and running the DELETE statement manually was quite fast (a second or so).
After some frantic search, it turned out that, for reasons lost in time, the QA db cluster used "mixed" replication, while the prod cluster used row-based replication. In row-based replication, the leader communicates to the replicas which rows were deleted. And since we didn't have a primary key on that table, the replicas did a full-table scan to find which record to delete, and repeated that for each of the 100k records to be deleted.
Adding a primary key to the table fixed the issue immediately, and we switched to mixed replication shortly after.
I've very familiar with this sort of problem. I used to work in high-volume electronics manufacturing and coercing high & variable transaction OLTP databases to be able to be 1) backupable, and 2) replicable for analytics workloads often feels like a dark art. In a lot of cases, trial and error is a perfectly reasonable approach, for better or for worse.
Most databases will ignore indexes if the table is small enough, so it sort of does this for you. So the index can just be a waste of space, until it isn't.
Why not just add the index? It future-proofs the performance and, as mentioned in the article, for small datasets, performance is not an issue anyway, so adding an index won't affect the small-dataset performance.
I agree, you don't want performance to suddenly degrade and then you have to find out why. This can be especially bad when this happens with many tables that should have indexes, and performance is sluggish, but not for any specific action.
Add indexes appropriately is a best practice.
> I use grep on a flat file, and testing now it takes 200ms for a query across my 180k history entries
200ms is order of magnitude more than I'd prefer for interactive use
200ms for 180k entries sounds way too slow. Just tested on my 150k line log file and it takes about 10ms on average for various kinds of patterns with various hit frequencies.
On an unrelated note, from a quick check with "strace", gnu grep seems to detect when output is redirected to /dev/null as no write syscalls are being made in that case.
Unrelated note intrigued me, so I went looking in the source code; detection is here: https://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c#n...
Interesting that they would optimize for this use case.
Especially since grep -q exists. It would never come to my mind to direct grep's output to /dev/null, I wonder how this optimization came to be?
I was going to say that it was "instant", because that's how it feels when using it, but got 200ms from running "time" on grepping it to include a number in the post.
Have you tried out using ripgrep  or fzf ? 200ms is quite a lot for history search, I'd expect search to be 20ms or less. In fact, it should be instantaneous as it should all fit in memory really. I've been using fzf and the searches are all faster than my monitor can display frames (so less than 16ms), although I only have 10k entries on my history as I clear it up once in a while.
Trying now, "cat > /dev/null" takes ~150ms, so I suspect it's related to keeping my history in Google Drive (with local mirroring).
If it gets annoyingly slows I'll stop doing that and use a different syncing system.
Actually, I was an idiot and misreading these numbers. It's 0.020s and 0.015s, which are 20ms and 15ms, not 200ms and 150ms.
Yeah that was my thought too, and I thought about fishing out the unbranded $1 1Gb microsd in my drawer I use for 3d printing to help him out with his 0.5Gb storage challenge
CloudWatch Logs Insights, Snowflake, Grafana Loki all do minimally indexed queries with highly parallel brute force scans of tons and tons of smallish s3 objects. A lot of data, especially data like logs, the ratio of “retrieved:ingested” can be in excess of 1:100. That makes it very much worth while to not index up front but to pay the retrieval cost for that rare amount of data that is actually retrieved. The minimal indexing mentioned is typically time and sometimes bloom filteresque term existence indices at the object level.
> CloudWatch Logs Insights, Snowflake, Grafana Loki
And the common thing in this list of tools? They're all laggy and not a nice experience to use.
Even if queries are rare, there is often an actual human sitting waiting for the results of a query. Whereas while ingesting logs, there is no human waiting. I would prefer to burn more CPU time to save some human time.
I use all three and for large scales they’re considerably faster than the alternative; which at some scales (peta/exabyte) is “none.”
It is infinitely easier to drop an index on a table that is causing poor performance than it is to add a good index to a table that suddenly needs one.
Index everything, drop those that make things worse.
Adding the index may be physically hard because it takes a while and steals resources from your server. But cognitively, it is easy - find the query that is slow and add the index to make it fast. It's usually very obvious from the query what index you need to add
Deleting an index might be physically easy to do. But figuring out if it's safe to do is hard. You have to inspect every query you make and make sure none need that index. Once the index is there, in practice it's usually there forever.
On the flip side, a missing index can bring down production.
We experienced that just a couple of weeks ago, where a missing index and an unexpected 1000x increase in volume at a customer brought the DB to its knees. Sure it was still serving queries but at such a low rate it was effectively useless.
For queries that will be run more than once, I try make sure there's an index it can use for something fairly unique.
I used to be one of the "keep the database running" people at a high frequency trading firm. So, super high pressure; any outage meant we were potentially hemorrhaging money at a brain-melting pace.
My lessons from that experience were twofold. First, you can't plan for sudden performance regressions ahead of time. The number of stars that have to align for that anticipatory index to actually help you are just too great. Even if you correctly guessed that a given table might be affected by one in the future, you'll never guess what fields it needs to include, and in which order, to support whatever new query or query plan change or whatever caused the problem. Second, Murphy's Law dictates that any attempt at preventing future read performance regressions by speculatively creating an index will end up causing a write performance problem instead.
Better instead to just get in the habit of periodically reviewing query plans for every query in the database. If you know the scaling characteristics of the various operations and the kinds of statistics changes are likely to cause the optimizer to choose different ones (and you should), then it's easy enough to select for better scaling characteristics. This is, incidentally, an excellent excuse for making the engineers use sprocs or a clean data access layer or something instead of a heavyweight ORM, so that you have some sensible way of getting a warning before they change queries on you. You can't effectively aim for a target that's bouncing around erratically. Even better still - and only realistically feasible if you somehow managed to win that bigger war I just proposed - set up a tool to monitor query plans and send you an alert when they change unexpectedly, so you can hear about the problem when it's still small.
Are you just trying to create employment for yourself? This is amazingly bad advice. I literally get hired to fix these setups by people that take this approach.
Yes, it's good to periodically review query plans.
But designing and planning for data retrieval is part of a design process. And it's easy. Even Django since 2003 allowed you to define indexes inside your models. Got a 100m record table and you're adding a column you want to query by interatively? Rollout an index. This is normal day-to-day software development...
The write overhead is wild speculation. 99% of business are read heavy.
It sounds like he's thinking of it from the DBA perspective, where they have to react to sudden changes in behavior from devs with little or no warning - since the devs don't talk to them.
DBAs doing proactive index creation when they don't know what the devs are doing is indeed futile.
The devs, however, should definitely be doing proactive database design/planning whenever they do anything, since they can (and will!) cause emergencies and downtime by not considering the impact of how they are interacting with/using the database.
If the devs are directly writing SQL, this is also relatively easy to get them into doing. If they're using a heavyweight ORM, it's nearly impossible to figure out what SQL it's going to run sometimes (and difficult to even trigger in a test), and so the devs often won't even try.
The company had actually banned new uses of ORM and was in the process of eliminating existing usage of it while I was there. We had discovered that teams that used ORM had much higher production incident rates than teams that didn't, and it was fairly directly attributable to lack of understanding and predictability of what was happening in the database.
Maybe not a huge deal if you're in a low-load situation. But HFT requires the A game at all times because you never know when some exciting news that causes trading volume to increase 50-fold almost instantaneously might happen.
For the record, I was a dev and not a DBA. But I did work closely with the DBAs. And I was pretty irritated when I found out ORM was banned, because it definitely made the "writing new code" part of the job more laborious. But, hey, learning experience - it turns out that it was the right move. In the long run we were able to move faster once we stopped having to deal with the blowback from breaking things quite so often.
It's a little bit like when I play Mario Kart with my kids. Why do I go so much faster than them? Mostly because they are constantly pushing hard on the accelerator button, while I ease off the gas and even hit the brakes sometimes. They think speed management is annoying. I think that I spend a lot less time bouncing off of walls and giving precious coins to Lakitu.
Fortunately we're small enough that us devs are effectively the DBAs as well. Meaning, we write the queries and we maintain the DB schema.
I can indeed imagine life being a lot different if you're on the receiving end of an unknown barrage of queries.
Was this because of a missing index or because the optimizer vomited on the queries without an index to provide it statistics? My gripe with RDBs is generally the query optimizer is non deterministic with respect to the query, I.e., it can make a great decision with certain statistics and flip to a terrible one with slightly different statistics even though the original query would have performed basically the same.
I’d rather have a database that query performance degrade gracefully and predictably with scale than some sort of black magic mumbo jumbo wake me up in the middle of the night because it lost its statistical little puny pea brain simply because it couldn’t compute a bounded expectation any more.
I really want the optimizer to make an estimate of the CPU/IO to complete a query. Then, during the query, if that much effort has been expended and we haven't yet completed the query, then update the estimate. If the updated estimate now shows that the query plan is no longer quickest, then abort the query and restart with a different plan.
Years ago I forked postgres and tried to implement the above. Initial results were promising - there were a good chunk of queries that ended up taking a different plan, and sometimes returning 100x quicker.
Alas, the postgres codebase is heindously complex, and implementing the above to be production grade would be many months work - and, due to the way postgres streams results to the client, might have actually required a change to the wire format.
In this case it was a missing index, on a query run on every order line when saving aka a lot.
It had gone under the radar because the volume with our other customers had been low enough that the table scan of the queried table wasn't noticed.
As mentioned a customer suddenly got 1000x the volume and it quickly became an issue.
But yea, we have a job running in the weekends to recalculate statistics on key tables, as we've had issues with that grinding production to a halt before.
And recently I sped up a query by 100x by removing a filter from the where clause that for some weird reason caused the DB to run a really poor plan. It was just a simple check intended to filter out some duplicates in a few edge cases, but couldn't find a way to make the DB run it as a post-predicate. Moved it to my code for the 100x performance win...
> ...an unexpected 1000x increase in volume at a customer brought the DB to its knees.
From an outside perspective it seems like the huge increase in volume was more the issue! It sounds like an index helped a lot, but it would also have added cost for all those customers who didn't see that jump in volume.
Well, of course the volume had something to do with it, but adding the missing index meant the system could easily handle that volume.
The other customers pay for that index as well of course but either the volume is low enough that it's trivial or it's large enough that they too saw ann increase in speed.
The inverse is also true, often you have linear or even quadratic lookup times when something like an index can give you the result "immediately".
I wish languages other than SQL had a concept of adding an index. Imagine a variant of Python's list or C++'s vector where you can add indicies on arbitrary fields of the contents, like a more flexible dict. Something like (pseudocode)
and then you could up by title or author without traversing the whole list. Sure you can just use a separate dict, but then you have to keep both in sync, it is harder to bind it to UI controls, and so on.
books = List<Book, indices=[Book.title, Book.author]>
In my experience these type of abstractions end up bloated with too many concerns which adds maintenance overhead. It's a lot simpler to simply add a `HashMap<Book.title, Book>` and cover with tests, than use an abstraction.
I want languages to do less magic, not more, since I read code more than I write it.
For C++ there's Boost's multi_index.
I used multi_index a lot in the past. However, since I also frequently use sqlite, I have decided to exclusively use SQLite for multi_index in memory. If there is a problem, temporarily using sqlite as a file makes it a lot easier to track the issue. When multi_index patterns become too complex, it's natural to wrap them in a unit test specifically designed for sqlite (multi_index).
> I recently came across someone maintaining a 0.5GB full text index to support searching their shell history, 100k commands. I use grep on a flat file, and testing now it takes 200ms for a query across my 180k history entries.
I just started using a tool that stores my bash history and replaces C-R that I found on HN, it's written in Rust and uses SQLite of course. But it'll randomly cause commands to pause for a few seconds before executing (I think the hook to write the command to history gets blocked). Probably some simple bug, but I could just have my command history in a flat file and use rg or fzf in a 5 line shell function and have no lag.
Okay getting real off-topic now (welcome to HN) but this is not what I expected:
Apparently when SQLite calls `ftruncate` on the `-shm` file, ZFS will sometimes block for several seconds. Maybe it's waiting for txg timeout? Strange.
Can you share the tool? I've been thinking of making something similar for fun backed by a suffix array, interested in what else is out there.
Moral of the post: don't do premature optimization. Common adage but it's a good reminder and example of it.
Aside from one case where OP argued that the queries were so rare compared to the data updates that maintaining the index is more expensive. Which is also pretty classic when you're taught about indexes.
What I recently learned the hard way about indexes is that they're slow when you have to read "large" chunks of a table to find an answer, in my case computing a median for an area. I'm indexing geospatial data and Europe+NA are overrepresented. When you view an area the size of ~Germany, the query would take 20 minutes and compute the median over ~3% of all records whereas a full table scan (whole world) took something like 40 seconds (with the median being computed over the same 3%, it would just evaluate the WHERE clause against every row instead of finding rows to include via the index). That's the power of a sequential read as compared to reading an Rtree. I haven't had such bad experiences with Btree indexes, not sure if that would behave just as badly. On the other hand, if you ask for an area the size of Amsterdam, the index is normal fast, and for <50 rows it's basically instant whereas the full table scan would still take the same 40 seconds.
The cause of this is the likely N+1-like behaviour of indexes on un-clustered data. Two options for speeding this up a ton (they dont' make sense to use together),
1. Cluster your data using the Gist/r-tree index. postgis docs  great explanation
2. Use the r-tree as covering index. IE add your associated data as dimensions into the index. The gist index becomes (lat, long, weather_c, avg_altitude), etc. This avoids having to load a separate page for each row in the spatial index 
Thanks! I don't think MariaDB offers the former but it's good to know this exists and I may want to look elsewhere. This isn't the only thing I ran into with MariaDB, but I'm also hesitant to subscribe to maintaining another service into eternity when the existing database has done fine for everything (more than a decade of projects) up until this project. If I had to redo it, I'd definitely try this out though to see if it's worth it! And at work it's also a different equation. So it's useful info :)
The latter, I considered but didn't think would cut the time by more than half whereas it would need an order of magnitude or more speedup to be useful. Doing a full table scan and caching the result for certain zoom levels was the solution for me.
Adding an index is an incredibly expensive task on a dataset so big that it needs one to be added.
It's something that is likely to require massive engineering effort on a live system.
Removing an index, on the other hand, is never an issue.
It's not at all premature optimization, it's simply basic software design.
Read OP's post please. They're talking of systems where it was acceptable to always do full table scans, in one case having 350 rows.
Not "an incredibly expensive task on a dataset so big that it needs one". Thus I read OP's post as recommending to not do premature optimization (without using those words literally).
1) Postgres optimizer at some point prefers full scans. It depends on your data, your query and configuration. So on read-heavy load you basically lose nothing by adding all appropriate indexes from the get-go. Overhead having index on small tables is small.
2) If your load is write-heavy, you should consider trade-offs. Yes, full scans are sometimes your friends. Interesting thing to notice here that write amplification is basically same even if table is small.
Don't add indexes. Add them later when things are slow and everything is on fire!
Source: guy that gets hired to fix database stuff on fire.
Comment was deleted :(
Indexes got me from 15 seconds down to under a second. I'll always use them even if they wind up being as much space as the actual data. MySQL is just weird that way.
Edit: believe me, I tried it all... refining my queries, multiple tables, bigger hardware, faster disks, changing all the tuneables... nothing was as good as adding a whack-tonne of indexes.
I think the risk of creating indexes prematurely is really that you will make assumptions about how your table is going to be queried that don't turn out to be correct, so you create indexes that you will never need, even if the table gets large.
That probably isn't the end of the world, but it can make it harder for others to understand what is going on in your database (indexes can also be a form of documentation) and could make your ORM code somewhat more complex.
The worst case scenario, I suppose, would be you create an index on a table that grows large
FWIW, most DBMS have built-in index usage stats, and it's not too difficult to query. In my previous job we had a Postgres health dashboard that notably showed `log10(index size) / index hits`, making it pretty clear when some recently added index was useless.
My opinion is that indexes should be either logical (ex: used for exclusion constraints) or purely for performance (tradeoff space for better QPS). Query patterns change, specs change, so monitoring your DB's performance is the way to go.
This has matched my experience too. Most experienced developers can probably identify an inappropriately hot table and add an index to it. Where a DB with dozens of individually modest indexes that add up to be huge is much trickier and more labor intensive to confidently improve.
Lately, I've been investigating 'serverless' databases (PlanetScale, Turso) that charge based on the number of rows read/scanned. As a result, instead of just relying on the perceived performance of a query, I've started monitoring the number of rows scanned, which has emphasized the importance of indexes to me. Granted, the cost per rows read is minuscule (PlanetScale charges $1 per billion!), but it's still something I keep in mind.
I feel this somewhat generalizes. "Cases where a simple list/array is better than a hashed collection." Which, oddly has the perverse problem of "when you should give up on a single collection of your data, and embrace storing it in different ways."
That is, few folks seem to internalize or consider that every index is essentially a copy of the data. A complete copy, if you are doing a fully projected index. So, when is it better to just do a full scan over something else? Well, its probably never better in isolation of every other concern. But, when is it better to just do a scan of the data over duplicating it in a place that would let you hash in? Depends on how much extra work you are doing because of it?
Having typed that, really I think the trick is to ask yourself roughly (not exactly, amusingly) what it means to add an index. Or to use a hashtable. Or to use a list and a hashtable. And realize that the question of using an index is the same as the question of keeping another copy of your data somewhere.
Saying there are cases where full scan is better than an index is really just describing an edge case where the default index implementation is non-optimal.
There are edge cases where default implementation of an index will not result in good performance in some use cases. For example, write heavy loads where the updated tuple values are frequently changing position in the index constantly. However, the solution is rarely "no index".
A solution is to throttle the index update by an acceptable delta through a non-default implementation.
A solution is to develop your own application-side index.
A solution is a custom PSQL function that interacts with an index in a more complicated way.
Maybe a non-default implementation is not priority right now, e.g. if scale is low. But it is always good to think ahead and to have a plan and to log the tech debt and track it responsibly so there are no surprises later. Don't build on shifty foundations.
I take the author's point for single-user write-frequently search-rarely use cases. Plenty of small greenfield projects look like that.
Bigger projects, or projects that are enjoying fast growth, may well need indexing. Why? to keep search performance up under a multi-user workload and/or a mixed write-query workload. Complex multi-table searches often also benefit from judicious indexing.
Here's something to keep in mind: Successful projects often find that an unpredicted query pattern has become very popular and/or a bottleneck. Often the bottleneck is remediated in production by adding appropriate indexes. SQL's designed to add indexing without changes to the data model, for precisely that reason. It's normal to add indexes as projects grow, and no sign of bad design or expensive technical debt. (Unsuccessful projects, or projects that don't need to grow? They don't have the problem.)
Anyone remember gsearch by Jeff Dean circa 2006 at Google? It was a full scan regex search, using 40-80 machines in parallel, not an index
Also sourcerer was a full scan of VCS metadata, on a single machine. i.e. if you listed all the commits by a certain person, it would just do a for loop in C++ going through about 1-2 M commits. They were just C++ structs in memory, and it was probably submillisecond times, even with hardware at the time
I remember reading those 2 programs and being impressed that they were both very fast, and full scans! (gsearch got slow later as the codebase grew, but the first version was a <1 second distributed grep of the whole Google codebase.)
Indexing doesn't add much for very small tables. It also doesn't really help much if the cardinality of the data is low.
I usually index any columns that are used in joins, or frequently used in predicates.
What is the rule that you use on which fields to index? I've been following the general when adding fields: if the field is mentioned in a sql WHERE, ORDER BY, or GROUP BY
Don't overthink that part. Just rely on EXPLAIN. It will tell you if the query had an index to use or if it scanned. You'd be surprised when a DB does/doesn't use an index if you try to figure it out yourself.
A run book for any org when writing queries should be
1) Write the query
2) Before shipping the query run the query through EXPLAIN. Did it scan or did it use an index?
3) If it scanned can you re-write it to use an existing index (the usual answer honestly).
4) Create the index.
The answer will heavily depend on what you're doing.
CRUD patterns where you're pulling out a small group of rows from several tables, you probably want indexes on `JOIN` clauses so that you can quickly find those links using nested loop index scans. And maybe a few indexes on common ways to find records.
Analytics patterns where you're aggregating a large number of rows, indexing `JOIN`s won't help you, because you really want hash or merge joins anyway. Instead, tune your indexes for `WHERE` clauses, because you want a filtered index scans instead of table scans on the reads.
You also want to learn about cardinality and selectivity. Ideally your indexes are organized by selectivity.
You should also consider how unique that field is. If half your widgets are blue and half your widgets are red, indexing the color is probably not helpful (ie, the query planner will ignore the widget_color index and do a full table scan). A rule of thumb number I vaguely recall is 10%
My recollection in postgres anyway is that low cardinality indexes aren't useful, because it doesn't take into account which side of the 1/99% you're on when determining to use the index.
What is useful is to do a partial index on values where foo=small % ,value because then it will use that index when it matches the query, but not when it's in the majority case.
> My recollection in postgres anyway is that low cardinality indexes aren't useful, because it doesn't take into account which side of the 1/99% you're on when determining to use the index.
It does take that into account. Demo:
=# CREATE TABLE low_cardinality AS SELECT generate_series(-10, 1000000) < 0 AS col; =# CREATE INDEX ON low_cardinality (col); =# ANALYZE low_cardinality; =# EXPLAIN SELECT * FROM low_cardinality WHERE col = false; ┌─────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├─────────────────────────────────────────────────────────────────────────┤ │ Seq Scan on low_cardinality (cost=0.00..14425.11 rows=1000011 width=1) │ │ Filter: (NOT col) │ └─────────────────────────────────────────────────────────────────────────┘ (2 rows) =# EXPLAIN SELECT * FROM low_cardinality WHERE col = true; ┌────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Index Only Scan using low_cardinality_col_idx on low_cardinality (cost=0.42..4.44 rows=1 width=1) │ │ Index Cond: (col = true) │ └────────────────────────────────────────────────────────────────────────────────────────────────────┘ (2 rows)
Guess they've improved -- That used to be a thing. Looks like it was probably in the PG13 changes to the statistics and btree indexes that did that, though it's hard to tell exactly.
It worked like this for much longer. The oldest version I have around is 9.2 and it behaves the same. You probably are thinking of a somewhat more complex scenario...
Forgot to say: Obviously a partial index is going to be considerably better, performance and size wise.
I don't get it...they are just bragging about writing terrible code? There isn't a single reason given in the article about why a full scan is better than an index other than "I don't need these fancy index things, I'm perfectly happy with my 200ms query time".
Well add the index anyway and add an extra 195ms sleep() call if you really want to. It's still better for your hardware and electricity bill.
While I'm not sure why you moved their recently I can see why you wouldn't have noticed how nice it used to be. As I said, it already started the decline a decade ago.
Comment was deleted :(
It is never really mentioned why adding an index upfront is such a burden?
I don't really understand how the downside of adding an index almost ever outweighs the potential downsides of not adding an index. I'm sure there are some cases, and that was what I was expecting to see in this article, but it just seems to boil down to being so lazy that he wants to avoid typing a few more characters.
He didn't show that full scans are better than indexes, just that they aren't obviously worse when the data is small.
Comment was deleted :(
Comment was deleted :(
Tiddlywiki only uses an indices for tags/fields yet it still works pretty well so long as the wiki isn't huge. There's something to be said for the flexibility an approach like that provides.
The HN title says something different than the blog post title / intent. The HN title is incorrect and misleading imo.
The blog post can be summarized as "you don't need indexes on tiny tables".
Comment was deleted :(
"fits in ram" ok got it
rule of thumb: O(n) latency is almost always fine when (n==1)
Crafted by RajatSource Code