The section about 128-bit pointers being necessary for expanded memory sizes is unconvincing -- 64 bits provides 16 EiB (16 x 1024 x 1024 x 1024 x 1 GiB), which is the sort of address space you might need for byte-level addressing of a warehouse full of high-density HDDs. Memory sizes don't grow like they used to, and it's difficult to imagine what kind of new physics would let someone fit that many bytes into a machine that's practical to control with a single Linux kernel instance.
CHERI is a much more interesting case, because it expands the definition of what a "pointer" is. Most low-level programmers think of pointers as just an address, but CHERI turns it into a sort of tuple of (address, bounds, permissions) -- every pointer is bounds-checked. The CHERI folks did some cleverness to pack that all into 128 bits, and I believe their demo platform uses 128-bit registers.
The article also touches on the UNIX-y assumption that `long` is pointer-sized. This is well known (and well hated) by anyone that has to port software from UNIX to Windows, where `long` and `int` are the same size, and `long long` is pointer-sized. I'm firmly in the camp of using fixed-size integers but the Linux kernel uses `long` all over the place, and unless they plan to do a mass migration to `intptr_t` it's difficult to imagine a solution that would let the same C code support 32-, 64-, and 128-bit platforms.
(comedy option: 32-bit int, 128-bit long, and 64-bit `unsigned middle`)
The article also mentions Rust types as helpful, but Rust has its own problems with big pointers because they inadvisably merged `size_t`, `ptrdiff_t`, and `intptr_t` into the same type. They're working on adding equivalent symbols to the FFI module, but untangling `usize` might not be possible at this point.
> it's difficult to imagine what kind of new physics would let someone fit that many bytes into a machine that's practical to control with a single Linux kernel instance.
I nominally agree with most of your post. But I should note that modern systems seem to be moving towards a "one pointer space" for the entire cluster. For example, 8 GPUs + 2 CPUs would share the same virtual memory space (GPU#1 may take one slice, GPU#2 takes another, etc. etc.).
This allows for RDMA (ie: mmap across Ethernet and other networking technologies). If everyone has the same address space, then you can share pointers / graphs between nodes and the underlying routing/ethernet software will be passing the data automatically between all systems. Its actually quite convenient.
I don't know how the supercomputer software works, but I can imagine that 4000 CPUs + 16000 GPUs all sharing the same 64-bit address space.
It seems to me that such a memory space could be physically mapped quite large while still presenting 64-bit virtual memory addresses to the local node? How likely is it that any given node would be mapping out more than 2^64 bytes worth of virtual pages?
The VM system could quite simply track the physical addresses as a pair of `u64_t`s or whatever, and present those pages as 64-bit pointers.
It seems in particular you might want to have this anyways, because the actual costs for dealing with such external memories would have to be much higher than local memory. Optimizing access would likely involve complicated cache hierarchies.
I mean, it'd be exciting if we had need for memory space larger than 2^64 but I just find it implausible with current physics and programs? But I'm also getting old.
Leaving cluster coherent address space behind - like you say - is doable. But you lose what the parent was saying:
> If everyone has the same address space, then you can share pointers / graphs between nodes and the underlying routing/ethernet software will be passing the data automatically between all systems. Its actually quite convenient.
Let's say you have nodes that have 10 TiB of RAM in them. You then need 1.6M nodes (not CPUs, but actual boxes) to use up 64bits of address space. It seems like the motivation is to continue to enable Top500 machines to scale. This wouldn't be coming to a commercial cloud offering for a long time.
Why limit yourself to in-memory storage? I'd definitely assume we have all our storage content memory mapped onto our cluster too, in this world. People have been building exabyte (1M gigabytes) scale datacenters since well before 2010, and 16 exabytes, the current Linux limit according to the most upvoted post here, isn't that much more inconceivable.
Having more space available usually opens up more interesting possibilities. I'm going to rattle off some assorted options. If there's multiple paths to a given bit of data, we could use different addresses to refer to different paths. We could do something like ILA in IPv6, using some of the address as a location identifier: having enough bits for both the location and the identity parts of the address without being too constrained would be helpful. We could use the extra pointer bits for tagged memory or something like CHERI, which allow all kinds of access-control or permission or security capabilities. Perhaps we create something like id's MegaTexture, where we can procedurally generate data on the fly if given an address. There's five options for why you'd want more address space than addressable storage. And I think some folks are already going to be quite limited & have quite a lot of difficulty partitioning up their address space, if they only have for example 1.6m buckets of 1TB (one possible partitioning scheme).
The idea of being able to refer to everything anywhere that does or did exist across a very large space sure seems compelling & interesting to me!
Maybe. You are paying a significant performance penalty for ALL compute to provide that abstraction though.
Sounds like a disaster in terms of potential bugs.
It is, but it's the same disaster of bugs we already have from multiple independent cores sharing the same memory space, not a brand new disaster or anything.
It's a disaster of latency issues too, but it's not like that's surprising anyone either, and we already have NUMA on some multi-core systems which is the same problem.
We have existing tools that can be extended in straightforward ways to deal with these issues. And it's not like there's a silver bullet here; having separate address spaces everywhere comes with its own disaster of issues. Pick your poison.
Just because you can refer to the identity of a thing anywhere in the cluster doesn't mean it can't also be memory-safe, capability-based, and just an RPC.
Then why make said identity a fixed size number pretending to be a flat address space, instead of some other type of key?
For example, reads (when allowed) might just be transparent RAM accesses.
I'm not advocating 128-bit pointers, or saying they're useful or realistic. I'm just saying, what if.
The issue is latency. If you don’t want to care if the read or write will take 10ns or 10000ns, then that can work, assuming a flat permission structure too. Latency is a fundamental restriction for any computer that is > zero size, and the larger the ‘computer’ (data center), the more noticeable it is.
(For that matter, what happens when different segments of memory have complex access controls? What about needing to retry to failures like network partitions that don’t happen in a normal memory space?)
If latency matters, which it usually does, then you need some kind of memory access hierarchy, copying things back and forth, etc. and then you’ll almost certainly need a library of some kind to manage all this, prefetch from a slow range of memory and populate some of your fast local memory, etc.
At that point, we’ve done a lot of work, and are still pretending there is no network or the like, even though it’s there. It isn’t free, anyway. We’d also need checksums, cross network/fabric/access error handling, etc.
And with 128 bits, we could also use something like IPv6 with the lower 64 bits being byte address hah.
Data centers are already doing "disaggregated memory" (page out to remote RAM), as it's been faster than local disk for years now (or at least was in the pre-NVMe world).
The HPC world is all about RDMA to direct-access their huge data sets, and likely hyperscaler clouds are starting to do that too.
The latency gap is just treated as yet another layer of the cache model.
And we already have error correction for local RAM.
Which as you note, NVME and the latest PCI generations turns on it’s head.
Those were due to bottlenecks that don’t exist anymore, and even in your example were even then only emergency measures due to local resource shortages.
ECC is also not reliable/sufficient in the face of issues that arrive when networks start playing their part.
Instead of having another thread improperly manipulating your pointers and scribbling all over memory, now you can have an entire cluster of distributed machines doing it. This is a clear step forward.
Not that the industry doesnt broadly deserve this FUD take/takedown, but perhaps possibly maybe it might end up being really good & useful & clean & clear & lead to very high functioning very performant very highly observable systems, for some.
Having a single system image has many potential upsides, and understandability & reasonability are high among them.
I think the challenge is that within those pages you might have absolute pointers rather than offset to some “page” boundary. In that case, everything really must share a single uniform address space even if any given mode accessed only a small portion, no?
At that point, maybe you want 256bit or 512bit pointers so that you can build a single global addressable system for all memory in the world.
> How likely is it that any given node would be mapping out more than 2^64 bytes worth of virtual pages?
In the Grace Hopper whitepaper, NVIDIA says that they connect multiple nodes with a fabric that allows them to creat a virtual address space across all of them.
Distributed Shared Memory is a thing, but I'm not sure how widely it is used. I found that it gives you all the coordination problems of threads in symmetric multiprocessing but at a larger scale and with much slower synchronisation.
Again, I'm not a supercomputer programmer. But the whitepapers often discuss RDMA.
From my imagination, it sounds like any other "mmap". You, the programmer, just remembers that the mmap'd region is slower (since it is read/write to a Disk, rather than to RAM). Otherwise, you treat it "like RAM" from a programming perspective entirely for convenience sake.
As long as you know my_mmap_region->next = foobar(); is a slow I/O operation pretending to be memory, you're fine.
Modern systems are converging upon this "single address space" programming model. PCIe 3.0 implements atomic operations and memory barriers, CXL is going to add cache-coherence over a remote / I/O interface. This means that all your memory_barriers / atomics / synchronization can be atomic-operations, and the OS will automatically translate these memory commands into the proper I/O level atomics/barriers to ensure proper synchronization.
This is all very new, only within the past few years. But I think its one of the most exciting things about modern computer design.
Yes, its slow. But its consistent and accurately modeled by all elements in the chain. Atomic-compare-and-swap over RDMA can allow for cache-coherent communications and synchronization over Ethernet, over GPUs, over CPUs, and any other accelerators sharing the same 64-bit memory space. Maybe not quite today, but soon.
This technology already exists for PCIe 3.0 CPU+GPUs synchronization from 8 years ago (Shared Virtual Memory). Its exciting to see it extend out into more I/O devices.
Distributed Memory Access is just another kind of Non-Uniform Memory Access, which is Yet Another Leaky Abstraction. Specifically, if you care about performance at all you now have to worry about where in RAM your data lives.
Caring about where in memory your data lives is different from dealing with cache or paging. Programmers have to plan ahead to keep frequently accessed data in fast RAM, and infrequently accessed data in "slow" RAM. You'll probably need special APIs to allocate and manage memory in the different pools, not unlike the Address Windowing Extensions API in Microsoft Windows.
And once you extend "memory" outside the chassis, you'll have to design your application with the expectation that any memory access could fail because a network failure means the memory is no longer accessible.
If you only plan to deploy in a data center then maybe you can ignore pointer faults, but that is still a risk, especially if you decide to deploy something like Chaos Monkey to test your fault tolerance.
> And once you extend "memory" outside the chassis, you'll have to design your application with the expectation that any memory access could fail because a network failure means the memory is no longer accessible.
You have to deal with these things anyway in any kind of distributed setting. What this kind of location-independence via SSI really buys you is the ability to scale the exact same workloads down to a single cluster or even a single node when feasible, while keeping an efficient shared-memory programming model instead of doing slow explicit message passing. It seems like a pretty big simplification.
I've written code for a few different large-scale SSI architectures. The shared-memory programming model is less efficient than explicit message passing in practice because it is much more difficult to optimize. The underlying infrastructure is essentially converged at this point, so performance mostly comes down to the usability of the programming model.
The marketing for SSI was that it was simple because programmers would not have to learn explicit message passing. Unfortunately, people that buy supercomputers tend to care about performance, so designing a supercomputer that is difficult to optimize misses the point. In real code, the only way to make them perform well was to layer topology-aware message passing on top of the shared memory model. At which point you should've just bought a message passing architecture.
There is only one type of large-scale SSI architecture that is able to somewhat maintain the illusion of uniform shared memory -- hardware latency-hiding e.g. barrel processors. If programmers have difficulty writing scalable code with message passing, then they definitely are going to struggle with this. These systems use a completely foreign programming paradigm that looks deceptively like vanilla C++. Exceptional efficient, and companies design new ones every few years, but without programmers that grok how to write optimal code they aren't much use.
RDMA is used heavily in SMB3 file systems for Microsoft HyperV failover clusters.
That's what I was going to chime in with - you pay for that extra address width. Binary addition and multiplication latency is super-linear with regards to operand width. Larger pointers lead to more memory use, and memory access latency is non-constant with respect to size.
It might make sense for large distributed systems to move to a 128-bit architecture, but I don't see any reason for consumer devices, at least with current technology.
> Binary addition ... is super-linear with regards to operand width
No its not. That's why Kogge-Stone's carry lookahead adder was such an amazing result. O(log(n)) latency with respect to operand width with O(n) total half-adders used.
It may seem like its super-linear. But the power of prefix-sums leads to a spectacular and elegant solution. Kogge-stone (and the concept of prefix-sums) is one of the most important parallel-programming / parallel-system results of the last 50 years. Dare I say it, its _THE_ most important parallel programming concept.
> multiplication latency
You could just... not implement 128-bit multiplication. Just support 128-bit pointers (aka: addition) and leave multiplication for 64-bits and below.
There's also something beautiful about seeing or creating a Kogge-Stone implementation on silicon.
I know it was one of the first time I thought to myself: this is not just a straightforward pipeline, yet it all follows such a beautifully geometrical interconnect pattern. Super fast, yet very elegant to layout.
The original paper is a masterpiece to read as well, if you haven't read it.
"A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equation", by Kogge and Stone.
It proves the result for _all_ associative operations (technically, a class slightly larger than associative. Kogge and Stone called this a "semi-associative" operation).
Well, just got it. Thanks for the reference!
A bit sad that 1974 papers are still behind a IEEE paywall...
Edit: Just finished reading it. I have to say that the generalization of 3.2 got a bit over me, but otherwise it's pretty amazing that they could define such a generalization. Intuition for those type of problem is often to proceed one step at a time, N times.
That it is provably doable in log2(N) is great, especially since it allows for a choice of the depth/number of processors you want to use for the problem. Hopefully next time I design a latency-constrained system I remember to look at that article
> Hopefully next time I design a latency-constrained system I remember to look at that article
Nah. Your next step is to read "Data parallel algorithms" by Hillis and Steele, which starts to show how these principles can be applied to code. (Much higher-level, easier to follow, paper. From ACM too, so its free since its older than 2000)
Then you realize that all you're doing is following the steps towards "Map-reduce" and modern parallel code and just use Map Reduce / NVidia cub::scan / etc. etc. and all the modern stuff that is built from these fundamental concepts.
Kogge and Stone's paper sits at the root of it all though.
From Data Parallel algorithms by Hillis and Steele conclusion: 'if the number of lines of code is fixed and the amount of data is allowed to grow arbitrarily, then the ratio of code to data will necessarily approach zero. The parallelism to be gained by concurrently operating on multiple data elements will therefore be greater than the parallelism to be gained by concurrently executing lines of code'
I feel like this sums up the way of thinking one must have in this paradigm.
It's funny because when designing digital hardware, you're kind-of trained to see things under that angle, since often the commands you write will expand in that tree-like structure, but to gates/ALU instead of data.
Then managing the data-path often piggy-backs on the hardware structure you just generated.
I feel like I have a overall grasp on these concept, yet there's so much interesting results that I don't know about...
You're right, binary addition isn't super-linear. It is non-constant, though, which is a slightly surprising result if you don't know much about hardware.
Kogge-stone's O(Log2(n)) latency complexity might as well be constant. The difference between 64-bit and 128-bit is the difference between 6 and 7.
There's going to be no issues implementing a 128-bit adder. None at all.
a 10% latency hit on addition would definitely be noticable. that's roughly 1.5 years of IPC improvements at the current rate of CPU progress.
Multiplication of 128 bit numbers is also not a big issue. Today you can do it with 4 MULs and some addition, and it would still be faster than 64-bit division. Hardware multiplication of 128-bit numbers would have area problems more than speed problems. You could always throw out the top 128 bits of the result (since 128x128 has a 256 bit result), and the circuit wouldn't be much bigger than a full 64 bit multiplier.
At least at this latest SIGMOD, it felt like everyone and their dog was researching databases in an RDMA environment… so I’d imagine this stuff hasn’t peaked in popularity.
Even so, existing ISAs could address more memory using segmentation. AMD64 has a variant of long mode where the segment registers are re-enabled. For the special programs that need such a large space, far pointers wouldn't be that complicating.
I agree with you.
And seeing datacenter after datacenter shooting up like mushrooms, there might be some sort of abstraction running in this direction, that makes 128bit addresses feasible. At the moment 64bit seems like paging in this sense.
8 EiB of data (in case we want addresses using signed integers) is around 20 metric tons of micro-SD cards one TB each (assuming they weigh 2g each). This could probably fit in a single shipping container.
But the cooling....
Seriously, that was great math you did there, and a neat way to think about volume. That's a standard shipping container , which is less than I thought it would be.
shipping container = 40x8x8ft
microsd card = 15x11x1mm, 0.5g
fits 437,503,976 cards = 379 EiB, costs $43.7B
219 metric tons
8 EiB ~ 10,000,000 TB = fills the shipping container 2.2% high or 56mm or 2 inches, 5 metric tons, costs $1B
shipping containers are rated for up to 24 metric tons, so ~40 EiB $5B 10 inches of cards etc
Don’t underestimate the bandwidth of a shipping container filled with SD cards.
Thanks for pointing out the `usize` ambiguity. It drives me nuts. I suspect it would make me even crazier if I was doing embedded with Rust right now and had to work with unsafe hardware pointers.
(They also fail to distinguish an equivalent of `off_t` out, too. Not that I think that would have the same bit width ambiguities. But it seems odd to refer to offsets by a 'size')
And there's another disadvantage to 128-bit pointers - memory size and alignment. It would follow that each struct field would become 16 byte-aligned, and pointers would bloat up as well, leading to even more memory consumption, especially in languages that favor pointer-heavy structures.
This was a major counterargument against 64-bit x86, where the transition came out as a net zero in terms of performance, due to the hit of larger pointer sizes counterbalanced by ISA improvements such as more addressable registers.
Many people in high-performance circles advocate using 32-bit array indices opposed to pointers, to counteract the cache pollution effects.
I figure the cache is going to be your largest disadvantage, and is the primary reason CPUs don't physically implement all address bits and why canonical addressing was required to get all this off the ground in the first place.
I can't imagine a single Linux kernel instance or single program controlling that much bus-local RAM, but as you say there are other uses.
One use I can imagine is massively distributed computing where pointers can refer to things that are either local or remote. These could even map onto IPv6 addresses where the least significant 64 bits are a local machine pointer and the most significant 64 bits are the machine's /64. Of course the security aspect would have to be handled at the transport layer or this would have to be done on a private network. The latter would be more common since this would probably be a supercomputer thing.
Still... I wonder if this needs CPU support or just compiler support? Would you get that much more performance from having this in hardware?
I do really like how Rust has u128 native in the language. This permits a lot of nice things including efficient implementation of some cryptography and math stuff. C has irregular support for uint128_t but it's not really a first class citizen.
I can't imagine a single Linux kernel instance or single program controlling that much bus-local RAM, but as you say there are other uses.
MS-DOS and 640K ...
The addressable size grows exponentially with more bits, not linearly. 2^64 is not twice as big as 2^32. It’s more than four billion times as big. 2^32 was only 65536 times as big as 2^16.
Going past 2^64 bytes of local high speed RAM becomes a physics problem. I won’t say never but it would not just be an evolutionary change from what we have and a processor that could perform useful computations on that much data would be equally nuts. Just moving that much data on a bus of today would take too long to be useful, let alone computing on it.
>(comedy option: 32-bit int, 128-bit long, and 64-bit `unsigned middle`)
rather than unsigned middle, could we just call it malcom?
It's very difficult to see normal computers that normal people use needing it any time soon, I agree. Frontier has 9.2PB of memory though, so that's 50bits for a petabyte and then 4 more bits, 54bits of memory addressability if we wanted to byte address it all. Looking at it that way, if super computers continue to be funded and grow like they have, we're getting shockingly close to 64bits of addressable memory.
I don't know that that really means we need 128bit, 80 or 96bits buys a lot of time, but it's probably worth a little bit of thought.
I don't know how many of you remember the pre-386 days. It was an effort to write interesting programs though, 512KB or 640KB of memory to work with but it was 16bit addressable and so you're writing code to manage segments and stuff, it's an extra degree of complexity and a pain to debug. 32bits seemed like a godsend when it happened. I imagine most of the dorks on here have ripped a blu-ray or transcoded a video image from somewhere, it's not super unusual to be dealing with a single file that cannot be represented as bytes with a 32bit pointer.
It's all about cost and value, 64bits is still a staggering amount of memory but if the protein folding problems and climate models and what have you need 80bits of memory to represent the problem space, I would hope that the people building those don't also have to worry about the memory "shoe boxing" problems of yesteryear too.
Personally, having just suffered the issue of uint64_t (unsigned long vs unsigned long long...): I prefer having "bit size" + "semantics" in the type, and that's it. If the hardware doesn't support it, fail horrendously or use an approximation/slower path, as specified by compiler options.
In my opinion, developers should think about semantics first, optimization (device-specific or software) after (, unless you already know your device).
I slightly disagree with JVM "int = 32bit" but they essentially are forcing their own "virtual hardware", so I can understand that. For portable native code... I can only say that I'm disappointed in uint64_t. But also that, maybe, the current Rust isn't the end-all be-all of portable types.
> (comedy option: 32-bit int, 128-bit long, and 64-bit `unsigned middle`)
Or rather than keep moving the long goalpost, keep long at u64/i64 and add prolong(ed) for 128. Or we could keep long as the “nominal” register value, and introduce “short long” for 64. So many options.
> 64 bits provides 16 EiB (16 x 1024 x 1024 x 1024 x 1 GiB), which is the sort of address space you might need for byte-level addressing of a warehouse full of high-density HDDs. Memory sizes don't grow like they used to
An exabyte was an absolutely incomprehensible amount of memory, once. Nearly as incomprehensible as 4 gigabytes seemed, at one time. But as you note, 64 bits of addressable data can fit into a single warehouse now.
Going by the historical rate of increase, $100 would buy about a petabyte of storage in 2040. Even presuming a major slowdown, we still start running into 64 bit addressing as a practical limit, perhaps sooner than you think.
Storage and interconnect specs are getting a lot faster. I could see a world where you treated an S3 scale storage system as a giant tiered addressable memory space. AS/400 systems sort of did something like that at a small scale.
> CHERI is a much more interesting case
It would be interesting to see something like this on x86 and ARM. I could imagine Apple implementing something similar.
We couldn't introduce a new 'middle' keyword, but could we say 'int' is 32 bit, 'long' is 128 bit and 'short long' is 64 bit..?
Systems for which pointers are not just integers have come and gone, sadly.
Many mainframes had function pointers which were more like a struct than a pointer.
Technically that’s still pretty common on the software side, that’s what tagged pointers are.
The ObjC/Swift runtime uses that for instance, the class pointer of an object also contains the refcount and a few flags.
Pointer tagging is still spiritually an integer (i.e. mask off the bits you guarantee the tags are in via alignment) versus (say) the pointer being multiple addresses into different bits of different types of memory and tags insert other stuff here.
> Pointer tagging is still spiritually an integer
No, pointer tagging is spiritually a packed structure. That can be a simple union (as it is in ocaml IIRC) but it can be a lot more e.g. the objc “non-pointer isa” ended up with 5 flags (excluding raw isa discriminant) and two additional non-pointer members of 19 and 9 bits.
You mask things on and off to unpack the structure into component the system will accept. Nothing precludes using tagged pointers to discriminate between kinds and locations.
Itanium function pointers worked like that, so it could have been the new normal if IA64 wasn’t so crazy on the whole.
A Raymond Chen post about the function pointer craziness: https://devblogs.microsoft.com/oldnewthing/20150731-00/?p=90...
FYI Such memory tagging has a rich history https://en.wikipedia.org/wiki/Tagged_architecture
> ...which is the sort of address space you might need for byte-level addressing of a warehouse full of high-density HDDs
So if you want to mmap() files stored in your datacenter warehouse, maybe you do need it?
Even assuming you are correct on all these points ASLR is still an important use case and the effective security of current 64-bit address spaces is low.
So 64 bit address is only 1024 16TB HDDs? That number may go down quickly. There is a 100TB SSD already.
1024*1024 16TB HDDs, or 1,048,576.
1,000,000 drives at 16 TB each I think.
Kilo is 10 bits. Mega 20. Gigs 30. Tera 40. 16TB is 44 bits. 1000* is another 10 bits so 54.
As my siblings are pointing out your error they are not addressing what you are saying. You are absolutely correct. It's conceivable within a few years to fit this much storage in one device (might be a full rack full of disks but still).
The highest density storage is currently 1PB in 1U, fitting 40PB in a Rack. We expect to reach 100PB in 2025, with some futuristic roadmap of 200PB in 2030, assuming we could yield those NAND. And a thousand layers NAND to reach 500PB in 2035.
That is still an order of magnitude smaller than 16EiB. So no, not in a few years time.
One thing is storage density, another is data transfer rate.
Frontier's LustreFS based mind-boggling 700PB Orion storage subsystem writes at a similarly impressive 5TB/s ; so with a generous reading, it could potentially fill up in about 39h. Historically, disk density increased more rapidly than network bandwidth, which ultimately limits the performance of distributed file systems.
If it takes weeks or months to write such enormous data set, then there will be only few use cases, if it takes years there will be none.
Not quite - 1024GiB is 1TiB, so it's 1024 x 1024 x 16TiB drives
Ah, parent already corrected their mistake. The comment I was responding to was saying 16*1024*1024*1GB.
On one hand The IBM System/38 used 128 bit pointers in the 1970s, despite having a 48 bit physical address bus. These were used to manage persistent objects on disk or network with unique ids a lot like uuids.
On the other hand, filling out a 64 bit address space looks tough. I struggled to find something of the same magnitude of 2^64 and I got ‘number of iron atoms in an iron filing’, From a nanotechnological point of view a memory bank that size is feasible (fits in a rack at 10,000 atoms per bit) but progress in semiconductors is slowing down. Features are still getting smaller but they aren’t getting cheaper anymore.
I thought up a few ways to visualize 2^64 unique items:
- You could give every ant on Earth ~920 unique IDs without any collisions
- You could give unique IDs for every brain neuron for all ~215 million people in Brazil
- The ocean contains about 20 × (2^64) gallons of water (3.5267 × 10^20 gallons total)
- There are between 100-400 billion stars in the Milky Way, so you could assign each star between 46,000,000–184,000,000 unique IDs each
- You could assign ~2.5 unique IDs to each grain of sand on Earth
- If every cell of your body contained a city with 500,000 people each, every "citizen" of your body could have a unique ID without any collisions
Calculating these figures is actually a lot of fun!
There are only ~368 grains of sand per ant?
The common claim that there are ~7.5E18 (~2.5 × 2⁶⁴) grains of sand seems to originate from this back-of-the-envelope calculation:
It takes only beach sand into account, and of course there's a lot of guessing involved. It could be easily off by many orders of magnitude.
Well no. See there's this one ant, Jeff, that's hogging them all for itself, so each ant only gets 50 grains grains of sand.
No wonder I keep reading about sand shortages.
Those are great examples
I think multiplying (1000 unique identifiers) isn’t fair but I like the gallons.
> filling out a 64 bit address space looks tough. I struggled to find something of the same magnitude of 2^64 and I got ‘number of iron atoms in an iron filing’
Reminded me of Jeff Bonwick's answer to the following question about his 'boiling the oceans' quip related to ZFS being a "128 bit filesystem":
> 64 bits would have been plenty ... but then you can't talk out of your ass about boiling oceans then, can you?
Sadly his Sun hosted blog was eaten by the migration to Oracle, so thanks to the Internet Archive again:
That one dives into some of the "handwaving" a bit:
And that one goes into how much energy it would take to merely spin enough disks up:
One absolute limit of computation is that it takes
of energy to delete one bit of information where k is the Boltzmann constant and T is the temperature. Let T = 300° K (room temperature)
I multiplied that by 2¹²⁸, and got 1.41×10¹⁸ J of energy. 1 ton of TNT is 4.2×10¹² J, so that is a 335 kiloton explosion worth of energy just to boot.
That's not impossible, that much heat is extracted from a nuclear reactor in a few months. If you want to go faster you need a bigger system, but a bigger system will be slower because of light speed latency.
(You do better, however, at a lower temperature, say 1° K but heat extraction gets more difficult at lower temperatures and you spend energy on refrigeration unless you wait long enough for the Universe to grow colder.)
In fairness, the need for 128-bit addressable systems will be when 64 address bits is not enough. That will be long before people are using 2^128 bytes on one system. So doing the calculation with 2^65 bytes would be a more even handed estimate of the machine that would require this
> 1.41×10¹⁸ J of energy.
Used over the course of a year, that is a constant 44.4 GW. Less than Bitcoin uses already
> On one hand The IBM System/38 used 128 bit pointers in the 1970s, despite having a 48 bit physical address bus.
And the original processor was 24 bits, then it was upgraded to 36 bits (not a typo: 36 bits), and then to POWER 64 bits.
(When that last happened, it was re-badged AS/400. Later, marketing renamed the AS/400 to iSeries, and then to IBM i, without changing anything significant. Still uses Power CPUs, AFAIK).
For users, upgrades were a slightly longer than usual backup and restore.
What's the hard part here?
The AS/400 and iSeries also use 128-bit pointers. 128-bit would be useful for multiple pointers already in common use such as ZFS and IP6 addresses. I expect it will the last hop for a long time.
In the context of the AS/400's single-level store architecture the 128-bit pointers make a lot of sense, too.
Only 64 of those 128 bits are the actual address. The first byte indicates the type of object the pointer points to. The next seven bytes are (for some pointer types) used to encode security permissions (capabilities); for other types they are just reserved (all zeroes). So taking up 128 bits is more about capability-based security and type safety than the single-level store per se; the single-level store would have worked about as well with only 64-bit pointers. There is also a 129th bit (tag bit) which helps detect unsafe pointer modification. (The tag bit is stored using ECC memory’s extra bits for storing the error correction, which happen to have one spare bit per every 16 bytes.)
Conceptually similar to ARM Morello/CHERI’s 129 bit pointers; although that is a much more sophisticated implementation than IBM 38/400/i.
Those are evolved from the System/38.
Yeah, IBM is one company that shows how to push the models down the road. They do take their legacy seriously.
IBM i is still in development and also has a 128 bit pointer.
> Matthew Wilcox took the stage to make the point that 64 bits may turn out to be too few — and sooner than we think
Let's think critically for a moment. I grew up in the 1980s and 1990s, when we all craved more and more powerful computers. I even remember the years when each generation of video games was marketed as 8-bit, 16-bit, 32-bit, ect.
BUT: We're hitting a point where, for what we use computers for, they're powerful enough. I don't think I'll ever need to carry a 128-bit phone in my pocket, nor do I think I'll need a 128-bit web browser, nor do I think I'll need a 128-bit web server. (See other posts about how 64-bits can address massive amounts of memory.)
Will we need 128-bit computing? I'm sure someone will find a need. But let's not assume they'll need an operating system designed in the 1990s for use cases that we can't imagine today.
I think the argument for 128-bits being necessary strictly for routing memory in a single "computer" is fairly weak. We're already seeing the plateauing of memory sizes nowadays; what was exponentially in reach within just a couple of decades exponentially recedes from us as our progress goes from exponential to polynomial.
But the argument that we need more than 64 bit capability for a lot of other reasons in conjunction with memory addressability is I think very strong. A lot of very powerful and safe techniques become available if we can tag pointers with more than a bit squeezed out here and a bit squeezed out there. I could even see hard-coding the CPU to, say, look at 80 bits as an address and then us the remaining 48 for tagging of various sorts. There's precedent, and 80 bits is an awful lot of addressible memory; that's a septillion+ addressible bytes, by the time we need more than that, if we do, our future selves can deal with that. (It is good to look ahead a decade or two and make reasonable preparations, as this article does; it is hubris to start trying to look ahead 50 years or a century.)
We are talking about the OS kernel that supports 4096 CPU cores. The companies who pay their engineers to do linux kernel development tend to be the same ones that have absurd needs.
And that's not enough to boot on some platforms that linux runs on (even with amd64 ISA) unless one partitions the computer into smaller cpu counts
IMO there's an important distinction to be made between high-bit addressing and high-bit computing.
Like, no one has enough memory to need more than 64 bits for addressing, and that is likely to remain the case for the foreseeable future. However, 128- and 256-bit values are commonly used in domains like graphics, audio, and so-on, where you need to apply long chains of transformations and filters, but retain as much of the underlying dynamic range as possible.
Those are data types, not pointer values or filesystem offsets. Totally different thing.
Is that strictly the definition, that the bit-width of a processor refers only to its native pointer size?
I know it's hardly a typical or modern example, but the N64 had just 4MB of memory (8MB with the expansion pack). It most certainly didn't need 64-bit pointers to address that pittance, so it was a "64 bit processor" largely for the purposes of register/data size.
Not so much a definition as where the problem lies.
If the thing that can refer to a memory address changes size, there are very different problems than will arise if the size of "an integer" changes.
You could easily imagine a processor that can only address an N-bit address space, but can trivially do arithmetic on N*M bit integers or floating point values. And obviously the other way around, too.
In general, I think "N bit processor" tends to refer to the data type sizing, but since those primitive data types will tend to fit into the same registers that are used to hold pointers, it ends up describing addressing too.
Supercomputers? Rack-scale computing? See some of the work being done with RDMA and "far memory".
Yes... And do you need that in the same kernel that goes into your phone and your web server?
It would be sad if we, as an industry, do not take this opportunity to create a better OS.
First, we should decide whether to have a microkernel or a monolithic kernel.
I think the answer is obvious: microkernel. This is much safer, and seL4 has shown that performance need not suffer too much.
Next, we should start by acknowledging the chicken-and-egg problem, especially with drivers. We will need drivers.
So let's reuse Linux drivers by implementing a library for them to run in userspace. This should be difficult, but not impossible, and the rewards would be massive, basically deleting the chicken-and-egg problem for drivers.
To solve the userspace chicken-and-egg problem (having applications that run on the OS), implement a POSIX API on top of the OS. Yes, this will mean that some bad legacy like `fork()` will exist, but it will solve that chicken-and-egg problem.
From there, it's a simple matter of deciding what the best design is.
I believe it would be three things:
1. Acknowledging hardware as in .
2. A copy-on-write filesystem with a transactional API (maybe a modified ZFS or BtrFS).
3. A uniform event API like Windows' handles and Wait() functions or Plan 9's file descriptors.
For number 3, note that not everything has to be a file, but receiving events like signals and events from child processes should be waitable, like in Windows or Linux's signalfd and pidfd.
For number 2, this would make programming so much easier on everybody, including kernel and filesystem devs. And I may be wrong, but it seems like it would not be hard to implement. When doing copy-on-write, just copy as usual, and update the root B-tree node; the transaction commits when the root B-tree node is flushed to disk, and the flush succeeds.
(Of course, this would also require disks that don't lie, but that's another problem.)
Would you like to mention what kind of expertise you have to state that a microkernel is obviously better, and in what sense it's better? And also explain how is it that microkernels already exist and no one cares? Especially the ones older than Linux?
For reference 2^64 = ~10^19.266 I don't think this is unreasonable at all, its unlikely that computers will largely stay the same in the coming years. I believe we'll see many changes to how things like mass addressing of data and computing resources is done. Right now our limitations in these regards are addressed by distributed computing and databases, but in a hyper-connected world there may come a time when such huge address space could actually be used.
It's an unlikely hypothetical but imagine if fiber ran everywhere, and all computers seamlessly worked together sharing computer power as needed. Even 256 bits wouldn't be out of the question then. And before you say something like that will never happen consider trying to convince somebody from 2009 that in 13 years people would be buying internet money backed by nothing.
You could do this today with 196 bits (128-bit IPv6 address, 64-bit local pointer). Take a look at RDMA, which could be summarized as "every computer's RAM might be any computer's RAM".
> It's an unlikely hypothetical but imagine if fiber ran everywhere, > and all computers seamlessly worked together sharing computer power > as needed. Even 256 bits wouldn't be out of the question then.
The question is whether such an address makes sense for the Linux kernel. If your hyper-converged distributed program wants to call `read()`, does the pointer to the buffer really need to be able to identify any machine in the world? Maybe it's enough for the kernel to use 64-bit local pointers only, and have a different address mechanism for remote storage.
Not agreeing or disagreeing for the most part, but in 2009 all of the prerequisite technology for cryptocurrency existed: general purpose computers for the average person, accessible internet, cryptographic algorithms and methods, and cheap storage.
For 256 bit computers, we need entirely new CPU architectures and updated ISAs for not just x86/AMD64, but for other archs increasing in popularity such as ARM and even RISC-V. Even then compilers, build tools, and dependant devices with their drivers need updates too. On top of all of this technical work, you have the political work of getting people to agree on new standards and methods.
256 bits in the case of a worldwide mega-computer would be such a huge departure from current architectures and more importantly latency-numbers that we can barely even speculate about it.
It may be of note that hypothetically one can have a soft-ISA 128 bit virtual address (a particularly virtual virtual address) which is JITed down into a narrower physical address by the operating system. This is as far as I'm aware how IBM i works.
For what is worth, 256 / log2(10) is around 77, while the observable universe is estimated to have anywhere from 10^78 to 10^82 atoms. An 256-bit address space, if fully utilized, would produce Asimov’s AC from The Last Question.
More realistically though, we would throw away at least half that length like how we are handing out /64 blocks to everyone on IPv6.
I don’t know it seems excessive to me. I could see the cold storage maybe, with spanning storage pools (by my reckoning there were 10TB drives in 2016 and the largest now are 20, so 16 years from now should be 320 if it keeps doubling, which is 5 orders of magnitude below still).
> Right now our limitations in these regards are addressed by distributed computing and databases, but in a hyper-connected world there may come a time when such huge address space could actually be used.
Used at the core of the OS itself? How do you propose to beat the speed of light exactly?
Because you don’t need a zettabyte-compatible kernel to run a distributed database (or even file system, see ZFS), trying to DMA things on the other side of the planet sounds like the worst possible experience.
Hell, our current computers right now are not even close to 64 bit address spaces. The baseline is 48 bits, and x86 and ARM are in the process of extending the address space (to 57 bits for x86, and 52 for ARM).
Thanks to Moore's law, you can assume that DRAM capacity will double every 1-3 years. Every time it doubles, you need one more bit. So if we use 48 bits today, we have 16 bits left to grow, which gives us at least 16 years of margin, and maybe 48 years. (and it could be even longer if you believe that Moore's law is going to keep slowing down).
The extra bits might be used for different things like e.g. in CHERI. The address space is still 64 bits, but there are 64 bits in metadata added to it, so you get a 128 bit architecture.
> all computers seamlessly worked together sharing computer power as needed. Even 256 bits wouldn't be out of the question then.
This sounds like it would be massively out of scope for Linux. It'd require a complete overhaul of most of its core functionality, and all of its syscalls. While not a completely infeasible idea, it sounds to me like it'd require a completely new designed kernel.
I remember a quote from the papers about Opal, an experimental OS that was intended to use h/w protection rather than virtual memory, so that all processes share the same address space and can just exchange pointers to share data.
"A 64 bit memory space is large enough that if a process allocated 1MB every second, it could continue doing this until significantly past the expected lifetime of the sun before it ran into problems"
> How would this look in the kernel? Wilcox had originally thought that, on a 128-bit system, an int should be 32 bits, long would be 64 bits, and both long long and pointer types would be 128 bits. But that runs afoul of deeply rooted assumptions in the kernel that long has the same size as the CPU's registers, and that long can also hold a pointer value. The conclusion is that long must be a 128-bit type.
Can anyone explain the rationale for not simply naming types after their size? In many programming languages, rather than this arcane terminology, “i16”, “i32”, “i64”, and “i128” simpy exist.
Legacy, there’s lots of dumb stuff in C. As you note, in modern languages the rule is generally to have fixed-size integers.
Though I think there are portability issues concerns, that world is mostly gone (it remains in some corners of computing e.g. dsps) but if you’re only using fixed-size integers what do you do when a platform doesn’t have that size? With a more flexible scheme, you have less issues there, however as the weirdness landscape contracts the risk of making technically incorrect assumptions (about relations between type sizes, or the actual limits and behaviour of a given type) start increasing dramatically.
Finally there’s the issue at hand here: even with fixed-size integers, “pointer” is a variable-size datum. So you still need a variable-size integer to go with it. C historically lacking that (nowadays it’s called uintptr_t), the kernel made assumptions which are incorrect.
Note that you can still get it wrong even if you try e.g. Rust believes and generally assumes that usize and pointers correspond, but that gets iffy with concepts like pointer provenance, which decouple pointer size and address space.
> Legacy, there’s lots of dumb stuff in C.
Yes, this, so much this
Who cares what an 'int' or a 'long' is. Except for things like the size of a pointer, it's better if you know exactly what you're working with.
> in modern languages the rule is generally to have fixed-size integers.
Modern languages have unlimited size integers :-)
"Modern" as in "since at least the 80s, more likely 70s".
Good luck using those to specify the data layout of a network packet.
You are supposed to stream your video data as base64 encoded xml embedded in a json array.
Well, for that you'd probably use a specialisation of Integer that's bounded and can thus be represented in a machine word.
And then you'll be wasting time marshaling data between the stream and your objects because they're not PODs and so you can't just memcpy() onto them.
Good luck seeing your performance drop off a very sharp cliff if you start using larger numbers than your CPU can fit into a single register.
Well, in those case other languages fail.
Either silently with overflows, usually leading to security exploits, or by crashing.
So in either case you are betting that these cases are somewhere between rare and non-existent, particularly for your core/performance intensive code.
Being somewhat slower, probably in very isolated contexts (60-62 bits is quite a bit to overflow), but always correct seems like the better tradeoff.
How many languages actually take advantage of the fact that they could get more performance out of "small" register sized integers? You tend to lose a lot of performance to arbitrary sized integers in most languages even if your inputs are small.
> but always correct seems like the better tradeoff.
Not if you are dealing with time constraints. Glitchy output isn't good, but locking up the system because some buggy code path is trying to allocate a 10 GB integer can be worse.
I'm sure someone will come along and explain why I have no idea what I'm talking about, but so far my understanding is those names exist because of the difference in CPU word size. Typically "int" represents the natural word size for that CPU, which matches the register size as well, so 'int plus int' is as fast as addition can run by default, on a variety of CPUs. That's one reason chars and shorts are promoted to ints automatically in C.
Let's say you want to work with numbers and you want your program to run as fast as possible. If you specify the number of bits you want, like i32, then the compiler must make sure on 64bit CPUs, where the register holding this value has an extra 32bits available, that the extra bits are not garbage and cannot influence a subsequent operation (like signed right shift), so the compiler might be forced to insert an instruction to clear the upper 32bits, and you end up with 2 instructions for a single operation, meaning that your code now runs slower on that machine.
However, had you used 'int' in your code, the compiler would have chosen to represent those values with a 64bit data type on 64bit machines, and 32bit data type on 32bit machines, and your code would run optimally, regardless of the CPU. This of course means it's up to you to make sure that whatever values your program handles fit in 32bit data types, and sometimes that's difficult to guarantee.
If you decide to have your cake and eat it too by saying "fine, I'll just select i32 or i64 at compile time with a condition" and you add some alias, like "word" -> either i32 or i64, "half word" -> either i16 or i32, etc depending on the target CPU, then congrats, you've just reinvented 'int', 'short', 'long', et.al.
Personally, I'm finding it useful to use fixed integer sizes (e.g. int32_t) when writing and reading binary files, to be able to know how many bytes of data to read when loading the file, but once those values are read, I cast them to (int) so that the rest of the program can use the values optimally regardless of the CPU the program is running on.
That explains "int", but it doesn't explain short or long or long long. Rust has "usize" for the "int" case, and then fixed sizes for everything else, which works much better. If you want portable software, it's usually more important to know how many bits you have available for your calculation than it is to know how efficiently that calculation will happen.
I suppose short and long have to do with register sizes being available as half word and dword, and there are instructions that work with smaller data sizes on both x86 and ARM, but I agree that in today's world, you want to know the number of bits. On those weak 4MHz machines, squeezing a few extra cycles was typically very important.
That is what was suggested in the next paragraph.
> But a better solution might just be to switch to Rust types, where i32 is a 32-bit, signed integer, while u128 would be unsigned and 128 bits. This convention is close to what the kernel uses already internally, though a switch from "s" to "i" for signed types would be necessary. Rust has all the types we need, he said, it would be best to just switch to them.
That's pretty much the mentioned proposal of "just use rust types", which are i16/u16 to i128/u128, plus usize/isize for pointer-sized things.
The only improvement that you really need over that is to differentiate between what c calls size_t and uintptr_t: the size of the largest possible array, and the size of a pointer. On "normal" architectures they're the same, but on architectures that do pointer tagging or segmented memory a pointer might be bigger than the biggest possible array.
But you still have to deal with legacy C code, and C was dreamt up when running code written for 16 bits on a 14 bit architecture without losing speed was a consideration, so the C type's are weird.
stdint.h has been around far longer than Rust.
I've been using those since the 00s for bit banging code where I need guarantees for where each bit goes.
Nothing quite like working with a micro processor with 12bit words to make you appreciate 2^n addresses.
C dates back to a time when the 8 bit byte didn’t have 100% market share.
Plus, it was a language to write systems, where "size of register on current machine" was a nice shortcut for "int", where registers could be anywhere from 8-32 bits, with 48 or 12 also a possibility.
I have a couple of 12-bit machines upstairs. There were also 36-bit systems once upon a time.
Except that’s not been true in a while, and technically this assumptions was not kosher for even longer: C itself only guarantees that int is 16 bits.
C still (I think, C23 may have finally killed support*) supports architectures like clearpath mainframes which have a 36 bit word, 9 bit byte, 36 (IIRC) bit data pointer and a 81 bit function pointer.
The changes to the arithmetic rules mean you can't have sign-magnitude or 1s complement anymore IIRC
C99 specifies stdint.h/inttypes.h as part of the standard library for exactly this purpose. I'd expect using it would be a best practice at this point. But I'm no C expert, so maybe there's a good reason for not always using those explicitly sized types.
Windows has macros for that kind of stuff, and only in C99 the stdint header came to be.
So you had almost three decades with everyone coming up with their own solution.
To be fair, the other languages were hardly any better than C in this regard.
I think that's because of portability. So that the common types just map to the correct size on a given system.
What's the average lifespan of a line of kernel code? I imagine by starting this project 12 years before its anticipated use case they can get very far just by requiring that any new code is 128-bit compatible (in addition to doing the broader infrastructure changes needed like fixing the syscall ABI)
> What's the average lifespan of a line of kernel code?
There's a fun tool called "Git of Theseus" which can answer this question! You can see some graphs of Linux code on the web page: https://github.com/erikbern/git-of-theseus
Named after the Ship of Theseus: https://en.wikipedia.org/wiki/Ship_of_Theseus
There're some more in the presentation article: https://erikbern.com/2016/12/05/the-half-life-of-code.html#:...
A (Linux) kernel line has half-life 6.6 years. The highest of the projects analyzed. The lowest went to Angular with half-life 0.32 years.
Not sure about those 12 years - 128-bit registers are already there, and CHERI Morello prototype is at a “physical silicon using this functionality under CheriBSD” stage.
There was a post awhile back from NASA saying how many digits of Pi they actually need .
I'm sure I got something wrong here, def off-by-one, but roughly it looks like it would need 1209-bit floats (2048-bit rounded up!). IDK, mildly interesting. :>
import math pi = 3141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233786783165271201909145648566923460348610454326648213393607260249141273724587006606315588174881520920962829254091715364367892590360 sign_bits = 1 sig_bits = math.ceil(math.log2(pi)) exp_bits = math.floor(math.log2(sig_bits)) assert sign_bits + sig_bits + exp_bits == 1209
The value of Pi you mention was the one in the question, the one in the answer is:
> For JPL's highest accuracy calculations, which are for interplanetary navigation, we use 3.141592653589793. Let's look at this a little more closely to understand why we don't use more decimal places. I think we can even see that there are no physically realistic calculations scientists ever perform for which it is necessary to include nearly as many decimal points as you present.
That's sixteen digits, so a quick trip to the dev tools tels me::
The last statement of the text I quoted is more interesting though. Although not surprising to me, given how many astronomers I know who joke that Pi equals three all the time.
>> Math.log2(3141592653589793) -> 51.480417552782754
I have horrid memories of debugging I had to do to get some god-awful fourier transform to calculate with 15 digits of precision to fit a spec. It's right at the boundary where double-precision stops being deterministic. Worst debugging week of my life.
> stops being deterministic
I'm imagining the maths equivalent of Heisenbugs, is that correct?
No, just having to match how Matlab did the calculation (development of an index) to implementing the same thing in C++ (necessitating showing the calculation returned same significant digits for the precision we expected). I've seen a Heisenbug and that was really weird. Happened during uni so I didn't have to start tracing down compiler bugs. Not even sure if I could, happened with Java.
Lol I should RTFA ;D
Nah, just claim you were invoking Cunningham's Law ;)
Pi is a bit special because in order to get accurate argument reduction for trigonometric functions you needs lots of digits (IIRC ~1000 for double precision).
The size of required data types is mostly orthogonal to the size of memory addresses or filesystem offsets.
I recall sitting in a packed room with over a hundred devs at the 2004 Ottawa Linux Symposium while the topic of the number of filesystem bits was being discussed (link: https://www.linux.com/news/ottawa-linux-symposium-day-2/). I recall people throwing out questions as to why we weren't just jumping to 128 or 256 bits, and at one point someone blurted out something about 1024 bits. Someone then made a comment about the number of atoms in the universe, everyone chuckled, and the discussion moved on. I sensed the feeling in the room was that any talk of 128 bits or more was simply ridiculous. Mind you this was for storage.
Fast-forward 18 years, and it's fascinating to me to see people now seriously floating the proposal to support 256-bit pointers.
The difference between "number of atoms in the universe" and "number of possible states of a system" are vastly different. The latter is a combinatorial problem, and if you're trying to to track the possible combinations of 100 variables that can take on 10 states each, you've got 10^100 combinations and are already beyond atoms in the universe (10^80). You can never enumerate them all, but the ability to work on large subspaces would be a help.
Sounds nuts. Does anyone know how much power a 32GB DIMM draws? How much would a fully populated 64-bit address space therefore pull?
Edit, if a 4GB (32-bits used) DRAMM pulls 1 watt, the rest of the memory space is 32 bit = 4E9 so your memory is pulling ~4Gwatts alone. That's not supportable, given the other electronics needed to go around it.
this seems just a bit too early - so that probably means it’s exactly the right time!
I was wondering if this (128bit memory) are on the radar of any of the BSDs. Will they forever be stuck at 64bit?
CheriBSD might be the first unix-like with 128 bit pointers
Maybe we can finally fix maxint_t to be the largest integer type again.
"The problem now is that there is no 64-bit type in the mix. One solution might be to "ask the compiler folks" to provide a __int64_t type. But a better solution might just be to switch to Rust types, where i32 is a 32-bit, signed integer, while u128 would be unsigned and 128 bits. This convention is close to what the kernel uses already internally, though a switch from "s" to "i" for signed types would be necessary. Rust has all the types we need, he said, it would be best to just switch to them."
Does anybody know why they don't use the existing fixed size integer types  from C99 ie uint64_t etc and define a 128 bit wide type on top of that (which will also be there in C23 IIRC)?
My own kernel dev experience is pretty rusty at this point (pun intended), but in the last decade of writing cross platform (desktop, mobile) userland C++ code I advocated exclusively for using fixed width types (std::uint32_t etc) as well as constants (UINT32_MAX etc).
This question came up in the discussion at the conference. There are several reasons:
1. The format specifiers for those types differ from the ones currently in use for the kernel. Nobody uses or wants to use PRId64 and PRIu64. If they did, they'd need to change thousands of occurrences (which is admittedly not too hard with tools like Coccinelle).
2. The stdint.h types are just typedefs (it's not like the compiler understands them intrinsically). And they're defined in terms of short, int, long, etc. The sizes of these are platform and compiler dependent. On the other hand, the Linux kernel tells the compiler "you must make long 64 bits if you are a 64 bit system" which may conflict with the defaults and throw those types out of whack. (at least, that's how I interpreted one of the points in the room, I could be wrong here)
3. Kernel is already using u32/s32, it's a smaller and easier change to go to u32/i32 in Rust style. People tend to find the stdint.h names to be verbose.
4. stdint.h brings in additional headers. I'm not sure if that's a problem or how - maybe in the Linux uapi headers?
There seemed to be general concensus in the room that stdint.h wasn't used for good reasons, so if these don't sound right, it's probably because I misinterpreted them.
Awesome summary, thanks and all good points.
On a bit tangential note, RAM price for a given cost used to increase exponentially until the 2010s or so.
Since then, it only roughly halved. What happened?
I know it's not process geometry, since we went from 45nm->5nm in the time, a roughly 81x decrease.
Is is realistic to assume scaling will resume?
We decided to slow down giving programmers excuses to make chat applications as heavy as web browsers
Just wanted to say I love this discussion. Have been pondering the need for a 128-bit OS for decades but several of the issues raised were completely novel to me. Fantastic to have so many people so much smarter than I am hash it out informally here. Feels like a master class.
Pointers will get fat (CHERI and other tagged pointer schemes) well before even server users will need to byte-address into more than 2^64 bytes' worth of stuff. So we should probably be realistically aiming for 256-bit architectures...
If we're going to go for 128- why not just go for 256-? that way we won't have to do this again for a while.
or better yet, design a new abstraction for not having to hard-code the limit of the pointer size but instead allow it to be extensible as more addressable space becomes a reality, instead of having to transition over and over. is this even possible? if it is, shouldn't we head in that direction?
The problem with variable-sized pointers is that...
1. Any abstraction you could make will have worse performance than a fixed-size machine pointer
2. In order to support any kind of variably-sized type you need machine pointers to begin with, and those will always be fixed-size because variable size is even harder to support in hardware than native code
There's two forces at play here: pointers need to be small enough to embed in any data structure and large enough to address the entire working set of the program. Larger pointers are not inherently better, and neither are smaller pointers. It's a balancing act.
 ASLR, PAC, and CHERI are exceptions, as mentioned in the original article.
I would think by now any bunch of clever people would be trying to fix the generalized problem of supporting n-bit memory addressing instead of continually solving the single problem of "how do we go from n*2 to (n+1)*2". I guess it's more practical to just let the next generation of kernel maintainers go through all of this hullabaloo again in 2090.
Is 128 bit the limit of what we would need? We use 128 bit UUIDs. 2^256 seems to be more than the number of atoms on Earth.
The article does talk about just making the pointer type used in syscalls 256 bit wide to "give room for any surprising future needs".
The size of large networked disk arrays will grow beyond 64 bit addresses, but I don't think we will exceed 2^128 bits of storage of any size, for any practical application. Then again, there's probably people who thought the same about 32 bit addresses when we moved from 16bit to 32bit addresses.
The most likely case for "giant" pointers (more than 128 bits) will be adding more metadata into the pointer. With time we might find enough use cases that are worth it to go to 256bit pointers, with 96bit address and 160 bit metadata or something like that.
>Then again, there's probably people who thought the same about 32 bit addresses when we moved from 16bit to 32bit addresses.
There's a fun "quote" about 384k being all anyone would ever need, so clearly everyone just needs to settle down and figure out how to refactor their code.
The IBM PC's 20-bit addressing was 16 times the size of 16-bit addresses. From 20-bit to 32-bit, 4096 times larger. 32 to 64 is 4,294,967,296 times larger (!). The scale alone makes using all this space unlikely on a PC.
Are there operations that vector processors are inherently worse at or much harder to program for? Nowadays they seem to be mainly used for specialized tasks like graphics and machine learning accelerators, but given the expansion of SIMD instruction sets, are general purpose vector CPUs in the pipeline anywhere?
And do this again in 2050 to get to 256-bit computing? And then 512-bit around 2070?
Can we not play it save and immediately jump to for 65536-bit :)
We could call it 16-bit-bit.
Perhaps it's an idea to make Linux parameterized in the pointer/word size, and let the compiler figure it out in the future.
I think the article is very shortsighted. By 2035-40 we'll probably have memory only (RAM) computers massively available. No disks means no current OS capable of handling these computers. A change of paradigm needing new platforms and OSes.
These future OSes may be 128bit, but I don't think the current ones will make it to the transition.
There are plenty of OSes today capable of booting and running from RAM. Pretty sure we wouldn't be burning all the prominent OSes for something like that.
The "In my shoes?" bit was hilarious
I wonder if with 128-bit wide pointers it would make sense to start using early-lisp-style tagged pointers.
As long as the posting of subscriber links in places like this is occasional, I believe it serves as good marketing for LWN - indeed, every now and then, I even do it myself. We just hope that people realize that we run nine feature articles every week, all of which are instantly accessible to LWN subscribers.
-- Jonathan Corbet, LWN founder & and grumpy editor in chief
Multiple other approvals: <https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...>
Jon's own submissions: <https://news.ycombinator.com/submitted?id=corbet>
And if we look for SubscriberLink submissions with significant (>20 comments) discussion ... they're showing up every few weeks, largely as Jon had requested.
That said, those who are able to comfortably subscribe and find this information useful: please do support the site through subscriptions.
Whether the posting of subscriber links is “occasional” as of late is debatable. Most of LWN’s paywalled content is posted on HN.
jaimehrubiks stated unequivocally without substantiation that "Somebody asked before to please not share lwn's SubscriberLinks". LWN's founder & editor has repeatedly stated otherwise, hasn't criticised the practice, and participates in the practice himself, as recently as three months ago.
SubscriberLinks are tracked by the LWN account sharing them. Abuse can be managed through LWN directly should that become an issue. Whether or not that's occurred in the past I've no idea, but the capability still exists and is permitted.
No link substantiating jamiehrubiks' assertion seems to have been supplied yet.
I'm going to take Corbet's authority on this.
Corbet repeatedly used the word “occasionally,” sometimes even with emphasis.
What I’m saying is that the current situation is that most of the for-pay content of LWN is available on HN which is at odds either with his wish that it be occasional or with my understanding of English.
Most of those submissions die in the queue.
I'd set a 20-comment limit to the search I presented for a reason. At present, the 30 results shown go back over 7 months. That's roughly a significant submission per week.
Contrasting a search for "lwn.net" alone in submissions, the first page of results (sorted by date, again, 30 results) only goes back 3 weeks (22 days). But most of those get little activity --- some upvotes, and a few with many comments, but, in a third search sorted by popularity over the past month,
Ten of those meet or beat my 20-comment threshold, 20 don't. And note that 20 comments isn't especially significant, 4 submissions exceed 100 comments.
lwn SubscriberLink & > 20 comments, by date: <https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...>
All "lwn.net" for past month: <https://hn.algolia.com/?dateRange=pastMonth&page=0&prefix=tr...>
Comments: 189 94 155 254 153 29 46 10 14 20 13 12 89 10 21 1 8 0 1 1 0 2 0 0 0 0 2 1 1 0
Points: 306 271 254 240 166 114 109 89 62 58 53 45 42 39 37 30 20 7 5 5 5 4 4 4 4 4 3 3 3 3
I'm not saying that the concern doesn't exist. But ultimately, it's LWN's to address. The constant admonishments to not share links seem to fall into tangential annoyances and generic tangents, both against HN guidelines: <https://news.ycombinator.com/newsguidelines.html>
I'd suggest leaving this to Corbet and dang.
Why are you posting GPT3 responses here?
Hi GPT3, I didn't expect to see you here.