Swap on HDD: Does placement matter?

szszrk · on Sept 7, 2021

I believe this was kind of obvious 15-25 years ago [1] . That was in THE basic tutorial [2]. Those were simpler days. It was hard to build something big by yourself. Now it's easier, but now we are learning just API's to API's that provision our hardware and software :) So much current dev and ops knowledge will be useless in a few years, yet I could easily use a book from 1970' that was recommended here one day, to learn and use some basic AWK nowadays.

[1] Example from 2007 https://www.linuxquestions.org/questions/debian-26/debian-in... [2] Example from around 1997 https://tldp.org/HOWTO/html_single/Partition/#SwapSize

iforgotpassword · on Sept 7, 2021

Yes, this was common knowledge, yet back in the day most distro setups still put swap at the end by default for reasons unknown to me. Apart from the speed issue, that also made moving an existing installation to a larger disk more complicated, since you couldn't just resize the os partition, you had to delete and then recreate swap.

dspillett · on Sept 7, 2021

> put swap at the end by default for reasons unknown to me

If you assume that swap is a crutch the ideally won't be used or if it is used it is either for a short period only (due a to temporary overallocation) or for pages that are very rarely (if ever) used again (chunks of code & data that get loaded by then only certain configurations ever touch again), then you want to keep the fastest part of the drive for things that are going to be assessed regularly (your root partition for instance) in normal operation. For the occasional write & read of swap it makes little difference, and once you are properly thrashing pages to & from swap the time cost of head movements completely dwarfs any difference made by the actual location of the swap area (the heads will be spending most of their time in/near it anyway in such circumstances).

If you were relying on swap for general operations because the amount of RAM you'd need otherwise was just far too expensive, then you have a workload that warrants custom partitioning, to put it elsewhere but the end or ideally on another drive if you could afford a second.

If speed is an issue then you want it near the most commonly accessed data. Back when I used to have to think about these things much at all my general default arrangement was “boot, LVM” and within LVM “root, var, swap, homes, other data”. Swap being in the middle makes resizing in-place something I wouldn't generally consider, but if I needed more temporarily the extra would be created as a swap file (with lower priority than the partition) instead and/or better on a different drive (with higher priority, moving the main swapping load off the system drive).

Another, though less commonly useful, reason might be because it is easier to resize that way: if you need more than shrink the filesystem and add an extra swap area in the newly freed space.

> that also made moving an existing installation to a larger disk more complicated, since you couldn't just resize the os partition, you had to delete and then recreate swap

That isn't really a significant issue though, you shouldn't need swap while performing that operation (unless you are somehow moving the root filesystem around live) so stopping swap isn't going to be a problem (and a user capable of safely performing such a move at all will be able to handle the three extra commands needed). Assuming that you move everything first then resize, my preference would instead to be to move and resize individual partitions instead of moving everything so swap doesn't need to be moved and resized at all.

mtdewcmu · on Sept 7, 2021

> If speed is an issue then you want it near the most commonly accessed data.

Yes. You expect the seek time to dominate performance.

The reason that the swap was faster when placed at the beginning is likely because the filesystem is mostly empty and so the allocated portion is at the beginning of the partition.

If the filesystem was near capacity and the files are distributed throughout, then you would expect the performance of the swap at the end and the swap at the beginning to start to converge.

Scaevolus · on Sept 7, 2021

They're talking about a swap partition, not a swap file. Filesystem allocation patterns are irrelevant for this.

toast0 · on Sept 7, 2021

Filesystem allocation patterns are relevant, one of the components of seek time is how far the heads have to seek. If most of the data is towards the front of the drive and your swap partition is towards the front of the drive, then the head will need to move less to get to the swap partition. If the data is towards the front and the partiton is near the end, then you would need to wait longer for the head to move, generally.

mtdewcmu · on Sept 7, 2021

Yes. Thanks for explaining.

minitoar · on Sept 7, 2021

I thought there was some idea that you wanted core os/app data near the center since you would always be using that.

ragnese · on Sept 7, 2021

Yes, that was the argument I remember reading. You put "system" stuff first, then /home if you did a separate partition for it, etc. Swap last because "hopefully you won't be swapping much anyway".

I also (vaguely) remember some people putting build partitions closer to the front.

actually_a_dog · on Sept 7, 2021

I think we're also talking about the days when a machine that was swapping extensively was going to be stupidly slow no matter what you did.

Filligree · on Sept 7, 2021

Those days never ended.

bee_rider · on Sept 7, 2021

I put my swap on a nice NVME drive and...

still avoid hitting that thing at any cost. Memory is pretty quick stuff.

throwawayboise · on Sept 7, 2021

Back in the day you couldn't "just" resize a partition either. At minimum you would need to copy all the data somewhere, recreate the partition, reformat the filesystem, copy the data back. You might need to do this with other partitions also to make room, if you didn't leave any gaps to start with.

alerighi · on Sept 7, 2021

I checked the man for resize2fs and the copyright notice is from 1998, so I guess that even back in the day was possible to grow ext2 filesystems. Shrinking them I don't know, it's still a feature that not all filesystems supports to this day.

If you think about it extending a filesystem is pretty easy: you just have to write in the filesystem control structures that you have more blocks available to store data than what originally planed. The problem of course is shrinking, since you have to relocate the blocks that goes beyond the new partition size.

iforgotpassword · on Sept 7, 2021

True, maybe not back in the day, but ~10 years ago I still used hdds in some machines and resizing was definitely possible and reliable.

Isthatablackgsd · on Sept 7, 2021

Me too. I remember using Gparted back in Ubuntu Ibex day (looking at the year, 2008ish). Usually OS come with gparted package or it is available via package manager back then.

bradknowles · on Sept 7, 2021

Some OSes could do that, yes. Rare.

zaarn · on Sept 7, 2021

Well, with modern NVMe and SSD, the "where on the disk is my swap file" begins to matter less. Even at my workplace, any VM needing swap has it's OS disk put on NVMe/SSD, simply because having the user even think a second about this isn't worth the time. On NVMe/SSD, the placement simply doesn't matter, memory becomes non-linear.

Johnny555 · on Sept 7, 2021

But then it becomes a question of "Do I want to put swap on this drive" at all? I don't know the endurance of modern NVME drives, but if you can write 1 PB before wearing out the drive at a sustained 100MB/sec, you can wear out the drive in less than 4 months if you let your system run under heavy swap.

Probably not an issue for a desktop since no one would want to use it under heavy swap all the time, but for a server no one pays much attention to... maybe.

throwawayboise · on Sept 7, 2021

I don't create swap space on servers anymore. If I run out of RAM, I'm likely dealing with something that's out of control and I'm going to run out of swap also, it just delays the inevitable.

toast0 · on Sept 7, 2021

A small (512 MB) swap partition gives you enough runway to warn on 25% use, alert on 50% use, and address some problems without the fun of abrupt shutdowns when allocations fail (or the OOM killer shows up). Monitoring for high swap I/O makes some sense, but 512 MB fills up fast, so chances are it'll fill up before anyone can respond to an alert in that case.

At least in my experience, it's pretty hard to actually gauge memory use, but swap use makes a reasonable gauge most of the time. There are certainly many use cases where the swap use ends up not being a useful gauge though.

lathiat · on Sept 8, 2021

Its been a bit of a complicated problem but memory pressure metrics are now a thing: https://facebookmicrosites.github.io/cgroup2/docs/pressure-m...

Johnny555 · on Sept 7, 2021

All of the servers I manage now are cloud servers, and swapping to attached storage is slow. I don't really want random processes killed by the OOM killer, leaving the server in an unknown state... so I set the servers to panic on OOM.

zaarn · on Sept 7, 2021

As someone who has servers with swap on NVMe; it barely matters. Sustained swap thrashing is a bad scenario no matter how you put it and it'll just tank performance. Get more RAM. SWAP I/O should never have any sustained background level, it should ideally only spike every few minutes or so and remain low level to zero otherwise.

SWAP on SSD or NVMe is still miles better than HDD, you can notice the difference when the swap is being used.

Johnny555 · on Sept 7, 2021

But that assumes that someone notices the swap -- when I was new at a former last job, I asked why the drive activity light was always on on the server marked "finance". The answer was "Who knows!? That's some special software that finance uses, when it gets slow they tell us and we reboot it". It had been like that for more than a year.

Turns out that the app grew huge over time and the machine would swap like crazy and would eventually slow to a crawl. The machine was already maxed out on RAM, so we added a service to restart the app twice a week. Finance said it took hours off their month-end work, they thought the app was just slow.

zaarn · on Sept 7, 2021

You can monitor swap usage; in htop you can turn on the SWAP, PERCENT_SWAP_DELAY and M_SWAP columns, telling you exactly how much of a process is in swap, how large that is and the delay the process experiences due to swap.

You can also monitor swapping activity in iotop. If need be, this can also be written on third party tools, the interfaces are exposed by the kernel after all.

Oh and you can use the modern PSI monitoring of the kernel to measure how much pressure a subsystem is experiencing, so you can restart services way before you'd even notice the swapping on other tools.

Johnny555 · on Sept 7, 2021

Yes, you can monitor a lot of things, but whether everyone does is a different question.

zaarn · on Sept 8, 2021

People should if they are interested in performance. I log about 1TB/day of metrics data for the applications I'm responsible for and developed a mantra for it; It's better to have a shitton of metrics logging and not need them than have no metrics logging and having to set it up when everything is broken already.

actually_a_dog · on Sept 7, 2021

Isn't it intuitively obvious, though? At the beginning of the disk, the radial velocity of each sector is much higher than at the end of the disk. It stands to reason you should want your swap file to be where it can be most quickly accessed, and that higher radial velocity should translate directly into lower seek times.

meragrin_ · on Sept 7, 2021

> Isn't it intuitively obvious, though? At the beginning of the disk

Where's the intuitive start or end of the disk? I knew the answer was the tracks furthest from the center. Whether that was the beginning or end, I couldn't tell you.

dragontamer · on Sept 7, 2021

Not really. The physical disks used most at that era were CDs and DVDs. Both of which have angular recording.

Which means that CDs and DVDs are always read at the same speed, no matter where the laser / read head is.

Only those who really worked with hard drives noticed the speed increase at the inner ring.

folmar · on Sept 7, 2021

> CDs and DVDs are always read at the same speed, no matter where the laser / read head is.

This is only true for "slow" drives, CD drives faster than 12x typically use CAV and DVD drives >= 8x use CAV or Z-CLV (sometimes P-CAV).

kevin_thibedeau · on Sept 7, 2021

Later optical drives employed constant speed spindles.

tomxor · on Sept 7, 2021

This is due to platter geometry... the "start" of the logical volume is at the outer edge of the platters, and the end is at the inner edge.

If you divide the platter into concentric circles of equal width, you will notice there is more area available on the outer circles... for this reason the number of sectors per track are greater the further the track is from the centre of the platter. Yet the head will pass over the entire track in the same amount of time... i.e more data in the same time.

It makes sense that the logical volume would be arranged from the outer edge to take advantage of the speed as soon as possible.

fomine3 · on Sept 8, 2021

> the "start" of the logical volume is at the outer edge of the platters,

Is this a something like natural theory, or de facto standard? Maybe the start can be most inner edge on HDD in another planet?

tomxor · on Sept 8, 2021

It's pretty universal, as inevitable as pythagoras theorem, even on an alien planet where everything else arbitrary and historical about the technology is different: the geometry dictates this optimisation will eventually be made (zone-bit-recording) [0]. Once the optimisation is made, it's also inevitable that logical layout will start from the outer cyclinder to provide the best sequential performance at the beginning of it's use.

Older HDDs did not make this optimization, they had a constant number of sectors per cylinder (track), and they also didn't come with HDD controllers, or came with more minimal controllers, and much like old floppy drives they exposed a lot of the physical layout and properties to the host system. This is why the CHS format is a historical part of OS and partitioning, modern HDDs only expose LBA.

Anyway, my point is that with the older drives that lacked ZBR there is no obvious "natural law" dictating that you shouldn't create a logical layout from the inner edge - and since they exposed CHS to the OS I wonder if it was possible to chose the layout direction.

[0] https://en.wikipedia.org/wiki/Zone_bit_recording

[edit]

I'm trying way too hard to entertain your idea, but now I thought it I gotta write it: the one way that ZBR could exist at the same time as it making sense to start the logical volume at the inner track (or more correctly making no difference), is if the angular velocity was not constant... modern HDDs have a constant angular velocity (constant RPM), and assuming the head's maximum read capability is the linear speed at the outer cylinder, then in theory the drive could spin faster when the head is closer to the center to provide the same linear speed, at which point the sequential performance is almost constant (cylinders sector density is segmented as the area passes a threshold, so there will be a subtle periodic difference).

This is how CDs work (ignoring the fact that they are a spiral), they have a constant linear speed rather than a constant angular speed and they start from the center, so the motor has to continuously change RPM to maintain it.

I suspect the reason this is not used is that HDDs spin much faster than CDs, and rather than single spiral track there are cylinders that the head has to move between, the head can move between cylinders much faster than the drive motor can accurately make the subtle changes required to achieve a constant linear speed.

The end result is probably very poor seek performance (i.e the seek performance would be limited by the motor rather than the head - which is very fast)... I don't know much about CDs but I suspect they also have poor seek performance, but by choice - that may have more to do with being one giant spiral track rather than motor performance.

fomine3 · on Sept 8, 2021

Thanks. Looks reasonable.

trhway · on Sept 7, 2021

> for this reason the number of sectors per track are greater the further the track is from the centre of the platter. Yet the head will pass over the entire track in the same amount of time...

an additional consequence is that for the same amount of data it takes lesser number of tracks thus making for faster/shorter seeks inside that data.

drewg123 · on Sept 7, 2021

In traditional BSD unixes, swap was always the second partition, very close to the front of the disk, and just behind a small root/boot partition (which was first, presumably due to bootstrapping needs).

Looks like the old timers knew what they were doing :)

kijin · on Sept 7, 2021

I still do that with my Linux boxes. A small /boot goes first, followed by swap, /, and finally the partition that will contain the majority of the data (usually /var or /home).

This arrangement has advantages even in the age of VMs and SSDs. If I want to change to a larger disk or array (or resize the virtual disk), I can simply extend the last partition where the extra space is most likely to be needed. If swap was last, it would get in the way. On the other hand, if I needed more swap, I could just add a swapfile somewhere.

chousuke · on Sept 7, 2021

I have the UEFI/boot partition followed by an LVM PV. If it's a VM, data disks just get the whole disk, though I still usually set up LVM because it enables stuff like live storage migrations. I've actually had to do those more than once in production; one migration involved moving several terabytes of data used by a hardware server from an aging SAN onto physical disks and iSCSI. It required no downtime.

I haven't really had to worry about partitioning on any Linux machine I manage for over a decade thanks to LVM; I just create volumes based on what makes sense for the applications hosted on the servers.

mkl95 · on Sept 7, 2021

I've always assumed this to be the proper way to do things, and it's still what I do when I'm asked to configure a new VM. If it ain't broke, don't fix it.

5e92cb50239222b · on Sept 7, 2021

swapoff + fdisk + mkswap + swapon takes two minutes tops. I much prefer that to partition "fragmentation".

raffraffraff · on Sept 7, 2021

With spinning rust the ideal usage pattern is sequential: like writing a large file from the start of a disk into contiguous sectors (or reading that file back).

One of the things that screws up HDD performance much worse than placement of files on disk is randomness in the usage pattern. The mechanical nature of a HDD means that when you read and write lots of small files in different sectors, the head spends more time moving around than reading or writing. Back when we used to defragment Windows filesystems, we doing a bunch of up-front disk optimization to organise files into continuous chunks so they could be read back quickly when needed.

The biggest problem I have seen with these situations is that you don't have direct control over the order of operations that the disk will be asked to perform. You think that because your file is written contiguously that it will be read that way. But depending on how busy the system is, that might not be the case. Where many processes are contending for disk access, and especially when the kernel is doing a lot of swapping to the same device, that head might be racing back and forth regardless of your file placement, and your disk performance goes straight into the toilet.

bluedino · on Sept 7, 2021

One of the reasons that you'd put /var on another disk back in the old days. Or /home, or wherever your web server stored its files, or your mail files...

lloydatkinson · on Sept 7, 2021

You still do defragment HDDs today

zozbot234 · on Sept 7, 2021

Modern file systems do not need defragmenting. It was something that was only really done with FAT.

bityard · on Sept 7, 2021

Modern file systems are better at _avoiding_ fragmentation than FAT was, but they are not immune to it.

kasabali · on Sept 8, 2021

They do, it's just that Windows runs defrag automatically as a background task nowadays so you don't notice.

lloydatkinson · on Sept 7, 2021

Are you trolling? NTFS.

redis_mlc · on Sept 7, 2021

That is completely false.

Typically I saw 30% to 100% performance improvements on ext4 by deleting and restoring database directories.

You can see disk fragmentation on linux with the filefrag and other commands.

kasabali · on Sept 8, 2021

ext4 also has a defrag tool - e4defrag (8)

birdman3131 · on Sept 7, 2021

The term you are looking for is "Short Stroking" and has been around for a long time. Before SSD's got cheap enough it was occasionally used where it was worth the cost of only using 25% or less of the drives capacity.

pbhjpbhj · on Sept 7, 2021

Nice write up. I ditched swap partitions a few years ago, my system (home computer) basically never swapped. At the time I was digging myself out of a too-small /boot (a once-recommended size) and figured that one-big-partition with a swap-file gave most flexibility.

So, is the an efficient way to leverage the speed improvement for other than swap -- like binary caching of executables of some form?

pizza234 · on Sept 7, 2021

The system uses swap to put less frequently used pages; the freed RAM can be used for caching, or more in general, pages that are used more frequently. So adding swap indirectly increases the memory available for caching.

I don't know the details of (Linux) caching, though. On my (32 GB) system, there are a few completely unused GB, it seems.

szszrk · on Sept 7, 2021

> So, is the an efficient way to leverage the speed improvement for other than swap -- like binary caching of executables of some form?

Sure. Like the site I mentioned in other comment, more or less "out tracks are faster". But this applies just to HDD drives. It's mostly useless for modern infrastructure, like SAN (even HDD based), all kinds of SSD and so on.

A curiosity - last time I saw a cool optimization of HDD usage was on "old gen" consoles, like ps4 and xbox one. Most games duplicated assets multiple times. Games took much more GB then needed, but the drive did not had to jump between many HDD tracks so much and it mattered for instance in big open world games.

kijin · on Sept 7, 2021

The most surprising thing about the result is that there isn't an order-of-magnitude jump between SATA SSD and any sort of HDD, as you would expect with random read/write workloads typical of swap thrashing. Instead, the chart looks as if it is mostly measuring sequential read/write performance. HDDs have long been known to be faster on one end than the other in sequential benchmarks.

This could be an artifact of the particular kind of workload that the author used. Maybe it causes large numbers of adjacent blocks to be swapped in and out at the same time?

koala_man · on Sept 7, 2021

Author here. In all cases, most access is still RAM. The storage is only hit to stash or load overflowing pages.

I originally ran the benchmark with 1GB RAM instead of the final 2GB, but the start-of-disk test did not finish in the 9 hours I let it run. With 0GB, I don't doubt that you'd see the expected 1,000,000x latency difference between disk and DRAM.

bluedino · on Sept 7, 2021

    165s (2:45) — RAM only
    451s (7:31) — NVMe SSD

Good argument for when the uninformed state that "NVMe might as well be RAM"

californical · on Sept 7, 2021

I mean that’s really close. I always thought of RAM as multiple orders of magnitude faster than disk. Within 3x of speed is pretty excellent.

(though, I guess this doesn’t give us any latency info, just throughput. I’d expect RAM latency to still be faster)

koala_man · on Sept 7, 2021

Author here. Keep in mind that most access in the swapping case is still RAM, so we can't just say that there's a 3x difference between DRAM and NVMe flash.

I originally tried running the test with only 1GB RAM, but killed the job after 9 hours of churning.

nh2 · on Sept 7, 2021

I would not take this benchmark to draw general conclusions.

The spinning disk result is only 10x slower than RAM. But a spinning disk's throughput is 100-1000x less than current RAM, and for latency it's even worse.

Similarly, the other factors in the benchmark graph are way off their hardware factors.

This benchmark is measuring how one specific program (the Haskell Compiler compiling ShellCheck) scales with faster memory, and the answer is "not very well".

koala_man · on Sept 7, 2021

The overwhelming majority of access would still happen in the 2GB RAM the benchmark has. The disk is only hit to stash or load overflowing pages, not on every memory access. That's why it doesn't mirror the hardware difference between DRAM and disk.

nh2 · on Sept 7, 2021

That makes sense, thanks!

zaarn · on Sept 7, 2021

Generally, in terms of transfer speed, NVMe is damn close. The latencies is where that hits you because NVMe hasn't nearly as short latencies and doesn't have latency guarantees about the 99th percentile.

If your ops aren't latency sensitive, then NVMe might as well be RAM, if they are latency sensitive, then NVMe is not RAM (yet)

koala_man · on Sept 7, 2021

Isn't it about ~2GB/s vs ~20GB/s? It's really impressive but still an order of magnitude.

zaarn · on Sept 7, 2021

A modern NVMe on PCIe 4.0 can deliver up to 5GB/s, which is only 4 times slower. You can get faster by using RAIDs and I believe some enterprise class stuff can get a bit faster still at the expense of disk space. PCIe 4.0 would top out at 8GB/s, so for faster you'll need PCIe 5.0 (soon).

nh2 · on Sept 7, 2021

RAM bandwidth scales with the number of DIMMs used, e.g. a current AMD EPYC machines can do 220 GB/s with 16 DIMMs per spec sheet.

How well does NVMe scale to multiple devices, that is, how many GB/s can you practically get today out of a server packed with NVMe until you hit a bottleneck (e.g. running out of PCIe lanes)?

zaarn · on Sept 7, 2021

An AMD Epyc can have 128 PCIe 4.0 lanes, each 8GB/s, meaning it tops out at a measely 1TB/s of total bandwidth. And you can in fact saturate that with the bigger Epycs. However, You will probably loose 4 lanes to your chipset and local disk setup, maybe some more depending on server setup but it'll remain close to 1TB/s.

jxcl · on Sept 7, 2021

I tested this on my own system somewhat recently, with a Ryzen 5950X, 64 GB of 3600 MHz CL 18 RAM and a 1TB Samsung 970 Evo, using the config file that ships with Fedora 33.

I created a ramdisk as follows:

    ~$ sudo mount -t tmpfs -o size=32g tmpfs ~/ramdisk/
    ~$ cp -r Downloads/linux-5.14-rc3 ramdisk/
    ~/ramdisk$ cp /boot/config-5.13.5-100.fc33.x86_64 linux-5.14-rc3/.config
    ~/ramdisk$ cd linux-5.14-rc3/
    ~/ramdisk/linux-5.14-rc3$ time make -j 32

My compiler invocation was:

    ~/ramdisk/linux-5.14-rc3$ time make -j 32

And got the following results

    Kernel: arch/x86/boot/bzImage is ready  (#3)

    real 6m2.575s
    user 143m42.402s
    sys  21m8.122s

When I compiled straight from the SSD I got a surprisingly similar number:

    Kernel: arch/x86/boot/bzImage is ready  (#1)

    real 6m23.194s
    user 154m24.760s
    sys  23m26.304s

I drew the conclusion that for compiling Linux, NVMe might as well be RAM, though if I did something wrong I'd be happy to hear about it!

zepearl · on Sept 7, 2021

Question not related to the article:

does anybody have hints or a link to some page explaining how to set up Linux so that it uses swap reaaally only if there is almost no free RAM available?

I have a few private servers & VMs, all having swap enabled, and all start using swap if I do a lot of I/O even if I have e.g. more than 20GBs free out of 36 being available. Usually swap is not being used just after having booted the server or VM, but after a few hours or days of doing reads & writes to disk the kernel will start writing stuff to swap - it's very little (few KBs being written every few seconds), but that accumulates and after a few days I end up having GBs of swap used.

On one hand I just personally hate seeing that happening, on the other hand some of my workloads are irregular so when the workload changes the swap is emptied (at least partially) and the whole thing starts over again.

So far I played with the values of "/proc/sys/vm/swappiness" (tried to set there 0, 1, 60, 100) and "/proc/sys/vm/vfs_cache_pressure" (tried to set there 50, 100, 200), but when doing a lot of I/O the OS always ended up using swap.

I would like to have swap available/enabled to cover potential extreme cases without having the programs crash (e.g. I might set memory limits of SW that might rarely run concurrently too high, or some database might suddenly allocate more than expected, etc...) => seeing that swap is being/was used would tell me that something is NOK in relation to the total RAM being used by my SW.. .

10GBps · on Sept 7, 2021

Well known but still interesting.

Nowadays I generally don't use any swap at all and find it annoying when distros/Windows create swap anyway. I mean if my 128GB+ or even 32GB of primary memory runs out, is it really going to help to swap 2GB to disk? And any larger swap than that is too slow to be usable.

jdblair · on Sept 7, 2021

Oh, this takes me back.

I used to spread my swap out across all my disks on my system. When I had 2 disks. I put /boot, / and /var on one disk and /home on the other. When I had more disks, I moved /var onto its own disk, and had an extra drive that I symlinked into /home.

I put swap first on all the partitions. It's not like I did any benchmarking, there was just lore that swap should be close to the middle, followed by frequently accessed user data. At some point I got enough RAM that the swap wasn't really important, but I always provisioned it.

Now everything is SSD, and I feel like the whole idea of filesystem that you have to mount and keep consistent is kind old fashioned, but we have so much stuff built on the filesystem it will be with us a long time.

natmaka · on Sept 7, 2021

On a related perspective zswap (Linux) is surprisingly efficient if the system isn't CPU-bound.

It is "a Linux kernel feature that provides a compressed write-back cache for swapped pages, as a form of virtual memory compression. Instead of moving memory pages to a swap device when they are to be swapped out, zswap performs their compression and then stores them into a memory pool dynamically allocated in the system RAM"

https://en.wikipedia.org/wiki/Zswap

marcodiego · on Sept 7, 2021

Interesting idea: multiple swap partitions... the kernel smartly chooses the closer one the write head whenever needed.

Lex-2008 · on Sept 7, 2021

I might be wrong, but I think I've read somewhere (here on HN) that kernel has no idea about disk head position. It's job of HDD's firmware to reorder read/write instructions it received from kernel for optimal performance.

Also, firmware can "remap" some (bad) sectors into reserve area without kernel knowing.

zozbot234 · on Sept 7, 2021

Modern disks use Logical Block Addressing, so block numbers do correlate with head position but there's no detailed info at the level of cylinders/heads/sectors. Block remapping is a theoretical possibility, but if you see even a single block being remapped in the SMART info it means the disk is dying and you should replace it ASAP.

zaarn · on Sept 7, 2021

Some modern disks, depending on firmware and applications, do in fact do a lot of remapping; they have wear leveling enabled, generally aimed at shoveling data such that the head tends to move less and give you better latencies. Wouldn't surprise me if normal disks are starting to that regardless of usage as reducing tail latencies never hurts much.

There is also a difference between remapping a sector and reallocating a sector. Remapping simply means the sector was moved for operational reasons, reallocating means a sector has produced some read errors but did read fine.

A disk can operate fine even with 10s of thousands of reallocated sectors (by experience). The dangerous part is SMART reporting you pending and offline sectors, doubly if pending sectors does not go below offline sectors. That is data loss.

But simpy put; on modern disks the logical block address has no relation to the position of the head on the platter.

PixelOfDeath · on Sept 7, 2021

> But simpy put; on modern disks the logical block address has no relation to the position of the head on the platter.

WD kind of tried that with device managed SMR devices and they show absolutely horrible re-silvering performance.

Without a relatively strong relations of linear write/read commands and their physical locations also being mostly linear, spinning rust performance is not on a usable level.

zaarn · on Sept 7, 2021

DMSMR is an issue yes, but CMR disks already do it and it's not as much of an issue as you think. On a CMR this is entirely fine.

The issue with SMR is that because a write can have insane latencies, normal access gets problems.

CMR doesn't have those write latencies, so you won't face resilvering taking forever.

It also helps if you run a newer ZFS, which has sequential resilvers that do in fact run fine on an SMR disk.

I will also point out that wear leveling on a DMR disk tries to achieve maximum linear write/read performance by organizing commonly read sectors closer to eachother.

h2odragon · on Sept 7, 2021

Did that on a big SPARC system 20yr ago; had 8 SCSI channels and 36 disk spindles so each one had a small swap partition and they got used like raid0. it was nifty.

teddyh · on Sept 7, 2021

IIRC, you can already set priorities on different swap partitions, so that the kernel chooses the ones you want it to use first.

marcodiego · on Sept 7, 2021

Yes,but it only changes the swap partition/device/file once another one is full.

timvdalen · on Sept 7, 2021

Couldn't that result in much slower reads when the head is far away from the swap when it needs to read (multiple times)?

marcodiego · on Sept 7, 2021

Even more interesting idea: pages are oportunistically mirrored between swap partitions and the kernel smartly chooses the closest one whenever needed!

resonator · on Sept 7, 2021

Does the kernel even have the information to know which is closest. I figured that would be absracted away to the disk controller.

marcodiego · on Sept 7, 2021

The slowest swap was at the end probably because it was farther. The position of the head can be inferred by the geometry and last access.

callesgg · on Sept 7, 2021

The less the reader arm has to move the faster seeking should be.

So if you place the swap near the rest of the files the hdd arm will not need to move so much.

Given that this was pretty much a clean Linux install I would assume that most files where at the start of the disk close to the best swap location.

bluedino · on Sept 7, 2021

I'd like to see a database benchmark ran instead of a software build.

egberts1 · on Sept 7, 2021

Swap placements that I do:

1. choose the fastest HDD device

2. Use direct partition, no LVM

3. partition in middle of spinning platter’ busiest region of hard drive

4. single swap partition only

5. keep swap and hibernate storage separate.

6. encrypt swap (only downside)

adancalderon · on Sept 7, 2021

I am an old time slackware user and I always put swap at the beginning on 3.5 inch drives. I belive for laptop drives the end of drive was better.

flatiron · on Sept 7, 2021

Wonder if zswap would make any difference here?