Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Unwritten Contract of Solid State Drives (2017) (acm.org)
110 points by jitl on May 24, 2021 | hide | past | favorite | 51 comments


Unwritten 5 rules are:

Request Scale: SSD clients should issue large data requests or multiple concurrent requests. A small request scale leads to low resource utilization and reduces immediate and sustainable performance

Locality rule: SSD clients should access with locality. Workloads without locality can incur a poor immediate performance because frequent cache misses lead to many translation-related reads and writes.Poor locality also impacts sustainable performance because data movement during garbage collection and wear-leveling requires translations and mapping updates

Aligned Sequentiality rule: clients of SSDs with hybrid FTLs should start writing at the aligned beginning of a block boundary and write sequentially. This rule does not affect immediate performance since the conversion happens later, but violating this rule degrades sustainable performance because of costly data movement during the delayed conversion.

Grouping By Death Time rule: The death time of a page is the time the page is discarded or overwritten by the host. If a block has data with different death times, then there is a time window between the first and last page invalidations within which both live and dead data reside in the block. There are two practical ways to achieve this: 1) data with similar death times are gathered in the write sequence 2) placing death groups to different logical segments isolates them physically

Uniform Lifetime rule: clients of SSDs should create data with similar lifetimes. Lack of lifetime uniformity does not directly impact immediate performance,but impacts sustainable performance as it necessitates wear-leveling and leads to loss of capacity.


The concerns in the paper motivating the locality rule really only apply to low-end consumer SSDs. Enterprise/datacenter SSDs and high-end consumer SSDs all carry plenty of DRAM for the FTL, so they don't suffer as much penalty for poor locality and it's really only worth drawing a distinction between sequential and non-sequential access.

I also don't think it's common for SSDs to use the kind of hybrid page/block mapping FTL that the Aligned Sequentiality rule is about. These days, if you want a SSD that cuts down on DRAM costs, you want the NVMe Zoned Namespaces feature.

The problems that the Uniform Lifetime rule and Grouping By Death Time rule are concerned with can be handled explicitly with features like NVMe's Streams directives, or Zoned Namespaces.

Overall, this article doesn't seem to have aged particularly well, and I'm not sure it was that useful and relevant even in 2017.


At the application level (E.G. a database storage layout) even for zoned namespaces the GENERALIZED principles seem logical. Even in the old days for tiny 512 byte sectors the seek and re-write time for a partial update likely influenced designs. Today I consider myself lucky to get storage with 4K sectors and I recall seeing rumors that game console SSDs would ship with far larger erase blocks.

I would prefer all storage came with interfaces and common access protocols for utilizing the raw features of the device directly. An enumerate of raw permanent storage addresses (possibly read only, write only in one packet, or write via write-zone promotion), (readable?) write buffer locations which can be promoted to long term storage, etc.

That might be NMVe 'streams', which I haven't read about yet. A quick search result makes me think they might be close and could have evolved there, but weren't there 2 years ago. https://www.anandtech.com/show/14543/nvme-14-specification-p...


> Today I consider myself lucky to get storage with 4K sectors and I recall seeing rumors that game console SSDs would ship with far larger erase blocks.

The erase blocks on current NAND flash are already up to 16+MB, while the page sizes are typically 16kB. So even presenting the illusion of 4kB sectors requires a lot of extra background work from the SSD controller, and that's been true for generations.

Also, the NVMe Streams feature is from the 1.3 spec in 2017: https://www.anandtech.com/show/11436/nvme-13-specification-p... Like most of the more advanced NVMe features, it's targeted at enterprise or datacenter use cases and most of the initial adoption is among the hyperscale cloud providers that have the resources and incentive to optimize their IO stack from top to bottom.


The paper has other locality concerns which apply to all SSDs:

“Locality is not only valuable for reducing required RAM for translations, but also for other purposes. For example, all types of SSDs are sensitive to locality due to their data cache. In addition, for SSDs that arrange flash chips in a RAID-like fashion, writes with good locality are more likely to update the same stripe and the parity calculation can thus be batched and written concurrently [92], improving performance.”


SSDs don't have much data caching. They use the large amount of DRAM for metadata (the FTL), and have small amounts of write buffering. Read caching and prefetching tends to be minimal and amount to the page buffers on each plane of each die, ie. the hardware has to perform a read of 16kB even if you only ask for 4kB.

The bit about updating the same stripe also sounds like hard drive talk, where data can be updated in-place. It doesn't make sense in an SSD context where all modifications go to new stripes and the drive will readily combine a batch of writes to disparate LBAs into a burst of writes to contiguous physical pages.

There's just not much truth behind the Locality rule that isn't already covered by the Request Scale rule.


Regarding the FTL acronym:

"To hide the complexity of SSD internals, the controller usually contains a piece of software called an FTL (Flash Translation Layer); the FTL provides the host with a simple block interface and manages all the operations on the flash chips."

...

"A hybrid FTL uses page-level mappings for new data and converts them to block-level mappings when it runs out of mapping cache."


I know journals differ in their guidance, but shouldn't an abstract say something about the findings? That one reads more like the back cover of a paperback...

(No offense intended to the authors for this fine piece of work, nor to paperback writers for their ability to create suspense. Just a gripe about metadata!)


> shouldn't an abstract say something about the findings?

One counter-example is in security / defence work where the existence of a project is unclassified but the details are protected in some way. Not reporting the findings allows the paper to be indexed in libraries / registries without 'contaminating' the library indexes with classified information.

The other example (rightly or wrongly) is where a publisher wants you to be able to find via bibliographic search tools a paper hidden behind a paywall but still wants you to pay to access the paper. If the abstract revealed too much, it defeats the paywall, from the publisher's perspective.


Indeed! And where is HN when you need it ;)

I for one will be refreshing this page over the course of this day hoping for an unsung hero to come along, and I shall be upvoting.


teekert and sideshowb sat idly on the couch, uninspired to scroll down to the discussion section. Would a tall, handsome hero show up to lift them to new heights of summarization? Only time would tell...


I doubt they would show up this time as the study has no single conclusion, just a number of observations, nothing really striking.


There are links to eReader and PDF in bright buttons at the top of the page.

https://dl.acm.org/doi/epdf/10.1145/3064176.3064187

https://dl.acm.org/doi/pdf/10.1145/3064176.3064187


The abstract on those is the same in that it does not summarize findings.


I have old pictures I took, they were on a SD card. For some reason a few of those files are 0 bytes, meaning they probably went corrupt at some point.

It seems that it also happens on my SSD. For some other reason, windows explorer seems to crash or hang when I browse folder with a lot of images. SSD are known to have a short life expectancy.

I wonder if there is a market for very long life expectancy for HDDs. I mean any type of data storage that can last for 10 years without failure would have value.

Of course magnetic tape systems exist but I would guess that very durable HDD would still be better.


HDDs have awful MTTF compared to SSDs. Where HDDs shine is when you fill them up, unplug them, and put them in a box for decades.

SSDs are much more reliable than HDDs if you want to access the data, they just need to be powered and never written to.


SSD data retention is also excellent if you only ever fill it up once. Data retention only becomes a problem if you use up most of the write endurance. Whether the SSD is powered or not during the long periods where you're not accessing the data has minimal effect.

They're still a horrible data archival solution, but mainly because of the high price per TB.


> Whether the SSD is powered or not during the long periods where you're not accessing the data has minimal effect.

I've heard conflicting reports on this.

Ultimately, any storage technology is about storing data. This can be done magnetically (aka: bits are tiny little magnets pointing in certain directions), as long as the magnets don't change, you can read back the data fine.

Or it can be done optically (ex: the dye inside of CD-roms, DVD or BluRay). As long as the dye remains a certain color, you can read it back fine.

Magnets lose charge over time, dyes can change color (especially if they're exposed to the sun). You won't notice over a year or even 2 years, but after 5, 10, or 15 years, things change enough that you start getting errors.

SSDs store data by punching an electrical charge into a transistor. The barrier is such that the electrical charge doesn't want to leak out: but... what is the rate of that leakage? All materials "leak" charge over time. And if the voltage changes dramatically, the change-in-voltage will change the data (and thus you'll lose data over time).

--------------------------------

Some SSDs, such as the Samsung 830, were well known for leaking charge and causing issues. Today, we use stacked 3D NAND, which is a different transistor type than the planar NAND on the Samsung 830, but that doesn't fundamentally answer the question of "What is the leakage current" and "how long do we have before enough voltage leaves the transistor to corrupt the data?"


> I've heard conflicting reports on this.

I've seen conflicting reports, too—at least in the popular press and tech forums. Filtering down to reports with clear sourcing back to actual academic or industry research, the confusion pretty much goes away. This is an issue where the actual facts have been oversimplified and dumbed down and wrong interpretations have caught on and been repeated by people with no understanding of the underlying mechanisms for data loss and for forestalling the data loss.

In particular, there are tons of people on forums like /r/hardware who think that plugging in the SSD will prevent charge from leaking out of a floating gate memory cell or the charge trap memory cells now used by almost all SSDs. It won't. Using the SSD can lead to data being re-written before it has degraded unrecoverably, through several possible mechanisms. But aside from drives like the Samsung 840 and 840 EVO that were retrofitted with explicit data degradation checks as part of their firmware's operation, merely leaving the drive plugged in will not help directly, and the temperature effects compared to leaving it on a shelf will be a slight detriment to data retention.

It's helpful to recall that the JEDEC standards for SSD write endurance require consumer drives to provide 1 year of unpowered data retention for a drive that has reached the end of its write endurance. For brand-new flash, the leakage is orders of magnitude slower than for worn-out flash, so it's quite reasonable to expect data retention comparable to some of the storage mediums that actually make economic sense for archival.


> Using the SSD can lead to data being re-written before it has degraded unrecoverably, through several possible mechanisms. But aside from drives like the Samsung 840 and 840 EVO that were retrofitted with explicit data degradation checks as part of their firmware's operation, merely leaving the drive plugged in will not help directly, and the temperature effects compared to leaving it on a shelf will be a slight detriment to data retention.

How about a monthly full drive read? That's easy to set up.


Yep, a monthly full drive read or even better a scrub with verification of the filesystem's checksums is great. It's not often enough to seriously impact performance in production or meaningfully increase the rate of read disturb errors, but it will allow the SSD to catch data degradation before it becomes a real problem, and to refresh any data that starts showing elevated error rates (or take other measures like start adjusting read voltage thresholds for that block). I don't actually have a cron job set up for it, but I tend to scrub my SSD array every month or two when checking on free space and SMART logs.


CDs are probably the sweet spot


they WERE the sweet spot at one point (at least, DVDs were), but they've fallen behind and are no longer worth it.

All the issues with hard drives (single "read" mechanism, limiting bandwidth) are doubly so when the medium exists outside of the read-mechanism and multiple manufacturers need to try to stay compatible with each other. It seems difficult to improve the medium independently of the read-mechanism.

I don't really know how the tape-drive industry manages to keep up. Tape just has some huge fundamental advantages in surface area I guess... but Tape also only promises read compatibility for 2-generations. So the whole industry manages to move forward on a decent cadence.

---------

SSDs are now the right size and price for typical end-user devices. But HDDs are still the king of capacity for all datasets below 50TBs, while Tape reigns supreme at the highest end in price per TB.


> “ they WERE the sweet spot at one point (at least, DVDs were), but they've fallen behind and are no longer worth it.”

Are you inclined M-Disk in that group?


By memory, it feels like M-Disk was a bit "too late" to the party.

By that point, BluRays came out and it was cheaper/more reasonable to just double-write your data on two BluRays. Eventually, M-Disk BluRays came out, but no one used optical anymore.

M-Disks were probably the best archival format for optical when they came out (at least, if their archive-quality claims are truly as good as they claim). But they always were a bit more expensive and/or late to the party. Being #1 in optical when optical is dying is not really a win...


I get your point. Certainly not if you want to buy blank media, or replace a burner.

Like minidisc. That was a fantastic medium. I loved those discs as artifacts—like having your old mix tapes. But once the player broke...only some random players on EBay.


For cheaply storing data, sure. The dye on burnable discs tends to degrade over time, though, making them poor for archival purposes.


Everything degrades over time. An archival quality DVD has dyes that resists the bitrot. Archival quality magnetic mediums (HDDs or Tape, or floppies) also resist the bitrot... but they're not immune.

The only way to fully avoid bitrot is to constantly read-and-rewrite the data. You need to read the data before it degrades: that could be once a year, or maybe every few months.

But cold-storage is just a devilishly hard problem. Its probably best to just not deal with that at all. Just get a NAS and read/rewrite the data automatically (aka: "scrub" your data in ZFS terms)


They used to be. These days spinning rust is cheaper than optical.


I just bought 30x new 64GB X25-E 50nm Intel SSDs from 2011 from ebay, they are SLC and should manage 100.000 writes per bit and I don't think we'll ever see that kind of quality again.

I'm going to use EXT4 (with type small because my custom distributed database writes many small JSON files) and hope for the best on my Atom 8/16-core machines that have 12 SATA ports.

I wont use raid but instead plan to mount these "manually".

Would love some feedback! :S


In my experience the weak point here is the controller/firmware. We bought a bunch of high-endurance Enterprise MLC drives (also Intel but newer) and one of them has already bricked itself, way before reaching any kind of wear worth mentioning.


This is extremely common with old SSDs.


Ok, so I will google firmware patches, thx!


A firmware patch doesn't fix thermal runaway on a linear regulator. Don't trust media too much.


I plan to mount these vertically outside the case for optimal passive thermal cooling. I think these are not much worse off than todays drives when it comes to heat, maybe I should strip the 2.5" cases!

Good point! Unless we have a EMP! ;)

Edit: I found my M2 spacers I ordered by mistake 5 years ago, they are a bit wide at 18mm but that is good for thermals! Genious!

Edit2: Ordered 6mm spacers! You can get an idea how it will look here: http://move.rupy.se/file/SSD_unit.png


Heat over time matters, but the thermal runaway scenario I referenced is the most common failure mechanism I know of in small electronics. It's just something that they do. It's not a matter of if, but when.

More important than cooling is the voltage rails. Reliable PSUs that don't overvolt the load under any scenario are important.


Ah, crap! I'm relying on PicoPSUs for this build, thx for the heads up!

Will use Mean Well from 220V -> 12V... but after that it's PicoPSUs! (160/180/200W I will try many to see how their 3.3/5/12 look, I think this drive uses 5V but how can I find out? also measuring voltages on a live circuit has burned me before how do you make sure not to short/fry stuff?!)


If you want to mess around with small fast drives with high endurance relative to their capacity, you should consider the less outdated Intel Optane Memory cache drives, which are also far faster than ancient 3Gbps SATA drives.


I saw those on ebay but the connector was weird also I don't need more speed, these machines will be behind 2x separate home 1Gb/s (up and down) fiber...

Another thing I had to consider was power since I plan to back these up with lead-acid batteries for power outages. These drives draw 2W when active... so that is 18W for 8x (512GB) DB drives and one OS drive! Compared to the 25/32W of the 8/16-core motherboards that's alot... but mechanical drives are insane on electricity, up to 10W per drive!!!

But good point, do you know what connector tese use?


The Optane Memory drives use a M.2 connector, keyed for carrying two lanes of PCI Express 3.0, over which they use the NVMe protocol.

If you're concerned about battery life, you really shouldn't be using lots of small drives, especially when the controllers are so ancient. You'll get at least 5x the performance per Watt from a single newer flash-based SSD and much better idle power behavior. Consumer SSDs also offer much more in the way of low-power idle states than enterprise SSDs. At the extreme, you can get a $75 512GB drive that provides significantly more performance than 8x X25-Es in aggregate, while never reaching 3W power draw under load and idling at under 10mW.


Sure but the 100.000 writes per bit!?

To me data integrity and longevity are paramount!

These have same idle power btw, but ofcourse 2W on read/write is bad but I rather use these $1000 in 2011 50nm SLC industrial drives and compress my database than buy some MLC/TLC/QLC drive made today with 14nm!


You should be aware that SSDs perform wear leveling, so the key metric is not how many program/erase cycles the underlying flash can handle on a per-block basis, but the total amount of writes the drive can handle, usually expressed in TB or PB.

Unless you have measured your own workload in enough detail to know otherwise, it is extremely unlikely you actually need to use SLC NAND. The market has almost completely abandoned SLC NAND for good reason, and your use case probably isn't that special (though if it is, I'm sure we could have an interesting conversation about it). Absent clear evidence to the contrary, I'm going to assume that a modern TLC SSD could easily fulfill your actual write endurance and data retention requirements, because that's the conclusion that has been reached by all of the big corporations who have been using modern SSDs in production at scale.


The wildcard here is the garbage collection in the drive, and bad firmware with high write amplification. I’ve seen TLC devices go bad within a year where the vendor optimized for full page P/E cycles.

That said these were high transaction server workloads, and typical home use would be challenged to get to that kind of wear.


Maybe, but I'm factoring in the time spent on maintenance at 100x the cost because that is not something I want me or my kids to have to do ever! Meaning with 5 million customers on my MMO we have 250 years of all of them playing every day with these drives. Which gives us margin!


> I don't think we'll ever see that kind of quality again.

Some companies still make new "psuedo-SLC" drives [1] which store a single bit per flash cell. The downside is they're expensive: $1/gigabyte instead of $0.1/gigabyte.

And at that price, RAID mirroring starts to look like a great deal!

[1] https://uk.farnell.com/c/office-computer-networking-products...


I got these drives at $0.6/GB...

These drives seem to be built like tanks and they fill the 2.5" enclosure to the brim.

They are also the first drive with RAM buffer on the read lane which is interesting in it's own right; meaning the motherboard never have to wait for writes but the disk have to wait for the motherboard on reads!!!


I have several, similar X25-M's still in use over a decade, they are great drives.


Good to hear, M are MLC I think so the X25-E should be even better!?


Its weird they didn't test with ZFS, UFS/FFS or ReiserFS.


> Its weird they didn't test with ZFS

Maybe because of the numerous ways if can be configured that might affect results. ZFS is not as simple as Ext4 or XFS in configuration, and it might have just made it harder to get useful information from the results. At the same time, someone can probably use the results here to make some educated guesses as to how ZFS can/should be configured for SSDs based on this and some knowledge of SSDs.

> UFS/FFS

The purpose isn't to test filesystems, it's to figure out how to best leverage SSDs specific performance and design characteristics for both immediate and long term performance.

> or ReiserFS

Uh, is ReiserFS still actively developed? I used to follow it quite actively, but I admit I sort of lost interest after it seemed to stall after Han Reiser's murder conviction.


RE: ZFS - Testing with modern filesystems should be a key point for benchmarking and "getting the most out of a device". Because ZFS contains both logical volume management and a variety of other safety/speed controls it really should've been included.

RE: UFS/FFS - These perform quite well with SSDs and in various configurations will outperform ext4/etc in terms of sheer write/read performance and also data safety.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: