partitioning - Is partition alignment to SSD erase block size pointless?

Friday, September 5, 2014

partitioning - Is partition alignment to SSD erase block size pointless?

Many people seem to have the idea (1, 2, 3, 4, 5) that aligning the start of your SSD partitions at a multiple of the SSD erase block size is somehow benefitial. I do not see the benefit; consider the following partitioning (please, suspend your disbelief about the 16K erase blocks; they are likely to be much larger in practice and so are the partitions):

Partitions:      [    1   ]              [        2        ]
Logical blocks:  [ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ]
Physical blocks: [ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ]
Erase blocks:    [          16K         ][          16K         ]

Now if logical block K corresponded to physical block K for any K (e.g. if there were no wear-levelling done by the SSD controller), then there might be some theoretical merit to this. Suppose for example that partition 2 in the above figure starts one logical / physical block earlier. Then any write at the beginning of partition 2 will cause the erasure of the first erase block as will any write to partition 1, which will cause additional wear to that particular erase block.

With wear-levelling, however, there is no set correspondence between logical and physical blocks (e.g. the logical block K can correspond to an arbitrary physical block L), so the erase-block alignment should be completely meaningless. Alignment to block size should be sufficient, so that pages (for swapping) and filesystem blocks (for data) written out to the partition do not occupy more blocks on the SSD than necessary.

Related questions:

Answer

This question is very hard, especially in view of the fact that SSD technology
is in constant evolution, and especially since modern operating systems are
constantly improving their handling of SSD.

In addition, I'm not sure that your problem is with Wear leveling.
It should rather be with SSD optimizations designed to avoid block erases.

Let us first get our terms right :

An SSD block or Erase block is the unit that the SSD can erase in one atomic operation, which can usually go up to 4MB bytes
(but 128KB or 256KB are more common).
An SSD cannot write to a block without erasing it first.

An SSD page is the smallest atomic unit that the SSD software can track.
A block usually contains multiple pages, usually up to 4KB in size.
The SSD keeps a mapping per page of where the OS thinks it is located
on the disk (the SSD writes pages wherever it prefers although the OS will
think in terms of a sequential disk).

A sector is the smallest element that the operating system thinks a hard disk
can write in one operation. The OS will also think in terms of disk cylinders
and tracks, even if they do not apply to SSD.
The OS will usually inform the SSD when a sector becomes free
(TRIM).
Smart SSD firmware will usually announce to the OS its page-size as the sector-size where possible.

It is clear that the SSD firmware would prefer always writing to empty blocks,
as they are already erased. Otherwise, to add a page to a block that contains
data will require the sequence of read-block/store-page/erase-block/write-block.

Too liberal application of the above will cause pages to be dispersed all over
the SSD and most blocks to become partially empty, so the SSD may soon run out
of empty blocks. To avoid that, the SSD will continuously do
Garbage collection in the background, consolidating partially-written
blocks and ensuring enough empty blocks are available.
This operation may look like this:

[ image1][1]

Garbage collection introduces another factor -
Write amplification
- meaning that one OS write to the SSD may need more than one physical write
on the SSD.

As an SSD block can only be erased and written a certain number of times before
it dies, Wear leveling
is designed to distribute block writes uniformly
across the SSD so no block is written much more than others.

The question of partition alignment

From the above, it looks like the mechanism that allows the SSD to map pages
to any physical location, keeping wherever the OS thinks they are stored,
voids the need for partition alignment. Since the page is not written where
the OS thinks it is written, there is no more any importance as to where the OS
thinks it writes the data.

However, this ignores the fact that the OS itself attempts to optimize
disk accesses. For classical hard disk it will attempt to minimize head
movements by allocating data accordingly on different tracks.
Clever SSD firmware should manipulate the fictional cylinder and tracks
information that it reports to the OS so that track-size will equal
block-size, and page-size will equal sector-size.

When the view the OS has of the SSD is in somewhat more in line with reality,
the optimizations done by the OS may avoid the need for the SSD to map pages
and avoid garbage collection, which will reduce Write amplification and
increase the lifetime of the SSD.

It should be noted that too much fragmentation of SSD (meaning too much
mapping of pages) increases the amount of work done by the SSD.
The 2009 article
Long-term performance analysis of Intel Mainstream SSDs
indicated that if the drive is abused for too long with a mixture of small and large writes, it can get into a state where the performance degredation is permanent, and that with Wear leveling this condition may extend to more
of the drive.
This condition is the reason while many SSD owners see performance degrade
over time.

My final advice is to align partitions to respect erase-blocks layout.
The OS will assume that a partition is well-aligned as regarding the disk,
and the decisions taken by it on the placement of files might be more
intelligently done. As always, individual idiosyncrasies of OS driver
versus SSD firmware may invalidate such concerns, but better to play it safe.

Blog

Friday, September 5, 2014

partitioning - Is partition alignment to SSD erase block size pointless?

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server