README.md 6.5 KB

finitary-derive

What's this all about, then?

Have you ever written an Unbox instance for a user-defined type? I hope not, because it's a uniquely tedious chore. If your type is more complex, this can be difficult, fiddly, and frustrating. Storable is not much better. This is the kind of 'work' that we as Haskellers ought not to put up with.

Now, you don't have to! As long as your type is Finitary, you can now get Unbox and Storable (as well as a whole bunch of other) instances almost automagically:

{-# LANGUAGE DeriveAnyClass #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE DerivingVia #-}

import Data.Finitary
import Data.Finitary.Finiteness
import Data.Finitary.PackInto
import Data.Word
import Data.Hashable

import qualified Data.Vector.Unboxed as VU
import qualified Data.Vector.Storable as VS

data Foo = Bar | Baz (Word8, Word8) | Quux Word16
  deriving (Eq, Generic, Finitary)
  deriving (Ord, Bounded, Hashable, NFData, Binary) via (Finiteness Foo)

someVector :: VU.Vector (PackInto Foo Word64)
someVector = VU.fromList . fmap Packed $ [Bar, Baz 0x0 0xf, Quux 0x134]

someStorableVector :: VS.Vector (PackInto Foo Word64)
someStorableVector = VS.fromList . fmap Packed $ [Bar, Baz 0x0 0xf, Quux 0x134]

If you don't have access to DerivingVia, you can still get the benefits of this library -- just use Finitary a instead of a. As it is a newtype, you can coerce through it if you care about efficiency.

What's the deal with Unbox and Storable exactly? What's with all the Pack types?

Essentially, being Finitary means that there's a finite set of indexes, one for each inhabitant. That means we can essentially represent any inhabitant as a fixed-length number. It's on the basis of this that we can 'magic up' Storable and Unbox.

However, how we represent this fixed-length number isn't immediately obvious. We have a couple of options:

  • A string of bits
  • A string of bytes
  • An array of machine words

Additionally, if we have another finitary type whose cardinality is not smaller, we could potentially 'borrow' its instances as well. Which of these choices is appropriate isn't obvious in general: it depends on whether you care about space or speed, the cardinality of the type, and a bunch of other things too. As we believe that the best people to judge tradeoffs like these are the people using our library, we provide all of these options for you to choose from, so that you can choose the one that best suits you.

So... what's the difference exactly?

PackBits represents indexes as strings of bits. This is the most compact representation possible (honestly, maths says so), but the least efficient, as accessing individual bits is slower on most architectures than whole bytes or words. Unless you've got large Vectors, you probably don't need this encoding, but if space is at an absolute premium, this is the best choice.

PackBytes instead represents indexes as byte strings. This is a more efficient choice than a string of bits, but can still be slow for architectures which prefer whole-word access. It's also fairly compact, especially if your architecture has big Words.

PackWords represents indexes as fixed-length arrays of Words. This is the most efficient encoding from the point of view of random reads and writes, but will likely waste a lot of space, unless your type is extremely large (as in, multiple copies of Word large).

Lastly, PackInto lets you choose another finitary type whose instances you want to 'borrow', and will use that type as a representation. This is the most flexible, and should be preferred whenever possible. However, it requires that a type of appropriate cardinality (at least as big as the one you want to encode) exists, and has the appropriate instances.

Why can't I DerivingVia through these Pack types?

For Unbox, the short answer is 'role restrictions on unboxed vectors'. If you want a more detailed explanation, check out the GHC wiki on roles, as well as the implementation of Data.Vector.Unboxed. You might also want to check out stuff about data families.

Additionally, there is some tension in the design. We could have made one of two choices: either define Pack types as transparent newtypes, and encode or decode whenever a type class method required it; or define Pack types as opaque, and encode or decode only when the values were constructed or deconstructed. Ultimately, we went with the second option, as it makes the occurences of encodes and decodes explicit to the user. Had we gone with the first choice, it would be unclear where encodes and decodes occur, especially when using functions built from type class methods. We believe this clarity is worth the inability to use DerivingVia to define Storable instances.

Why do PackBytes, PackWords and PackInto have Storable instances, but not PackBits?

Because it's not clear what this instance should look like. Let's suppose you want to bit-pack a type Giraffe with cardinality 11 - what should sizeOf for PackBits Giraffe be? How about alignment? The only obvious solution is padding, but in this case, you might as well use PackBytes, PackWords or PackInto, since then you'll at least know what you're getting, and are explicit about it.

Sounds good! Can I use it?

Certainly - we've tested on the following (all x86_64 only):

  • GNU/Linux: GHC 8.4.4, 8.6.5, 8.8.1
  • macOS: GHC 8.8.1
  • Windows: GHC 8.10.4, 9.0.1

If you would like support for any additional GHC versions, let us know. Unfortunately, while the library will build on 8.4.4, due to hedgehog-classes being limited to 8.6+, tests cannot be run on this version.

If you build and use this library successfully on any other platforms, we'd like to know too - it'd be beneficial even if nothing breaks, and especially if something does.

License

This library is under the GNU General Public License, version 3 or later (SPDX code GPL-3.0-or-later). For more details, see the LICENSE.md file.