-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HashMapT salt, which allows setting of salt with Nat. #321
base: master
Are you sure you want to change the base?
Add HashMapT salt, which allows setting of salt with Nat. #321
Conversation
70f5bdc
to
58a344f
Compare
benchmarks (without fixing boxing issues) I ran
|
So, If I understand correctly, the boxing issue has little impact on performance. I kindly invite others to reproduce these results, or perhaps point out mistakes I've made. I'm new to this after all. |
7a4681c
to
73fa02e
Compare
This allows clients to create custom salted hashmaps. For backwards compatibility we use ```haskell -- backwards compatibility type HashMap = HashMapT DefaultSalt ``` Then modify the functions to be free of salt if they can, for example insert: ```haskell insert :: forall k v salt . (Eq k, Hashable k) => k -> v -> HashMapT salt k v -> HashMapT salt k v insert k v m = insert' (hash salt k) k v m where salt = natVal (Proxy :: Proxy salt) ``` This allows the default HashMap with backwards compatibility, but also any other HashMapT I think this solves the issue with having different salts in an intersect: ```haskell intersection :: (Eq k, Hashable k) => HashMapT salt k v -> HashMapT salt k w -> HashMapT salt k v ``` Because salt is the same type variable for all arguments, it's enforced to be the same. Then you can also provide a function to resalt if the user ever ends up with different salts and still wants to do an intersect. (which results in a reconstruction of the hashmap). See thread: haskell-unordered-containers#319 I know these libraries will at least fail with the changes because they have instances on `HashMap`, which now need to be `HashMapT salt`: quickcheck-instances-0.3.25.2.drv semirings-0.6.drv relude-0.7.0.0.drv semigroupoids-5.3.5.drv I did this by stubbing out unordered containers in a 100k loc codebase to see what issues would come up in CI.
73fa02e
to
89c4f40
Compare
We could easily fix the compatibility issues by making |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed the whole patch, but I hope you'll find some of my suggestions useful. I'm still not too clear on the purpose of this whole exercise. It still won't lead to a secure implementation; is it solely for the purpose of preventing accidentally repeated bad performance? If so, is it really worth the complexity?
| Collision !Hash !(A.Array (Leaf k v)) | ||
deriving (Typeable) | ||
|
||
type role HashMap nominal representational | ||
-- backwards compatibility | ||
type HashMap = HashMapT DefaultSalt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me a bit nervous. Beginners might be a bit frightened by HashMap
being a type synonym for some wacky type-tagged thing.
@@ -196,8 +205,10 @@ import Data.Coerce (coerce) | |||
------------------------------------------------------------------------ | |||
|
|||
-- | Convenience function. Compute a hash value for the given value. | |||
hash :: H.Hashable a => a -> Hash | |||
hash = fromIntegral . H.hash | |||
hash ::forall a (salt :: Nat) . H.Hashable a => KnownNat salt => Proxy salt -> a -> Hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for the double =>
rather than the conventional constraint tuple? If so, please add an explanatory comment. Also, is there a reason you use Proxy
here and not Proxy#
? I'd prefer the latter to be more confident we won't pass an actual run-time value.
hash ::forall a (salt :: Nat) . H.Hashable a => KnownNat salt => Proxy salt -> a -> Hash | ||
hash proxy = fromIntegral . H.hashWithSalt (fromIntegral salt) | ||
where | ||
salt = natVal proxy -- TODO ensure is unboxed, ensure is a Word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much doubt that it's possible to unbox it with current GHC, but I'd love to be proven wrong.
Question: would we get better code/specialization/performance with an auxiliary class? I'm not at all sure, but it seems vaguely plausible.
wordVal :: forall n. KnownWord n => Proxy# n -> Word
wordVal _ = unTagged (wordVal' :: Tagged n Word)
class KnownWord (n :: Nat) where
wordVal' :: Tagged n Word
instance KnownNat n => KnownWord n where
#if MIN_VERSION_base(4,8,0)
wordVal = Tagged $ fromIntegral (natVal' (proxy# :: Proxy# n))
#else
wordVal _ = Tagged $ fromIntegral (natVal (Proxy :: Proxy n))
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about this, how do I know if the resulting code boxes it or not?
My understanding of boxing is in C:
int* salt = malloc(sizeof(int)); // reserves memory somehwere, pointer gets pushed on stack
(*salt) = 50; // put 50 into memory somehwere using pointer
rather then
int salt = 50; // put on stack
How do I make ghc tell me which of those two options it chooses?
|
||
-- these come from | ||
-- https://github.com/haskell-unordered-containers/hashable/blob/master/src/Data/Hashable/Class.hs#L221 | ||
-- Can't have negatives in a natural. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just don't use a Natural
.
type DefaultSalt = 14695981039346656037 -- fromInteger (-3750763034362895579) :: Word64 | ||
#else | ||
type DefaultSalt = 15868100553162883236 -- old values https://github.com/haskell-unordered-containers/hashable/blame/ade7f97d1c59e9cfbad49b6ed130b90805311758/Data/Hashable/Class.hs#L202 | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand these comments a bit to add context?
empty = empty' | ||
|
||
-- | like 'empty' but allows a custom salt to be set | ||
empty' :: forall k v (salt :: Nat) . HashMapT salt k v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the salt should be the first type argument for use with TypeApplications
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, for new API's I can do this, for old API's I need to retain push the salt to the final position (in case someone used type applications).
HashMap is now an alias to HashMapT with a hardcoded DefaultSalt. | ||
These changes allow salt to be anything. | ||
semigroup operations (a -> a -> a) can use the same salt to guarantee | ||
not having to rebuild the hahsmap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not having to rebuild the hahsmap. | |
not having to rebuild the hashmap. |
usually it's good enough to provide the instance with a free salt: | ||
|
||
```haskell | ||
instance SomeTypeClass (HashMap salt k v) where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that be HashMapT
?
singleton = singleton' | ||
|
||
-- | like 'singleton' but allows a custom salt to be set | ||
singleton' :: forall k v (salt :: Nat) . (Hashable k, KnownNat salt) => k -> v -> HashMapT salt k v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think the salt should be the first type argument. That may even always be the case.
|
||
-- | /O(1)/ Construct a map with a single element. | ||
singleton :: (Hashable k) => k -> v -> HashMap k v | ||
singleton k v = Leaf (hash k) (L k v) | ||
singleton :: forall k v . (Hashable k) => k -> v -> HashMap k v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're here, remove the redundant parentheses.
Good question. It'd solve the instance head problem, but introduce another problem for existing functions, consider empty :: forall k v . HashMap k v
insert :: forall k v salt . (Eq k, Hashable k, KnownNat salt) => k -> v -> HashMapT salt k v -> HashMapT salt k v The insert :: forall k v salt . (Eq k, Hashable k, KnownNat salt, Coercable (map k v) (HashMapT salt k v) , Coercable (HashMapT salt k v) (map k v) => k -> v -> map k v -> map k v
insert k v m = coerce $ ...
where
orig :: HashMapT salt k v
orig = coerce m I think that option has to be weighed against the cost of fixing downstream instance heads, which (mostly (1) ) can be solved with a find and replace.
(1): I had to introduce an additional typeable constraint for Data.Data in older ghc < 8 versions. I'll try fixing review comments after work. |
Thanks for giving us a concrete implementation to play with, @jappeace! :) Before I dive into the implementation, I'd first like to discuss the basic design and how it helps address #319. In particular, I'm wondering how this design compares to a more simplistic approach to #45, like
I'm also not convinced yet that we need to have different hash salts per Does one of these approaches make it more difficult to leak a secret salt? I'm not sure! My suggestion would be to discuss these design issues separately from this PR, in a new issue. |
I have to answer all these questions later (after work), because it involves some verifying of my intuition, but you're right this is just a possible solution (don't mean to pressure anyone). I just saw treeowl also has some more fundemental questions. |
Without suggesting that any of this is ultimately right for the library, the |
It'll give the user the capability to use a different salt, per hashmap value. getRngSalt <- randomIO
case someNatVal getRngSalt of
(Just (SomeNat (proxy :: Proxy n))) ->
print (HML.singleton' @_ @_ @n "hi" 0 ) -- here we can use n with a random value.
Nothing -> pure () This won't make HashMap perse more robust against collision attacks, But, if the salt can be changed it starts making more sense to look at SipHash because we now can hide the
But I really need to read the paper to verify this claim.
|
I think treeowl already did a good job of explaining, but I'll expand it with some examples. Consider union union :: (Eq k, Hashable k, KnownNat salt) =>
HashMap k v -- salt of 5
-> HashMap k v -- salt of 6
-> HashMap k v now the hashable instance will produce misaligning hashes for the same value. For example a key hashWithSalt 5 "hello world" = 2407350307841482486
hashWithSalt 6 "hello world" = -5301410815273744343 The implementation however doesn't do a full reconstruct, instead it aligns the datastructures, for example: -- empty vs. anything
go !_ t1 Empty = t1
go _ Empty t2 = t2
-- leaf vs. leaf
go s t1@(Leaf h1 l1@(L k1 v1)) t2@(Leaf h2 l2@(L k2 v2))
| h1 == h2 = if k1 == k2
then Leaf h1 (L k1 (f k1 v1 v2))
else collision h1 l1 l2
| otherwise = goDifferentHash s h1 h2 t1 t2 If we'd do a simpler approach we'd have to always reconstruct one of them on unequal salts (which is slow). Another alternative is to tell the user the salts don't allign: data UnionErrors = SaltsNoAlign Int Int
union :: (Eq k, Hashable k, KnownNat salt) => HashMapT salt k v -> HashMapT salt k v -> Either UnionErrors (HashMapT salt k v) This breaks the API.
it would be better, although FNV isn't really resistant against hash flooding attacks. |
I understand that the type-level salt ensures that we use the fast non-rehashing algorithms for combining I think the advantage of a simple "stashed salt" approach is that we could use the same API for both cases. In the implementation, we'd check whether the salts are the same, and, depending on the outcome, use either the fast non-rehashing algorithm or a slower algorithm that involves rehashing the keys of one of the maps.
My understanding is that, as long as FNV is used, even a random seed per request would be pretty useless, since it's possible to generate seed-independent collisions for FNV: https://medium.com/@robertgrosse/generating-64-bit-hash-collisions-to-dos-python-5b21404a5306 Also, since you brought up SipHash: My current perspective is that any approach involving SipHash would be pointless, because SipHash seems to be too slow: #319 (comment). We need a faster hash function. |
Using this I'd suggest maybe going even further and using data Salt = Salt Nat | GlobalRandomSalt
type DefaultSalt = 'Salt _ One of the problems of this approach of using |
We don't actually need to use |
Probably but there are some tricks to cover most cases. At least under -O2, GHC is pretty good at not boxing static But to really maximise the chances that at every site the salt is reified from type to value, it's represented as a constant wordVal# :: KnownWord w => Proxy# w -> Word#
wordVal# = ...
{-# INLINE insert #-}
insert :: forall salt k v. (KnownWord salt, Eq k, Hashable k) => k -> v -> HashMapT salt k v -> HashMapT salt k v
insert k v m = unsafeInsert# (wordVal# (Proxy# :: Proxy# salt)) k v m
-- whatever original inlinable/inline/noinline goes pragma here
unsafeInsert# :: forall salt k v. (KnownWord salt, Eq k, Hashable k) => Word# -> k -> v -> HashMapT salt k v -> HashMapT salt k v
unsafeInsert# = ... Which would definitely turn this into a pretty big/ugly refactor. AFAIK wherever the |
This allows clients to create custom salted hashmaps.
For backwards compatibility we use
Then modify the functions to be free of salt if they can, for example insert:
This allows the default HashMap with backwards compatibility,
but also any other HashMapT
I think this solves the issue with having different salts in
an intersect:
Because salt is the same type variable for all arguments,
it's enforced to be the same.
Then you can also provide a function to resalt if the user ever ends
up with different salts and still wants to do an intersect.
(which results in a reconstruction of the hashmap).
See thread: #319
I know these libraries will at least fail with the changes because they have instances on
HashMap
,which now need to be
HashMapT salt
:quickcheck-instances-0.3.25.2.drv
semirings-0.6.drv
relude-0.7.0.0.drv
semigroupoids-5.3.5.drv
I did this by stubbing out unordered containers in a 100k loc codebase
to see what issues would come up in CI.