Add more detailed data structure information #161

treeowl · 2017-06-26T17:12:14Z

Off the top of my head:

Are Empty subtrees trees allowed everywhere? Nowhere? Some places but not others?
Exactly how is the bitmap represented? That seems relatively tricky, and particularly important for improving the test suite.
What is the endianness of hash consumption?

The text was updated successfully, but these errors were encountered:

tibbe · 2017-06-27T03:56:40Z

Are Empty subtrees trees allowed everywhere? Nowhere? Some places but not others?

They are allowed everywhere but as an optimization they should only appear in an empty map. The point of BitmapIndexed is to avoid having to store pointers to Empty in the tree. It would be possible to refactor the data structure to

data HashMap k v = Empty | Tree k v
data Tree k v = ...  -- the rest

I don't know if that's a win or not. One related thing I've considered is to have

data HashMap k v = Empty | Root k v
data Root k v = Root !Size !(Tree k v)
data Tree k v = ...

if we ever wanted to stored the size (for O(1) lookup). Function like insert would have to return the diff in size and update the root.

Exactly how is the bitmap represented? That seems relatively tricky, and particularly important for improving the test suite.

Just like the HAMT data structure presented in the original paper: the N bits of the bitmap represent the N-subtrees that are present. The array only contains the present subtrees with no empty trees in-between. We use popcount to go between the logical and compact indices.

What is the endianness of hash consumption?

Are you asking which end of the hash we use first when indexing or are you asking how hashable hashes various machine word-sized types?

treeowl · 2017-06-27T16:12:32Z

On Jun 26, 2017 11:56 PM, "Johan Tibell" <notifications@github.com> wrote: Are Empty subtrees trees allowed everywhere? Nowhere? Some places but not others? They are allowed everywhere but as an optimization they should only appear in an empty map. The point of BitmapIndexed is to avoid having to store pointers to Empty in the tree. Well, if they're not supposed to be in non-empty maps, that can be part of a validity check. It would be possible to refactor the data structure to data HashMap k v = Empty | Tree k vdata Tree k v = ... -- the rest I don't know if that's a win or not. Well, it adds an indirection, which would matter if you have a container of maps. I really wish the newtype optimization were available for GADTs under special circumstances; such circumstances apply here and would make this free. One related thing I've considered is to have data HashMap k v = Empty | Root k vdata Root k v = Root !Size !(Tree k v)data Tree k v = ... if we ever wanted to stored the size (for O(1) lookup). Function like insert would have to return the diff in size and update the root. I don't see how that will work for unions or intersections. *Exactly how is the bitmap represented?* That seems relatively tricky, and particularly important for improving the test suite. Just like the HAMT data structure presented in the original paper: the N bits of the bitmap represent the N-subtrees that are present. The array only contains the present subtrees with no empty trees in-between. We use popcount to go between the logical and compact indices. So the popcount of the bitmap must equal the size of the array? That sounds like another good fact for generating test cases and checking validity. Given a logical index, we mask off all but that many bits of the bitmap and popcount to get the compact index? Given a compact index, I don't see a particularly nice way to get the logical index; does that use a binary search, or is it just never necessary? What is the endianness of hash consumption? Are you asking which end of the hash we use first when indexing or are you asking how hashable hashes various machine word-sized types? I meant which end of the hash is used first, but I guess the Hashable question must interact with that for performance. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#161 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABzi_ZBJH_HQHyFzZpq44P1Y6D-lRj7eks5sIH14gaJpZM4OFmPM> .

tibbe · 2017-06-28T00:49:56Z

Well, if they're not supposed to be in non-empty maps, that can be part of
a validity check.

Sure. Perhaps add a macro-like check like we have for other things. You might have to add it to every function that mutates the map however. Or perhaps use smart constructors that validate the invariants.

I don't see how that will work for unions or intersections.

I haven't thought it through. Perhaps we'd just recompute the size from scratch while doing the union. Each recursive call to union could return the subtree size.

I meant which end of the hash is used first, but I guess the Hashable
question must interact with that for performance.

It uses the low order bits first. See D.HM.Base.index. Either could work, it depends on which side you expect to be better behaved. For example, if you were storing pointer and the Hashable instance for those was identity then you'd expect the lowest three bits to be zero on 64-bit platforms due to alignment.

treeowl · 2017-06-28T01:41:55Z

On Tue, Jun 27, 2017 at 8:49 PM, Johan Tibell ***@***.***> wrote: Well, if they're not supposed to be in non-empty maps, that can be part of a validity check. Sure. Perhaps add a macro-like check like we have for other things. You might have to add it to every function that mutates the map however. Or perhaps use smart constructors that validate the invariants.

I just meant that the test suite can say `valid result .&&. ....` in each test.

I don't see how that will work for unions or intersections. I haven't thought it through. Perhaps we'd just recompute the size from scratch while doing the union. Each recursive call to union could return the subtree size.

Can't unions skip all the processing of subtrees that are only found in one of the maps? Descending them all just to count could be pricy.

I meant which end of the hash is used first, but I guess the Hashable question must interact with that for performance. It uses the low order bits first. See D.HM.Base.index. Either could work, it depends on which side you expect to be better behaved. For example, if you were storing pointer and the Hashable instance for those was identity then you'd expect the lowest three bits to be zero on 64-bit platforms due to alignment.

I'd expect the pointer hash to rotate three bits to the right.

tibbe · 2017-06-28T02:49:53Z

Can't unions skip all the processing of subtrees that are only found in one
of the maps? Descending them all just to count could be pricy.

Possibly. As I said I haven't thought this through fully.

I'd expect the pointer hash to rotate three bits to the right.

Yes that's a common approach (or xor high and low bits).

sjakobi · 2020-07-13T16:36:23Z

Are Empty subtrees trees allowed everywhere? Nowhere? Some places but not others?

They are allowed everywhere but as an optimization they should only appear in an empty map. The point of BitmapIndexed is to avoid having to store pointers to Empty in the tree.

Based on #154 (comment), it seems that the Eq instance relies on Empty only existing at the top-level, not as a subtree.

treeowl · 2020-07-13T16:40:22Z

Correct. Empty only for the whole tree.

sjakobi · 2020-07-30T13:55:15Z

For reference, there was some discussion of invariants in #282 (comment).

IMHO it's very important that these invariants are documented (ideally in the code), and also tested.

sjakobi · 2022-05-03T16:06:21Z

#444 has made some progress on documenting and checking the invariants.

Better documentation in proper prose and with examples is still needed though.

EDIT: #425 and #450 are related.

treeowl assigned tibbe Jun 26, 2017

sjakobi added the documentation label May 31, 2020

sjakobi mentioned this issue Jul 13, 2020

Fast inclusion operation on hashmaps and hashsets #282

Merged

sjakobi mentioned this issue Mar 4, 2022

Invariants should be documented and tested #366

Closed

tibbe removed their assignment May 17, 2022

sjakobi mentioned this issue May 24, 2022

Towards unlifted HashMaps #463

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more detailed data structure information #161

Add more detailed data structure information #161

treeowl commented Jun 26, 2017

tibbe commented Jun 27, 2017

treeowl commented Jun 27, 2017 via email

tibbe commented Jun 28, 2017

treeowl commented Jun 28, 2017 via email

tibbe commented Jun 28, 2017

sjakobi commented Jul 13, 2020

treeowl commented Jul 13, 2020

sjakobi commented Jul 30, 2020

sjakobi commented May 3, 2022 •

edited

Loading

Add more detailed data structure information #161

Add more detailed data structure information #161

Comments

treeowl commented Jun 26, 2017

tibbe commented Jun 27, 2017

treeowl commented Jun 27, 2017 via email

tibbe commented Jun 28, 2017

treeowl commented Jun 28, 2017 via email

tibbe commented Jun 28, 2017

sjakobi commented Jul 13, 2020

treeowl commented Jul 13, 2020

sjakobi commented Jul 30, 2020

sjakobi commented May 3, 2022 • edited Loading

sjakobi commented May 3, 2022 •

edited

Loading