-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexMap::insert_unique_unchecked #200
base: master
Are you sure you want to change the base?
Conversation
Insert a key-value pair into the map without checking if the key already exists in the map. This operation is safe if a key does not exist in the map. However, if a key exists in the map already, the behavior is unspecified: this operation may panic, or any following operation with the map may panic or return arbitrary result. This operation is faster than regular insert, because it does not perform lookup before insertion. This operation is useful during initial population of the map. For example, when constructing a map from another map, we know that keys are unique. Simple benchmark of `insert` vs `insert_unique_unchecked` included: ``` test insert ... bench: 14,929 ns/iter (+/- 2,222) test insert_unique_unchecked ... bench: 11,272 ns/iter (+/- 1,172) ```
Does hashbrown itself have any operation like this? Maybe If this method is good - is it something we can use internally somewhere? |
Hashbrown's several I'm not aware of anything public like this on |
I like Give me some time, I'll open a PR to hashbrown and and hear what people say there (about name, and generally about the idea of exposing map internals). One possibility is that they say, just use |
Sometimes a map is constructed when it is known that all keys are unique (e. e. if keys are coming from another map or from a sorted/deduplicated iterator). In this case we can make insertion faster by skipping a check that a key already exists in the map. `insert_unique_unchecked` is guaranteed to be memory-safe, but does not guarantee anything beyoyd that: if inserted key is not unique, `HashMap` can panic, loop forever, return incorrect entry etc. Added simple benchmark. `insert_unique_unchecked` is about 30% faster than `insert`. Your mileage may vary of course. Similar PR was [added to `indexmap` crate](indexmap-rs/indexmap#200) and they asked to discuss the name of the operation with `hashbrown` crate owners to come to the same naming convention (if `hashbrown` is willing to have the same operation).
Sometimes a map is constructed when it is known that all keys are unique (e. e. if keys are coming from another map or from a sorted/deduplicated iterator). In this case we can make insertion faster by skipping a check that a key already exists in the map. `insert_unique_unchecked` is guaranteed to be memory-safe, but does not guarantee anything beyond that: if inserted key is not unique, `HashMap` can panic, loop forever, return incorrect entry etc. Added simple benchmark. `insert_unique_unchecked` is about 30% faster than `insert`. Your mileage may vary of course. Similar PR was [added to `indexmap` crate](indexmap-rs/indexmap#200) and they asked to discuss the name of the operation with `hashbrown` crate owners to come to the same naming convention (if `hashbrown` is willing to have the same operation).
Hashbrown PR: rust-lang/hashbrown#293 |
Sometimes a map is constructed when it is known that all keys are unique (e. e. if keys are coming from another map or from a sorted/deduplicated iterator). In this case we can make insertion faster by skipping a check that a key already exists in the map. `insert_unique_unchecked` is guaranteed to be memory-safe, but does not guarantee anything beyond that: if inserted key is not unique, `HashMap` can panic, loop forever, return incorrect entry etc. Added simple benchmark. `insert_unique_unchecked` is about 30% faster than `insert`. Your mileage may vary of course. Similar PR was [added to `indexmap` crate](indexmap-rs/indexmap#200) and they asked to discuss the name of the operation with `hashbrown` crate owners to come to the same naming convention (if `hashbrown` is willing to have the same operation).
insert_unique_unchecked operation Sometimes a map is constructed when it is known that all keys are unique (e. e. if keys are coming from another map or from a sorted/deduplicated iterator). In this case we can make insertion faster by skipping a check that a key already exists in the map. `insert_unique_unchecked` is guaranteed to be memory-safe, but does not guarantee anything beyond that: if inserted key is not unique, `HashMap` can panic, loop forever, return incorrect entry etc. Added simple benchmark. `insert_unique_unchecked` is about 30% faster than `insert`. Your mileage may vary of course. Similar PR was [added to `indexmap` crate](indexmap-rs/indexmap#200) and they asked to discuss the name of the operation with `hashbrown` crate owners to come to the same naming convention (if `hashbrown` is willing to have the same operation).
90c45fd
to
dd3b2d8
Compare
Updated the PR with However, unlike hashbrown, I opted not to return |
Summary: Before this diff `kwargs` were collected into `Dict` which is `SmallMap<Hashed<Value>, Value>`. Now `kwargs` are collected into `SmallMap<Hashed<StringValue>, Value>` and that `SmallMap` is "coerced" into a map with `Value` key. When inserting keys we lookup previous keys (this should be partially addressed by [this PR in indexmap](indexmap-rs/indexmap#200)), and equality operation on `StringValue` is cheaper that equality on `Value` because there's no dynamic casts. We don't do real equality often (because hash collisions are rare), but having `StringValue` instead of `Value` may generate more efficient machine code. Also this makes code a little more type-safe. Reviewed By: ndmitchell Differential Revision: D30921794 fbshipit-source-id: cf2b4fa72eeef150e6308d2fe2d7c16f59166586
Summary: Before this diff `kwargs` were collected into `Dict` which is `SmallMap<Hashed<Value>, Value>`. Now `kwargs` are collected into `SmallMap<Hashed<StringValue>, Value>` and that `SmallMap` is "coerced" into a map with `Value` key. When inserting keys we lookup previous keys (this should be partially addressed by [this PR in indexmap](indexmap-rs/indexmap#200)), and equality operation on `StringValue` is cheaper that equality on `Value` because there's no dynamic casts. We don't do real equality often (because hash collisions are rare), but having `StringValue` instead of `Value` may generate more efficient machine code. Also this makes code a little more type-safe. Reviewed By: ndmitchell Differential Revision: D30921794 fbshipit-source-id: cf2b4fa72eeef150e6308d2fe2d7c16f59166586
/// Insert a key-value pair into the map without checking | ||
/// if the key already exists in the map. | ||
/// | ||
/// Returns a reference to the key and value just inserted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! It does not return a reference in this version (this doc and the other both say this).
I can see the problem, we'd need some unsafe code to do the index -> &mut V lookup without extra cost here. It's fine to avoid this and just return nothing I think (?) Unless there is a compelling nameable usecase
@cuviper This is a new feature and in theory we'd bump the version to 1.8 since it's a feature update. I guess it's only debatable where to draw the line between incremental fix and improvement, but I don't mind using the feature bump for new methods (for new trait impls, depends, then it depends I think). |
I prefer to see that we try to keep map-set parity when possible. Should there be a set method for this? |
dd3b2d8
to
21edbab
Compare
Updated docs. There's a |
I'm conflicted about matching the name without properly matching the API in the return value. I don't see why you would change |
FWIW, the initial numbers are much closer on my machine, using
Adding the return references does consistently slow that down:
I also tried running with 10k insertions, and the benefit narrowed further:
and with return references:
|
@cuviper looks bad. Let me check again. |
On my laptop, MacBook Pro, Intel Core i9, rustc 1.57.0-nightly (8c2b6ea37 2021-09-11) the numbers I gave are correct. |
For 10k items:
For 100:
|
I'm not too concerned about that version criteria, personally. I do prefer minor bumps for pseudo-breaking things like MSRV, but otherwise it's fine either way. We also have #195 and #196 pending release that are featureful.
That's fine, benchmarks vary. FWIW mine is a desktop running Fedora 34 with an AMD Ryzen 7 3800X. |
And if I do that, there's no regression anymore (at least I don't see it easily: rust builtin benchmarks are too unreliable). I'll update the PR now. |
OK, proper (more or less) benchmark for 1000 inserts. 100 runs. Raw data: https://gist.github.com/stepancheg/b234abc8da06de88acc18c3a1c7adfe3 Yes, I gave a too high number initially. Sorry. |
21edbab
to
969115b
Compare
I'll post an update after another benchmark finishes. |
OK, more benchmarks. Returning references from How I checked. I created a simple utility:
Compiled it with this branch. Then applied a patch which removes returning values and compiled the utility again (version C). And applied a patch to this branch which uses unsafe (version B):
Then I used my absh utility https://github.com/stepancheg/absh which I used for proper A/B benchmarking. Utility is invoked like this:
After 500 iterations the output is:
Again: So, simply returning references (A) causes about 3% regression against version with returning nothing (C), but using |
Hi, this is an interesting feature, what can I do too push it forward ? The PR in hashbrown got merged. Do we need it also on the HashMap of std ? |
The ideal would be to have this added and stabilized in |
Insert a key-value pair into the map without checking if the key
already exists in the map.
This operation is safe if a key does not exist in the map.
However, if a key exists in the map already, the behavior is
unspecified: this operation may panic, or any following operation
with the map may panic or return arbitrary result.
This operation is faster than regular insert, because it does not
perform lookup before insertion.
This operation is useful during initial population of the map. For
example, when constructing a map from another map, we know that
keys are unique.
Simple benchmark of
insert
vsinsert_unique_unchecked
included: