Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update benchmark results after migrating to AWS. #127

Merged
merged 3 commits into from
Jan 9, 2025

Conversation

scottcanoe
Copy link
Contributor

@scottcanoe scottcanoe commented Jan 8, 2025

This PR updates benchmark results tables after migrating to AWS. No changes have been made to tbp.monty's core code since the last set of complete benchmarks was completed.

@scottcanoe scottcanoe marked this pull request as ready for review January 8, 2025 15:46
@tristanls tristanls added documentation Improvements or additions to documentation triaged This issue or pull request was triaged labels Jan 8, 2025
@nielsleadholm
Copy link
Contributor

Thanks very much for running these Scott. Are the results deterministic within AWS? I.e. if a new instance is spun-up, do we at least get the exact same results across those cases? I would be less concerned in that case about any differences between these results and Oracle (e.g. the accuracy changes look equivocal).

@scottcanoe
Copy link
Contributor Author

Thanks very much for running these Scott. Are the results deterministic within AWS? I.e. if a new instance is spun-up, do we at least get the exact same results across those cases? I would be less concerned in that case about any differences between these results and Oracle (e.g. the accuracy changes look equivocal).

@nielsleadholm I'm rerunning benchmarks now, and I'll post an update when they're done. It'd be really nice to have a complete answer about reproducibility (i.e., rerun all benchmarks, including laptop). I think I can probably get this done in 2-3 hours with AWS.

@scottcanoe
Copy link
Contributor Author

Reproducibility Report

Since we are on a new infrastructure and some results were different than on OCI, I ran the benchmarks a second time. Below are tables of both batches of runs. I'm happy to report that everything is identical except small variations in run times.

Shorter Experiments with 10 Objects

Run 1

Experiment % Correct % Used MLH Num Matching Steps Rotation Error (radians) Run Time Episode Run Time (s)
base_config_10distinctobj_dist_agent 99.29% 5.00% 34 0.27 6m 20s
base_config_10distinctobj_surf_agent 100.00% 0.00% 28 0.17 4m 19s
randrot_noise_10distinctobj_dist_agent 98.00% 6.00% 47 0.45 5m 31s
randrot_noise_10distinctobj_dist_on_distm 100.00% 2.00% 36 0.26 4m 28s
randrot_noise_10distinctobj_surf_agent 99.00% 0.00% 28 0.33 4m 27s
randrot_10distinctobj_surf_agent 100.00% 0.00% 29 0.40 3m 19s
randrot_noise_10distinctobj_5lms_dist_agent 100.00% 7.00% 52 0.86 18m 86s
base_10simobj_surf_agent 95.00% 7.86% 70 0.16 8m 41s
randrot_noise_10simobj_dist_agent 82.00% 40.00% 182 0.61 16m 116s
randrot_noise_10simobj_surf_agent 90.00% 34.00% 180 0.50 24m 203s
randomrot_rawnoise_10distinctobj_surf_agent 73.00% 78.00% 15 1.54 11m 12s
base_10multi_distinctobj_dist_agent 69.29% 47.14% 25 0.82 1h6m 2s

Run 2

Experiment % Correct % Used MLH Num Matching Steps Rotation Error (radians) Run Time Episode Run Time (s)
base_config_10distinctobj_dist_agent 99.29% 5.00% 34 0.27 6m 19s
base_config_10distinctobj_surf_agent 100.00% 0.00% 28 0.17 4m 17s
randrot_noise_10distinctobj_dist_agent 98.00% 6.00% 47 0.45 7m 38s
randrot_noise_10distinctobj_dist_on_distm 100.00% 2.00% 36 0.26 4m 29s
randrot_noise_10distinctobj_surf_agent 99.00% 0.00% 28 0.33 5m 36s
randrot_10distinctobj_surf_agent 100.00% 0.00% 29 0.40 3m 19s
randrot_noise_10distinctobj_5lms_dist_agent 100.00% 7.00% 52 0.86 16m 77s
base_10simobj_surf_agent 95.00% 7.86% 70 0.16 9m 48s
randrot_noise_10simobj_dist_agent 82.00% 40.00% 182 0.61 16m 117s
randrot_noise_10simobj_surf_agent 90.00% 34.00% 180 0.50 22m 189s
randomrot_rawnoise_10distinctobj_surf_agent 73.00% 78.00% 15 1.54 15m 16s
base_10multi_distinctobj_dist_agent 69.29% 47.14% 25 0.82 1h5m 2s

Longer Experiments with all 77 YCB Objects

Run 1

Experiment % Correct % Used MLH Num Matching Steps Rotation Error (radians) Run Time Episode Run Time (s)
base_77obj_dist_agent 93.07% 14.72% 86 0.33 1h4m 197s
base_77obj_surf_agent 98.27% 5.19% 57 0.21 31m 96s
randrot_noise_77obj_dist_agent 87.01% 29.87% 148 0.69 1h33m 314s
randrot_noise_77obj_surf_agent 94.81% 19.91% 107 0.61 55m 198s
randrot_noise_77obj_5lms_dist_agent 84.42% 9.09% 64 1.07 42m 800s

Run 2

Experiment % Correct % Used MLH Num Matching Steps Rotation Error (radians) Run Time Episode Run Time (s)
base_77obj_dist_agent 93.07% 14.72% 86 0.33 1h4m 196s
base_77obj_surf_agent 98.27% 5.19% 57 0.21 28m 88s
randrot_noise_77obj_dist_agent 87.01% 29.87% 148 0.69 1h36m 323s
randrot_noise_77obj_surf_agent 94.81% 19.91% 107 0.61 57m 205s
randrot_noise_77obj_5lms_dist_agent 84.42% 9.09% 64 1.07 47m 944s

Unsupervised Learning

Run 1

Experiment %Correct - 1st Epoch % Correct - >1st Epoch Mean Objects per Graph Mean Graphs per Object Run Time Episode Run Time (s)
surf_agent_unsupervised_10distinctobj 80.00% 86.67% 1.11 1.11 16m 10s
surf_agent_unsupervised_10distinctobj_noise 80.00% 67.78% 1.09 2.78 22m 13s
surf_agent_unsupervised_10simobj 50.00% 76.67% 2.75 2.20 25m 15s

Run 2

Experiment %Correct - 1st Epoch % Correct - >1st Epoch Mean Objects per Graph Mean Graphs per Object Run Time Episode Run Time (s)
surf_agent_unsupervised_10distinctobj 80.00% 86.67% 1.11 1.11 17m 10s
surf_agent_unsupervised_10distinctobj_noise 80.00% 67.78% 1.09 2.78 23m 14s
surf_agent_unsupervised_10simobj 50.00% 76.67% 2.75 2.20 26m 16s

Monty-Meets-World

Run 1

Experiment % Correct % Used MLH Num Matching Steps [Rotation Error (radians)] Run Time Episode Run Time (s)
randrot_noise_sim_on_scan_monty_world 80.00% 85.83% 437 0.94 54m 25s
world_image_on_scanned_model 66.67% 87.50% 453 2.05 16m 19s
dark_world_image_on_scanned_model 43.75% 77.08% 433 1.87 15m 18s
bright_world_image_on_scanned_model 47.92% 83.33% 457 2.16 22m 27s
hand_intrusion_world_image_on_scanned_model 54.17% 47.92% 333 1.79 11m 13s
multi_object_world_image_on_scanned_model 41.67% 39.58% 298 1.67 10m 12s

Run 2

Experiment % Correct % Used MLH Num Matching Steps Rotation Error (radians) Run Time Episode Run Time (s)
randrot_noise_sim_on_scan_monty_world 80.00% 85.83% 437 0.94 57m 27s
world_image_on_scanned_model 66.67% 87.50% 453 2.05 20m 24s
dark_world_image_on_scanned_model 43.75% 77.08% 433 1.87 17m 21s
bright_world_image_on_scanned_model 47.92% 83.33% 457 2.16 18m 21s
hand_intrusion_world_image_on_scanned_model 54.17% 47.92% 333 1.79 10m 12s
multi_object_world_image_on_scanned_model 41.67% 39.58% 298 1.67 9m 11s

Copy link
Contributor

@nielsleadholm nielsleadholm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running those Scott.

Given the results are identical across the two AWS runs, I'm not concerned about issues with our seed fixing etc, and I assume that some of the other potential elements we discussed might explain the Oracle vs AWS differences. The accuracy changes between Oracle and AWS look equivocal to me so I think we should merge this and not spend 1 week+ chasing possible causes, i.e. given they haven't had a negative effect.

@vkakerbeck just tagging you to make sure you agree before we merge it?

Copy link
Contributor

@vkakerbeck vkakerbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. Thanks for updating those Scott and for making sure the new infrastructure still produces consistent results!

@scottcanoe scottcanoe merged commit a6464bc into thousandbrainsproject:main Jan 9, 2025
13 checks passed
@scottcanoe scottcanoe deleted the aws_benchmarks branch January 9, 2025 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation triaged This issue or pull request was triaged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants