feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454

knzo25 · 2024-11-25T08:31:06Z

Description

This PR is part of a series of PRs that aim to accelerate the Sensing/Perception pipeline through an appropriate use of CUDA.

List of PRs:

feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454 (pointcloud preprocessing)
feat(autoware_pointcloud_preprocessor): cuda-accelerated pointcloud concatenation #9455 (concatenation)
feat(autoware_lidar_centerpoint): added the cuda_blackboard to centerpoint #9453 (centerpoint)
transfusion (TODO)
feat(autoware_probabilistic_occupancy_grid_map): cuda accelerated implementation #9542 (OGM - the first implementation will be independent of the blackboard to ease the transition)
feat: acceleration and transport layer tier4/aip_launcher#348 (aip_launcher)
feat: acceleration and transport layer sample_sensor_kit_launch#111 (sample_sensor_kit_launch)

To use these branches, the following additions to the autoware.repos are necessary:

  vendor/cuda_blackboard:
    type: git
    url: git@github.com:knzo25/cuda_blackboard.git
    version: main
  vendor/negotiated:
    type: git
    url: https://github.com/osrf/negotiated.git
    version: master

Depending on your machine and how many nodes are in a container, the following branch may also be required:
https://github.com/knzo25/launch_ros/tree/fix/load_composable_node
There seems to be a but in ROS where if you send too many services at once some will be lost and ros_launch can not handle that.

How was this PR tested?

The sensing/perception pipeline was tested until centerpoint for TIER IV's taxi using the logging simulator.
The following tests were executed in a laptop equipped with a RTX 4060 (laptop) GPU and a Intel(R) Core(TM) Ultra 7 165H (22 cores)

Node / processing time [ms]	Current	PR
/sensing/lidar/top/crop_box_filter_self/debug/processing_time_ms	5.81	N/A
/sensing/lidar/top/crop_box_filter_mirror/debug/processing_time_ms	4.59	N/A
/sensing/lidar/top/distortion_corrector/debug/processing_time_ms	10.96	N/A
/sensing/lidar/top/ring_outlier_filter/debug/processing_time_ms	10.69	N/A
/sensing/lidar/top/cuda_organized_pointcloud_adapter/debug/processing_time_ms	N/A	3.75
/sensing/lidar/top/cuda_pointcloud_preprocessor/debug/processing_time_ms	N/A	1.00
/sensing/lidar/concatenate_data_synchronizer/debug/processing_time_ms	7.83	0.70
Total	38.8	5.45

Notes for reviewers

The main branch that I used for development is feat/cuda_acceleration_and_transport_layer.
However, the changes were too big so I split the PRs. That being said, development, if any will still be on that branch (and then cherrypicked to the respective PRs), and the review changes will be cherrypicked into the development branch.

Interface changes

An additional topic is added to perform type negotiation:
Example: input/pointcloud -> input/pointcloud and input/pointcloud/cuda

Effects on system behavior

Enabling this preprocessing in the launchers should provide a much reduced latency and cpu usage (at the cost of a higher GPU usage)

…sonal repository Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

github-actions · 2024-11-25T08:31:26Z

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

You've checked our contribution guidelines.
Your PR follows our pull request guidelines.
All required CI checks pass before marking the PR ready for review.

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

…pointcloud changes after the first iteration Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

…ntcloud_preprocessing

Signed-off-by: Kenzo Lobos-Tsunekawa <kenzo.lobos@tier4.jp>

mojomex

Thank you for the amazing PR, these performance improvements are desperately needed.

I haven't checked the PR for functionality yet, but I'll leave my first round of comments here.

The main points I'd like to address are

memory safety and idiomatic C++ (there is currently a lot of raw-pointer code which should be avoided whenever possible)
modulatiry: currently the pipeline is hard-coded and all in one place. This makes the module hard to adapt to different projects, and hard to maintain individual modules in the pipeline.

Thank you for your time!

sensing/autoware_cuda_pointcloud_preprocessor/README.md

mojomex · 2024-11-26T06:18:39Z

sensing/autoware_cuda_pointcloud_preprocessor/config/cuda_pointcloud_preprocessor.param.yaml

+    self_crop.min_x: 1.0
+    self_crop.min_y: 1.0
+    self_crop.min_z: 1.0
+    self_crop.max_x: -1.0
+    self_crop.max_y: -1.0
+    self_crop.max_z: -1.0
+    mirror_crop.min_x: 1.0
+    mirror_crop.min_y: 1.0
+    mirror_crop.min_z: 1.0
+    mirror_crop.max_x: -1.0
+    mirror_crop.max_y: -1.0
+    mirror_crop.max_z: -1.0


Instead of having two hard-coded crop-box filters here, a list would be more easily extensible and it should be quite straight-forward to change the implementation.

sensing/autoware_cuda_pointcloud_preprocessor/README.md

sensing/autoware_cuda_pointcloud_preprocessor/docs/cuda-pointcloud-preprocessor.md

mojomex · 2024-12-24T05:41:22Z

...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp

+
+  std::size_t max_ring = 0;
+
+  for (std::size_t i = 0; i < input_pointcloud_msg_ptr->width * input_pointcloud_msg_ptr->height;


Iteration without explicit bounds checking of the underlying array is not memory-safe. Thus, I would suggest using the abovementioned PointCloud2Iterators here.

mojomex · 2024-12-24T05:41:27Z

...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp

+  num_rings_ = std::max(num_rings_, static_cast<std::size_t>(16));
+  std::vector<std::size_t> ring_points(num_rings_, 0);
+
+  for (std::size_t i = 0; i < input_pointcloud_msg_ptr->width * input_pointcloud_msg_ptr->height;


Iteration without explicit bounds checking of the underlying array is not memory-safe. Thus, I would suggest using the abovementioned PointCloud2Iterators here.

mojomex · 2024-12-24T05:42:34Z

...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp

+    max_ring = std::max(max_ring, ring);
+  }
+
+  // Set max rings to the next power of two


Admittedly kind of a niche problem, but not all sensors (Pandar40P) have 2^n rings.

Although auto-detecting the number of rings is nice, it has no hard guarantee to be accurate (e.g. the sensor is under a cover when turned on and there are thus no points in the cloud).

Does cuda_pointcloud_preprocessor support changing dimenions of input pointclouds across iterations (e.g. starts with 0 rings in cloud 1, then 64 rings with 2000 points, then 64 rings with 5000 points each)?
If not, I'd suggest to make n_rings and max_points_per_ring parameters so that we can guarantee correct behavior at runtime.

mojomex · 2024-12-24T05:47:44Z

...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp

+bool CudaOrganizedPointcloudAdapterNode::orderPointcloud(
+  const sensor_msgs::msg::PointCloud2::ConstSharedPtr input_pointcloud_msg_ptr)
+{
+  const autoware::point_types::PointXYZIRCAEDT * input_buffer =


Same comment about bounds/type checking as above 🙇

mojomex · 2024-12-24T05:57:53Z