Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zenoh 1.0.0 Porting #276

Merged
merged 148 commits into from
Dec 6, 2024
Merged

Zenoh 1.0.0 Porting #276

merged 148 commits into from
Dec 6, 2024

Conversation

YuanYuYuan
Copy link
Contributor

@YuanYuYuan YuanYuYuan commented Sep 5, 2024

Notes

  • Remove some unused dependencies.
  • Use z_view_keyexpr_t rather z_owned_keyexpr_t if possible. This keeps the reference only and doesn't need to be dropped.
  • The attachment type has been removed. Use the plain z_bytes instead.
  • Attachment is nothing but z_bytes. Use an iterator to get value by key.
  • z_check is now internal. We check zenoh entities' health by the creation functions' return.

Mallets and others added 30 commits August 9, 2024 14:32
Signed-off-by: Luca Cominardi <luca.cominardi@gmail.com>
Signed-off-by: Luca Cominardi <luca.cominardi@gmail.com>
Signed-off-by: Luca Cominardi <luca.cominardi@gmail.com>
chore: checkout dev/1.0.0
Copy link
Member

@Yadunund Yadunund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed 75% of the changes and provided some feedback. My primary concern is the introduction of registering a function with the atexit signal handler. As explained inline, there is still a problem of the static variable being invalidated and i'm confused why we need it with dev/1.0.0 when we do not need it rolling.

Secondly, I think the diff in the PR is way larger than it needs to be. Imo the changes only need reflect changes in the zenoh-c API (and things like attachment_helpers) but there are changes in rmw_zenoh.cpp wrt to checking if entities are valid, zenoh utils, and various cleanups to headers.
My recommendation is to take all the changes that remove cleanup headers and target a PR to the current rolling branch. And save other changes unrelated to breaking zenoh-c API for a future PR.

rmw_zenoh_cpp/src/zenohd/main.cpp Outdated Show resolved Hide resolved
@@ -361,28 +351,21 @@ rmw_ret_t ClientData::take_response(
}

// Fill in the request_header
request_header->request_id.sequence_number =
rmw_zenoh_cpp::get_int64_from_attachment(&sample->attachment, "sequence_number");
rmw_zenoh_cpp::attachment_data_t attachment(z_sample_attachment(sample));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API feels awkward and requires you to define the move constructor. Why not just pass const z_loaned_sample_t * and construct the z_sample_attachment internally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to miss this. The attachment is nothing but the usual zenoh bytes in 1.0.
On the other hand, zenoh-c basically doesn't provide any special structure for the users.
That's why we need to define an attachment data type and its (de)serialization.

The rmw_zenoh_cpp::attachment_data_t in this PR is considered to mapped to a z_bytes. One of the constructor is defined as

attachment_data_t::attachment_data_t(const z_loaned_bytes_t * attachment)
{
  ...
}

which means we can parse the given attachment from raw zenoh bytes to the desired data types.

We also use this pattern in zenoh-cpp.

auto attachment = sample.get_attachment();
if (!attachment.has_value()) return;
// Assume we convert the raw bytes to a `std::unordered_map<std::string, std::string>`
auto attachment_data = ext::deserialize<std::unordered_map<std::string, std::string>>(attachment->get());

Comment on lines 538 to 516
if (zc_liveliness_token_check(&token_)) {
zc_liveliness_undeclare_token(z_move(token_));
}
if (z_check(z_loan(keyexpr_))) {
z_drop(z_move(keyexpr_));
}
zc_liveliness_undeclare_token(z_move(token_));

z_drop(z_move(keyexpr_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We previously needed this check since it was possible for token_ and keyexpr_ to be uninitialized when shutdown is called within ClientData::make. Do these functions now check whether the tokens are valid before dropping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is one of the misunderstandings we would like to avoid.

  1. There's no way to check an uninitialized zenoh entity.
z_owned_keyexpr_t keyexpr;
z_check(z_loan(keyexpr))   // This is an undefined behavior.

A proper way to use z_check is to use it on an initialized data.

z_owned_keyexpr_t keyexpr;
z_keyexpr_from_str(&keyexpr, "foo");
z_check(z_loan(keyexpr));  // This is okay.

z_owned_keyexpr_t keyexpr;
z_null(&keyexpr);
z_check(z_loan(keyexpr)); // This is also fine.
  1. The check itself checks the gravestone status of zenoh data. It doesn't do functional checks like liveliness token querying over the network. These functions are necessary for zenoh-cpp development yet are easy to misuse. That's why we finally decided to make them internal. See z_check and z_null made internal eclipse-zenoh/zenoh-c#605 for details.

  2. Regarding the contention issue you mentioned, I assume that the function ClientData::shutdown can only be called once a valid ClientData instance has been created. However, if we prefer to relax this restriction, we could introduce a flag to indicate whether the class object has been properly validated. This would help address the issue. What are your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining that. Right now ClientData::shutdown can be invoked even if the tokens are not properly initialized. So it sounds like we would need to introduce an initialized_ flag to the implementation.

Copy link
Contributor Author

@YuanYuYuan YuanYuYuan Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've uniformly applied the initialized_ flag on all the detail/rmw_*_data in ZettaScaleLabs@4c63e8d

rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/rmw_zenoh.cpp Outdated Show resolved Hide resolved
Comment on lines 396 to 399
if (!context_impl->session_is_valid()) {
RMW_SET_ERROR_MSG("zenoh session is invalid");
return nullptr;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Contributor Author

@YuanYuYuan YuanYuYuan Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer: the z_session_check function has been removed. The z_check function has been changed to internal in eclipse-zenoh/zenoh-c#605. The check itself is to check the gravestone health of the zenoh data. If we want to check whether the session is constructed successfully, we should use the return code during the creation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say the session was constructed successfully but was closed later on. How would we check if the session is still open?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a function z_session_is_closed and it's been used as the implementation of rmw_context_impl_s::Data session_is_valid. I think I can add it back to this.

Copy link
Contributor Author

@YuanYuYuan YuanYuYuan Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in ZettaScaleLabs@09b45c3.

rmw_zenoh_cpp/src/rmw_zenoh.cpp Outdated Show resolved Hide resolved
Comment on lines 754 to 757
if (!pub_data->liveliness_is_valid()) {
return RMW_RET_ERROR;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated above, the z_check has been removed.

Copy link
Contributor Author

@YuanYuYuan YuanYuYuan Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in ZettaScaleLabs@09b45c3.

Addressed in ZettaScaleLabs@84d1267.

Comment on lines 929 to 932
if (!context_impl->session_is_valid()) {
RMW_SET_ERROR_MSG("zenoh session is invalid");
return nullptr;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why remove this here and in other functions below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated above, the z_check has been removed.

Copy link
Contributor Author

@YuanYuYuan YuanYuYuan Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in ZettaScaleLabs@09b45c3.

@YuanYuYuan
Copy link
Contributor Author

YuanYuYuan commented Dec 2, 2024

Latency comparison

branch payload min p05 p50 p95 max
rolling 32b 229 247 430 514 2850
session 32b 187 195 219 438 4219
rolling 64k 422 443 625 742 4535
session 64k 203 215 232 459 4729

@YuanYuYuan
Copy link
Contributor Author

YuanYuYuan commented Dec 2, 2024

Pacakge Test rolling session dev/1.0 Note
rclcpp 18 - test_allocator_memory_strategy (Timeout) X
rclcpp 23 - test_clock X
rclcpp 28 - test_client_common X #328
rclcpp 62 - test_publisher X X X Identified
rclcpp 91 - test_wait_for_message X
rclcpp 95 - test_executors X X
rclcpp 99 - test_executors_busy_waiting F X
rclcpp 118 - test_executor (Timeout) F
rclcpp_action 1 - test_client X X X #326
rclcpp_action 2 - test_server X X severe
rclcpp_lifecycle 6 - test_lifecycle_node F ?
rclcpp_lifecycle 9 - test_lifecycle_service_client X X ? To be checked
test_communication 73 - test_publisher_subscriber__rclpy__rmw_zenoh_cpp X
test_communication 78 - test_action_client_server__rclpy__rclcpp__rmw_zenoh_cpp X
test_communication 81 - test_action_client_server__rclcpp__rclpy__rmw_zenoh_cpp X
test_communication 83 - test_requester_replier__rclcpp__rmw_zenoh_cpp X X Related to #329
test_communication 84 - test_action_client_server__rclcpp__rmw_zenoh_cpp X
test_rclcpp 70 - test_parameter_server_cpp__rmw_zenoh_cpp (Timeout) X X To be checked
test_rclcpp 74 - test_n_nodes__rmw_zenoh_cpp F F #329

@YuanYuYuan
Copy link
Contributor Author

YuanYuYuan commented Dec 4, 2024

Follow-up on Review Feedback

z_owned_publisher_t pub;
auto free_type_hash_c_str = rcpputils::make_scope_exit([&]() {
    z_drop(z_move(pub));
});
if (z_declare_publisher(session, &pub, ...) != Z_OK) {
    return;
}

Since we actually don't need to z_drop the zenoh entity if we fail to establish it. This can be reordered in a safer way in case we drop an uninitialized memory.

z_owned_publisher_t pub;
if (z_declare_publisher(session, &pub, ...) != Z_OK) {
    return;
}
auto free_type_hash_c_str = rcpputils::make_scope_exit([&]() {
    z_drop(z_move(pub));
});

(*) This means the change is held on the branch https://github.com/ZettaScaleLabs/rmw_zenoh/tree/wip/session-clone and not yet merged into dev/1.0.0.

@clalancette
Copy link
Collaborator

All right, so I just tried out the latest here (I'm sorry I haven't gotten to this earlier).

First of all, I needed the following patch to compile:

diff --git a/rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp b/rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp
index 3c89983..ed3bd9a 100644
--- a/rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp
+++ b/rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp
@@ -152,10 +152,10 @@ public:
         const z_loaned_sample_t * sample = z_reply_ok(z_loan(reply));
         z_view_string_t keystr;
         z_keyexpr_as_view_string(z_sample_keyexpr(sample), &keystr);
-        std::string livelines_str(z_string_data(z_loan(keystr)), z_string_len(z_loan(keystr)));
+        std::string liveliness_str(z_string_data(z_loan(keystr)), z_string_len(z_loan(keystr)));
         // Ignore tokens from the same session to avoid race conditions from this
         // query and the liveliness subscription.
-        graph_cache_->parse_put(std::move(livelines_str), true);
+        graph_cache_->parse_put(std::move(liveliness_str), true);
       } else {
         RMW_ZENOH_LOG_DEBUG_NAMED(
           "rmw_zenoh_cpp", "[rmw_context_impl_s] z_call received an invalid reply.\n");
@@ -439,8 +439,8 @@ static void graph_sub_data_handler(z_loaned_sample_t * sample, void * data)
   }
 
   // Update the graph cache.
-  std::string livelines_str(z_string_data(z_loan(keystr)), z_string_len(z_loan(keystr)));
-  data_shared_ptr->update_graph_cache(z_sample_kind(sample), std::move(livelines_str));
+  std::string liveliness_str(z_string_data(z_loan(keystr)), z_string_len(z_loan(keystr)));
+  data_shared_ptr->update_graph_cache(z_sample_kind(sample), std::move(liveliness_str));
 }
 
 ///=============================================================================
diff --git a/rmw_zenoh_cpp/src/detail/rmw_publisher_data.cpp b/rmw_zenoh_cpp/src/detail/rmw_publisher_data.cpp
index cbc7e08..5966de6 100644
--- a/rmw_zenoh_cpp/src/detail/rmw_publisher_data.cpp
+++ b/rmw_zenoh_cpp/src/detail/rmw_publisher_data.cpp
@@ -173,8 +173,7 @@ std::shared_ptr<PublisherData> PublisherData::make(
   z_view_keyexpr_from_str(&liveliness_ke, liveliness_keyexpr.c_str());
   z_owned_liveliness_token_t token;
   if (z_liveliness_declare_token(
-      z_loan(owned_session), &token, z_loan(liveliness_ke),
-      NULL) != Z_OK)
+      session, &token, z_loan(liveliness_ke), NULL) != Z_OK)
   {
     RMW_ZENOH_LOG_ERROR_NAMED(
       "rmw_zenoh_cpp",

After that, I tried to run the tests in rcl, and I'm getting a lot of crashing tests:

The following tests FAILED:
	  9 - test_context__rmw_zenoh_cpp (Failed)
	 15 - test_init__rmw_zenoh_cpp (Failed)
	 16 - test_node__rmw_zenoh_cpp (Failed)
	 19 - test_guard_condition__rmw_zenoh_cpp (Failed)
	 37 - test_time (Failed)

I also tried out the wip/session branch, and with that one I'm seeing about the same crashing tests. So something is going on there that I need to investigate. I'm spending some time on it now, but I thought I'd put a brief report on what I'm seeing first.

@clalancette
Copy link
Collaborator

After that, I tried to run the tests in rcl, and I'm getting a lot of crashing tests:

Ah, never mind, this was a local problem. Things are working OK now (though I still need the patch above). Will continue testing.

{
std::lock_guard<std::recursive_mutex> lock(mutex_);
return zc_liveliness_token_check(&token_);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that these z_check functions have been made internal but having a method to check whether the liveliness token is still valid is useful. How about we simply return !is_shutdown_ here since we unregister the token only after the entity is shutdown. Then keep the function call in rmw_zenoh.cpp.

Copy link
Contributor Author

@YuanYuYuan YuanYuYuan Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we only check liveliness_is_valid in rmw_zenoh_cpp/src/rmw_zenoh.cpp. Let me change it with is_shutdown() there. ZettaScaleLabs@84d1267

rolling branch

rmw_zenoh_cpp/src/rmw_zenoh.cpp
754:  if (!pub_data->liveliness_is_valid()) {

rmw_zenoh_cpp/src/detail/rmw_service_data.hpp
63:  bool liveliness_is_valid() const;

rmw_zenoh_cpp/src/detail/rmw_service_data.cpp
242:bool ServiceData::liveliness_is_valid() const

rmw_zenoh_cpp/src/detail/rmw_publisher_data.cpp
431:bool PublisherData::liveliness_is_valid() const

rmw_zenoh_cpp/src/detail/rmw_publisher_data.hpp
74:  bool liveliness_is_valid() const;

rmw_zenoh_cpp/src/detail/rmw_subscription_data.hpp
90:  bool liveliness_is_valid() const;

rmw_zenoh_cpp/src/detail/rmw_client_data.cpp
281:bool ClientData::liveliness_is_valid() const

rmw_zenoh_cpp/src/detail/rmw_subscription_data.cpp
389:bool SubscriptionData::liveliness_is_valid() const

rmw_zenoh_cpp/src/detail/rmw_client_data.hpp
64:  bool liveliness_is_valid() const;

@YuanYuYuan
Copy link
Contributor Author

After that, I tried to run the tests in rcl, and I'm getting a lot of crashing tests:

Ah, never mind, this was a local problem. Things are working OK now (though I still need the patch above). Will continue testing.

The branch is somewhat out of sync with the wip/session-clone. I just pushed a fix with the typo fix. Thanks!

@YuanYuYuan
Copy link
Contributor Author

New results

Pacakge Test Session Rolling
rclc 1 - rclc_test (Failed) X
rclc_parameter 1 - rclc_parameter_test (Timeout) X X
rclcpp 118 - test_executor (Timeout) X
rclcpp 28 - test_client_common (Failed) X
rclcpp 62 - test_publisher (Failed) X X
rclcpp 86 - test_time (Failed) X
rclcpp 96 - test_executors_timer_cancel_behavior (Failed) X
rclcpp_action 1 - test_client (Failed) X X
rclcpp_action 2 - test_server (Failed) X
rclcpp_lifecycle 9 - test_lifecycle_service_client (Failed) X X
rclpy 7 - test_client (Failed) X
test_communication 83 - test_requester_replier__rclcpp__rmw_zenoh_cpp (Failed) X
test_rclcpp 70 - test_parameter_server_cpp__rmw_zenoh_cpp (Timeout) X
test_rclcpp 74 - test_n_nodes__rmw_zenoh_cpp (Failed) X

Signed-off-by: Yadunund <yadunund@gmail.com>
@clalancette clalancette dismissed stale reviews from ahcorde and themself December 6, 2024 14:36

Stale review.

@clalancette clalancette merged commit 0293278 into ros2:rolling Dec 6, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.