-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE-666] - Migrate daemon monitoring to use health checks #783
Conversation
WalkthroughThe changes introduce a comprehensive health monitoring system for daemons, including new structures, methods, and tests for health checks. Old monitoring functions are removed from the daemon server, and new metrics capture is implemented. Client and integration tests are updated to incorporate the health monitor. The changes also include new flags for daemon failure behavior and unhealthy duration, with corresponding updates in tests and CLI test suites. Changes
Please note that the changes are grouped based on the relatedness of changes across multiple files, and each row provides a high-level summary of the changes made to the files included in that set. TipsChat with CodeRabbit Bot (
|
7715685
to
9a48642
Compare
protocol/app/app.go
Outdated
@@ -290,6 +290,9 @@ type App struct { | |||
|
|||
PriceFeedClient *pricefeedclient.Client | |||
LiquidationsClient *liquidationclient.Client | |||
BridgeClient *bridgeclient.Client | |||
|
|||
HealthMonitor *daemonservertypes.HealthMonitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Health Monitor does not need access to daemon server client request contents anymore, and has been moved to live on the app.
5a6a503
to
25e93be
Compare
def8cbf
to
42bc296
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 3
Configuration used: CodeRabbit UI
Files selected for processing (14)
- protocol/app/app.go (10 hunks)
- protocol/app/app_test.go (2 hunks)
- protocol/daemons/pricefeed/client/client_integration_test.go (4 hunks)
- protocol/daemons/server/bridge.go (2 hunks)
- protocol/daemons/server/liquidation.go (3 hunks)
- protocol/daemons/server/pricefeed.go (3 hunks)
- protocol/daemons/server/server.go (3 hunks)
- protocol/daemons/server/server_test.go (3 hunks)
- protocol/daemons/server/types/health_monitor.go (1 hunks)
- protocol/daemons/server/types/health_monitor_test.go (1 hunks)
- protocol/daemons/types/health_checkable.go (3 hunks)
- protocol/mocks/HealthCheckable.go (1 hunks)
- protocol/mocks/Makefile (1 hunks)
- protocol/testutil/app/app.go (1 hunks)
Files skipped from review due to trivial changes (2)
- protocol/daemons/server/server_test.go
- protocol/mocks/Makefile
Additional comments: 24
protocol/daemons/server/pricefeed.go (1)
- 43-48: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [33-47]
The removal of the
ExpectPricefeedDaemon
method and the response reporting check in theUpdateMarketPrices
method is consistent with the shift to the new health check-based monitoring system. However, ensure that all references to these removed functionalities are also updated or removed throughout the codebase to prevent any broken dependencies or calls.protocol/mocks/HealthCheckable.go (1)
- 1-63: The mock implementation for
HealthCheckable
is correctly following the pattern for mocks using the testify library. It includes methods forHealthCheck
,ReportFailure
,ReportSuccess
, andServiceName
, which are essential for testing the health check functionality. TheNewHealthCheckable
function is also correctly setting up a new mock with a cleanup function to assert expectations, which is a good practice for ensuring that mock behavior is verified after tests run. This should help in writing effective unit tests for the health monitoring system.protocol/daemons/server/liquidation.go (2)
2-7: The import of the
telemetry
package might no longer be necessary if telemetry functionality has been removed from theLiquidationServer
. Verify if this import can be removed to clean up the code.36-41: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [34-41]
The use of telemetry metrics (
telemetry.ModuleSetGauge
) should be reassessed. Since the pull request indicates that telemetry and metrics-related code has been removed from theServer
struct, this code might be obsolete. If telemetry is no longer used, this block should be removed to avoid dead code and potential confusion.protocol/testutil/app/app.go (1)
- 490-494: The change from
DisableUpdateMonitoringForTesting
toDisableHealthMonitorForTesting
aligns with the new health monitoring system introduced in the pull request. This update is necessary to ensure that the health monitoring behavior is correctly disabled during testing, which is important for test stability and performance. The change is consistent with the overall shift from the old update monitoring system to the new health check-based approach.protocol/daemons/pricefeed/client/client_integration_test.go (4)
5-11: The import of the
app
package is correctly added to support the newHealthMonitor
functionality. However, ensure that all other imports are still necessary after the refactoring and that there are no unused imports left.220-223: The addition of the
healthMonitor
field to thePriceDaemonIntegrationTestSuite
struct is consistent with the summary provided. This field is necessary for the integration of the health monitoring system into the test suite.280-292: The initialization of the
healthMonitor
field within theSetupTest
function is correctly implemented. TheNewHealthMonitor
function is called with appropriate parameters, and thehealthMonitor
is then used to register services. This aligns with the new health check-based monitoring system.337-342: The registration of the
pricefeedDaemon
service with thehealthMonitor
in thestartClient
function is correctly implemented. This ensures that the health of thepricefeedDaemon
service will be monitored as part of the new health check system.protocol/daemons/types/health_checkable.go (5)
22-26: The addition of the
ServiceName
method to theHealthCheckable
interface is a good practice as it provides a standardized way to reference services during health checks. This change will require all implementers of theHealthCheckable
interface to provide a unique service name, which can be useful for logging and error reporting.67-73: The
serviceName
field has been added to thetimeBoundedHealthCheckable
struct to store the name of the service being monitored. This is a good practice as it associates a human-readable identifier with the health checkable service, which can be used for logging and monitoring purposes.81-88: The
NewTimeBoundedHealthCheckable
function initializes thetimeBoundedHealthCheckable
to an unhealthy state by reporting an error during its creation. This is a good practice as it ensures that the health checkable service starts in a known state and requires an explicit successful update to be considered healthy.91-94: The implementation of the
ServiceName
method for thetimeBoundedHealthCheckable
struct is straightforward and follows the interface contract by returning the stored service name. This is a simple and effective way to fulfill the requirements of theHealthCheckable
interface.96-98: The
ReportSuccess
method is correctly implemented to be thread-safe by using a mutex lock. This is important to prevent data races when multiple goroutines may be updating the health check status concurrently.protocol/app/app_test.go (2)
3-9: The import of the
mocks
package is necessary for the new test functionTestMonitorDaemon_Panics
that uses a mock implementation of theHealthCheckable
interface. However, the import ofgopkg.in/typ.v4/slices
does not seem to be used in the provided code. If it's not used elsewhere in the file, it should be removed to keep the code clean and avoid unnecessary dependencies.228-237: The new test
TestMonitorDaemon_Panics
is designed to ensure that theMonitorDaemon
method panics when a service is registered twice. This is a good test for the health monitoring system's robustness. However, it's important to ensure that the panic is caused by the expected condition (double registration) and not by some other issue. It would be beneficial to assert the panic message to confirm the cause of the panic.protocol/app/app.go (8)
298-298: The addition of the
HealthMonitor
field to theApp
struct is a critical change for the new health monitoring system. It's important to ensure that theHealthMonitor
is properly initialized and that all necessary daemons are registered with it. This change aligns with the pull request's goal of integrating health checks into the daemon services.595-602: The instantiation of the
HealthMonitor
within theApp
struct is a key part of the new health monitoring system. It's crucial to verify that the constantsDaemonStartupGracePeriod
andHealthCheckPollFrequency
are set to sensible defaults and that they can be configured if needed. Additionally, ensure that theLogger
is correctly passed to theHealthMonitor
for logging purposes.611-616: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [612-627]
The code for starting the liquidations client and registering it with the health monitor is crucial. It's important to ensure that the
MaximumDaemonUnhealthyDuration
is being used correctly and that the daemon is properly monitored. If the daemon fails to start or becomes unhealthy, the system should respond appropriately, which may include logging, alerting, or taking corrective action.
- 623-635: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [631-648]
The price feed client is being started and monitored similarly to the liquidations client. It's important to verify that the
StartNewClient
function is correctly implemented and that theMaximumDaemonUnhealthyDuration
is being used appropriately. The health monitoring system should be robust enough to handle any issues with the price feed client.
- 645-660: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [652-670]
The bridge client is being started and monitored. As with the other daemons, it's important to verify that the
MaximumDaemonUnhealthyDuration
is being used correctly and that the health monitoring system can handle any issues with the bridge client. The commented TODO on line 654 suggests that there may be pending work related to the bridge daemon, which should be tracked and resolved.
688-688: The comment on line 688 indicates a TODO related to the metrics daemon and its health tracking. It's important to follow up on this TODO to ensure that the metrics daemon's health is properly monitored and that any necessary changes are made in a timely manner.
1233-1251: The
MonitorDaemon
method is a new addition that registers a daemon service with the health monitor. It's critical to ensure that this method is called for all daemons that need to be monitored and that the error handling is appropriate. The use ofpanic
on line 1249 should be carefully considered, as it will cause the application to crash if the daemon cannot be registered. It may be more appropriate to handle the error in a way that does not bring down the entire application.1253-1256: The
DisableHealthMonitorForTesting
method is a useful addition for testing purposes. It's important to ensure that this method is only used in test environments and that it does not accidentally get called in production, as it would disable an important monitoring feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (1)
- protocol/app/app.go (10 hunks)
Additional comments: 5
protocol/app/app.go (5)
5-17: The import block has been modified to remove unused imports. This is a good practice for maintainability and to avoid confusion about dependencies. However, ensure that these imports are indeed not used anywhere in the codebase to prevent breaking changes.
298-298: The
HealthMonitor
field has been added to theApp
struct, which is a significant change. This aligns with the previous comment about moving health monitoring to live on the app. Ensure that theHealthMonitor
is properly initialized and integrated with the rest of the application's lifecycle.595-603: The
HealthMonitor
is being initialized with default values for the grace period and poll frequency. Ensure that these values are appropriate for the application's needs and consider making them configurable for flexibility.645-660: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [611-670]
The
MonitorDaemon
method is being used to register daemon services with the health monitor. It is crucial to ensure that all daemon services are registered correctly and that theMaximumDaemonUnhealthyDuration
is set appropriately. Additionally, the error handling here is to panic, which is a drastic measure. Consider if there are more graceful ways to handle initialization failures that won't necessarily bring down the entire application.
- 1227-1253: The
MonitorDaemon
andDisableHealthMonitorForTesting
methods have been added to theApp
struct. These methods are part of the new health monitoring system and should be reviewed to ensure they are correctly implemented and used throughout the application. TheDisableHealthMonitorForTesting
method, in particular, should be used with caution to avoid disabling health checks in production environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (4)
- protocol/app/app.go (9 hunks)
- protocol/daemons/bridge/client/client.go (1 hunks)
- protocol/daemons/liquidation/client/client.go (1 hunks)
- protocol/daemons/pricefeed/client/client.go (1 hunks)
Files skipped from review due to trivial changes (2)
- protocol/daemons/bridge/client/client.go
- protocol/daemons/liquidation/client/client.go
Additional comments: 11
protocol/daemons/pricefeed/client/client.go (1)
- 81-84: The addition of the
ServiceName
method to theClient
struct is consistent with the implementation of theHealthCheckable
interface, which requires aServiceName
method. This change will allow instances ofClient
to be used in the context of health monitoring where a service name identifier is required.protocol/app/app.go (10)
5-14: The import block includes several standard library packages. Ensure that all imported packages are used within the file to avoid unnecessary imports that can lead to increased compilation time and binary size.
298-298: The
HealthMonitor
field has been added to theApp
struct as expected, following the removal of the old monitoring mechanisms. This change aligns with the pull request's intent to introduce health checks into the daemon services.595-602: The
HealthMonitor
is instantiated with default values for the startup grace period and health check poll frequency. Ensure that these values are appropriate for the application's needs and consider making them configurable if necessary.611-619: The
MonitorDaemon
method is used to register theLiquidationsClient
with the health monitor. It is important to ensure that theMaximumDaemonUnhealthyDuration
is an appropriate threshold for the liquidations daemon's health checks.629-636: The
PriceFeedClient
is registered with the health monitor using theMonitorDaemon
method. Verify that theMaximumDaemonUnhealthyDuration
is suitable for the price feed daemon's health checks and that the daemon's operation is critical to the application's functionality.653-661: The
BridgeClient
is registered with the health monitor using theMonitorDaemon
method. As with other daemons, ensure that theMaximumDaemonUnhealthyDuration
is an appropriate threshold for the bridge daemon's health checks.684-689: The
metricsclient.Start
function is invoked without being monitored by the health monitor. This is intentional as the metrics daemon is used for observability and should not bring the application down. Ensure that this aligns with the application's reliability requirements.1230-1246: The
MonitorDaemon
method is used to register daemon services with the health monitor. It is critical to ensure that all daemon services are registered correctly and that the application's startup sequence allows for the daemons to be initialized before the health checks begin.1250-1252: The
DisableHealthMonitorForTesting
method provides a way to disable the health monitor during testing. This is a useful feature for testing scenarios where the health checks may interfere with the test conditions or outcomes.1227-1228: The
New
function correctly returns an instance of theApp
struct. This is the expected behavior for the constructor function of the application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Configuration used: CodeRabbit UI
Files selected for processing (1)
- protocol/app/app.go (9 hunks)
Additional comments: 8
protocol/app/app.go (8)
298-298: The
HealthMonitor
field has been added to theApp
struct, which is responsible for monitoring the health of daemon services. Ensure that theHealthMonitor
is properly initialized and used throughout the application.595-602: The
HealthMonitor
is being initialized with default values for the startup grace period and health check poll frequency. It's important to ensure that these values are appropriate for the application's requirements and that they can be configured if necessary.612-617: The
MonitorDaemon
method is being used to monitor the health of theLiquidationsClient
. Ensure that theMaximumDaemonUnhealthyDuration
is appropriate for this daemon and that the monitoring is correctly implemented.630-635: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [630-648]
The
PriceFeedClient
is being monitored using theMonitorDaemon
method. Verify that theMaximumDaemonUnhealthyDuration
is suitable for the price feed daemon's requirements and that the monitoring logic is correctly implemented.
651-658: The
BridgeClient
is being monitored using theMonitorDaemon
method. As with the other daemons, ensure that theMaximumDaemonUnhealthyDuration
is appropriate and that the monitoring is correctly implemented.684-689: The
metricsclient.Start
method is being called without monitoring. Since the metrics daemon is used for observability and should not bring the app down, ensure that this behavior is intentional and that the lack of monitoring is acceptable.1230-1246: The
MonitorDaemon
method registers a daemon service with the health monitor and panics if registration fails. Ensure that this panic behavior is acceptable and that the error handling is consistent with the application's requirements.1250-1252: The
DisableHealthMonitorForTesting
method has been added to disable the health monitor during testing. Verify that this method is only used in test environments and that it does not affect production code paths.
server.gsrv.Stop() | ||
} | ||
|
||
// DisableUpdateMonitoringForTesting disables the update monitor for testing purposes. This is needed in integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed health monitor from daemon server and attached it to the app as a top level field, so this pass-through interface is no longer necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant is not needed anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file renamed to health_monitor.go
@@ -22,6 +22,8 @@ type HealthCheckable interface { | |||
ReportFailure(err error) | |||
// ReportSuccess records a successful update. | |||
ReportSuccess() | |||
// ServiceName returns the name of the service being monitored. This name is expected to be unique. | |||
ServiceName() string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this field to easily key health checkables by service name.
aa7e502
to
db8d8d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 8
Configuration used: CodeRabbit UI
Files selected for processing (14)
- protocol/app/app.go (9 hunks)
- protocol/app/app_test.go (2 hunks)
- protocol/daemons/pricefeed/client/client_integration_test.go (4 hunks)
- protocol/daemons/server/bridge.go (2 hunks)
- protocol/daemons/server/liquidation.go (3 hunks)
- protocol/daemons/server/pricefeed.go (3 hunks)
- protocol/daemons/server/server.go (3 hunks)
- protocol/daemons/server/server_test.go (3 hunks)
- protocol/daemons/server/types/health_monitor.go (1 hunks)
- protocol/daemons/server/types/health_monitor_test.go (1 hunks)
- protocol/daemons/types/health_checkable.go (3 hunks)
- protocol/mocks/HealthCheckable.go (1 hunks)
- protocol/mocks/Makefile (1 hunks)
- protocol/testutil/app/app.go (1 hunks)
Files skipped from review due to trivial changes (3)
- protocol/daemons/server/liquidation.go
- protocol/daemons/server/server_test.go
- protocol/mocks/Makefile
Additional comments: 24
protocol/daemons/server/bridge.go (2)
3-7:
The removal oftypes
andtime
imports suggests that they are no longer used in this file. Verify that no other parts of the file require these imports to function correctly.20-25:
The removal ofExpectBridgeDaemon
function should be verified to ensure that its responsibilities are now correctly handled by the new health check system.protocol/testutil/app/app.go (1)
- 490-494: The change from
DisableUpdateMonitoringForTesting
toDisableHealthMonitorForTesting
aligns with the pull request's goal of transitioning from update monitoring to health checks. However, ensure that theDisableHealthMonitorForTesting
method has been implemented correctly in theApp
struct and that it effectively disables the health monitor during tests.protocol/mocks/HealthCheckable.go (1)
- 1-63:
The mock implementation for theHealthCheckable
interface appears to be correctly generated and includes all necessary methods and a constructor with a cleanup function for testing.protocol/app/app_test.go (3)
3-9: The import of the
time
package is added, which is necessary for the new testTestMonitorDaemon_Panics
that usestime.Minute
. This change is correct and relevant to the new functionality being tested.4-4: The import of the
mocks
package is added, which is necessary for the new testTestMonitorDaemon_Panics
that usesmocks.HealthCheckable
. This change is correct and relevant to the new functionality being tested.228-237: The new test
TestMonitorDaemon_Panics
is added to ensure that theMonitorDaemon
function panics when a health checkable service is registered twice. This is a good test to ensure that the health check monitor behaves as expected in the case of duplicate registrations.protocol/daemons/server/server.go (3)
3-12:
The removal of imports should be verified to ensure that they are not used elsewhere in the file or that their removal does not affect other functionalities.21-26: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [14-26]
The removal of the update monitoring feature from the
Server
struct is consistent with the changes described in the summary.
- 46-49:
TheStop
method has been simplified to only stop the gRPC server, which aligns with the removal of the update monitoring feature.protocol/daemons/types/health_checkable.go (4)
26-26:
The addition ofServiceName()
method to theHealthCheckable
interface is a good practice for identifying services.72-72:
TheserviceName
field is correctly added as a read-only field to store the service name.81-88:
TheNewTimeBoundedHealthCheckable
function correctly initializes theserviceName
field and sets the initial state to unhealthy.91-93:
TheServiceName
method is correctly implemented to return the service name.protocol/daemons/pricefeed/client/client_integration_test.go (4)
5-11:
Ensure that the new import"github.com/dydxprotocol/v4-chain/protocol/app"
is used within the file, otherwise it should be removed to avoid unnecessary imports.220-223:
The declaration ofhealthMonitor
as a global variable within the test suite is appropriate given the context of the change summary. It is used to register services for health checks.280-292:
The initialization ofhealthMonitor
and its integration withdaemonServer
is consistent with the pull request's goal of refactoring the update monitor to use health checks.337-342:
The registration ofpricefeedDaemon
withhealthMonitor
is correctly implemented with error handling. However, the use ofapp.MaximumDaemonUnhealthyDuration
suggests a dependency on a global variable which should be verified for correctness.protocol/app/app.go (6)
298-298: The
HealthMonitor
field has been added to theApp
struct, which is in line with the summary of changes. Ensure that theHealthMonitor
is properly initialized and used throughout the application.611-617: The
MonitorDaemon
method is used to register theLiquidationsClient
with the health monitor. Ensure that theMaximumDaemonUnhealthyDuration
is properly passed and handled in theMonitorDaemon
method.645-648: The
MonitorDaemon
method is used to register thePriceFeedClient
with the health monitor. Ensure that theMaximumDaemonUnhealthyDuration
is properly passed and handled in theMonitorDaemon
method.658-658: The
MonitorDaemon
method is used to register theBridgeClient
with the health monitor. Ensure that theMaximumDaemonUnhealthyDuration
is properly passed and handled in theMonitorDaemon
method.1230-1248: The
MonitorDaemon
method registers a daemon service with the health monitor and panics if registration fails. Ensure that this panic is an acceptable behavior and that the health monitor is correctly initialized before this method is called.1250-1253: The
DisableHealthMonitorForTesting
method has been added to disable the health monitor during testing. Verify that this method is only used in test environments and does not affect production code.
@@ -47,12 +36,6 @@ func (s *Server) LiquidateSubaccounts( | |||
metrics.Received, | |||
metrics.Count, | |||
) | |||
// If the daemon is unable to report a response, there is either an error in the registration of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my own understanding, why are we removing this error logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error here was emitted by the daemon monitor (which lived on the daemon server) if it failed to register the response. We don't report daemon responses to the monitor anymore; instead, we query the daemon for it's health directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to still log these errors for debugging purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to make sense to log this as we never expect the report response call to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...we never expect the response call to fail" - yes, since we've removed health monitoring code from this method, there's nothing here but telemetry and this call will never generate an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Configuration used: CodeRabbit UI
Files selected for processing (7)
- protocol/app/app.go (9 hunks)
- protocol/daemons/server/bridge.go (3 hunks)
- protocol/daemons/server/liquidation.go (3 hunks)
- protocol/daemons/server/pricefeed.go (6 hunks)
- protocol/daemons/server/server.go (4 hunks)
- protocol/daemons/server/types/health_monitor.go (1 hunks)
- protocol/daemons/server/types/health_monitor_test.go (1 hunks)
Files skipped from review due to trivial changes (1)
- protocol/daemons/server/types/health_monitor_test.go
Additional comments: 17
protocol/app/app.go (3)
1230-1248: The
MonitorDaemon
method correctly registers a daemon service with the health monitor and panics if registration fails, ensuring that the application does not run with unmonitored critical services.1250-1253: The
DisableHealthMonitorForTesting
method is a good addition for testing purposes, allowing tests to run without the health monitor interfering.645-661: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [611-661]
The
startDaemons
function correctly starts daemon clients and registers them with the health monitor. However, there is a TODO comment regarding the bridge daemon that should be addressed in the future to ensure it is properly integrated.protocol/daemons/server/bridge.go (3)
2-8: The removal of imports suggests that the associated functionality is no longer needed. Verify that no other parts of the codebase require these imports to function correctly.
21-22: The
WithBridgeEventManager
method correctly sets thebridgeEventManager
field and returns theServer
instance for method chaining.33-36: The
AddBridgeEvents
method now unconditionally callss.reportResponse(types.BridgeDaemonServiceName)
. Ensure that this change aligns with the new health check monitoring system and that the absence of error handling here is intentional and safe.protocol/daemons/server/liquidation.go (2)
2-10: Verify that the "time" package is not used elsewhere in the file since it has been removed from the imports.
37-42: The call to
s.reportResponse
has been removed, which is consistent with the transition to the new health check monitoring system. Ensure that the new health monitoring system is correctly capturing the necessary information that was previously handled byreportResponse
.protocol/daemons/server/pricefeed.go (6)
6-6: The import alias
types
has been removed, ensure that there are no remaining references totypes
that would cause compilation errors.39-39: The call to
s.reportResponse
has been removed. Confirm that this is intentional and that the removal aligns with the new health check monitoring system.31-42: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [41-46]
The telemetry code for measuring latency has been removed. Ensure that this is consistent with the new health check monitoring system and that no alternative metrics are required.
48-54: The panic now uses the
daemontypes.ErrServerNotInitializedCorrectly
error constant. Confirm that this change is consistent with the new error handling strategy.69-71: The error constant
daemontypes.ErrPriceFeedMarketPriceUpdateEmpty
is used for validation. Confirm that all error constants have been updated to use the newdaemontypes
package.97-111: The error constants
daemontypes.ErrPriceFeedInvalidPrice
anddaemontypes.ErrPriceFeedLastUpdateTimeNotSet
are used for validation. Confirm that these constants are correctly defined in thedaemontypes
package and that their usage here is appropriate.protocol/daemons/server/server.go (3)
8-14: The removal of certain imports suggests that the associated functionality has been deprecated or refactored. Ensure that no other parts of the codebase rely on these imports to prevent any breakage.
24-29: The
Server
struct still contains embedded fields forBridgeServer
,PriceFeedServer
, andLiquidationServer
. Ensure that the new health monitoring system is compatible with these services and that their integration is properly handled.43-58: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [54-66]
The
reportResponse
method has been refactored to only report metrics. Confirm that this change aligns with the new health monitoring system's requirements and that no other functionality is lost due to the removal of the error return type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5fe6028
to
bc6fe09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 3
Configuration used: CodeRabbit UI
Files selected for processing (14)
- protocol/app/app.go (9 hunks)
- protocol/app/app_test.go (2 hunks)
- protocol/daemons/pricefeed/client/client_integration_test.go (4 hunks)
- protocol/daemons/server/bridge.go (3 hunks)
- protocol/daemons/server/liquidation.go (3 hunks)
- protocol/daemons/server/pricefeed.go (6 hunks)
- protocol/daemons/server/server.go (4 hunks)
- protocol/daemons/server/server_test.go (3 hunks)
- protocol/daemons/server/types/health_monitor.go (1 hunks)
- protocol/daemons/server/types/health_monitor_test.go (1 hunks)
- protocol/daemons/types/health_checkable.go (3 hunks)
- protocol/mocks/HealthCheckable.go (1 hunks)
- protocol/mocks/Makefile (1 hunks)
- protocol/testutil/app/app.go (1 hunks)
Files skipped from review due to trivial changes (2)
- protocol/daemons/server/server_test.go
- protocol/mocks/Makefile
Additional comments: 17
protocol/app/app_test.go (1)
- 3-4: The addition of the
github.com/dydxprotocol/v4-chain/protocol/mocks
import is appropriate for the new test functionTestMonitorDaemon_Panics
that uses mocks.protocol/daemons/pricefeed/client/client_integration_test.go (4)
5-11: The addition of the import statement aligns with the changes described in the summary.
284-288: The initialization of
healthMonitor
is correctly implemented and follows the new health check monitoring system's design.340-341: The registration of the
s.pricefeedDaemon
service with thehealthMonitor
is correctly implemented and is a critical part of the new health check monitoring system.280-292: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [5-342]
Overall, the changes in the provided hunks are consistent with the pull request's intent to implement a robust health check monitoring system and replace the previous update monitoring approach.
protocol/daemons/server/bridge.go (1)
- 33-36: The
AddBridgeEvents
method now unconditionally callss.reportResponse(types.BridgeDaemonServiceName)
. Ensure that thereportResponse
method is designed to handle all scenarios appropriately since the conditional panic has been removed.protocol/daemons/server/liquidation.go (3)
3-10: The import reorganization and addition of
liquidationtypes
appear to be correct and necessary for the new functionality.37-42: Ensure that the
reportResponse
method is correctly implemented and used in the context of the new health monitoring system, as it is not visible in the provided code snippet.37-43: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [29-39]
Verify whether the
telemetry.ModuleSetGauge
call is still required, as the summary indicates that telemetry and metrics functionality is being stripped from the package.protocol/daemons/server/pricefeed.go (1)
- 48-54: The use of
panic
here will crash the server ifs.marketToExchange
is not initialized. Ensure that this is the intended behavior and that there are proper recovery mechanisms in place to handle such crashes gracefully.protocol/daemons/server/server.go (1)
- 43-58: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [54-66]
The
reportResponse
method has been correctly updated to remove references to theupdateMonitor
and now focuses solely on metrics collection. Ensure that the metrics collection system is still compatible with these changes.protocol/daemons/types/health_checkable.go (4)
22-26: The addition of the
ServiceName
method to theHealthCheckable
interface is consistent with the summary provided. This change will require all implementing types to provide a unique service name, which can be useful for identifying services during health checks.71-72: The
serviceName
field has been correctly added to thetimeBoundedHealthCheckable
struct to store the name of the service being monitored. This aligns with the summary and the newServiceName
method implementation.81-88: The initialization of
timeBoundedHealthCheckable
in theNewTimeBoundedHealthCheckable
function correctly sets theserviceName
,timeProvider
, andlogger
. Additionally, it initializes the health checkable to an unhealthy state, which is a good practice to ensure that the service is checked for health immediately after creation.91-93: The implementation of the
ServiceName
method in thetimeBoundedHealthCheckable
struct is correct and returns the stored service name as expected.protocol/mocks/HealthCheckable.go (1)
- 1-63: The mock implementation for
HealthCheckable
and theNewHealthCheckable
function are correctly set up and follow the standard practices for using the testify mock package.protocol/testutil/app/app.go (1)
- 493-493: The change from
tApp.App.Server.DisableUpdateMonitoringForTesting()
totApp.App.DisableHealthMonitorForTesting()
aligns with the refactoring from an update monitoring system to a health check monitor system. Ensure that theDisableHealthMonitorForTesting
method is implemented correctly in theApp
struct and that it effectively disables health monitoring during tests.
@@ -47,12 +36,6 @@ func (s *Server) LiquidateSubaccounts( | |||
metrics.Received, | |||
metrics.Count, | |||
) | |||
// If the daemon is unable to report a response, there is either an error in the registration of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to make sense to log this as we never expect the report response call to fail.
@lcwik reached out to you via slack to discuss the test case error I'm seeing consistently in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (7)
- protocol/app/app.go (10 hunks)
- protocol/app/app_test.go (2 hunks)
- protocol/daemons/server/bridge.go (2 hunks)
- protocol/daemons/server/liquidation.go (2 hunks)
- protocol/daemons/server/pricefeed.go (5 hunks)
- protocol/daemons/server/types/health_checker.go (1 hunks)
- protocol/daemons/server/types/health_monitor.go (1 hunks)
Additional comments: 17
protocol/app/app_test.go (2)
3-9: The addition of new imports for
github.com/dydxprotocol/v4-chain/protocol/mocks
andtime
is consistent with the summary and the PR objective of refactoring the daemon monitoring system.228-241: The addition of the
TestRegisterDaemonWithHealthMonitor_Panics
test function is consistent with the summary and aligns with the PR objective to ensure robustness in the health check monitor system by not allowing the same service to be registered more than once.protocol/daemons/server/bridge.go (1)
- 26-41: The change in the function signature of
AddBridgeEvents
to return a pointer toapi.AddBridgeEventsResponse
and anerror
separately is consistent with the PR's objective to refactor the daemon monitoring system. Ensure that all calls to this function are updated to handle the new signature correctly.protocol/daemons/server/liquidation.go (3)
2-10: The import changes are appropriate and align with the restructuring of the codebase.
27-50: The function signature change for
LiquidateSubaccounts
is correct and reflects the new health check monitoring system's requirements.46-47: The addition of
reportValidResponse
is a good practice to capture valid responses in metrics, aligning with the new health check monitoring system.protocol/daemons/server/pricefeed.go (6)
9-9: The previous comment about removing the
gometrics
import if it's not used is still valid. Please verify if this import is necessary, and if not, remove it to keep the code clean.6-6: The import alias
servertypes
has been correctly updated totypes
to reflect the changes in the import path.47-81: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [34-72]
The function signature of
UpdateMarketPrices
has been updated to use named return values, which is consistent with the PR objective and summary.
75-79: The error types in the
validateMarketPricesUpdatesMessage
andvalidateMarketPriceUpdate
functions have been updated to usedaemontypes.Err*
, aligning with the PR objective and summary.52-58: The comment explaining why the panic in
UpdateMarketPrices
would be unexpected has been added, providing the necessary context as per the previous discussion.70-70: The response reporting in
UpdateMarketPrices
now usestypes.PricefeedDaemonServiceName
, which is consistent with the PR objective and summary.protocol/daemons/server/types/health_checker.go (4)
- 22-22: The previous comment suggested renaming the
Update
method toUpdateLastErr
to clarify that it updates the error and sets the timestamp only if it's the first error in a streak. This change is not reflected in the provided code.- func (u *timestampWithError) Update(timestamp time.Time, err error) { + func (u *timestampWithError) UpdateLastErr(timestamp time.Time, err error) {
- 38-38: The previous comment suggested renaming the
IsZero
method toIsUnset
to better reflect its purpose. This change is not reflected in the provided code.- func (u *timestampWithError) IsZero() bool { + func (u *timestampWithError) IsUnset() bool {
- 69-69: The previous comment suggested renaming the field
mostRecentFailureStreakError
tolastErrWithFirstErrTimestamp
for clarity. This change is not reflected in the provided code.- mostRecentFailureStreakError timestampWithError + lastErrWithFirstErrTimestamp timestampWithError
- 1-218: Overall, the code aligns with the PR objective of refactoring the daemon monitoring system into a health check monitor. The new structures and methods introduced in this file are consistent with the goal of re-issuing panics if a daemon's unhealthiness persists beyond a maximum allowable threshold.
protocol/daemons/server/types/health_monitor.go (1)
- 1-223: The changes in the provided hunk align with the PR objective and the generated summaries. The new health monitoring system is implemented with proper synchronization, error handling, and logging. The code is well-structured and follows best practices. It appears ready for further testing and integration.
err error | ||
} | ||
|
||
func (u *timestampWithError) Update(timestamp time.Time, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any thoughts on this?
u.err = nil | ||
} | ||
|
||
func (u *timestampWithError) IsZero() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any thoughts here?
// This field is used to determine how long the daemon has been unhealthy. If this timestamp is nil, then either | ||
// the service has never been unhealthy, or the most recent error streak ended before it could trigger a callback. | ||
// Access to mostRecentFailureStreakError is synchronized. | ||
mostRecentFailureStreakError timestampWithError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any thoughts here?
// maxAcceptableUnhealthyDuration is the maximum acceptable duration for a health-checkable service to | ||
// remain unhealthy. If the service remains unhealthy for this duration, the monitor will execute the | ||
// specified callback function. | ||
maxAcceptableUnhealthyDuration time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we name this to maxDaemonUnhealthyDuration
so it's consistent w #802?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ it's renamed in that PR, which will be merged into this one before merging to main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see here
} | ||
|
||
// Schedule next poll. | ||
hc.mutableState.SchedulePoll(hc.pollFrequency) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, we want to schedule the next poll when the daemon is unhealthy right? Makes sense to me, but would be nice to add some comments why this is intended.
if maximumAcceptableUnhealthyDuration <= 0 { | ||
return fmt.Errorf( | ||
"health check registration failure for service %v: "+ | ||
"maximum acceptable unhealthy duration %v must be positive", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just to confirm, will you be updating this text too to match maxDaemonUnhealthyDuration
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's updated here
u.err = nil | ||
} | ||
|
||
func (u *timestampWithError) IsZero() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like IsUnset
…PC query. Update error for a more specific test. PR comments. Revert liquidations health reporting to occur only at task loop boundary. Checkpoint. checkpoint - implementation, moved methods to app. need to migrate / add testing. Update test app. Checkpoint for tests. Finish test cases. App unit test + cleanup. Cleanup, rename. cleanup. Remove todo for refactoring metrics daemon. Move monitoring call to before client start. Update placement of monitor code for pricefeed client. Remove defunct constants. Revert ServiceName changes.
Remove superfluous methods. Comment nit. Method name fix.
cd77cee
to
f996947
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 10
Configuration used: CodeRabbit UI
Files selected for processing (18)
- protocol/app/app.go (10 hunks)
- protocol/app/app_test.go (2 hunks)
- protocol/daemons/metrics/client/client.go (1 hunks)
- protocol/daemons/pricefeed/client/client_integration_test.go (4 hunks)
- protocol/daemons/server/bridge.go (2 hunks)
- protocol/daemons/server/liquidation.go (2 hunks)
- protocol/daemons/server/pricefeed.go (5 hunks)
- protocol/daemons/server/server.go (4 hunks)
- protocol/daemons/server/server_test.go (3 hunks)
- protocol/daemons/server/types/health_checker.go (1 hunks)
- protocol/daemons/server/types/health_checker_test.go (1 hunks)
- protocol/daemons/server/types/health_monitor.go (1 hunks)
- protocol/daemons/server/types/health_monitor_test.go (1 hunks)
- protocol/daemons/server/types/stoppable.go (1 hunks)
- protocol/daemons/types/health_checkable.go (3 hunks)
- protocol/mocks/HealthCheckable.go (1 hunks)
- protocol/mocks/Makefile (1 hunks)
- protocol/testutil/app/app.go (1 hunks)
Additional comments: 46
protocol/app/app.go (5)
189-189: As previously discussed, ensure that the follow-up PR to make
MaximumDaemonUnhealthyDuration
configurable has been merged or is tracked for merging to accommodate different environments and use cases.298-298: The addition of
DaemonHealthMonitor
to theApp
struct is consistent with the PR's objective to refactor daemon monitoring to use health checks.1233-1256: The addition of
RegisterDaemonWithHealthMonitor
andDisableHealthMonitorForTesting
methods to theApp
struct aligns with the PR's objective to enhance daemon health monitoring.645-661: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [611-661]
The
startDaemons
function correctly starts various daemon clients and registers them with the health monitor, which is in line with the PR's objective to monitor daemon health.
- 672-680: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [672-692]
Ensure that a comment is added explaining why the Metrics Daemon is not monitored by the health check system, as discussed in the previous comments.
protocol/app/app_test.go (2)
3-4: The addition of the import statement for
"github.com/dydxprotocol/v4-chain/protocol/mocks"
is appropriate for the new test function that utilizes mocks.228-241: The new test
TestRegisterDaemonWithHealthMonitor_Panics
correctly verifies that the health monitor panics when a service is registered more than once, aligning with the PR's objective to handle daemon unhealthiness.protocol/daemons/metrics/client/client.go (1)
- 20-24: The comment added provides a clear explanation for the absence of a health-check implementation for the metrics daemon, which is consistent with the PR's objective to refactor daemon monitoring.
protocol/daemons/pricefeed/client/client_integration_test.go (4)
5-11: The addition of the import statement aligns with the PR objectives and the summary provided.
220-223: The declaration of the
healthMonitor
variable is consistent with the PR objectives and the summary provided.280-292: The modifications in the
SetupTest
function to initialize thehealthMonitor
variable align with the PR objectives and the summary provided.337-342: The addition of a call to
healthMonitor.RegisterService
in thestartClient
function is consistent with the PR objectives and the summary provided.protocol/daemons/server/bridge.go (2)
26-40: The changes to the
AddBridgeEvents
function signature and the direct handling of errors within the function body align with the PR's objectives to refactor daemon monitoring to use health checks. The use of named return values (response
anderr
) is a common Go idiom that can improve readability, especially when dealing with multiple return values.2-8: The removal of the
ExpectBridgeDaemon
method is consistent with the PR's objectives to transition from an update monitor to a health check monitor. Ensure that any code that previously relied on this method has been updated to work with the new health check system.protocol/daemons/server/liquidation.go (3)
2-10: The import modification is non-functional and does not impact the behavior of the code.
27-50: The function signature change for
LiquidateSubaccounts
is correct and aligns with the PR objectives.36-42: The addition of telemetry metrics is consistent with the PR objectives and the previous comments.
protocol/daemons/server/pricefeed.go (4)
- 9-9: The import
gometrics "github.com/armon/go-metrics"
appears to be unused in the provided code. If it is indeed not used, it should be removed to keep the code clean.
The import
gometrics "github.com/armon/go-metrics"
is used in the fileprotocol/daemons/server/pricefeed.go
. Therefore, it should not be removed.
6-6: The import alias
servertypes
is still being used in the code. If the import path has been changed totypes
as mentioned in the summary, this alias should be updated accordingly.52-58: The panic condition in the
UpdateMarketPrices
function has been updated to usedaemontypes.ErrServerNotInitializedCorrectly
. This change aligns with the PR objectives to use health checks for daemon monitoring.47-81: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [34-73]
The removal of the
ExpectPricefeedDaemon
function is consistent with the PR objectives to refactor the daemon monitoring system into a health check monitor.protocol/daemons/server/server.go (1)
- 24-29: The summary does not mention the addition of
fileHandler
andsocketAddress
fields to theServer
struct, which are present in the hunks. This should be included in the summary to accurately reflect the changes made to theServer
struct.protocol/daemons/server/server_test.go (1)
- 171-176: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [161-175]
The changes from lines 161 to 175 show the
createServerWithMocks
function without theserver.DisableUpdateMonitoringForTesting()
call, which aligns with the PR's objective to migrate from update monitoring to health checks. Ensure that the removal of this call is consistent with the new health check system and that no other parts of the test suite rely on this method.protocol/daemons/server/types/health_monitor.go (12)
15-19: The constants
HealthCheckPollFrequency
andHealthMonitorLogModuleName
are well-named and documented, providing clear context for their usage.21-34: The
healthMonitorMutableState
struct is well-encapsulated, with a mutex to protect its mutable state. This should help prevent data races when accessing theserviceToHealthChecker
map and thestopped
anddisabled
flags.43-50: The
DisableForTesting
method is a good practice for testing purposes, allowing the health monitor to be disabled without affecting the production code path.52-68: The
Stop
method correctly checks if the monitor has already been stopped before proceeding, which is a good safeguard against redundant operations.70-116: The
RegisterHealthChecker
method is synchronized and handles the case where the monitor is disabled or stopped, which is crucial for avoiding race conditions and ensuring that no new health checkers are registered after the monitor is stopped.118-128: The
HealthMonitor
struct is well-defined with clear separation of mutable and immutable state. The immutable fields are set during construction and are not modified afterward, which is a good practice for maintaining thread safety.130-141: The
NewHealthMonitor
constructor function initializes theHealthMonitor
with a logger that includes the module name, which is good for traceability in logs.148-178: The
RegisterServiceWithCallback
method ensures that themaximumAcceptableUnhealthyDuration
is positive before proceeding with registration, which is a good validation check to prevent misconfiguration.181-187: The
PanicServiceNotResponding
function provides a callback that panics with a service-specific message, which aligns with the PR objective to re-issue panics for unhealthy daemons.189-200: The
LogErrorServiceNotResponding
function is a good alternative to panicking, allowing for logging errors without stopping the service. This could be useful in non-critical environments or during development.203-218: The
RegisterService
method provides a convenient wrapper aroundRegisterServiceWithCallback
, using the panic callback by default. This aligns with the PR objective and simplifies the registration process for services that should panic when unhealthy.220-223: The
Stop
method in theHealthMonitor
struct delegates to the mutable state'sStop
method, maintaining consistency in the stopping behavior across the system.protocol/daemons/server/types/stoppable.go (1)
- 3-7: The
Stoppable
interface has been correctly defined according to the PR objectives and summary.protocol/daemons/types/health_checkable.go (7)
22-26: The addition of the
ServiceName
method to theHealthCheckable
interface aligns with the PR objectives and the provided summaries.69-73: The inclusion of the
serviceName
field in thetimeBoundedHealthCheckable
struct and the implementation of theServiceName
method are consistent with the PR objectives and the provided summaries.81-88: The changes to the
NewTimeBoundedHealthCheckable
function, including the addition of theserviceName
parameter and its assignment to theserviceName
field, are in line with the PR objectives and the provided summaries.91-93: The implementation of the
ServiceName
method in thetimeBoundedHealthCheckable
struct is correct and aligns with the PR objectives and the provided summaries.96-98: The
ReportSuccess
method in thetimeBoundedHealthCheckable
struct is correctly implemented and thread-safe, which is consistent with the PR objectives and the provided summaries.98-98: The
ReportFailure
method in thetimeBoundedHealthCheckable
struct is correctly implemented and thread-safe, which is consistent with the PR objectives and the provided summaries.98-98: The
HealthCheck
method in thetimeBoundedHealthCheckable
struct is correctly implemented and thread-safe, which is consistent with the PR objectives and the provided summaries.protocol/mocks/HealthCheckable.go (1)
- 1-63: The changes to the
HealthCheckable
mock type are consistent with the PR objectives and summaries provided. The mock functions forHealthCheck
,ReportFailure
,ReportSuccess
, andServiceName
are correctly implemented.protocol/mocks/Makefile (1)
- 24-24: The changes correctly add the mock generation command for the
HealthCheckable
interface, aligning with the PR objectives and the summary provided.protocol/testutil/app/app.go (1)
- 492-493: The change from
DisableUpdateMonitoringForTesting
toDisableHealthMonitorForTesting
correctly reflects the PR's objective to migrate daemon monitoring to health checks.
func (u *timestampWithError) Update(timestamp time.Time, err error) { | ||
// If the timestamp is zero, this is the first update, so set the timestamp. | ||
if u.timestamp.IsZero() { | ||
u.timestamp = timestamp | ||
} | ||
|
||
u.err = err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming the Update
method to UpdateLastErr
to clarify that it updates the error and the timestamp only if it's the first error in a streak.
// createTestMonitor creates a health monitor with a poll frequency of 10ms and a zero duration grace period. | ||
func createTestMonitor() (*types.HealthMonitor, *mocks.Logger) { | ||
logger := &mocks.Logger{} | ||
logger.On("With", "module", "health-monitor").Return(logger).Once() | ||
return types.NewHealthMonitor( | ||
ZeroDuration, | ||
10*time.Millisecond, | ||
logger, | ||
), logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The createTestMonitor
function sets up a health monitor with a very short poll frequency of 10ms. While this is likely intended to make the tests run quickly, it's important to ensure that this does not lead to flaky tests in slower environments or CI systems. Consider parameterizing the poll frequency or using a mock clock to control time in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Configuration used: CodeRabbit UI
Files selected for processing (1)
- protocol/daemons/server/types/health_monitor_test.go (1 hunks)
Additional comments: 1
protocol/daemons/server/types/health_monitor_test.go (1)
- 202-206: The error message in the assertion on line 206 has been corrected to match the nature of the error being tested.
Co-authored-by: ttl33 <19664986+ttl33@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// Update updates the timeStampWithError to reflect the current error. If the timestamp is zero, this is the first | ||
// update, so set the timestamp. | ||
func (u *timestampWithError) Update(timestamp time.Time, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming the Update
method to UpdateLastErr
to clarify that it updates the error and the timestamp only if it's the first error in a streak.
} | ||
|
||
// IsZero returns true if the timestampWithError is zero, indicating that the service is healthy. | ||
func (u *timestampWithError) IsZero() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method IsZero
should be renamed to IsUnset
as per previous discussions to better reflect its purpose.
// This field is used to determine how long the daemon has been unhealthy. If this timestamp is nil, then either | ||
// the service has never been unhealthy, or the most recent error streak ended before it could trigger a callback. | ||
// Access to mostRecentFailureStreakError is synchronized. | ||
mostRecentFailureStreakError timestampWithError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming the mostRecentFailureStreakError
field to lastErrWithFirstErrTimestamp
for clarity, as previously suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 6
Configuration used: CodeRabbit UI
Files ignored due to filter (1)
- protocol/docker-compose.yml
Files selected for processing (11)
- protocol/app/app.go (8 hunks)
- protocol/app/flags/flags.go (1 hunks)
- protocol/daemons/flags/flags.go (5 hunks)
- protocol/daemons/flags/flags_test.go (4 hunks)
- protocol/daemons/pricefeed/client/client_integration_test.go (3 hunks)
- protocol/daemons/server/types/health_checker.go (1 hunks)
- protocol/daemons/server/types/health_monitor.go (1 hunks)
- protocol/daemons/server/types/health_monitor_test.go (1 hunks)
- protocol/x/clob/client/cli/cancel_order_cli_test.go (2 hunks)
- protocol/x/clob/client/cli/liquidations_cli_test.go (2 hunks)
- protocol/x/clob/client/cli/place_order_cli_test.go (2 hunks)
Files skipped from review as they are similar to previous changes (1)
- protocol/daemons/pricefeed/client/client_integration_test.go
Additional comments: 24
protocol/app/app.go (1)
- 8-8: The
math/big
package import is still present, which contradicts the summary stating it was removed. If this package is no longer used, it should be removed to keep the code clean.- "math/big"
protocol/app/flags/flags.go (1)
- 3-7: The summary indicates that the "fmt" package import was removed, but the import statement is still present in the hunk. This discrepancy should be corrected in the summary.
protocol/daemons/flags/flags.go (5)
9-17: The addition of new flags
FlagPanicOnDaemonFailureEnabled
andFlagMaxDaemonUnhealthySeconds
is consistent with the PR objectives and the summary provided. These flags are crucial for the new health check monitoring system.30-37: The
SharedFlags
struct has been correctly updated to include the new fieldsPanicOnDaemonFailureEnabled
andMaxDaemonUnhealthySeconds
, aligning with the PR's goal to enhance daemon health monitoring.80-86: The default values for the new flags
PanicOnDaemonFailureEnabled
andMaxDaemonUnhealthySeconds
have been set appropriately in theGetDefaultDaemonFlags
function. This ensures that the daemons will panic on failure and have a default unhealthy duration threshold, which is in line with the PR's objectives.117-132: The command line flag additions for
FlagPanicOnDaemonFailureEnabled
andFlagMaxDaemonUnhealthySeconds
are correctly implemented, allowing users to configure the behavior of the health check monitoring system via command line options.196-208: The parsing logic for the new flags
FlagPanicOnDaemonFailureEnabled
andFlagMaxDaemonUnhealthySeconds
is correctly implemented, ensuring that the values can be retrieved and set from the command line options.protocol/daemons/flags/flags_test.go (4)
17-24: The addition of new flags
FlagPanicOnDaemonFailureEnabled
andFlagMaxDaemonUnhealthySeconds
to the command line interface is correctly implemented and tested for presence.43-50: The test function
TestGetDaemonFlagValuesFromOptions_Custom
correctly sets custom values for the new flags and verifies their retrieval from the options map.68-79: The assertions for the new flags
FlagPanicOnDaemonFailureEnabled
andFlagMaxDaemonUnhealthySeconds
in theTestGetDaemonFlagValuesFromOptions_Custom
test function are correctly implemented.91-97: The
TestGetDaemonFlagValuesFromOptions_Default
test function correctly checks for the default values of the daemon flags, ensuring that the new flags will have their expected default values when not explicitly set.protocol/daemons/server/types/health_checker.go (1)
- 182-186: The implementation of the
Poll
method correctly schedules the next poll regardless of the health check result, which is consistent with the intended behavior as per the PR's objective and previous comments. The comments within the code provide clear explanation for this behavior, which is to ensure continuous monitoring even when the daemon is unhealthy and the callback does not halt the daemon.protocol/daemons/server/types/health_monitor.go (5)
21-34: The mutable state struct
healthMonitorMutableState
is well encapsulated and uses a mutex for synchronization, which is good for thread safety.188-194: The
PanicServiceNotResponding
function provides a clear panic message including the service name and error, which is helpful for debugging.196-208: The
LogErrorServiceNotResponding
function is a good counterpart toPanicServiceNotResponding
, providing a non-terminating error logging option. It's important to ensure that the logger is properly configured to capture these error messages.210-233: The
RegisterService
method provides a clear API for registering services with the health monitor, with the behavior toggled by theenablePanics
flag. This is a good use of a configuration flag to alter behavior without changing the interface.235-238: The
Stop
method in theHealthMonitor
struct delegates to the mutable state'sStop
method, maintaining consistency in the approach to stopping services.protocol/x/clob/client/cli/cancel_order_cli_test.go (3)
11-11: The addition of the
math
package import is correct and necessary for the use ofmath.MaxUint32
later in the code.77-82: The configuration setting to disable the health monitor panic timeout by setting
daemonflags.FlagMaxDaemonUnhealthySeconds
tomath.MaxUint32
is consistent with the PR objectives and is a valid approach for integration tests where the health check panic is not desirable.74-86: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [8-86]
The changes in the provided hunks are consistent with the PR objectives and the AI-generated summary. There are no alterations to the declarations of exported entities or any other issues that need to be addressed.
protocol/x/clob/client/cli/liquidations_cli_test.go (2)
5-10: The addition of the "math" package import is correctly reflected in the hunk and is used within the test setup.
78-83: The setting for
daemonflags.FlagMaxDaemonUnhealthySeconds
is correctly added to disable the health monitor panic timeout for the tests.protocol/x/clob/client/cli/place_order_cli_test.go (2)
- 5-8: ```shell
Verify that the "math" package is used elsewhere in the file.
ast-grep --lang go --pattern $'import("math")'
* 72-77: ```shell # Verify that the setting for FlagMaxDaemonUnhealthySeconds is not used elsewhere in a way that conflicts with this test configuration. ast-grep --lang go --pattern $'Set(daemonflags.FlagMaxDaemonUnhealthySeconds, $_)'
// The following tests may still intermittently fail or report false positives on an overloaded system as they rely | ||
// on callbacks to execute before the termination of the `time.Sleep` call, which is not guaranteed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests still rely on time.Sleep
for synchronization, which can lead to flaky tests. Consider refactoring the tests to use synchronization primitives or mocking time-dependent functionality to make the tests more deterministic and reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (1)
- protocol/daemons/server/types/health_checker.go (1 hunks)
Additional comments: 1
protocol/daemons/server/types/health_checker.go (1)
- 1-217: The summary mentions the introduction of new global variables and flags, such as
MaximumDaemonUnhealthyDuration
andFlagMaxDaemonUnhealthySeconds
, but these are not visible in the provided code. Please ensure that all relevant changes are included in the review or update the summary accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (1)
- protocol/daemons/server/types/health_checker.go (1 hunks)
Additional comments: 6
protocol/daemons/server/types/health_checker.go (6)
77-87:
TheReportSuccess
method correctly updates the last success timestamp and resets the error streak, ensuring that the health checker's state is accurately maintained after a successful health check.89-97:
TheReportFailure
method properly updates the error streak with the most recent error and calculates the duration of the current failure streak, which is essential for determining if the maximum unhealthy duration has been exceeded.100-113:
TheSchedulePoll
method correctly checks if the health checker has been stopped before scheduling the next poll, which prevents unnecessary operations after the health checker has been stopped.115-127:
TheStop
method ensures that the health checker is stopped only once and that subsequent calls to stop do not have any effect, which is a good practice to prevent redundant operations.156-179:
ThePoll
method encapsulates the logic for executing a health check, handling both healthy and unhealthy responses, and scheduling the next poll. The method also correctly invokes the unhealthy callback when the service has been unhealthy for longer than the maximum acceptable duration.186-209:
TheStartNewHealthChecker
function initializes and starts a new health checker with the provided parameters, including a startup grace period before the first poll, which allows the service to initialize properly.
// Schedule next poll. We schedule another poll whether the callback was invoked or not, as callbacks are not | ||
// guaranteed to panic or otherwise halt the daemon. In such cases, we may end up invoking the callback several | ||
// times once the service exceeds the maximum unhealthy duration. For example, a callback that emits error logs | ||
// will continue to emit error logs every 5s until the service becomes healthy again. | ||
hc.mutableState.SchedulePoll(hc.pollFrequency) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code schedules a new poll after each health check, regardless of the health check's outcome. It would be beneficial to add a comment explaining the rationale behind this decision, as previously suggested by other reviewers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func TestRegisterService_RegistrationFailsAfterStop(t *testing.T) { | ||
ufm, _ := createTestMonitor() | ||
ufm.Stop() | ||
|
||
stoppableHc := &stoppableFakeHealthChecker{} | ||
hc2 := mockFailingHealthCheckerWithError("test-service-2", TestError1) | ||
|
||
// Register unhealthy services. These services are confirmed to trigger a panic if registered when the monitor is | ||
// not stopped. | ||
// Register a stoppable unhealthy service. | ||
err := ufm.RegisterService(stoppableHc, 10*time.Millisecond) | ||
require.Nil(t, err) | ||
|
||
// Register a non-stoppable unhealthy service. | ||
err = ufm.RegisterService(hc2, 10*time.Millisecond) | ||
require.Nil(t, err) | ||
|
||
// Since the max allowable unhealthy duration is 10ms, and the polling period is 10ms, 100ms is long enough to wait | ||
// in order to trigger a panic if a service is polled. | ||
time.Sleep(100 * time.Millisecond) | ||
|
||
// Assert that the monitor proactively stops any stoppable service that was registered after the monitor was | ||
// stopped. | ||
require.True(t, stoppableHc.stopped) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case does not assert the expected behavior for non-stoppable services after the monitor is stopped. Consider adding an assertion to verify that non-stoppable services are not stopped by the monitor.
}() | ||
|
||
// Give the monitor time to poll the health checkable service. Polls occur once every 10ms. | ||
time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests still rely on time.Sleep
for synchronization, which can lead to flaky tests. Consider refactoring the tests to use synchronization primitives or mocking time-dependent functionality to make the tests more deterministic and reliable.
Also applies to: 119-119, 159-159, 185-185, 234-234
Changelist
Refactor the existing update monitor to be a health check monitor, and to re-issue panics after a sustained period of daemon unhealthiness meets the maximum allowable threshold.
Test Plan
Author/Reviewer Checklist
state-breaking
label.PrepareProposal
orProcessProposal
, manually add the labelproposal-breaking
.feature:[feature-name]
.backport/[branch-name]
.refactor
,chore
,bug
.