Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an ADR to keep document history in sync #9666

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 147 additions & 0 deletions docs/adr/0005-keep-document-history-in-sync-with-rabbit-mq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# 5. Keep document history in sync with Publishing API via RabbitMQ

Date: 2025-11-27

## Status

Accepted

## Context

When Content Blocks created in the Content Block Manager are used in documents, we want to be able to
record when a change to a content block triggers an update to the host document. Currently this works
like so:

* Content block is updated
* We find all documents that use the content block
* Each document is then represented to the content store with the updated content block details

This all happens in Publishing API, so there is no record in Whitehall (or any other publishing apps)
of when a change to a document has been triggered by an update to a content block.

With this in mind, we need to find some way of enabling two-way communication between Publishing API
and Whitehall, so publishers can see when content blocks that their document uses have been updated.

There are two potential solutions, each with their own advantages and drawbacks:

### Solution 1: Interweave content block updates in with Whitehall's history

In order to do this, we need to update the Publishing API to record an event when a document has been
ChrisBAshton marked this conversation as resolved.
Show resolved Hide resolved
republished as a result to a change to a content block, as well as add an endpoint that allows us to
see the events for a particular document, as well as filtering by event type and date.

An JSON representation of event object will look like this:

```json
{
"id": 115,
"action": "HostContentUpdateJob",
"user_uid": null,
"created_at": "2024-11-28T14:14:11.375Z",
"updated_at": "2024-11-28T14:14:11.375Z",
"request_id": "91cfbab2f3ff8889ff55a1c7b308d60c",
"content_id": "0c643225-b5ae-4bd4-8c5d-9d8911433e28",
"payload": {
"locale": "en",
"message": "Host content updated by content block update",
"content_id": "0c643225-b5ae-4bd4-8c5d-9d8911433e28",
"source_block": {
"title": " Universal Credit Helpline ",
"content_id": "a55a917b-740f-466b-9b31-9a9df4526de4",
}
}
}
```

When a document is loaded in Whitehall, we could then call the API and weave these events into the timeline.
However, this is complicated by the fact that Whitehall's document history is paginated, so we won't necessarily
have the full Whitehall history at load time and won't necessarily know the full date window of Publishing events
to fetch. For example:

A document has the following range of event datetimes for the first page:

```
2024-03-23T09:23:00
.....
2023-12-10T11:13:00
```

And a range of event datetimes for the second page

```
2023-11-22T12:27:00
...
2023-09-12T15:17:00
```

If we have an event that happens between `2023-11-22T12:27:00` (the newest event for the second page) and
`2023-12-10T11:13:00` (the oldest event for the first page) it won't get picked up because it doesn't occur
within that range of events.

We could get around this by making a request to get the datetime of the first event on the next page, thus
giving us a full window of dates to interleave, but this makes an already [complex class][1] harder to understand.

Additionally, making an extra database query and calling out to an API endpoint could have performance impacts.

It's also worth considering that currently, we display 10 events on each "page" of results. If we are interleaving
new events with each page of results, this could be confusing for the user if they only expect to see 10 results.

Another solution could be sending a request to the Publishing API endpoint before we fetch the history and then creating
new events, however:

1. This will result in an API call every time a user views a document; and
2. Carrying out an INSERT query on a GET request isn't a pattern we want to encourage

## Solution 2: Add a new message consumer in Whitehall

This would involve setting up a new RabbitMQ message topic in Publishing API that sends
messages when a content block update triggers a change to a document. This would be a brand new
topic that contains a thin message that includes the `content_id` of the document that has
been updated, when it was updated and information about the content block that triggered the update:
Comment on lines +97 to +100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interesting. In my imagination we were going to send just a message for the content block and then leave Whitehall to do the lookup of all the documents that link to the block. However, I realise now that would mean searching through govspeak because Whitehall doesn't have a structured reference to content blocks. I can see now why the original plan was to use the existing message, because we'd now have to send two messages for each document. Hmm.

Whitehall does a thing where it parses out the references from Govspeak to contacts and to other editions each time an edition is saved (link). We could do the same with content block references perhaps, but I have not really thought through the consequences of that (performance etc.). It would allow us to just send the content block ID to Whitehall and update the document history though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have got a similar thing (contained within the Content Block Tools Gem) that we use to extract references from Govspeak (It's actually used in Whitehall at the mo too), so that's entirely a thing we could do


```json
{
"locale": "en",
"content_id": "0c643225-b5ae-4bd4-8c5d-9d8911433e28",
"updated_at": "2024-11-28T14:14:11.375Z",
"content_block": {
"title": " Universal Credit Helpline ",
"content_id": "a55a917b-740f-466b-9b31-9a9df4526de4",
}
}
```

We will then set up a queue in Whitehall to listen for events with the relevant key. When an
event has been received, we create a new event in Whitehall (something like an `EditorialRemark`)
for the document with that `content_id`.

This will require a bit more work on both the Publishing API and Whitehall side and will involve
a degree of opacity (as well as extra lines on an architecture graph), but this will avoid complexity
when rendering the history of the document.

## Decision

We propose going with Solution 2.

## Consequences

We will need to set up a RabbitMQ consumer in Whitehall, which will require some minor work on the
ops side of things. It will also mean we will need to consider two-way communication between the
two applications when thinking about the publishing platform architecture.

However, once this is set up, this could potentially open up the possibility of more two way
communication between Whitehall and Publishing API in the future, such as feeding back to
the user when something has not published successfully.

## Alternatives considered

We could remove pagination entirely from the events, or carry out in-memory pagination, but these
options could result in performance issues, especially with older documents. We would also have to
make an API call to Publishing API each time a document is loaded, which could slow things down.

Another option could be to treat Publishing API as the source of truth for the history of a document,
but this could be a considerably more complex piece of work, which we would have limited resource for.
If we decided in the future that it was worth the investment of time, we could still do this further
down the line.

[1]: https://github.com/alphagov/whitehall/blob/main/app/models/document/paginated_timeline.rb
Loading