diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000000..e69de29bb2 diff --git a/404.html b/404.html new file mode 100644 index 0000000000..db22fcb022 --- /dev/null +++ b/404.html @@ -0,0 +1,933 @@ + + + +
+ + + + + + + + + + + + + +A common question is why not just NGINX instead of the custom proxy? The reason is the dynamic routing for the applications, e.g. URLs like https://jupyterlab-abcde1234.mydomain.com/some/path: each one has a lot of fairly complex requirements.
+While not impossible to leverage NGINX to move some code from the proxy, there would still need to be custom code, and NGINX would have to communicate via some mechanism to this custom code to achieve all of the above: extra HTTP or Redis requests, or maybe through a custom NGINX module. It is suspected that this will make things more complex rather than less, and increase the burden on the developer.
+We will use a custom proxy for Data Workspace, rather than simply using NGINX.
+This will decrease the burden on the developer that would have been required by custom NGINX modules, extra HTTP or Redis requests, which all would still have required custom code.
+Using the custom proxy allows for all of the complex requirements and dynamic routing of our applications over which we have absolute control.
+Initial difficulty when onboarding new team members as they will need to understand these decisions and requirements.
+There is an extra network hop compared to not having a proxy.
+The proxy fits the typical use-case of event-loop based programming: low CPU but high IO requirements, with potentially high number of connections.
+The asyncio library aiohttp provides enough low-level control over the headers and the bytes of requests and responses to work as a controllable proxy. For example, the typical HTTP request cycle can be programmed fairly explicitly.
+An incoming request begins: its headers are received.
+The library also allows for receiving and making WebSockets requests. This is done without knowledge ahead of time which path is WebSockets, and which is HTTP. This is something that doesn't seem possible with, for example, Django Channels.
+Requests and responses can be of the order of several GBs, so this streaming behaviour is a critical requirement.
+We will use the asyncio library aiohttp.
+Allows for critical requirement of streaming behaviour.
+We can stream HTTP(S) and Websockets requests in an efficient way with one cohesive Python package.
+A core bit of infrastructure will depend on a flavour of Python unknown to even experienced Python developers.
+Aiohttp is unable to proxy things that are not HTTP or Websockets, i.e. SSH. This is why GitLab isn't behind the proxy.
+This section contains a list of Architecture Decision Records (ADRs).
+As an example, from the point of view of user abcde1234
, https://jupyterlab-abcde1234.mydomain.com/
is the fixed address of their private JupyterLab application. Going to https://jupyterlab-abcde1234.mydomain.com/
in a browser will:
If the application is stopped, then a visit to https://jupyterlab-abcde1234.mydomain.com/
will repeat the process. The user will never leave https://jupyterlab-abcde1234.mydomain.com/
. If the user visits https://jupyterlab-abcde1234.mydomain.com/some/path
, they will also remain at https://jupyterlab-abcde1234.mydomain.com/some/path
to ensure, for example, bookmarks to any in-application page work even if they need to start the application to view them.
The browser will only make GET requests during the start of an application. While potentially a small abuse of HTTP, it allows the straightfoward behaviour described: no HTML form or JavaScript is required to start an application (although JavaScript is used to show a countdown to the user and to check if an application has loaded), and the GET requests are idempotent.
+The proxy however, has a more complex behaviour. On an incoming request from the browser for https://jupyterlab-abcde1234.mydomain.com/
:
GET
details of an application with the host jupyterlab-abcde1234
from an internal API of the main application;GET
returns a 404, it will make a PUT
request to the main application that initiates creation of the Fargate task;GET
returns a 200, and the details contain a URL, the proxy will attempt to proxy the incoming request to it;SPAWNING
application as a true error: they are effectively swallowed.GET
as STOPPED
, which happens on error, it will DELETE
the application, and show an error to the user.The proxy itself only responds to incoming requests from the browser, and has no long-lived tasks that go beyond one HTTP request or WebSockets connection. This ensures it can be horizontally scaled.
+ + + + + + +In addition to being able to run any Docker container, not just JupyterLab, Data Workspace has some deliberate architectural features that are different to JupyterHub.
+All state is in the database, accessed by the main Django application.
+Specifically, no state is kept in the memory of the main Django application. This means it can be horizontally scaled without issue.
+The proxy is also stateless: it fetches how to route requests from the main application, which itself fetches the data from the database. This means it can also be horizontally scaled without issue, and potentially independently from the main application. This means sticky sessions are not needed, and multiple users could access the same application, which is a planned feature for user-supplied visualisation applications.
+Authentication is completely handled by the proxy. Apart from specific exceptions like the healthcheck, non-authenticated requests do not reach the main application.
+The launched containers do not make requests to the main application, and the main application does not make requests to the launched containers. This means there are fewer cyclic dependencies in terms of data flow, and that applications don't need to be customised for this environment. They just need to open a port for HTTP requests, which makes them extremely standard web-based Docker applications.
+There is a notable exception to the statelessness of the main application: the launch of an application is made of a sequence of calls to AWS, and is done in a Celery task. If this sequence is interrupted, the launch of the application will fail. This is a solvable problem: the state could be saving into the database and the sequence resumed later. However, since this sequence of calls lasts only a few seconds, and the user will be told of the error and can refresh to try to launch the application again, at this stage of the project this has been deemed unnecessary.
+ + + + + + +Data Workspace is made of a number of components. This page explains what those are and how they work together.
+To understand the components of Data Workspace's architecture, you should have familiary with:
+At the highest level, users access the Data Workspace application, which accesses a PostgreSQL database.
+graph
+ A[User] --> B[Data Workspace]
+ B --> C["PostgreSQL (Aurora)"]
+The architecture is heavily Docker/ECS Fargate based.
+graph
+ A[User] -->|Staff SSO| B[Amazon Quicksight];
+ B --> C["PostgreSQL (Aurora)"];
+ A --> |Staff SSO|F["'The Proxy' (aiohttp)"];
+ F --> |rstudio-9c57e86a|G[Per-user and shared tools];
+ F --> H[Shiny, Flask, Django, NGINX];
+ F --> I[Django, Data Explorer];
+ G --> C;
+ H --> C;
+ I --> C;
+
+
+
+Main application: + A Django application to manage datasets and permissions, launch containers, a proxy to route requests to those containers, and an NGINX instance to route to the proxy and serve static files.
+JupyterLab: + Launched by users of the main application, and populated with credentials in the environment to access certain datasets.
+rStudio: + Launched by users of the main application, and populated with credentials in the environment to access certain datasets.
+pgAdmin: + Launched by users of the main application, and populated with credentials in the environment to access certain datasets.
+File browser: + A single-page-application that offers upload and download of files to/from each user's folder in S3. The data is transferred directly between the user's browser and S3.
+metrics: + A sidecar-container for the user-launched containers that exposes metrics from the ECS task metadata endpoint in Prometheus format.
+s3sync: + A sidecar-container for the user-launched containers that syncs to and from S3 using mobius3. This is to allow file-persistance on S3 without using FUSE, which at the time of writing is not possible on Fargate.
+dns-rewrite-proxy: + The DNS server of the VPC that launched containers run in. It selectively allows only certain DNS requests through to migitate chance of data exfiltration through DNS. When this container is deployed, it changes DHCP settings in the VPC, and will most likely break aspects of user-launched containers.
+healthcheck: + Proxies through to the healthcheck endpoint of the main application, so the main application can be in a security group locked-down to certain IP addresses, but still be monitored by Pingdom.
+mirrors-sync: + Mirrors pypi, CRAN and (ana)conda repositories to S3, so user-launched JupyterLab and rStudio containers can install packages without having to contact the public internet.
+prometheus: + Collects metrics from user-launched containers and re-exposes them through federation.
+registry: + A Docker pull-through-cache to repositories in quay.io. This allows the VPC to not have public internet access but still launch containers from quay.io in Fargate.
+sentryproxy: + Proxies errors to a Sentry instance: only used by JupyterLab.
+Contributions to Data Workspace are welcome, such as reporting issues, requesting features, making documentation changes, or submitting code changes.
+Suspected issues with Data Workspace can be submitted at Data Workspace issues. +An issue that contains a minimal, reproducible example stands the best chance of being resolved. However, it is understood that this is not possible in all circumstances.
+A feature request can be submitted using the Ideas category in Data Workspace discussions.
+The source of the documentation is in the docs/
directory of the source code, and is written using Material for mkdocs.
Changes are then submitted via a Pull Request (PR). To do this:
+Decide on a short hyphen-separated descriptive name for your change, prefixed with docs/
, for example docs/add-example
.
Make a branch using this descriptive name:
+ +Make your changes in a text editor.
+Preview your changes locally:
+ +Commit your change and push to your fork. Ideally the commit message will follow the Conventional Commit specification:
+ +Raise a PR at https://github.com/uktrade/data-workspace/pulls against the master branch in data-workspace.
+Wait for the PR to be approved and merged, and respond to any questions or suggested changes.
+When the PR is merged, the documentation is deployed automatically to https://data-workspace.docs.trade.gov.uk/.
+Changes are submitted via a Pull Request (PR). To do this:
+Decide on a short hyphen-separated descriptive name for your change, prefixed with the type of change. For example fix/the-bug-description
.
Make a branch using this descriptive name:
+ +Make sure you can run existing tests locally, for example by running:
+ +See Running tests for more details on running tests.
+Make your changes in a text editor. In the cases of changing behaviour, this would usually include changing or adding tests within dataworkspace/dataworkspace/tests, and running them.
+Commit your changes and push to your fork. Ideally the commit message will follow the Conventional Commit specification:
+ +Raise a PR at https://github.com/uktrade/data-workspace/pulls against the master branch of data-workspace.
+Wait for the PR to be approved and merged, and respond to any questions or suggested changes.
+Data Workspace is essentially an interface to a PostgreSQL database, referred to as the datasets database. Technical users can access specific tables in the datasets database directly, but there is a concept of "datasets" on top of this direct access. Each dataset has its own page in the user-facing data catalogue that has features for non-technical users.
+Conceptually, there are 3 different types of datasets in Data Workspace: source datasets, reference datasets, and data cuts. Metadata for the 3 dataset types is controlled through a single administration interface, but how data is ingested into these depends on the dataset.
+In addition to the structured data exposed in the catalogue, data can be uploaded by users on an ad-hoc basis, treated by Data Workspace as binary blobs.
+Data Workspace is a Django application, with a staff-facing administration interface, usually refered to as Django admin. Metadata for of each the 3 types of dataset is managed within Django admin.
+A source dataset is the core Data Workspace dataset type. It is made up of one or more tables in the PostgreSQL datasets database. Typically a source dataset would be updated frequently.
+However, ingesting into these tables is not handled by the Data Workspace project itself. There are many ways to ingest data into PostgreSQL tables. The Department for Business and Trade uses Airflow to handle ingestion using a combination of Python and SQL code.
+Note
+The Airflow pipelines used by The Department for Business and Trade to ingest data are not open source. Some parts of Data Workspace relating to this ingestion depend on this closed source code.
+Reference datasets are datasets usually used to classify or contextualise other datasets, and are expected to not change frequently. "UK bank holidays" or "ISO country codes" could be reference datasets.
+The structure and data of reference datasets can be completely controlled through Django admin.
+Data isn't ingested into data cuts directly. Instead, data cuts are defined by SQL queries entered into Django admin that run dynamically, querying from source and reference datasets. As such they update as frequently as the data they query from updates.
+A datacut could filter a larger source dataset for a specific country, calculate aggregate statistics, join multiple source datasets together, join a source dataset with a reference dataset, or a combination of these.
+Each user is able to upload binary blobs in ad-hoc cases to their own private prefix in an S3 bucket, as well to any authorized team prefixes. Read and write access to these prefixes is by 3 mechanisms:
+Through a custom React-based S3 browser built into the Data Workspace Django application.
+From tools using the S3 API or S3 SDKs, for example boto3.
+Certain parts of each user's prefix are automatically synced to and from the local filesystem in on-demand tools they launch. This gives users the illusion of a permanent filesystem in their tools, even though the tools are ephermeral.
+Data Workspace contains code that helps it be deployed using Amazon Web Services (AWS). This page explains how to use this code.
+To deploy Data Workspace to AWS you must have:
+data-workspace
. See Running locally for detailsYou should also have familiarity with working on the command line, working with Terraform, and with AWS.
+Each deployment, or environment, of Data Workspace requires a folder for its configuration. This folder should be within a sibling folder to data-workspace
.
The Data Workspace source code contains a template for this configuration. To create a folder in an appropriate location based on this template:
+Decide on a meaningful name for the environment. In the following production
is used.
Ensure you're in the root of the data-workspace
folder that contains the cloned Data Workspace source code.
Copy the template into a new folder for the environment:
+ +This folder structure allows the configuration to find and use the infra/
folder in data-workspace
which contains the low level details of the infrastructure to provision in each environment.
Before deploying the environment, it must be initialised.
+Change to the new folder for the environment:
+ +Generate new SSH keys:
+ +Install AWS CLI and configure an AWS CLI profile. This will support some of the included configuration scripts.
+You can do this by putting credentials directly into ~/.aws/credentials
or by using aws sso
.
Create an S3 bucket and dynamodb table for Terraform to use, and add them to main.tf
. --bucket
will provide the base name for both objects.
Enter the details of your hosting platform, SSH keys, and OAuth 2.0 server by changing all instances of REPLACE_ME
in:
admin-environment.json
gitlab-secrets.json
main.tf
Initialise Terraform:
+ +Check the environment you created has worked correctly:
+ +If everything looks right, you're ready to deploy:
+ + + + + + + +It should possible to deploy to platforms other than Amazon Web Services (AWS), but at the time of writing this hasn't been done. It may involve a significant amount of work.
+You can start a discussion on how best to approach this.
+ + + + + + +Data Workspace's user-facing metadata catalogue uses Django. When developing Data Workspace, if a change is made to Django's models, to reflect this change in the metadata database, migrations must be created and run.
+To create migrations you must have the Data Workspace prerequisites and cloned its source code. See Running locally for details.
+After making changes to Django models, to create any required migrations:
+docker compose build && \
+docker compose run \
+ --user root \
+ --volume=$PWD/dataworkspace:/dataworkspace/ \
+ data-workspace django-admin makemigrations
+
The migrations must be committed to the codebase, and will run when Data Workspace is next started.
+This pattern can be used to run other Django management commands by replacing makemigrations
with the name of the command.
Turn an existing govuk styled table into a govuk styled ag-grid grid.
+enhanced-table
.<thead>
and one <tbody>
.data-size-to-fit
to ensure columns fit the whole width of the table:Configuration for the columns is done on the <th>
elements via data attributes. The options are:
data-sortable
- enable sorting for this column (disabled by default).data-column-type
- use a specific ag-grid column type.data-renderer
- optionally specify the renderer for the column. Only needed for certain data types.data-renderer="htmlRenderer"
- render/sort column as html (mainly used to display buttons or links in a cell).data-renderer="dateRenderer"
- render/sort column as dates.data-renderer="datetimeRenderer"
- render/sort column as datetimes.data-width
- set a width for a column.data-min-width
- set a minimum width in pixels for a column.data-max-width
- set a maximum width in pixels for a column.data-resizable
- allow resizing of the column (disabled by default).<table class="govuk-table enhanced-table data-size-to-fit">
+ <thead class="govuk-table__head">
+ <tr class="govuk-table__row">
+ <th class="govuk-table__header" data-sortable data-renderer="htmlRenderer">A link</th>
+ <th class="govuk-table__header" data-sortable data-renderer="dateRenderer">A date</th>
+ <th class="govuk-table__header" data-width="300">Some text</th>
+ <th class="govuk-table__header" data-column-type="numericColumn">A number</th>
+ </thead>
+ <tbody class="govuk-table__body">
+ {% for object in object_list %}
+ <tr>
+ <td class="name govuk-table__cell">
+ <a class="govuk-link" href="#">The link</a>
+ </td>
+ ...
+ </tr>
+ {% endfor %}
+ </tbody>
+</table>
+
Add the following to your page:
+<script src="{% static 'ag-grid-community.min.js' %}"></script>
+<script src="{% static 'dayjs.min.js' %}"></script>
+<script src="{% static 'js/grid-utils.js' %}"></script>
+<script src="{% static 'js/enhanced-table.js' %}"></script>
+<link rel="stylesheet" type="text/css" href="{% static 'data-grid.css' %}"/>
+<script nonce="{{ request.csp_nonce }}">
+ document.addEventListener('DOMContentLoaded', () => {
+ initEnhancedTable("enhanced-table");
+ });
+</script>
+
As ipdb
has some issues with gevent and monkey patching we are only able to debug using vanilla pdb
currently.
To set this up locally:
+pip install remote-pdb-client
or just pip install -r requirements-dev.txt
. dev.env
:PYTHONBREAKPOINT=remote_pdb.set_trace
REMOTE_PDB_HOST=0.0.0.0
REMOTE_PDB_PORT=4444
breakpoint()
s liberally in your code.docker compose up
. remotepdb_client --host localhost --port 4444
.To debug via the pycharm remote debugger you will need to jump through a few hoops:
+Configure docker-compose.yml
as a remote interpreter:
+
Configure a python debug server for pydev-pycharm
to connect to. You will need to ensure the path mapping
+is set to the path of your dev environment:
+
Bring up the containers:
+ docker compose up
Start the pycharm debugger:
+
Add a breakpoint using pydev-pycharm:
+
Profit:
+
Below are the basic steps for debugging remotely with vscode. They are confirmed to work but may needs some tweaks so feel free to update the docs:
+launch.json
:
+ breakpoint()
docker compose up
To develop features on Data Workspace, or to evaluate if it's suitable for your use case, it can be helpful to run Data Workspace on your local computer.
+To run Data Workspace locally, you must have these tools installed:
+ +You should also have familiarity with the command line, and editing text files. If you plan to make changes to the Data Workspace source code, you should also have familiarity with Python.
+To run Data Workspace locally, you must also have the Data Workspace source code, which is stored in the Data Workspace GitHub repository. The process of copying this code so it is available locally is known as cloning.
+If you don't already have a GitHub account, create a GitHub account.
+Create a new fork of the Data Workspace repository. Make a note of the owner you choose to fork to. This is usually your GitHub username. There is more documentation on forking at GitHub's guide on contributing to projects.
+If you're a member if the uktrade GitHub organisation you should skip this step and not fork. If you're not planning on contributing changes, you can also skip forking.
+Clone the repository by running the following command, replacing owner
with the owner that you forked to in step 3. If you skipped forking, owner
should be uktrade
:
This will create a new directory containing a copy of the Data Workspace source code, data-workspace
.
Change to the data-workspace
directory:
In order to be able to properly test cookies that are shared with subdomains, localhost is not used for local development. Instead, by default the dataworkspace.test domain is used. For this to work, you will need the below in your /etc/hosts
file:
127.0.0.1 dataworkspace.test
+127.0.0.1 data-workspace-localstack
+127.0.0.1 data-workspace-sso.test
+127.0.0.1 superset-admin.dataworkspace.test
+127.0.0.1 superset-edit.dataworkspace.test
+
To run tool and visualisation-related code, you will need subdomains in your /etc/hosts
file, such as:
Set the required variables:
+ +Start the application:
+ +The application should then visible at http://dataworkspace.test:8000.
+Then run docker compose
using the superset profile:
You can then visit http://superset-edit.dataworkspace.test:8000/ or http://superset-admin.dataworkspace.test:8000/.
+We use node-sass to build the front end css and include the GOVUK Front End styles.
+To build this locally requires NodeJS. Ideally installed via nvm
https://github.com/nvm-sh/nvm:
# this will configure node from .nvmrc or prompt you to install
+ nvm use
+ npm install
+ npm run build:css
+
We're set up to use django-webpack-loader for hotloading the React app while developing.
+You can get it running by starting the dev server:
+ +and in a separate terminal changing to the js app directory and running the webpack hotloader:
+ +For production usage we use pre-built JavaScript bundles to reduce the pain of having to build npm modules at deployment.
+If you make any changes to the React apps you will need to rebuild and commit the bundles.
+This will create the relevant js files in /static/js/bundles/
directory:
cd dataworkspace/dataworkspace/static/js/react_apps/
+# this may about 10 minutes to install all dependencies
+npm install
+npm run build
+git add ../bundles/*.js ../stats/react_apps-stats.json
+
If you have issues building the containers try the following:
+ + + + + + + +Running tests locally is useful when developing features on Data Workspace to make sure existing functionality isn't broken, and to ensure any new functionality works as intended.
+To create migrations you must have the Data Workspace prerequisites and cloned its source code. See Running locally for details.
+To run all tests:
+ +To only run Django unit tests:
+ +To only run higher level integration tests:
+ +To run the tests locally without having to rebuild the containers every time append -local
to the test make commands:
To run specific tests pass -e TARGET=<test>
into make:
make docker-test-unit-local -e TARGET=dataworkspace/dataworkspace/tests/test_admin.py::TestCustomAdminSite::test_non_admin_access
+
We have some Selenium integration tests that launch a (headless) browser in order to interact with a running instance of Data Workspace to assure some core flows (only Data Explorer at the time of writing). It is sometimes desirable to watch these tests run, e.g. in order to debug where it is failing. To run the selenium tests through docker compose using a local browser, do the following:
+Download the latest Selenium Server and run it in the background, e.g. java -jar ~/Downloads/selenium-server-standalone-3.141.59 &
.
Run the selenium tests via docker-compose, exposing the Data Workspace port and the mock-SSO port and setting the REMOTE_SELENIUM_URL
environment variable, e.g. docker compose --profile test -p data-workspace-test run -e REMOTE_SELENIUM_URL=http://host.docker.internal:4444/wd/hub -p 8000:8000 -p 8005:8005 --rm data-workspace-test pytest -vvvs test/test_selenium.py
.
We use pip-tools to manage dependencies across two files - requirements.txt
and requirements-dev.txt
. These have corresponding .in
files where we specify our top-level dependencies.
Add the new dependencies to those .in
files, or update an existing dependency, then (with pip-tools
already installed), run make save-requirements
.
Host your own data analysis platform
Data Workspace is an open source data analysis platform with features for users with a range of technical skills. Features include:
Data Workspace has been built with features specifically for the Department for Business and Trade. However, we are open to contributions to make this more generic. See Contributing for details on how to make changes for your use case.
Run Data Workspace locally"},{"location":"contributing/","title":"How to contribute","text":"
Contributions to Data Workspace are welcome, such as reporting issues, requesting features, making documentation changes, or submitting code changes.
"},{"location":"contributing/#prerequisites","title":"Prerequisites","text":"Suspected issues with Data Workspace can be submitted at Data Workspace issues. An issue that contains a minimal, reproducible example stands the best chance of being resolved. However, it is understood that this is not possible in all circumstances.
"},{"location":"contributing/#feature-requests","title":"Feature requests","text":"A feature request can be submitted using the Ideas category in Data Workspace discussions.
"},{"location":"contributing/#documentation","title":"Documentation","text":"The source of the documentation is in the docs/
directory of the source code, and is written using Material for mkdocs.
Changes are then submitted via a Pull Request (PR). To do this:
Decide on a short hyphen-separated descriptive name for your change, prefixed with docs/
, for example docs/add-example
.
Make a branch using this descriptive name:
git checkout -b docs/add-example\ncd data-workspace\n
Make your changes in a text editor.
Preview your changes locally:
pip install -r requirements-docs.txt # Only needed once\nmkdocs serve\n
Commit your change and push to your fork. Ideally the commit message will follow the Conventional Commit specification:
git add docs/getting-started.md # Repeat for each file changed\ngit commit -m \"docs: add an example\"\ngit push origin docs/add-example\n
Raise a PR at https://github.com/uktrade/data-workspace/pulls against the master branch in data-workspace.
Wait for the PR to be approved and merged, and respond to any questions or suggested changes.
When the PR is merged, the documentation is deployed automatically to https://data-workspace.docs.trade.gov.uk/.
"},{"location":"contributing/#code","title":"Code","text":"Changes are submitted via a Pull Request (PR). To do this:
Decide on a short hyphen-separated descriptive name for your change, prefixed with the type of change. For example fix/the-bug-description
.
Make a branch using this descriptive name:
git checkout -b fix/a-bug-description\n
Make sure you can run existing tests locally, for example by running:
make docker-test\n
See Running tests for more details on running tests.
Make your changes in a text editor. In the cases of changing behaviour, this would usually include changing or adding tests within dataworkspace/dataworkspace/tests, and running them.
Commit your changes and push to your fork. Ideally the commit message will follow the Conventional Commit specification:
git add my_file.py # Repeat for each file changed\ngit commit -m \"fix: the bug description\"\ngit push origin fix/the-bug-description\n
Raise a PR at https://github.com/uktrade/data-workspace/pulls against the master branch of data-workspace.
Wait for the PR to be approved and merged, and respond to any questions or suggested changes.
Data Workspace is essentially an interface to a PostgreSQL database, referred to as the datasets database. Technical users can access specific tables in the datasets database directly, but there is a concept of \"datasets\" on top of this direct access. Each dataset has its own page in the user-facing data catalogue that has features for non-technical users.
Conceptually, there are 3 different types of datasets in Data Workspace: source datasets, reference datasets, and data cuts. Metadata for the 3 dataset types is controlled through a single administration interface, but how data is ingested into these depends on the dataset.
In addition to the structured data exposed in the catalogue, data can be uploaded by users on an ad-hoc basis, treated by Data Workspace as binary blobs.
"},{"location":"data-ingestion/#dataset-metadata","title":"Dataset metadata","text":"Data Workspace is a Django application, with a staff-facing administration interface, usually refered to as Django admin. Metadata for of each the 3 types of dataset is managed within Django admin.
"},{"location":"data-ingestion/#source-datasets","title":"Source datasets","text":"A source dataset is the core Data Workspace dataset type. It is made up of one or more tables in the PostgreSQL datasets database. Typically a source dataset would be updated frequently.
However, ingesting into these tables is not handled by the Data Workspace project itself. There are many ways to ingest data into PostgreSQL tables. The Department for Business and Trade uses Airflow to handle ingestion using a combination of Python and SQL code.
Note
The Airflow pipelines used by The Department for Business and Trade to ingest data are not open source. Some parts of Data Workspace relating to this ingestion depend on this closed source code.
"},{"location":"data-ingestion/#reference-datasets","title":"Reference datasets","text":"Reference datasets are datasets usually used to classify or contextualise other datasets, and are expected to not change frequently. \"UK bank holidays\" or \"ISO country codes\" could be reference datasets.
The structure and data of reference datasets can be completely controlled through Django admin.
"},{"location":"data-ingestion/#data-cuts","title":"Data cuts","text":"Data isn't ingested into data cuts directly. Instead, data cuts are defined by SQL queries entered into Django admin that run dynamically, querying from source and reference datasets. As such they update as frequently as the data they query from updates.
A datacut could filter a larger source dataset for a specific country, calculate aggregate statistics, join multiple source datasets together, join a source dataset with a reference dataset, or a combination of these.
"},{"location":"data-ingestion/#ad-hoc-binary-blobs","title":"Ad-hoc binary blobs","text":"Each user is able to upload binary blobs in ad-hoc cases to their own private prefix in an S3 bucket, as well to any authorized team prefixes. Read and write access to these prefixes is by 3 mechanisms:
Through a custom React-based S3 browser built into the Data Workspace Django application.
From tools using the S3 API or S3 SDKs, for example boto3.
Certain parts of each user's prefix are automatically synced to and from the local filesystem in on-demand tools they launch. This gives users the illusion of a permanent filesystem in their tools, even though the tools are ephermeral.
As an example, from the point of view of user abcde1234
, https://jupyterlab-abcde1234.mydomain.com/
is the fixed address of their private JupyterLab application. Going to https://jupyterlab-abcde1234.mydomain.com/
in a browser will:
If the application is stopped, then a visit to https://jupyterlab-abcde1234.mydomain.com/
will repeat the process. The user will never leave https://jupyterlab-abcde1234.mydomain.com/
. If the user visits https://jupyterlab-abcde1234.mydomain.com/some/path
, they will also remain at https://jupyterlab-abcde1234.mydomain.com/some/path
to ensure, for example, bookmarks to any in-application page work even if they need to start the application to view them.
The browser will only make GET requests during the start of an application. While potentially a small abuse of HTTP, it allows the straightfoward behaviour described: no HTML form or JavaScript is required to start an application (although JavaScript is used to show a countdown to the user and to check if an application has loaded), and the GET requests are idempotent.
The proxy however, has a more complex behaviour. On an incoming request from the browser for https://jupyterlab-abcde1234.mydomain.com/
:
GET
details of an application with the host jupyterlab-abcde1234
from an internal API of the main application;GET
returns a 404, it will make a PUT
request to the main application that initiates creation of the Fargate task;GET
returns a 200, and the details contain a URL, the proxy will attempt to proxy the incoming request to it;SPAWNING
application as a true error: they are effectively swallowed.GET
as STOPPED
, which happens on error, it will DELETE
the application, and show an error to the user.The proxy itself only responds to incoming requests from the browser, and has no long-lived tasks that go beyond one HTTP request or WebSockets connection. This ensures it can be horizontally scaled.
"},{"location":"architecture/comparison-with-jupyterhub/","title":"Comparison with JupyterHub","text":"In addition to being able to run any Docker container, not just JupyterLab, Data Workspace has some deliberate architectural features that are different to JupyterHub.
All state is in the database, accessed by the main Django application.
Specifically, no state is kept in the memory of the main Django application. This means it can be horizontally scaled without issue.
The proxy is also stateless: it fetches how to route requests from the main application, which itself fetches the data from the database. This means it can also be horizontally scaled without issue, and potentially independently from the main application. This means sticky sessions are not needed, and multiple users could access the same application, which is a planned feature for user-supplied visualisation applications.
Authentication is completely handled by the proxy. Apart from specific exceptions like the healthcheck, non-authenticated requests do not reach the main application.
The launched containers do not make requests to the main application, and the main application does not make requests to the launched containers. This means there are fewer cyclic dependencies in terms of data flow, and that applications don't need to be customised for this environment. They just need to open a port for HTTP requests, which makes them extremely standard web-based Docker applications.
There is a notable exception to the statelessness of the main application: the launch of an application is made of a sequence of calls to AWS, and is done in a Celery task. If this sequence is interrupted, the launch of the application will fail. This is a solvable problem: the state could be saving into the database and the sequence resumed later. However, since this sequence of calls lasts only a few seconds, and the user will be told of the error and can refresh to try to launch the application again, at this stage of the project this has been deemed unnecessary.
"},{"location":"architecture/components/","title":"Components","text":"Data Workspace is made of a number of components. This page explains what those are and how they work together.
"},{"location":"architecture/components/#prerequisites","title":"Prerequisites","text":"To understand the components of Data Workspace's architecture, you should have familiary with:
At the highest level, users access the Data Workspace application, which accesses a PostgreSQL database.
graph\n A[User] --> B[Data Workspace]\n B --> C[\"PostgreSQL (Aurora)\"]
"},{"location":"architecture/components/#medium-level-architecture","title":"Medium level architecture","text":"The architecture is heavily Docker/ECS Fargate based.
graph\n A[User] -->|Staff SSO| B[Amazon Quicksight];\n B --> C[\"PostgreSQL (Aurora)\"];\n A --> |Staff SSO|F[\"'The Proxy' (aiohttp)\"];\n F --> |rstudio-9c57e86a|G[Per-user and shared tools];\n F --> H[Shiny, Flask, Django, NGINX];\n F --> I[Django, Data Explorer];\n G --> C;\n H --> C;\n I --> C;\n\n\n
"},{"location":"architecture/components/#user-facing","title":"User-facing","text":"Main application: A Django application to manage datasets and permissions, launch containers, a proxy to route requests to those containers, and an NGINX instance to route to the proxy and serve static files.
JupyterLab: Launched by users of the main application, and populated with credentials in the environment to access certain datasets.
rStudio: Launched by users of the main application, and populated with credentials in the environment to access certain datasets.
pgAdmin: Launched by users of the main application, and populated with credentials in the environment to access certain datasets.
File browser: A single-page-application that offers upload and download of files to/from each user's folder in S3. The data is transferred directly between the user's browser and S3.
metrics: A sidecar-container for the user-launched containers that exposes metrics from the ECS task metadata endpoint in Prometheus format.
s3sync: A sidecar-container for the user-launched containers that syncs to and from S3 using mobius3. This is to allow file-persistance on S3 without using FUSE, which at the time of writing is not possible on Fargate.
dns-rewrite-proxy: The DNS server of the VPC that launched containers run in. It selectively allows only certain DNS requests through to migitate chance of data exfiltration through DNS. When this container is deployed, it changes DHCP settings in the VPC, and will most likely break aspects of user-launched containers.
healthcheck: Proxies through to the healthcheck endpoint of the main application, so the main application can be in a security group locked-down to certain IP addresses, but still be monitored by Pingdom.
mirrors-sync: Mirrors pypi, CRAN and (ana)conda repositories to S3, so user-launched JupyterLab and rStudio containers can install packages without having to contact the public internet.
prometheus: Collects metrics from user-launched containers and re-exposes them through federation.
registry: A Docker pull-through-cache to repositories in quay.io. This allows the VPC to not have public internet access but still launch containers from quay.io in Fargate.
sentryproxy: Proxies errors to a Sentry instance: only used by JupyterLab.
This section contains a list of Architecture Decision Records (ADRs).
"},{"location":"architecture/ADRs/#accepted","title":"Accepted","text":"A common question is why not just NGINX instead of the custom proxy? The reason is the dynamic routing for the applications, e.g. URLs like https://jupyterlab-abcde1234.mydomain.com/some/path: each one has a lot of fairly complex requirements.
While not impossible to leverage NGINX to move some code from the proxy, there would still need to be custom code, and NGINX would have to communicate via some mechanism to this custom code to achieve all of the above: extra HTTP or Redis requests, or maybe through a custom NGINX module. It is suspected that this will make things more complex rather than less, and increase the burden on the developer.
","tags":["Accepted"]},{"location":"architecture/ADRs/0001/#decision","title":"Decision","text":"We will use a custom proxy for Data Workspace, rather than simply using NGINX.
","tags":["Accepted"]},{"location":"architecture/ADRs/0001/#consequences","title":"Consequences","text":"","tags":["Accepted"]},{"location":"architecture/ADRs/0001/#positive","title":"Positive","text":"This will decrease the burden on the developer that would have been required by custom NGINX modules, extra HTTP or Redis requests, which all would still have required custom code.
Using the custom proxy allows for all of the complex requirements and dynamic routing of our applications over which we have absolute control.
Initial difficulty when onboarding new team members as they will need to understand these decisions and requirements.
There is an extra network hop compared to not having a proxy.
The proxy fits the typical use-case of event-loop based programming: low CPU but high IO requirements, with potentially high number of connections.
The asyncio library aiohttp provides enough low-level control over the headers and the bytes of requests and responses to work as a controllable proxy. For example, the typical HTTP request cycle can be programmed fairly explicitly.
An incoming request begins: its headers are received.
The library also allows for receiving and making WebSockets requests. This is done without knowledge ahead of time which path is WebSockets, and which is HTTP. This is something that doesn't seem possible with, for example, Django Channels.
Requests and responses can be of the order of several GBs, so this streaming behaviour is a critical requirement.
We will use the asyncio library aiohttp.
","tags":["Accepted"]},{"location":"architecture/ADRs/0002/#consequences","title":"Consequences","text":"","tags":["Accepted"]},{"location":"architecture/ADRs/0002/#positive","title":"Positive","text":"Allows for critical requirement of streaming behaviour.
We can stream HTTP(S) and Websockets requests in an efficient way with one cohesive Python package.
A core bit of infrastructure will depend on a flavour of Python unknown to even experienced Python developers.
Aiohttp is unable to proxy things that are not HTTP or Websockets, i.e. SSH. This is why GitLab isn't behind the proxy.
Data Workspace contains code that helps it be deployed using Amazon Web Services (AWS). This page explains how to use this code.
"},{"location":"deployment/aws/#prerequisites","title":"Prerequisites","text":"To deploy Data Workspace to AWS you must have:
data-workspace
. See Running locally for detailsYou should also have familiarity with working on the command line, working with Terraform, and with AWS.
"},{"location":"deployment/aws/#environment-folder","title":"Environment folder","text":"Each deployment, or environment, of Data Workspace requires a folder for its configuration. This folder should be within a sibling folder to data-workspace
.
The Data Workspace source code contains a template for this configuration. To create a folder in an appropriate location based on this template:
Decide on a meaningful name for the environment. In the following production
is used.
Ensure you're in the root of the data-workspace
folder that contains the cloned Data Workspace source code.
Copy the template into a new folder for the environment:
mkdir -p ../data-workspace-environments\ncp -Rp infra/environment-template ../data-workspace-environments/production\n
This folder structure allows the configuration to find and use the infra/
folder in data-workspace
which contains the low level details of the infrastructure to provision in each environment.
Before deploying the environment, it must be initialised.
Change to the new folder for the environment:
cd ../data-workspace-environments/production\n
Generate new SSH keys:
./create-keys.sh\n
Install AWS CLI and configure an AWS CLI profile. This will support some of the included configuration scripts.
You can do this by putting credentials directly into ~/.aws/credentials
or by using aws sso
.
Create an S3 bucket and dynamodb table for Terraform to use, and add them to main.tf
. --bucket
will provide the base name for both objects.
./bootstrap-terraform.sh \\\n--profile <value> \\\n--bucket <value> \\\n--region <value>\n
Enter the details of your hosting platform, SSH keys, and OAuth 2.0 server by changing all instances of REPLACE_ME
in:
admin-environment.json
gitlab-secrets.json
main.tf
Initialise Terraform:
terraform init\n
Check the environment you created has worked correctly:
terraform plan\n
If everything looks right, you're ready to deploy:
terraform apply\n
"},{"location":"deployment/other-platforms/","title":"Deploying to other platforms","text":"It should possible to deploy to platforms other than Amazon Web Services (AWS), but at the time of writing this hasn't been done. It may involve a significant amount of work.
You can start a discussion on how best to approach this.
"},{"location":"development/database-migrations/","title":"Database migrations","text":"Data Workspace's user-facing metadata catalogue uses Django. When developing Data Workspace, if a change is made to Django's models, to reflect this change in the metadata database, migrations must be created and run.
"},{"location":"development/database-migrations/#prerequisites","title":"Prerequisites","text":"To create migrations you must have the Data Workspace prerequisites and cloned its source code. See Running locally for details.
"},{"location":"development/database-migrations/#creating-migrations","title":"Creating migrations","text":"After making changes to Django models, to create any required migrations:
docker compose build && \\\ndocker compose run \\\n--user root \\\n--volume=$PWD/dataworkspace:/dataworkspace/ \\\ndata-workspace django-admin makemigrations\n
The migrations must be committed to the codebase, and will run when Data Workspace is next started.
This pattern can be used to run other Django management commands by replacing makemigrations
with the name of the command.
Turn an existing govuk styled table into a govuk styled ag-grid grid.
enhanced-table
.<thead>
and one <tbody>
.data-size-to-fit
to ensure columns fit the whole width of the table:<table class=\"govuk-table enhanced-table data-size-to-fit\">\n ...\n</table>\n
"},{"location":"development/enhancedtables/#configure-rows","title":"Configure rows","text":"Configuration for the columns is done on the <th>
elements via data attributes. The options are:
data-sortable
- enable sorting for this column (disabled by default).data-column-type
- use a specific ag-grid column type.data-renderer
- optionally specify the renderer for the column. Only needed for certain data types.data-renderer=\"htmlRenderer\"
- render/sort column as html (mainly used to display buttons or links in a cell).data-renderer=\"dateRenderer\"
- render/sort column as dates.data-renderer=\"datetimeRenderer\"
- render/sort column as datetimes.data-width
- set a width for a column.data-min-width
- set a minimum width in pixels for a column.data-max-width
- set a maximum width in pixels for a column.data-resizable
- allow resizing of the column (disabled by default).<table class=\"govuk-table enhanced-table data-size-to-fit\">\n <thead class=\"govuk-table__head\">\n <tr class=\"govuk-table__row\">\n <th class=\"govuk-table__header\" data-sortable data-renderer=\"htmlRenderer\">A link</th>\n <th class=\"govuk-table__header\" data-sortable data-renderer=\"dateRenderer\">A date</th>\n <th class=\"govuk-table__header\" data-width=\"300\">Some text</th>\n <th class=\"govuk-table__header\" data-column-type=\"numericColumn\">A number</th>\n </thead>\n <tbody class=\"govuk-table__body\">\n {% for object in object_list %}\n <tr>\n <td class=\"name govuk-table__cell\">\n <a class=\"govuk-link\" href=\"#\">The link</a>\n </td>\n ...\n </tr>\n {% endfor %}\n </tbody>\n</table>\n
"},{"location":"development/enhancedtables/#initialise-it","title":"Initialise it","text":"Add the following to your page:
<script src=\"{% static 'ag-grid-community.min.js' %}\"></script>\n<script src=\"{% static 'dayjs.min.js' %}\"></script>\n<script src=\"{% static 'js/grid-utils.js' %}\"></script>\n<script src=\"{% static 'js/enhanced-table.js' %}\"></script>\n<link rel=\"stylesheet\" type=\"text/css\" href=\"{% static 'data-grid.css' %}\"/>\n<script nonce=\"{{ request.csp_nonce }}\">\n document.addEventListener('DOMContentLoaded', () => {\n initEnhancedTable(\"enhanced-table\");\n });\n</script>\n
"},{"location":"development/remotedebugging/","title":"Remote debugging containers","text":""},{"location":"development/remotedebugging/#pdb","title":"PDB","text":"As ipdb
has some issues with gevent and monkey patching we are only able to debug using vanilla pdb
currently.
To set this up locally:
pip install remote-pdb-client
or just pip install -r requirements-dev.txt
. dev.env
:PYTHONBREAKPOINT=remote_pdb.set_trace
REMOTE_PDB_HOST=0.0.0.0
REMOTE_PDB_PORT=4444
breakpoint()
s liberally in your code.docker compose up
. remotepdb_client --host localhost --port 4444
.To debug via the pycharm remote debugger you will need to jump through a few hoops:
Configure docker-compose.yml
as a remote interpreter:
Configure a python debug server for pydev-pycharm
to connect to. You will need to ensure the path mapping is set to the path of your dev environment:
Bring up the containers: docker compose up
Start the pycharm debugger:
Add a breakpoint using pydev-pycharm:
Profit:
Below are the basic steps for debugging remotely with vscode. They are confirmed to work but may needs some tweaks so feel free to update the docs:
launch.json
: {\n\"configurations\": [\n{\n\"name\": \"Python: Remote Attach\",\n\"type\": \"python\",\n\"request\": \"attach\",\n\"connect\": {\n\"host\": \"0.0.0.0\",\n\"port\": 4444\n},\n\"pathMappings\": [\n{\n\"localRoot\": \"${workspaceFolder}/dataworkspace\",\n\"remoteRoot\": \"/dataworkspace\"\n}\n]\n}\n]\n}\n
import debugpy\ndebugpy.listen(('0.0.0.0', 4444))\ndebugpy.wait_for_client()\n
breakpoint()
docker compose up
To develop features on Data Workspace, or to evaluate if it's suitable for your use case, it can be helpful to run Data Workspace on your local computer.
"},{"location":"development/running-locally/#prerequisites","title":"Prerequisites","text":"To run Data Workspace locally, you must have these tools installed:
You should also have familiarity with the command line, and editing text files. If you plan to make changes to the Data Workspace source code, you should also have familiarity with Python.
"},{"location":"development/running-locally/#cloning-source-code","title":"Cloning source code","text":"To run Data Workspace locally, you must also have the Data Workspace source code, which is stored in the Data Workspace GitHub repository. The process of copying this code so it is available locally is known as cloning.
If you don't already have a GitHub account, create a GitHub account.
Setup an SSH key and associate it with your GitHub account.
Create a new fork of the Data Workspace repository. Make a note of the owner you choose to fork to. This is usually your GitHub username. There is more documentation on forking at GitHub's guide on contributing to projects.
If you're a member if the uktrade GitHub organisation you should skip this step and not fork. If you're not planning on contributing changes, you can also skip forking.
Clone the repository by running the following command, replacing owner
with the owner that you forked to in step 3. If you skipped forking, owner
should be uktrade
:
git clone git@github.com:owner/data-workspace.git\n
This will create a new directory containing a copy of the Data Workspace source code, data-workspace
.
Change to the data-workspace
directory:
cd data-workspace\n
In order to be able to properly test cookies that are shared with subdomains, localhost is not used for local development. Instead, by default the dataworkspace.test domain is used. For this to work, you will need the below in your /etc/hosts
file:
127.0.0.1 dataworkspace.test\n127.0.0.1 data-workspace-localstack\n127.0.0.1 data-workspace-sso.test\n127.0.0.1 superset-admin.dataworkspace.test\n127.0.0.1 superset-edit.dataworkspace.test\n
To run tool and visualisation-related code, you will need subdomains in your /etc/hosts
file, such as:
127.0.0.1 visualisation-a.dataworkspace.test\n
"},{"location":"development/running-locally/#starting-the-application","title":"Starting the application","text":"Set the required variables:
cp .envs/sample.env .envs/dev.env\ncp .envs/superset-sample.env .envs/superset.dev.env\n
Start the application:
docker compose up --build\n
The application should then visible at http://dataworkspace.test:8000.
"},{"location":"development/running-locally/#running-superset-locally","title":"Running Superset locally","text":"Then run docker compose
using the superset profile:
docker compose --profile superset up\n
You can then visit http://superset-edit.dataworkspace.test:8000/ or http://superset-admin.dataworkspace.test:8000/.
"},{"location":"development/running-locally/#front-end-static-assets","title":"Front end static assets","text":"We use node-sass to build the front end css and include the GOVUK Front End styles.
To build this locally requires NodeJS. Ideally installed via nvm
https://github.com/nvm-sh/nvm:
# this will configure node from .nvmrc or prompt you to install\n nvm use\n npm install\n npm run build:css\n
"},{"location":"development/running-locally/#react-apps","title":"React apps","text":"We're set up to use django-webpack-loader for hotloading the React app while developing.
You can get it running by starting the dev server:
docker compose up\n
and in a separate terminal changing to the js app directory and running the webpack hotloader:
cd dataworkspace/dataworkspace/static/js/react_apps/\nnpm run dev\n
For production usage we use pre-built JavaScript bundles to reduce the pain of having to build npm modules at deployment.
If you make any changes to the React apps you will need to rebuild and commit the bundles. This will create the relevant js files in /static/js/bundles/
directory:
cd dataworkspace/dataworkspace/static/js/react_apps/\n# this may about 10 minutes to install all dependencies\nnpm install\nnpm run build\ngit add ../bundles/*.js ../stats/react_apps-stats.json\n
"},{"location":"development/running-locally/#issues-on-apple-silicon","title":"Issues on Apple Silicon","text":"If you have issues building the containers try the following:
DOCKER_DEFAULT_PLATFORM=linux/amd64 docker compose up --build\n
"},{"location":"development/running-tests/","title":"Running tests","text":"Running tests locally is useful when developing features on Data Workspace to make sure existing functionality isn't broken, and to ensure any new functionality works as intended.
"},{"location":"development/running-tests/#prerequisites","title":"Prerequisites","text":"To create migrations you must have the Data Workspace prerequisites and cloned its source code. See Running locally for details.
"},{"location":"development/running-tests/#unit-and-integration-tests","title":"Unit and integration tests","text":"To run all tests:
make docker-test\n
To only run Django unit tests:
make docker-test-unit\n
To only run higher level integration tests:
make docker-test-integration\n
"},{"location":"development/running-tests/#without-rebuilding","title":"Without rebuilding","text":"To run the tests locally without having to rebuild the containers every time append -local
to the test make commands:
make docker-test-unit-local\n
make docker-test-integration-local\n
make docker-test-local\n
To run specific tests pass -e TARGET=<test>
into make:
make docker-test-unit-local -e TARGET=dataworkspace/dataworkspace/tests/test_admin.py::TestCustomAdminSite::test_non_admin_access\n
make docker-test-integration-local -e TARGET=test/test_application.py\n
"},{"location":"development/running-tests/#watching-selenium-tests","title":"Watching Selenium tests","text":"We have some Selenium integration tests that launch a (headless) browser in order to interact with a running instance of Data Workspace to assure some core flows (only Data Explorer at the time of writing). It is sometimes desirable to watch these tests run, e.g. in order to debug where it is failing. To run the selenium tests through docker compose using a local browser, do the following:
Download the latest Selenium Server and run it in the background, e.g. java -jar ~/Downloads/selenium-server-standalone-3.141.59 &
.
Run the selenium tests via docker-compose, exposing the Data Workspace port and the mock-SSO port and setting the REMOTE_SELENIUM_URL
environment variable, e.g. docker compose --profile test -p data-workspace-test run -e REMOTE_SELENIUM_URL=http://host.docker.internal:4444/wd/hub -p 8000:8000 -p 8005:8005 --rm data-workspace-test pytest -vvvs test/test_selenium.py
.
We use pip-tools to manage dependencies across two files - requirements.txt
and requirements-dev.txt
. These have corresponding .in
files where we specify our top-level dependencies.
Add the new dependencies to those .in
files, or update an existing dependency, then (with pip-tools
already installed), run make save-requirements
.
This section contains a list of Architecture Decision Records (ADRs).
"},{"location":"architecture/ADRs/#accepted","title":"Accepted","text":"