OCRVS-6000: Infrastructure deployment, monitoring and maintenance updates #789

euanmillar · 2023-11-10T08:33:26Z

Cherry-picked improvements from the planned 1.4 upgrade:

(Riku) Root user should be a configurable option. Default should not be root
(Riku) Encrypt backups with SSL openssl enc -aes-256-cbc -salt -in MyData.tar.gz -out MyData.tar.gz.enc
Install Filebeats
Production alerts: (successful SSH login, making a backup fails, these require filebeat to be installed)
Hide openhim-console from traefik by default
2FA SSH login:

Ubuntu 2FA setup requires us to:
Setup separate user accounts for anyone with access (this should be done anyways)
Install the authenticator software to the servers
On SSH login, require users to set up their 2FA
Alerts on SSH login
Metabase SQL shouldnt be copied in deploy but instead by a config docker compose file
(PR) Create a directory server-setup/tasks and split up the singular ansible playbook files to separate modules so that user management, ufw, fail2ban, data directory setup related tasks would always be in their own module

Import the files in the root ansible playbook like this

    - include_tasks:
        file: tasks/fail2ban.yml
        apply:
          tags:
            - security
      tags:
        - security

Documentation

We need documentation on how to configure the VPN with Screenshots of the UI and links to VPN clients that the users and development team can use to access.
Document how to create a github environment
Migration plan for older installations

naftis · 2023-11-10T13:04:20Z

infrastructure/create-github-environment.js

+}
+
+const octokit = new Octokit({
+  auth: process.env.GITHUB_TOKEN


What permissions would the token need?

@rikukissa is it "repo" & "codespaces"? Anything else?

Seems to be enough after testing.

infrastructure/create-github-environment.js

infrastructure/emergency-restore-metadata.sh

rikukissa · 2023-11-17T07:59:27Z

.github/workflows/backup-check.yml

+      #
+      # Uncomment if using VPN
+      #
+      #- name: Install openconnect ppa
+      #  run: sudo add-apt-repository ppa:dwmw2/openconnect -y && sudo apt update
+
+      #- name: Install openconnect
+      #  run: sudo apt install -y openconnect
+
+      #- name: Connect to VPN
+      #  run: |
+      #    echo "${{ secrets.VPN_PWD }}" | sudo openconnect -u ${{ secrets.VPN_USER }} --passwd-on-stdin --protocol=${{ secrets.VPN_PROTOCOL }} ${{ secrets.VPN_HOST }}:${{ secrets.VPN_PORT }} --servercert ${{ secrets.VPN_SERVERCERT }} --background
+
+      #- name: Test if IP is reachable
+      #  run: |
+      #    ping -c4 ${{ secrets.SSH_HOST }}
+


As per our discussions, let's move these comments to internal docs

rikukissa · 2023-11-17T07:59:36Z

.github/workflows/deploy-prod.yml

+          # SUDO_PASSWORD: ${{ secrets.SUDO_PASSWORD }}
+          # ELASTALERT_SLACK_WEBHOOK: ${{ secrets.ELASTALERT_SLACK_WEBHOOK }}


Suggested change

# SUDO_PASSWORD: ${{ secrets.SUDO_PASSWORD }}

# ELASTALERT_SLACK_WEBHOOK: ${{ secrets.ELASTALERT_SLACK_WEBHOOK }}

rikukissa · 2023-11-17T08:05:49Z

infrastructure/deploy.sh

+echo -e "$SSH_KEY" > /tmp/private_key_tmp
+chmod 600 /tmp/private_key_tmp
+echo -e "$KNOWN_HOSTS" > /tmp/known_hosts
+chmod 600 /tmp/known_hosts


Aren't both the private key and the known hosts already on the Github Runner machine at this point?

The get set up here:

opencrvs-farajaland/.github/workflows/deploy.yml

Line 84 in e1825d9

known_hosts: ${{ secrets.KNOWN_HOSTS }}

rikukissa · 2023-11-17T08:06:59Z

infrastructure/deploy.sh

@@ -329,15 +372,18 @@ rotate_authorized_keys() {

 # Download base docker compose files to the server

-rsync -rP /tmp/docker-compose* infrastructure $SSH_USER@$SSH_HOST:/opt/opencrvs/
+sudo rsync -e 'ssh -o UserKnownHostsFile=/tmp/known_hosts -i /tmp/private_key_tmp' -rP /tmp/docker-compose* infrastructure $SSH_USER@$SSH_HOST:/opt/opencrvs/


Is sudo needed here?

rikukissa · 2023-11-17T08:08:14Z

infrastructure/setup-deploy-config.sh

+# Set ssh user in logrotate.conf
+sed -i -e "s%{{SSH_USER}}%$SSH_USER%" /opt/opencrvs/infrastructure/logrotate.conf
+


It's best if we keep logrotate and all other server configuration in the server provisioning stage

The problem I faced in Niue was that an error is thrown when the logs are rotated and they dont in fact rotate. This was because they are created with the root user in logrotate.conf and the stack is running under SSH_USER account which is not necessarily root.

29fe10e

This commit should now make it so that SSH_USER can access and write to those files

rikukissa · 2023-11-17T08:13:05Z

infrastructure/deploy.sh

+if [ -z "$KNOWN_HOSTS" ] ; then
+    echo 'Error: Missing environment variable KNOWN_HOSTS.'
+    print_usage_and_exit
+fi
+
+if [ -z "$SUDO_PASSWORD" ] ; then
+    echo 'Info: Missing optional sudo password'
+fi
+


These are probably not needed right? Ideally we wouldn't use sudo at all in any of our automation as there are security risks involved

rikukissa · 2023-11-17T08:14:59Z

infrastructure/server-setup/playbook-3.yml

+    - name: Create secret logfile
+      ansible.builtin.file:
+        path: /var/log/rotate-secrets.log
+        owner: '{{ ansible_user }}'
+        group: '{{ ansible_user }}'
+        state: touch
+        mode: 'u+rwX,g+rwX,o-rwx'
+
+    - name: Create backup logfile
+      ansible.builtin.file:
+        path: /var/log/opencrvs-backup.log
+        owner: '{{ ansible_user }}'
+        group: '{{ ansible_user }}'
+        state: touch
+        mode: 'u+rwX,g+rwX,o-rwx'


It would be best if we created a directory under /var/log/ for instance /var/log/opencrvs that had an owner root and group access for app group. @naftis has done this before and can advice

rikukissa · 2023-11-17T08:15:33Z

infrastructure/server-setup/playbook-3.yml

+    - name: Allow all access to tcp port 443
+      ufw:
+        rule: allow
+        port: '443'
+        proto: tcp


This is opened by docker swarm as it takes precedence over UFW

rikukissa · 2023-11-17T08:19:08Z

infrastructure/backup-check.sh

@@ -0,0 +1,67 @@
+# This Source Code Form is subject to the terms of the Mozilla Public


I'm slightly worried about Github runners being able to access backups directly. My advice for the infrastructure setup is that the backup server would be hidden into an internal network. We need some automation to make sure backups are what we expect them to be but I think we need to discuss how exactly it should work.

For this Farajaland demo it's not that big of a deal. Just need to make sure countries do not adopt the same mechanism

…ivilege escalation access can provision the host machine

…izes (#798) * split playbooks to different task modules, use only one playbook for all deployment sizes * update provisioning pipeline * try initialising the provision pipeline by adding a temporary push trigger * setup ssh key before trying to provision * add known hosts file * do not try to mount cryptfs partition to /data if it's already mounted

…ce in data partition

…n to the provisioning user

…ates (#789) * fix conflicts * add amends from cdpi-living-lab repository * setup pem file * fix merge conflict * add libsodium to dev dependencies * configure provisioning and deployment script so that any user with privilege escalation access can provision the host machine * compress and encrypt backup directories before sending to backup server * supply backup password to backup cronjob * supply backup encryption passphrase from github secrets * hide openhim-console by default * hide openhim-api by default * Modularise playbook tasks, use only one playbook for all deployment sizes (#798) * split playbooks to different task modules, use only one playbook for all deployment sizes * update provisioning pipeline * try initialising the provision pipeline by adding a temporary push trigger * setup ssh key before trying to provision * add known hosts file * do not try to mount cryptfs partition to /data if it's already mounted * add filebeat so logs can be accessed, monitored by kibana * fix kibana address * Setup new alerts: SSH login, error in backup logs, available disk space in data partition * add ansible task for creating user accounts for maintainers with 2FA login enabled * add new alerts for log alerts and ssh alerts * pass initial metabase sequal file to metabase as a config file so deployment doesn't have to touch the /data directory * temporarily allow root login again until we set up deployment users * add port to port forwarding container names so multiple ports can be opened from one container * Changes to environment provisioning script and log file handling * remove vagrant files * remove references to sudo password sudo operation should only be performed by humans as it gives permission to do root-level operations. automated users should have required permissions set by provisioning playbooks * remove VPN mentions for now * remove elastalert slack alert environment variable as it's not referred anywhere * remove extra environment variables from deploy script call * remove proxy config from backup script * generate BACKUP_ENCRYPTION_PASSPHRASE for all github environments * make log files be accessible by application group so SSH_USER can read and write to them * remove node version matrices from new pipelines * add separate inventory files for all environments * make docker manager1 reference dynamic * Combine country config compose files to base deployment compose files, include replica compose files in environmet-specific compose files (#808) * Production VPN (#809) * add initial wireguard server setup * move vpn to QA server * remove unused HOSTNAME parameter * fix a bug in environment creator script, make sure secrets are never committed * add development environment to provisioning scripts * add development machine to inventory * remove unnecessary PEM setup step * always use the same ansible variables * fix ansible variable reference * remove global ansible user setting * add back missing dockerhub username * disable SSH login with root login if provisioning is not done as root * convert inventory files to yml so ssh keys and users can be directly defined in them * add Tahmid's public key * fix inventory file reference * add development to machines that can be deployed to * fix known hosts mechanism in deployment pipelines * make environment seletion in deploy.sh dynamic * volume mount metabase init file as docker has a file size limit of 500kb for config files * copy the whole project directory to the server * send core compose files to the server * fix common file paths * fix environment compose file * use absolute paths in the compose file * add debug log * remove deploy log file temporarily * remove matrices from deployment pipelines * add debug log * debug github action * fix deploy pipeline syntax * add variables to debug step * make debugging an option * fix pipeline syntax * just a commit to make pipeline update on github * more syntax fixes * more syntax fixes * more syntax fixes * only define overlay net in the main deploy docker compose so that it keeps attachable * remove files from target server infrastructure directory if those files do not exist on in repo anymore * fix deploy path * do a docker login as part of deployment * only volume link minio admin's config to the container so it wont write anything new to the source code directory * remove container names as docker swarm do not support those * fix path for elasticsearch config * change the clear data script so that it doesn't touch /data directory directly. This helps us restrict deployment user's access to data * add missing env variables * do not use interactive shell * stop debug mode from starting if its not explicitly enables * add development to seed pipeline * add pipeline for clearing an environment * rename pipeline * temporarily adda a push trigger to clear environment * Revert "temporarily adda a push trigger to clear environment" This reverts commit 882c432. * fix reset script file reference, reuse clear-environment pipeline in deploy pipeline * run clearing through ssh * add missing ssh secrets * fix pipeline reference in deploy script * make clear-environment reusable * debug why no reset * add migration run to clear-environment pipeline * remove data clearing from deploy script * try without conditionals * try with a true string * use singlequotes * update staging server fingerprint * add output for reset step * fix synta * change staging IP * fix pexpect reference * remove pyexpect completely * remove python3-docker module as we do not have any ansible docker commands * try again with the module as its needed for logging in to docker * run provisioning tasks through qa * add jump host * update known hosts once more * add more logging * update qa fingerprint * lower timeout limits * restart ssh as root * change ssh restart method for ubuntu 23 * make a 1-1 mapping to github environments and deployed environments. Demo should have its own Github environment and not use production * add back docker login * make it possible to pass SSH args to deploy script * fix * make it possible to supply additional ssh parameters for clear script * updates to create environment script * configure jump host for production * update production ssh fingerprint * make production a 2-server deployment * add missing jump host definition for docker-workers * ignore VPN and other allowed addresses in fail2ban * update staging and prod docker composed * fix jinja template * configure rsync to not change file permissions * add debug * remove -a from rsync so it doesnt try to change permissions * add wireguard data partition, ensure files in deployment directory are owned by application group * make setting ownership recursive * set read parmissions to others in /opt/opencrvs so docker users can read the files * increase fail2ban limits * attach traefik to vpn network * make ssh user configurable for port-forwarding script * update wg-easy * update wg-east * fix cert resolver for vpn * use github container registry and latest version for wg-easy * pass wireguard password variable through deployment pipeline * pass all github deployment environment variables to docker swarm deployment * move environments variables to right function * make a separate function that reads and supplies the env variables * remove KNOWN_HOSTS from env variables * remove more variables, fix escape * make sure KNOWN_HOSTS wont leak to deploy step * remove debug logging * only set traefik to vpn network on QA where Wireguard server is * add validation to make sure all environment variables are set * download core compose iles before validating environment variables * fix curl urls when downloading core compose files * remove default latest value from country config version * fix country config version variable not going to docker compose files * fix compose env file order * fix environment variable filtering * add pipeline for resetting user's 2FA * fix name of the pipeline * trick github into showing the new pipeline * fetch repo first * use jump host * add debug step * remove unnecessary matrix definition * remove debugging code * use docker config instead of volume mounts where possible * add read and execute rights for others to the deployment directory as sometimes users inside docker containers do not match the host machine users * create a jump user for QA, allow definining multiple ssh keys for users * do not add 2factor for jump users * use new jump user in inventory files as well * set infobip environment variables as optional, add missing required environment variables to environment creator script * add support for 1-infinite replicas * add missing network * add missing export to VERSION variable * remove demo deployment configuration for now * Create a backup restore cron on staging (#812) * Create a backup restore cron on staging * allowed label to be passed to script for snapshot usage * Updated release action * Add approval step to production deploys * Add Riku's username to prod deploys * add separate config flag for provisioning for indicating if the server should backup its data somewhere else of if it should periodically restore data * make configuration so that qa can allow connections through the provision user to other machines * create playbook for backup servers and the connection between app servers and backups * add tags * add tag to workflow * add task to ensure ssh dir exists for backup user * create home directory for backup * ensure backup task is always applied for root's crontab * add default value for periodic_restore_from_backup * make it possible to deploy production with current infrastructure * Revert "make it possible to deploy production with current infrastructure" This reverts commit 36edf30. * fix wait hosts definition for migrations * make production a qa environment temporarily * add shell for backup user so rsync works * explicitly define which user is the one running crontab, ensure that user's key gets to backup server * ensure .ssh directory exists for crontab user * get user home directories dynamically * add missing tags * add become * fix file path * define backup machine in staging config as well * remove condition from fetch * always create public key from private key * use hadcoded file name for public key * fix syntax * make staging a QA environment so it reflects production * separate backup downloading and restoring to two different scripts, use production server's encryptin key on the machine that restres the backup (staging) * fix an issue with a running OpenHIM while we restore backup when I cleared the database and then restored data there, the restore process failed if the running OpenHIM process had written new documents during this period * restart minio after restoring data --------- Co-authored-by: Riku Rouvila <riku.rouvila@gmail.com> * fix snapshot script restore reference * remove openhim base config * remove WIREGUARD_ADMIN_PASSWORD reference from production deployment pipelines * remove authorized_keys file * add debug logging for clear all data script * define REPLICAS variable before validating it * fix syntax error in clear script * automate updating branches on release * switch back to previous traefik port definition https://github.com/opencrvs/opencrvs-farajaland/pull/789/files/7a034732d3f38cfdb00d919f470bb7e48d587cdd#r1449976486 * rename 2factor to two_factor * add default true value for two_factor * [OCRVS-6437] Forward Elastalert emails through country config (#851) * forward Elastalert emails first to country config's new /email endpoint and forward from there * add NOTIFICATION_TRANSPORT variable to deployments scripts * fix deployment * move dotenv to normal deps * add back removed environment variable * fix email route definition * make default route ignore the /email path * add missing environment variables for dev environment * [OCRVS-6350] Disable root (#849) * disable root login completely * stop users from using 'su' * only disable root login if ansible user being used is not root * add history timestamps for user terminal history (#848) * add playbook for ubuntu to update security patches automatically (#846) * fix staging + prod key access to backup server * update prod & staging jump keys * fix manager hostname reference * add a mechanism for defining additional SSH public keys that can login to the provisioning user --------- Co-authored-by: naftis <pyry.rouvila@gmail.com> Co-authored-by: Riku Rouvila <riku.rouvila@gmail.com>

fix conflicts

50e996f

euanmillar added the 🚧 work in progress label Nov 10, 2023

naftis reviewed Nov 10, 2023

View reviewed changes

infrastructure/create-github-environment.js Show resolved Hide resolved

naftis reviewed Nov 10, 2023

View reviewed changes

infrastructure/create-github-environment.js Outdated Show resolved Hide resolved

Zangetsu101 reviewed Nov 14, 2023

View reviewed changes

infrastructure/emergency-restore-metadata.sh Outdated Show resolved Hide resolved

naftis added 4 commits November 16, 2023 15:13

add amends from cdpi-living-lab repository

c1de407

setup pem file

0560ad0

fix merge conflict

db578b0

add libsodium to dev dependencies

e1825d9

rikukissa requested changes Nov 17, 2023

View reviewed changes

rikukissa added 7 commits November 24, 2023 14:20

configure provisioning and deployment script so that any user with pr…

9b19e19

…ivilege escalation access can provision the host machine

compress and encrypt backup directories before sending to backup server

a2618d1

supply backup password to backup cronjob

254a20a

supply backup encryption passphrase from github secrets

6efc958

hide openhim-console by default

a65b672

hide openhim-api by default

2a637d3

rikukissa added the documentation Improvements or additions to documentation label Nov 29, 2023

add filebeat so logs can be accessed, monitored by kibana

2a56a0d

rikukissa temporarily deployed to staging November 29, 2023 08:24 — with GitHub Actions Inactive

rikukissa had a problem deploying to staging November 29, 2023 09:49 — with GitHub Actions Failure

rikukissa had a problem deploying to staging November 29, 2023 09:59 — with GitHub Actions Failure

rikukissa had a problem deploying to staging November 29, 2023 10:07 — with GitHub Actions Failure

rikukissa had a problem deploying to staging November 29, 2023 10:51 — with GitHub Actions Error

fix kibana address

3cbc67a

rikukissa temporarily deployed to staging November 29, 2023 11:14 — with GitHub Actions Inactive

Setup new alerts: SSH login, error in backup logs, available disk spa…

bf6900d

…ce in data partition

rikukissa force-pushed the infra-improvements branch 2 times, most recently from 5e3b7f4 to 7f2a14e Compare November 29, 2023 12:21

fix staging + prod key access to backup server

928a2d0

rikukissa had a problem deploying to staging January 24, 2024 06:21 — with GitHub Actions Failure

rikukissa had a problem deploying to production January 24, 2024 06:21 — with GitHub Actions Failure

update prod & staging jump keys

4bc7e14

rikukissa had a problem deploying to production January 24, 2024 06:31 — with GitHub Actions Error

rikukissa temporarily deployed to qa January 24, 2024 06:31 — with GitHub Actions Inactive

rikukissa had a problem deploying to production January 24, 2024 06:33 — with GitHub Actions Failure

rikukissa had a problem deploying to staging January 24, 2024 06:34 — with GitHub Actions Failure

rikukissa had a problem deploying to production January 24, 2024 06:39 — with GitHub Actions Failure

rikukissa had a problem deploying to staging January 24, 2024 06:39 — with GitHub Actions Failure

rikukissa had a problem deploying to staging January 24, 2024 06:46 — with GitHub Actions Failure

rikukissa had a problem deploying to staging January 24, 2024 06:56 — with GitHub Actions Failure

rikukissa had a problem deploying to production January 24, 2024 07:02 — with GitHub Actions Failure

rikukissa had a problem deploying to staging January 24, 2024 07:02 — with GitHub Actions Failure

rikukissa had a problem deploying to production January 24, 2024 07:07 — with GitHub Actions Failure

rikukissa temporarily deployed to staging January 24, 2024 07:07 — with GitHub Actions Inactive

rikukissa temporarily deployed to production January 24, 2024 07:15 — with GitHub Actions Inactive

fix manager hostname reference

70e2eae

rikukissa temporarily deployed to staging January 24, 2024 07:23 — with GitHub Actions Inactive

add a mechanism for defining additional SSH public keys that can logi…

a9cb29f

…n to the provisioning user

rikukissa temporarily deployed to qa January 24, 2024 07:30 — with GitHub Actions Inactive

rikukissa had a problem deploying to production January 24, 2024 07:43 — with GitHub Actions Failure

rikukissa temporarily deployed to production January 24, 2024 08:03 — with GitHub Actions Inactive

rikukissa merged commit d0da5bb into release-v1.4.0 Jan 24, 2024
3 checks passed

rikukissa temporarily deployed to production January 25, 2024 14:28 — with GitHub Actions Inactive

euanmillar had a problem deploying to production February 23, 2024 19:31 — with GitHub Actions Failure

rikukissa temporarily deployed to production March 1, 2024 16:22 — with GitHub Actions Inactive

rikukissa temporarily deployed to production March 1, 2024 17:14 — with GitHub Actions Inactive

rikukissa deleted the infra-improvements branch May 7, 2024 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCRVS-6000: Infrastructure deployment, monitoring and maintenance updates #789

OCRVS-6000: Infrastructure deployment, monitoring and maintenance updates #789

euanmillar commented Nov 10, 2023 •

edited by rikukissa

Loading

naftis Nov 10, 2023

euanmillar Dec 7, 2023 •

edited

Loading

euanmillar Dec 7, 2023

rikukissa Nov 17, 2023

rikukissa Nov 17, 2023

rikukissa Nov 17, 2023

rikukissa Nov 17, 2023

rikukissa Nov 17, 2023

euanmillar Dec 7, 2023 •

edited

Loading

rikukissa Dec 8, 2023

rikukissa Nov 17, 2023

rikukissa Nov 17, 2023

rikukissa Nov 17, 2023

rikukissa Nov 17, 2023 •

edited

Loading

		# SUDO_PASSWORD: ${{ secrets.SUDO_PASSWORD }}
		# ELASTALERT_SLACK_WEBHOOK: ${{ secrets.ELASTALERT_SLACK_WEBHOOK }}

		# Set ssh user in logrotate.conf
		sed -i -e "s%{{SSH_USER}}%$SSH_USER%" /opt/opencrvs/infrastructure/logrotate.conf

		@@ -0,0 +1,67 @@
		# This Source Code Form is subject to the terms of the Mozilla Public

OCRVS-6000: Infrastructure deployment, monitoring and maintenance updates #789

OCRVS-6000: Infrastructure deployment, monitoring and maintenance updates #789

Conversation

euanmillar commented Nov 10, 2023 • edited by rikukissa Loading

Documentation

Choose a reason for hiding this comment

euanmillar Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

euanmillar Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rikukissa Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

euanmillar commented Nov 10, 2023 •

edited by rikukissa

Loading

euanmillar Dec 7, 2023 •

edited

Loading

euanmillar Dec 7, 2023 •

edited

Loading

rikukissa Nov 17, 2023 •

edited

Loading