Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCRVS-6000: Infrastructure deployment, monitoring and maintenance updates #789

Merged
merged 198 commits into from
Jan 24, 2024

Conversation

euanmillar
Copy link
Collaborator

@euanmillar euanmillar commented Nov 10, 2023

Cherry-picked improvements from the planned 1.4 upgrade:

  • (Riku) Root user should be a configurable option. Default should not be root

  • (Riku) Encrypt backups with SSL openssl enc -aes-256-cbc -salt -in MyData.tar.gz -out MyData.tar.gz.enc

  • Install Filebeats

  • Production alerts: (successful SSH login, making a backup fails, these require filebeat to be installed)

  • Hide openhim-console from traefik by default

  • 2FA SSH login:

    Ubuntu 2FA setup requires us to:
    Setup separate user accounts for anyone with access (this should be done anyways)
    Install the authenticator software to the servers
    On SSH login, require users to set up their 2FA

  • Alerts on SSH login

  • Metabase SQL shouldnt be copied in deploy but instead by a config docker compose file

  • (PR) Create a directory server-setup/tasks and split up the singular ansible playbook files to separate modules so that user management, ufw, fail2ban, data directory setup related tasks would always be in their own module

Import the files in the root ansible playbook like this

    - include_tasks:
        file: tasks/fail2ban.yml
        apply:
          tags:
            - security
      tags:
        - security
  • Create a provisioning script for the VPN server if the provisioning environment is "staging". Add VPN details to docker-compose.staging
  • Create a backup server provisioning pipeline
  • Deprecate SUDO_PASSWORD (Euan)
  • Have a single .ini file with environments all there, with 3 or 5 server cluster options just commented out
  • We need to deprecate the KNOWN_HOSTS secret and instead use the .known_hosts file (Euan)
  • Create a cron job on staging to restore the latest prod backup
  • Elastalert doesnt work with Office365 so lets configure Elastalert to send http requests to countryconfig so we can use the SMTP details and we can deprecate EMAIL_API_KEY & ELASTALERT_SLACK_WEBHOOK entirely.
  • Stop users from elevating their permissions
  • Create a mirror production server so that production backup automatically restores there rather than onto QA
  • Create a cron job to delete legacy backups older than 4 months for example and report to Slack or Alert Email.
  • Deprecate external_backup_server_remote_directory: ${{ vars.BACKUP_DIRECTORY }}

Documentation

  • We need documentation on how to configure the VPN with Screenshots of the UI and links to VPN clients that the users and development team can use to access.
  • Document how to create a github environment
  • Migration plan for older installations

}

const octokit = new Octokit({
auth: process.env.GITHUB_TOKEN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What permissions would the token need?

Copy link
Collaborator Author

@euanmillar euanmillar Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rikukissa is it "repo" & "codespaces"? Anything else?

Screenshot 2023-12-07 at 17 52 55
Screenshot 2023-12-07 at 17 51 48

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be enough after testing.

Comment on lines 33 to 49
#
# Uncomment if using VPN
#
#- name: Install openconnect ppa
# run: sudo add-apt-repository ppa:dwmw2/openconnect -y && sudo apt update

#- name: Install openconnect
# run: sudo apt install -y openconnect

#- name: Connect to VPN
# run: |
# echo "${{ secrets.VPN_PWD }}" | sudo openconnect -u ${{ secrets.VPN_USER }} --passwd-on-stdin --protocol=${{ secrets.VPN_PROTOCOL }} ${{ secrets.VPN_HOST }}:${{ secrets.VPN_PORT }} --servercert ${{ secrets.VPN_SERVERCERT }} --background

#- name: Test if IP is reachable
# run: |
# ping -c4 ${{ secrets.SSH_HOST }}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per our discussions, let's move these comments to internal docs

Comment on lines 135 to 136
# SUDO_PASSWORD: ${{ secrets.SUDO_PASSWORD }}
# ELASTALERT_SLACK_WEBHOOK: ${{ secrets.ELASTALERT_SLACK_WEBHOOK }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# SUDO_PASSWORD: ${{ secrets.SUDO_PASSWORD }}
# ELASTALERT_SLACK_WEBHOOK: ${{ secrets.ELASTALERT_SLACK_WEBHOOK }}

Comment on lines 357 to 360
echo -e "$SSH_KEY" > /tmp/private_key_tmp
chmod 600 /tmp/private_key_tmp
echo -e "$KNOWN_HOSTS" > /tmp/known_hosts
chmod 600 /tmp/known_hosts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't both the private key and the known hosts already on the Github Runner machine at this point?

The get set up here:

known_hosts: ${{ secrets.KNOWN_HOSTS }}

@@ -329,15 +372,18 @@ rotate_authorized_keys() {

# Download base docker compose files to the server

rsync -rP /tmp/docker-compose* infrastructure $SSH_USER@$SSH_HOST:/opt/opencrvs/
sudo rsync -e 'ssh -o UserKnownHostsFile=/tmp/known_hosts -i /tmp/private_key_tmp' -rP /tmp/docker-compose* infrastructure $SSH_USER@$SSH_HOST:/opt/opencrvs/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sudo needed here?

Comment on lines 16 to 18
# Set ssh user in logrotate.conf
sed -i -e "s%{{SSH_USER}}%$SSH_USER%" /opt/opencrvs/infrastructure/logrotate.conf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best if we keep logrotate and all other server configuration in the server provisioning stage

Copy link
Collaborator Author

@euanmillar euanmillar Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I faced in Niue was that an error is thrown when the logs are rotated and they dont in fact rotate. This was because they are created with the root user in logrotate.conf and the stack is running under SSH_USER account which is not necessarily root.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

29fe10e

This commit should now make it so that SSH_USER can access and write to those files

Comment on lines 213 to 221
if [ -z "$KNOWN_HOSTS" ] ; then
echo 'Error: Missing environment variable KNOWN_HOSTS.'
print_usage_and_exit
fi

if [ -z "$SUDO_PASSWORD" ] ; then
echo 'Info: Missing optional sudo password'
fi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are probably not needed right? Ideally we wouldn't use sudo at all in any of our automation as there are security risks involved

Comment on lines 405 to 419
- name: Create secret logfile
ansible.builtin.file:
path: /var/log/rotate-secrets.log
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
state: touch
mode: 'u+rwX,g+rwX,o-rwx'

- name: Create backup logfile
ansible.builtin.file:
path: /var/log/opencrvs-backup.log
owner: '{{ ansible_user }}'
group: '{{ ansible_user }}'
state: touch
mode: 'u+rwX,g+rwX,o-rwx'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best if we created a directory under /var/log/ for instance /var/log/opencrvs that had an owner root and group access for app group. @naftis has done this before and can advice

Comment on lines 446 to 450
- name: Allow all access to tcp port 443
ufw:
rule: allow
port: '443'
proto: tcp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is opened by docker swarm as it takes precedence over UFW

@@ -0,0 +1,67 @@
# This Source Code Form is subject to the terms of the Mozilla Public
Copy link
Member

@rikukissa rikukissa Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried about Github runners being able to access backups directly. My advice for the infrastructure setup is that the backup server would be hidden into an internal network. We need some automation to make sure backups are what we expect them to be but I think we need to discuss how exactly it should work.

For this Farajaland demo it's not that big of a deal. Just need to make sure countries do not adopt the same mechanism

…ivilege escalation access can provision the host machine
…izes (#798)

* split playbooks to different task modules, use only one playbook for all deployment sizes

* update provisioning pipeline

* try initialising the provision pipeline by adding a temporary push trigger

* setup ssh key before trying to provision

* add known hosts file

* do not try to mount cryptfs partition to /data if it's already mounted
@rikukissa rikukissa added the documentation Improvements or additions to documentation label Nov 29, 2023
@rikukissa rikukissa force-pushed the infra-improvements branch 2 times, most recently from 5e3b7f4 to 7f2a14e Compare November 29, 2023 12:21
@rikukissa rikukissa merged commit d0da5bb into release-v1.4.0 Jan 24, 2024
3 checks passed
rikukissa added a commit that referenced this pull request Jan 24, 2024
…ates (#789)

* fix conflicts

* add amends from cdpi-living-lab repository

* setup pem file

* fix merge conflict

* add libsodium to dev dependencies

* configure provisioning and deployment script so that any user with privilege escalation access can provision the host machine

* compress and encrypt backup directories before sending to backup server

* supply backup password to backup cronjob

* supply backup encryption passphrase from github secrets

* hide openhim-console by default

* hide openhim-api by default

* Modularise playbook tasks, use only one playbook for all deployment sizes (#798)

* split playbooks to different task modules, use only one playbook for all deployment sizes

* update provisioning pipeline

* try initialising the provision pipeline by adding a temporary push trigger

* setup ssh key before trying to provision

* add known hosts file

* do not try to mount cryptfs partition to /data if it's already mounted

* add filebeat so logs can be accessed, monitored by kibana

* fix kibana address

* Setup new alerts: SSH login, error in backup logs, available disk space in data partition

* add ansible task for creating user accounts for maintainers with 2FA login enabled

* add new alerts for log alerts and ssh alerts

* pass initial metabase sequal file to metabase as a config file so deployment doesn't have to touch the /data directory

* temporarily allow root login again until we set up deployment users

* add port to port forwarding container names so multiple ports can be opened from one container

* Changes to environment provisioning script and log file handling

* remove vagrant files

* remove references to sudo password

sudo operation should only be performed by humans as it gives permission to do root-level operations. automated users should have required permissions set by provisioning playbooks

* remove VPN mentions for now

* remove elastalert slack alert environment variable as it's not referred anywhere

* remove extra environment variables from deploy script call

* remove proxy config from backup script

* generate BACKUP_ENCRYPTION_PASSPHRASE for all github environments

* make log files be accessible by application group so SSH_USER can read and write to them

* remove node version matrices from new pipelines

* add separate inventory files for all environments

* make docker manager1 reference dynamic

* Combine country config compose files to base deployment compose files, include replica compose files in environmet-specific compose files (#808)

* Production VPN (#809)

* add initial wireguard server setup

* move vpn to QA server

* remove unused HOSTNAME parameter

* fix a bug in environment creator script, make sure secrets are never committed

* add development environment to provisioning scripts

* add development machine to inventory

* remove unnecessary PEM setup step

* always use the same ansible variables

* fix ansible variable reference

* remove global ansible user setting

* add back missing dockerhub username

* disable SSH login with root login if provisioning is not done as root

* convert inventory files to yml so ssh keys and users can be directly defined in them

* add Tahmid's public key

* fix inventory file reference

* add development to machines that can be deployed to

* fix known hosts mechanism in deployment pipelines

* make environment seletion in deploy.sh dynamic

* volume mount metabase init file as docker has a file size limit of 500kb for config files

* copy the whole project directory to the server

* send core compose files to the server

* fix common file paths

* fix environment compose file

* use absolute paths in the compose file

* add debug log

* remove deploy log file temporarily

* remove matrices from deployment pipelines

* add debug log

* debug github action

* fix deploy pipeline syntax

* add variables to debug step

* make debugging an option

* fix pipeline syntax

* just a commit to make pipeline update on github

* more syntax fixes

* more syntax fixes

* more syntax fixes

* only define overlay net in the main deploy docker compose so that it keeps attachable

* remove files from target server infrastructure directory if those files do not exist on in repo anymore

* fix deploy path

* do a docker login as part of deployment

* only volume link minio admin's config to the container so it wont write anything new to the source code directory

* remove container names as docker swarm do not support those

* fix path for elasticsearch config

* change the clear data script so that it doesn't touch /data directory directly. This helps us restrict deployment user's access to data

* add missing env variables

* do not use interactive shell

* stop debug mode from starting if its not explicitly enables

* add development to seed pipeline

* add pipeline for clearing an environment

* rename pipeline

* temporarily adda a push trigger to clear environment

* Revert "temporarily adda a push trigger to clear environment"

This reverts commit 882c432.

* fix reset script file reference, reuse clear-environment pipeline in deploy pipeline

* run clearing through ssh

* add missing ssh secrets

* fix pipeline reference in deploy script

* make clear-environment reusable

* debug why no reset

* add migration run to clear-environment pipeline

* remove data clearing from deploy script

* try without conditionals

* try with a true string

* use singlequotes

* update staging server fingerprint

* add output for reset step

* fix synta

* change staging IP

* fix pexpect reference

* remove pyexpect completely

* remove python3-docker module as we do not have any ansible docker commands

* try again with the module as its needed for logging in to docker

* run provisioning tasks through qa

* add jump host

* update known hosts once more

* add more logging

* update qa fingerprint

* lower timeout limits

* restart ssh as root

* change ssh restart method for ubuntu 23

* make a 1-1 mapping to github environments and deployed environments. Demo should have its own Github environment and not use production

* add back docker login

* make it possible to pass SSH args to deploy script

* fix

* make it possible to supply additional ssh parameters for clear script

* updates to create environment script

* configure jump host for production

* update production ssh fingerprint

* make production a 2-server deployment

* add missing jump host definition for docker-workers

* ignore VPN and other allowed addresses in fail2ban

* update staging and prod docker composed

* fix jinja template

* configure rsync to not change file permissions

* add debug

* remove -a from rsync so it doesnt try to change permissions

* add wireguard data partition, ensure files in deployment directory are owned by application group

* make setting ownership recursive

* set read parmissions to others in /opt/opencrvs so docker users can read the files

* increase fail2ban limits

* attach traefik to vpn network

* make ssh user configurable for port-forwarding script

* update wg-easy

* update wg-east

* fix cert resolver for vpn

* use github container registry and latest version for wg-easy

* pass wireguard password variable through deployment pipeline

* pass all github deployment environment variables to docker swarm deployment

* move environments variables to right function

* make a separate function that reads and supplies the env variables

* remove KNOWN_HOSTS from env variables

* remove more variables, fix escape

* make sure KNOWN_HOSTS wont leak to deploy step

* remove debug logging

* only set traefik to vpn network on QA where Wireguard server is

* add validation to make sure all environment variables are set

* download core compose iles before validating environment variables

* fix curl urls when downloading core compose files

* remove default latest value from country config version

* fix country config version variable not going to docker compose files

* fix compose env file order

* fix environment variable filtering

* add pipeline for resetting user's 2FA

* fix name of the pipeline

* trick github into showing the new pipeline

* fetch repo first

* use jump host

* add debug step

* remove unnecessary matrix definition

* remove debugging code

* use docker config instead of volume mounts where possible

* add read and execute rights for others to the deployment directory as sometimes users inside docker containers do not match the host machine users

* create a jump user for QA, allow definining multiple ssh keys for users

* do not add 2factor for jump users

* use new jump user in inventory files as well

* set infobip environment variables as optional, add missing required environment variables to environment creator script

* add support for 1-infinite replicas

* add missing network

* add missing export to VERSION variable

* remove demo deployment configuration for now

* Create a backup restore cron on staging (#812)

* Create a backup restore cron on staging

* allowed label to be passed to script for snapshot usage

* Updated release action

* Add approval step to production deploys

* Add Riku's username to prod deploys

* add separate config flag for provisioning for indicating if the server should backup its data somewhere else of if it should periodically restore data

* make configuration so that qa can allow connections through the provision user to other machines

* create playbook for backup servers and the connection between app servers and backups

* add tags

* add tag to workflow

* add task to ensure ssh dir exists for backup user

* create home directory for backup

* ensure backup task is always applied for root's crontab

* add default value for periodic_restore_from_backup

* make it possible to deploy production with current infrastructure

* Revert "make it possible to deploy production with current infrastructure"

This reverts commit 36edf30.

* fix wait hosts definition for migrations

* make production a qa environment temporarily

* add shell for backup user so rsync works

* explicitly define which user is the one running crontab, ensure that user's key gets to backup server

* ensure .ssh directory exists for crontab user

* get user home directories dynamically

* add missing tags

* add become

* fix file path

* define backup machine in staging config as well

* remove condition from fetch

* always create public key from private key

* use hadcoded file name for public key

* fix syntax

* make staging a QA environment so it reflects production

* separate backup downloading and restoring to two different scripts, use production server's encryptin key on the machine that restres the backup (staging)

* fix an issue with a running OpenHIM while we restore backup

when I cleared the database and then restored data there, the restore process failed if the running OpenHIM process had written new documents during this period

* restart minio after restoring data

---------

Co-authored-by: Riku Rouvila <riku.rouvila@gmail.com>

* fix snapshot script restore reference

* remove openhim base config

* remove WIREGUARD_ADMIN_PASSWORD reference from production deployment pipelines

* remove authorized_keys file

* add debug logging for clear all data script

* define REPLICAS variable before validating it

* fix syntax error in clear script

* automate updating branches on release

* switch back to previous traefik port definition

https://github.com/opencrvs/opencrvs-farajaland/pull/789/files/7a034732d3f38cfdb00d919f470bb7e48d587cdd#r1449976486

* rename 2factor to two_factor

* add default true value for two_factor

* [OCRVS-6437] Forward Elastalert emails through country config (#851)

* forward Elastalert emails first to country config's new /email endpoint and forward from there

* add NOTIFICATION_TRANSPORT variable to deployments scripts

* fix deployment

* move dotenv to normal deps

* add back removed environment variable

* fix email route definition

* make default route ignore the /email path

* add missing environment variables for dev environment

* [OCRVS-6350] Disable root (#849)

* disable root login completely

* stop users from using 'su'

* only disable root login if ansible user being used is not root

* add history timestamps for user terminal history (#848)

* add playbook for ubuntu to update security patches automatically (#846)

* fix staging + prod key access to backup server

* update prod & staging jump keys

* fix manager hostname reference

* add a mechanism for defining additional SSH public keys that can login to the provisioning user

---------

Co-authored-by: naftis <pyry.rouvila@gmail.com>
Co-authored-by: Riku Rouvila <riku.rouvila@gmail.com>
@rikukissa rikukissa deleted the infra-improvements branch May 7, 2024 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation 💥 Hot Fixes 🚧 work in progress
Projects
Status: In Code Review
Development

Successfully merging this pull request may close these issues.

4 participants