Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production VPN #809

Merged
merged 3 commits into from
Dec 13, 2023
Merged

Production VPN #809

merged 3 commits into from
Dec 13, 2023

Conversation

rikukissa
Copy link
Member

No description provided.

@rikukissa rikukissa requested a review from euanmillar December 12, 2023 15:55
documents:
environment:
- NODE_ENV=production

wg-easy:
Copy link
Collaborator

@euanmillar euanmillar Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to move the wg-easy block to docker-compose-vpn.deploy.yml. Then deploy.sh script should pick that up only if we are provisioning a wireguard VPN

Copy link
Member Author

@rikukissa rikukissa Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something similar in the compose file PR for making the MongoDB clusters dynamic with a docker-compose.cluster.yml. In the end, as per our conversation yesterday, decided to configure the deployment sizes and replicas per environment in the environment files to only reflect how the Farajaland setup is.

I agree, VPN should be a default for production environments. Doing it like we're doing it here, so deploying it to a production machine, is not the right choice for real production environments and the VPN configuration will be different for different network setups. What I've configured here is very much a Mickey Mouse version of a real VPN setup.

If we want to make this generic, we would at least need to:

  • Ensure production machine is in a private internal network
  • Provide a "jump host" mechanism for provisioning and deployment and other pipelines. This is an SSH proxy setting so you can connect to machine B in internal network by first connection to A in the public internet.
  • DNS server for clients connecting with a domain. This is because if the application server doesn't have a public IP address, there's nothing to attach the register.crvs.fr domain to. The DNS server needs to redirect the client from register.crvs.fr to the internal address e.g. 192.168.1.4.
  • Make it configurable to which internal network(s) the Wireguard server attaches the connecting client
  • Make the Wireguard server be setup by provisioning scripts and not be a docker container so it can access the network interfaces of the host machines to make the previous point possible

In the long term, I'd also like to phasing out deploy.sh completely, bringing more logic into the compose files similarly to what Kubernetes + Helm combination would be.

Copy link
Collaborator

@euanmillar euanmillar Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move this to the qa compose file

; This configuration variable blocks all access to the server, including SSH, except from the IP addresses specified below.
; This should always be set when configuring a production server if there is no other firewall in front of the server.
; SSH and other services should never be exposed to the public internet.
only_allow_access_from_addresses=165.22.205.62,165.22.110.53
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to staging

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the QA server is also behind the VPN, and if the VPN goes down then this would be not necessary. We could use SSH with public and private key to access the VPN server. So perhaps we can remove this when we finish testing.

@rikukissa rikukissa merged commit b205399 into infra-improvements Dec 13, 2023
1 check passed
rikukissa added a commit that referenced this pull request Jan 24, 2024
…ates (#789)

* fix conflicts

* add amends from cdpi-living-lab repository

* setup pem file

* fix merge conflict

* add libsodium to dev dependencies

* configure provisioning and deployment script so that any user with privilege escalation access can provision the host machine

* compress and encrypt backup directories before sending to backup server

* supply backup password to backup cronjob

* supply backup encryption passphrase from github secrets

* hide openhim-console by default

* hide openhim-api by default

* Modularise playbook tasks, use only one playbook for all deployment sizes (#798)

* split playbooks to different task modules, use only one playbook for all deployment sizes

* update provisioning pipeline

* try initialising the provision pipeline by adding a temporary push trigger

* setup ssh key before trying to provision

* add known hosts file

* do not try to mount cryptfs partition to /data if it's already mounted

* add filebeat so logs can be accessed, monitored by kibana

* fix kibana address

* Setup new alerts: SSH login, error in backup logs, available disk space in data partition

* add ansible task for creating user accounts for maintainers with 2FA login enabled

* add new alerts for log alerts and ssh alerts

* pass initial metabase sequal file to metabase as a config file so deployment doesn't have to touch the /data directory

* temporarily allow root login again until we set up deployment users

* add port to port forwarding container names so multiple ports can be opened from one container

* Changes to environment provisioning script and log file handling

* remove vagrant files

* remove references to sudo password

sudo operation should only be performed by humans as it gives permission to do root-level operations. automated users should have required permissions set by provisioning playbooks

* remove VPN mentions for now

* remove elastalert slack alert environment variable as it's not referred anywhere

* remove extra environment variables from deploy script call

* remove proxy config from backup script

* generate BACKUP_ENCRYPTION_PASSPHRASE for all github environments

* make log files be accessible by application group so SSH_USER can read and write to them

* remove node version matrices from new pipelines

* add separate inventory files for all environments

* make docker manager1 reference dynamic

* Combine country config compose files to base deployment compose files, include replica compose files in environmet-specific compose files (#808)

* Production VPN (#809)

* add initial wireguard server setup

* move vpn to QA server

* remove unused HOSTNAME parameter

* fix a bug in environment creator script, make sure secrets are never committed

* add development environment to provisioning scripts

* add development machine to inventory

* remove unnecessary PEM setup step

* always use the same ansible variables

* fix ansible variable reference

* remove global ansible user setting

* add back missing dockerhub username

* disable SSH login with root login if provisioning is not done as root

* convert inventory files to yml so ssh keys and users can be directly defined in them

* add Tahmid's public key

* fix inventory file reference

* add development to machines that can be deployed to

* fix known hosts mechanism in deployment pipelines

* make environment seletion in deploy.sh dynamic

* volume mount metabase init file as docker has a file size limit of 500kb for config files

* copy the whole project directory to the server

* send core compose files to the server

* fix common file paths

* fix environment compose file

* use absolute paths in the compose file

* add debug log

* remove deploy log file temporarily

* remove matrices from deployment pipelines

* add debug log

* debug github action

* fix deploy pipeline syntax

* add variables to debug step

* make debugging an option

* fix pipeline syntax

* just a commit to make pipeline update on github

* more syntax fixes

* more syntax fixes

* more syntax fixes

* only define overlay net in the main deploy docker compose so that it keeps attachable

* remove files from target server infrastructure directory if those files do not exist on in repo anymore

* fix deploy path

* do a docker login as part of deployment

* only volume link minio admin's config to the container so it wont write anything new to the source code directory

* remove container names as docker swarm do not support those

* fix path for elasticsearch config

* change the clear data script so that it doesn't touch /data directory directly. This helps us restrict deployment user's access to data

* add missing env variables

* do not use interactive shell

* stop debug mode from starting if its not explicitly enables

* add development to seed pipeline

* add pipeline for clearing an environment

* rename pipeline

* temporarily adda a push trigger to clear environment

* Revert "temporarily adda a push trigger to clear environment"

This reverts commit 882c432.

* fix reset script file reference, reuse clear-environment pipeline in deploy pipeline

* run clearing through ssh

* add missing ssh secrets

* fix pipeline reference in deploy script

* make clear-environment reusable

* debug why no reset

* add migration run to clear-environment pipeline

* remove data clearing from deploy script

* try without conditionals

* try with a true string

* use singlequotes

* update staging server fingerprint

* add output for reset step

* fix synta

* change staging IP

* fix pexpect reference

* remove pyexpect completely

* remove python3-docker module as we do not have any ansible docker commands

* try again with the module as its needed for logging in to docker

* run provisioning tasks through qa

* add jump host

* update known hosts once more

* add more logging

* update qa fingerprint

* lower timeout limits

* restart ssh as root

* change ssh restart method for ubuntu 23

* make a 1-1 mapping to github environments and deployed environments. Demo should have its own Github environment and not use production

* add back docker login

* make it possible to pass SSH args to deploy script

* fix

* make it possible to supply additional ssh parameters for clear script

* updates to create environment script

* configure jump host for production

* update production ssh fingerprint

* make production a 2-server deployment

* add missing jump host definition for docker-workers

* ignore VPN and other allowed addresses in fail2ban

* update staging and prod docker composed

* fix jinja template

* configure rsync to not change file permissions

* add debug

* remove -a from rsync so it doesnt try to change permissions

* add wireguard data partition, ensure files in deployment directory are owned by application group

* make setting ownership recursive

* set read parmissions to others in /opt/opencrvs so docker users can read the files

* increase fail2ban limits

* attach traefik to vpn network

* make ssh user configurable for port-forwarding script

* update wg-easy

* update wg-east

* fix cert resolver for vpn

* use github container registry and latest version for wg-easy

* pass wireguard password variable through deployment pipeline

* pass all github deployment environment variables to docker swarm deployment

* move environments variables to right function

* make a separate function that reads and supplies the env variables

* remove KNOWN_HOSTS from env variables

* remove more variables, fix escape

* make sure KNOWN_HOSTS wont leak to deploy step

* remove debug logging

* only set traefik to vpn network on QA where Wireguard server is

* add validation to make sure all environment variables are set

* download core compose iles before validating environment variables

* fix curl urls when downloading core compose files

* remove default latest value from country config version

* fix country config version variable not going to docker compose files

* fix compose env file order

* fix environment variable filtering

* add pipeline for resetting user's 2FA

* fix name of the pipeline

* trick github into showing the new pipeline

* fetch repo first

* use jump host

* add debug step

* remove unnecessary matrix definition

* remove debugging code

* use docker config instead of volume mounts where possible

* add read and execute rights for others to the deployment directory as sometimes users inside docker containers do not match the host machine users

* create a jump user for QA, allow definining multiple ssh keys for users

* do not add 2factor for jump users

* use new jump user in inventory files as well

* set infobip environment variables as optional, add missing required environment variables to environment creator script

* add support for 1-infinite replicas

* add missing network

* add missing export to VERSION variable

* remove demo deployment configuration for now

* Create a backup restore cron on staging (#812)

* Create a backup restore cron on staging

* allowed label to be passed to script for snapshot usage

* Updated release action

* Add approval step to production deploys

* Add Riku's username to prod deploys

* add separate config flag for provisioning for indicating if the server should backup its data somewhere else of if it should periodically restore data

* make configuration so that qa can allow connections through the provision user to other machines

* create playbook for backup servers and the connection between app servers and backups

* add tags

* add tag to workflow

* add task to ensure ssh dir exists for backup user

* create home directory for backup

* ensure backup task is always applied for root's crontab

* add default value for periodic_restore_from_backup

* make it possible to deploy production with current infrastructure

* Revert "make it possible to deploy production with current infrastructure"

This reverts commit 36edf30.

* fix wait hosts definition for migrations

* make production a qa environment temporarily

* add shell for backup user so rsync works

* explicitly define which user is the one running crontab, ensure that user's key gets to backup server

* ensure .ssh directory exists for crontab user

* get user home directories dynamically

* add missing tags

* add become

* fix file path

* define backup machine in staging config as well

* remove condition from fetch

* always create public key from private key

* use hadcoded file name for public key

* fix syntax

* make staging a QA environment so it reflects production

* separate backup downloading and restoring to two different scripts, use production server's encryptin key on the machine that restres the backup (staging)

* fix an issue with a running OpenHIM while we restore backup

when I cleared the database and then restored data there, the restore process failed if the running OpenHIM process had written new documents during this period

* restart minio after restoring data

---------

Co-authored-by: Riku Rouvila <riku.rouvila@gmail.com>

* fix snapshot script restore reference

* remove openhim base config

* remove WIREGUARD_ADMIN_PASSWORD reference from production deployment pipelines

* remove authorized_keys file

* add debug logging for clear all data script

* define REPLICAS variable before validating it

* fix syntax error in clear script

* automate updating branches on release

* switch back to previous traefik port definition

https://github.com/opencrvs/opencrvs-farajaland/pull/789/files/7a034732d3f38cfdb00d919f470bb7e48d587cdd#r1449976486

* rename 2factor to two_factor

* add default true value for two_factor

* [OCRVS-6437] Forward Elastalert emails through country config (#851)

* forward Elastalert emails first to country config's new /email endpoint and forward from there

* add NOTIFICATION_TRANSPORT variable to deployments scripts

* fix deployment

* move dotenv to normal deps

* add back removed environment variable

* fix email route definition

* make default route ignore the /email path

* add missing environment variables for dev environment

* [OCRVS-6350] Disable root (#849)

* disable root login completely

* stop users from using 'su'

* only disable root login if ansible user being used is not root

* add history timestamps for user terminal history (#848)

* add playbook for ubuntu to update security patches automatically (#846)

* fix staging + prod key access to backup server

* update prod & staging jump keys

* fix manager hostname reference

* add a mechanism for defining additional SSH public keys that can login to the provisioning user

---------

Co-authored-by: naftis <pyry.rouvila@gmail.com>
Co-authored-by: Riku Rouvila <riku.rouvila@gmail.com>
rikukissa added a commit that referenced this pull request Jan 24, 2024
…ates (#789)

* fix conflicts

* add amends from cdpi-living-lab repository

* setup pem file

* fix merge conflict

* add libsodium to dev dependencies

* configure provisioning and deployment script so that any user with privilege escalation access can provision the host machine

* compress and encrypt backup directories before sending to backup server

* supply backup password to backup cronjob

* supply backup encryption passphrase from github secrets

* hide openhim-console by default

* hide openhim-api by default

* Modularise playbook tasks, use only one playbook for all deployment sizes (#798)

* split playbooks to different task modules, use only one playbook for all deployment sizes

* update provisioning pipeline

* try initialising the provision pipeline by adding a temporary push trigger

* setup ssh key before trying to provision

* add known hosts file

* do not try to mount cryptfs partition to /data if it's already mounted

* add filebeat so logs can be accessed, monitored by kibana

* fix kibana address

* Setup new alerts: SSH login, error in backup logs, available disk space in data partition

* add ansible task for creating user accounts for maintainers with 2FA login enabled

* add new alerts for log alerts and ssh alerts

* pass initial metabase sequal file to metabase as a config file so deployment doesn't have to touch the /data directory

* temporarily allow root login again until we set up deployment users

* add port to port forwarding container names so multiple ports can be opened from one container

* Changes to environment provisioning script and log file handling

* remove vagrant files

* remove references to sudo password

sudo operation should only be performed by humans as it gives permission to do root-level operations. automated users should have required permissions set by provisioning playbooks

* remove VPN mentions for now

* remove elastalert slack alert environment variable as it's not referred anywhere

* remove extra environment variables from deploy script call

* remove proxy config from backup script

* generate BACKUP_ENCRYPTION_PASSPHRASE for all github environments

* make log files be accessible by application group so SSH_USER can read and write to them

* remove node version matrices from new pipelines

* add separate inventory files for all environments

* make docker manager1 reference dynamic

* Combine country config compose files to base deployment compose files, include replica compose files in environmet-specific compose files (#808)

* Production VPN (#809)

* add initial wireguard server setup

* move vpn to QA server

* remove unused HOSTNAME parameter

* fix a bug in environment creator script, make sure secrets are never committed

* add development environment to provisioning scripts

* add development machine to inventory

* remove unnecessary PEM setup step

* always use the same ansible variables

* fix ansible variable reference

* remove global ansible user setting

* add back missing dockerhub username

* disable SSH login with root login if provisioning is not done as root

* convert inventory files to yml so ssh keys and users can be directly defined in them

* add Tahmid's public key

* fix inventory file reference

* add development to machines that can be deployed to

* fix known hosts mechanism in deployment pipelines

* make environment seletion in deploy.sh dynamic

* volume mount metabase init file as docker has a file size limit of 500kb for config files

* copy the whole project directory to the server

* send core compose files to the server

* fix common file paths

* fix environment compose file

* use absolute paths in the compose file

* add debug log

* remove deploy log file temporarily

* remove matrices from deployment pipelines

* add debug log

* debug github action

* fix deploy pipeline syntax

* add variables to debug step

* make debugging an option

* fix pipeline syntax

* just a commit to make pipeline update on github

* more syntax fixes

* more syntax fixes

* more syntax fixes

* only define overlay net in the main deploy docker compose so that it keeps attachable

* remove files from target server infrastructure directory if those files do not exist on in repo anymore

* fix deploy path

* do a docker login as part of deployment

* only volume link minio admin's config to the container so it wont write anything new to the source code directory

* remove container names as docker swarm do not support those

* fix path for elasticsearch config

* change the clear data script so that it doesn't touch /data directory directly. This helps us restrict deployment user's access to data

* add missing env variables

* do not use interactive shell

* stop debug mode from starting if its not explicitly enables

* add development to seed pipeline

* add pipeline for clearing an environment

* rename pipeline

* temporarily adda a push trigger to clear environment

* Revert "temporarily adda a push trigger to clear environment"

This reverts commit 882c432.

* fix reset script file reference, reuse clear-environment pipeline in deploy pipeline

* run clearing through ssh

* add missing ssh secrets

* fix pipeline reference in deploy script

* make clear-environment reusable

* debug why no reset

* add migration run to clear-environment pipeline

* remove data clearing from deploy script

* try without conditionals

* try with a true string

* use singlequotes

* update staging server fingerprint

* add output for reset step

* fix synta

* change staging IP

* fix pexpect reference

* remove pyexpect completely

* remove python3-docker module as we do not have any ansible docker commands

* try again with the module as its needed for logging in to docker

* run provisioning tasks through qa

* add jump host

* update known hosts once more

* add more logging

* update qa fingerprint

* lower timeout limits

* restart ssh as root

* change ssh restart method for ubuntu 23

* make a 1-1 mapping to github environments and deployed environments. Demo should have its own Github environment and not use production

* add back docker login

* make it possible to pass SSH args to deploy script

* fix

* make it possible to supply additional ssh parameters for clear script

* updates to create environment script

* configure jump host for production

* update production ssh fingerprint

* make production a 2-server deployment

* add missing jump host definition for docker-workers

* ignore VPN and other allowed addresses in fail2ban

* update staging and prod docker composed

* fix jinja template

* configure rsync to not change file permissions

* add debug

* remove -a from rsync so it doesnt try to change permissions

* add wireguard data partition, ensure files in deployment directory are owned by application group

* make setting ownership recursive

* set read parmissions to others in /opt/opencrvs so docker users can read the files

* increase fail2ban limits

* attach traefik to vpn network

* make ssh user configurable for port-forwarding script

* update wg-easy

* update wg-east

* fix cert resolver for vpn

* use github container registry and latest version for wg-easy

* pass wireguard password variable through deployment pipeline

* pass all github deployment environment variables to docker swarm deployment

* move environments variables to right function

* make a separate function that reads and supplies the env variables

* remove KNOWN_HOSTS from env variables

* remove more variables, fix escape

* make sure KNOWN_HOSTS wont leak to deploy step

* remove debug logging

* only set traefik to vpn network on QA where Wireguard server is

* add validation to make sure all environment variables are set

* download core compose iles before validating environment variables

* fix curl urls when downloading core compose files

* remove default latest value from country config version

* fix country config version variable not going to docker compose files

* fix compose env file order

* fix environment variable filtering

* add pipeline for resetting user's 2FA

* fix name of the pipeline

* trick github into showing the new pipeline

* fetch repo first

* use jump host

* add debug step

* remove unnecessary matrix definition

* remove debugging code

* use docker config instead of volume mounts where possible

* add read and execute rights for others to the deployment directory as sometimes users inside docker containers do not match the host machine users

* create a jump user for QA, allow definining multiple ssh keys for users

* do not add 2factor for jump users

* use new jump user in inventory files as well

* set infobip environment variables as optional, add missing required environment variables to environment creator script

* add support for 1-infinite replicas

* add missing network

* add missing export to VERSION variable

* remove demo deployment configuration for now

* Create a backup restore cron on staging (#812)

* Create a backup restore cron on staging

* allowed label to be passed to script for snapshot usage

* Updated release action

* Add approval step to production deploys

* Add Riku's username to prod deploys

* add separate config flag for provisioning for indicating if the server should backup its data somewhere else of if it should periodically restore data

* make configuration so that qa can allow connections through the provision user to other machines

* create playbook for backup servers and the connection between app servers and backups

* add tags

* add tag to workflow

* add task to ensure ssh dir exists for backup user

* create home directory for backup

* ensure backup task is always applied for root's crontab

* add default value for periodic_restore_from_backup

* make it possible to deploy production with current infrastructure

* Revert "make it possible to deploy production with current infrastructure"

This reverts commit 36edf30.

* fix wait hosts definition for migrations

* make production a qa environment temporarily

* add shell for backup user so rsync works

* explicitly define which user is the one running crontab, ensure that user's key gets to backup server

* ensure .ssh directory exists for crontab user

* get user home directories dynamically

* add missing tags

* add become

* fix file path

* define backup machine in staging config as well

* remove condition from fetch

* always create public key from private key

* use hadcoded file name for public key

* fix syntax

* make staging a QA environment so it reflects production

* separate backup downloading and restoring to two different scripts, use production server's encryptin key on the machine that restres the backup (staging)

* fix an issue with a running OpenHIM while we restore backup

when I cleared the database and then restored data there, the restore process failed if the running OpenHIM process had written new documents during this period

* restart minio after restoring data

---------

Co-authored-by: Riku Rouvila <riku.rouvila@gmail.com>

* fix snapshot script restore reference

* remove openhim base config

* remove WIREGUARD_ADMIN_PASSWORD reference from production deployment pipelines

* remove authorized_keys file

* add debug logging for clear all data script

* define REPLICAS variable before validating it

* fix syntax error in clear script

* automate updating branches on release

* switch back to previous traefik port definition

https://github.com/opencrvs/opencrvs-farajaland/pull/789/files/7a034732d3f38cfdb00d919f470bb7e48d587cdd#r1449976486

* rename 2factor to two_factor

* add default true value for two_factor

* [OCRVS-6437] Forward Elastalert emails through country config (#851)

* forward Elastalert emails first to country config's new /email endpoint and forward from there

* add NOTIFICATION_TRANSPORT variable to deployments scripts

* fix deployment

* move dotenv to normal deps

* add back removed environment variable

* fix email route definition

* make default route ignore the /email path

* add missing environment variables for dev environment

* [OCRVS-6350] Disable root (#849)

* disable root login completely

* stop users from using 'su'

* only disable root login if ansible user being used is not root

* add history timestamps for user terminal history (#848)

* add playbook for ubuntu to update security patches automatically (#846)

* fix staging + prod key access to backup server

* update prod & staging jump keys

* fix manager hostname reference

* add a mechanism for defining additional SSH public keys that can login to the provisioning user

---------

Co-authored-by: naftis <pyry.rouvila@gmail.com>
Co-authored-by: Riku Rouvila <riku.rouvila@gmail.com>
@rikukissa rikukissa deleted the production-vpn branch May 7, 2024 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants