This is a Python 3 based daily scraper that collects data on actively listed ETFs using the Alpha Vantage and Yahoo Finance APIs (via the yfinance package).
The infrasturcture of the scraper includes:
- Amazon EventBridge: Triggers a Lambda function to run daily at 5:00 PM EST / 4:00 PM CST on weekdays after the market closes.
- AWS Lambda: Starts an AWS Fargate task, which runs the containerized application code.
- AWS Fargate: Executes the application code to collect and process ETF data, then stores the data in an S3 bucket as either a Parquet file or a CSV file.
For a detailed walkthrough of the project, check out the following blog post: Scraping ETF KPIs with AWS Lambda, AWS Fargate, and Alpha Vantage.
Fork the repository and clone the forked repository to local machine:
# HTTPS
$ git clone https://github.com/YOUR_GITHUB_USERNAME/etf-kpis-scraper.git
# SSH
$ git clone git@github/YOUR_GITHUB_USERNAME/etf-kpis-scraper.git
Install poetry
using the official installer for your operating system. Detailed instructions can be found in Poetry's Official Documentation. Make sure to add poetry
to your PATH. Refer to the official documentation linked above for specific steps for your operating system.
There are three primary methods to set up and use poetry
for this project:
Configure poetry
to create the virtual environment inside the project's root directory (and only do so for the current project using the --local flag):
poetry config virtualenvs.in-project true --local
cd path_to_cloned_repository
poetry sync --only main
With pyenv, ensure that Python (3.12
is the default for this project) is installed:
# List available Python versions 10 through 12
$ pyenv install --list | grep " 3\.\(10\|11\|12\)\."
# Install Python 3.12.8
$ pyenv install 3.12.8
# Activate Python 3.12.8 for the current project
$ pyenv local 3.12.8
# Use currently activated Python version to create the virtual environment
$ poetry sync --all-groups
- Create a new conda environment named
etf_kpis_scraper
with Python3.12
:
yes | conda create --name etf_kpis_scraper python=3.12
- Install the project dependencies (ensure that the
conda
environment is activated):
cd path_to_cloned_repository
conda activate etf_kpis_scraper
poetry sync --all-groups
To test run the code locally, create a .env
file in the root directory with the following environment variables:
API_KEY=your_alpha_vantage_api_key
S3_BUCKET=your_s3_bucket_name
IPO_DATE=threshold_for_etf_ipo_date
MAX_ETFS=maximum_number_of_etfs_to_scrape
PARQUET=True
Set ENV
to dev
(i.e., the default) to run the scraper in dev
mode when running the entrypoint main.py
locally.
Details on these environment variables can be found in the Modules subsection of the blog post.
The workflows require the following secrets:
-
AWS_GITHUB_ACTIONS_ROLE_ARN
: The ARN of the IAM role that GitHub Actions assumes to deploy to AWS. -
AWS_REGION
: The AWS region where the resources are deployed. -
ECR_REPOSITORY
: The name of the ECR repository where the Docker image is stored. -
S3_BUCKET
: The name of the S3 bucket where the ETF data is stored. -
LAMBDA_FUNCTION
: The name of the Lambda function that triggers the Fargate task.
To deploy the resources programmatically via the Terraform instead of using the AWS console, ensure that the AWS CLI is installed on the local machine and that it is configured with the necessary credentials. Follow the instructions in the AWS CLI Documentation.
A simple starting point, though it may violate the principle of least privilege, is to use the AdministratorAccess policy.