ETF KPIs Scraper

This is a Python 3 based daily scraper that collects data on actively listed ETFs using the Alpha Vantage and Yahoo Finance APIs (via the yfinance package).

The infrasturcture of the scraper includes:

Amazon EventBridge: Triggers a Lambda function to run daily at 5:00 PM EST / 4:00 PM CST on weekdays after the market closes.
AWS Lambda: Starts an AWS Fargate task, which runs the containerized application code.
AWS Fargate: Executes the application code to collect and process ETF data, then stores the data in an S3 bucket as either a Parquet file or a CSV file.

For a detailed walkthrough of the project, check out the following blog post: Scraping ETF KPIs with AWS Lambda, AWS Fargate, and Alpha Vantage.

Project Setup

Fork and Clone the Repository

Fork the repository and clone the forked repository to local machine:

# HTTPS
$ git clone https://github.com/YOUR_GITHUB_USERNAME/etf-kpis-scraper.git
# SSH
$ git clone git@github/YOUR_GITHUB_USERNAME/etf-kpis-scraper.git

Set Up with `poetry`

Install poetry using the official installer for your operating system. Detailed instructions can be found in Poetry's Official Documentation. Make sure to add poetry to your PATH. Refer to the official documentation linked above for specific steps for your operating system.

There are three primary methods to set up and use poetry for this project:

Method 1: Using `poetry`

Configure poetry to create the virtual environment inside the project's root directory (and only do so for the current project using the --local flag):

poetry config virtualenvs.in-project true --local
cd path_to_cloned_repository
poetry sync --only main

Method 2: Using `pyenv` and `poetry` Together

With pyenv, ensure that Python (3.12 is the default for this project) is installed:

# List available Python versions 10 through 12
$ pyenv install --list | grep " 3\.\(10\|11\|12\)\."
# Install Python 3.12.8
$ pyenv install 3.12.8
# Activate Python 3.12.8 for the current project
$ pyenv local 3.12.8
# Use currently activated Python version to create the virtual environment
$ poetry sync --all-groups

Method 3: Using `conda` and `poetry` Together

Create a new conda environment named etf_kpis_scraper with Python 3.12:

yes | conda create --name etf_kpis_scraper python=3.12

Install the project dependencies (ensure that the conda environment is activated):

cd path_to_cloned_repository
conda activate etf_kpis_scraper
poetry sync --all-groups

Create Environment Variables

To test run the code locally, create a .env file in the root directory with the following environment variables:

API_KEY=your_alpha_vantage_api_key
S3_BUCKET=your_s3_bucket_name
IPO_DATE=threshold_for_etf_ipo_date
MAX_ETFS=maximum_number_of_etfs_to_scrape
PARQUET=True

Set ENV to dev (i.e., the default) to run the scraper in dev mode when running the entrypoint main.py locally.

Details on these environment variables can be found in the Modules subsection of the blog post.

Workflow Secrets

The workflows require the following secrets:

AWS_GITHUB_ACTIONS_ROLE_ARN: The ARN of the IAM role that GitHub Actions assumes to deploy to AWS.
AWS_REGION: The AWS region where the resources are deployed.
ECR_REPOSITORY: The name of the ECR repository where the Docker image is stored.
S3_BUCKET: The name of the S3 bucket where the ETF data is stored.
LAMBDA_FUNCTION: The name of the Lambda function that triggers the Fargate task.

AWS CLI for Programmatic Deployment

To deploy the resources programmatically via the Terraform instead of using the AWS console, ensure that the AWS CLI is installed on the local machine and that it is configured with the necessary credentials. Follow the instructions in the AWS CLI Documentation.

A simple starting point, though it may violate the principle of least privilege, is to use the AdministratorAccess policy.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
cloudformation_templates		cloudformation_templates
lambda		lambda
scripts		scripts
src		src
terraform		terraform
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETF KPIs Scraper

Project Setup

Fork and Clone the Repository

Set Up with `poetry`

Method 1: Using `poetry`

Method 2: Using `pyenv` and `poetry` Together

Method 3: Using `conda` and `poetry` Together

Create Environment Variables

Workflow Secrets

AWS CLI for Programmatic Deployment

About

Releases

Packages

Languages

yangwu1227/etf-kpis-scraper

Folders and files

Latest commit

History

Repository files navigation

ETF KPIs Scraper

Project Setup

Fork and Clone the Repository

Set Up with poetry

Method 1: Using poetry

Method 2: Using pyenv and poetry Together

Method 3: Using conda and poetry Together

Create Environment Variables

Workflow Secrets

AWS CLI for Programmatic Deployment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Set Up with `poetry`

Method 1: Using `poetry`

Method 2: Using `pyenv` and `poetry` Together

Method 3: Using `conda` and `poetry` Together

Packages