Podify is the first podcast streaming service specifically designed for academic research. With high resemblances to existing modern streaming services, and a scalable design to accommodate large-scale user studies, it implements a customisable catalogue search, with manual playlist creation and curation, podcast listening, and explicit and implicit feedback collection mechanisms. With all user interactions automatically logged by the platform and easily exportable in a readable format for subsequent analysis, Podify aims to reduce the overhead researchers face when conducting user studies.
This repository contains the source code for the platform outlined in the Demonstration Paper Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research, accepted at the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2023).
For the YouTube presentation of this platform, please click here.
To know more about our research activities at NeuraSearch Laboratory, please follow us on Twitter (@NeuraSearch) and to get notified of future uploads please subscribe to our YouTube channel!
- Ruby v3.0.2
- It is recommended to use a version manager such as rbenv
- Bundler:
gem install bundler
- Ruby on Rails:
cd Podify
bundle install
- Redis
- Docker
- FFmpeg
- geoip-database
- It is preinstalled on Heroku
sudo apt-get install geoip-database
- Tailwind CSS
cd Podify
./bin/rails tailwindcss:install
In a terminal window, and from the root folder (cd Podify
), run:
./bin/dev
Navigate to localhost:3000
from your local browser. Before interacting with Podify, however, please complete the all the steps outlined below.
Run the following command prior to creating and seeding the database.
docker run \
-d \
--name elasticsearch-podify \
--publish 9200:9200 \
--env "discovery.type=single-node" \
--env "cluster.name=elasticsearch-rails" \
--env "cluster.routing.allocation.disk.threshold_enabled=false" \
--rm \
docker.elastic.co/elasticsearch/elasticsearch-oss:7.6.0
Create and seed the PostgreSQL database, as specified in db/seeds.rb
:
rails db:reset
In this step, an admin user is also created, with the following credentials:
- Username: admin@example.com
- Password: password
These credentials can be used to access the admin dashboard, available at: localhost:3000/admin
Podify uses AWS S3 Buckets to generate the catalogue as well as downloading and storing the audio, transcript, and image files.
Please create a S3 Bucket and then edit the following credentials. Please make sure to provide the access_key_id, secret_access_key, region, and bucket_name values:
EDITOR="code --wait" bin/rails credentials:edit
Since Podify expects RSS feeds, it does not restrict its usage to only, for example, the Spotify Podcast Dataset. However, the RSS feeds originating from the Spotify Podcast Dataset were used for the demonstration paper. Thus, in order to pre-process the data for creation of the catalogue, the following scripts have to be executed:
python3 utils/1-extract_episodes.py
- [Requirement]: metadata.tsv of the Spotify Podcast Dataset
- This script creates episodes.json from metadata.tsv. Only the episodes with valid metadata and RSS feed are included in the JSON file. This list of episodes will be the catalogue.
python3 utils/2-download_audio_files.py
- [Requirement]: setup rclone as documented in the Spotify Podcast Dataset README.md file
- This script creates a new folder and it downloads the audio files from the Spotify Podcast Dataset for the episodes listed in episodes.json
python3 utils/3-convert_transcripts_to_vtt.py
- [Requirement]: a folder (podcasts-transcripts) that contains all the transcript files of the Spotify Podcast Dataset. The tar.gz files have to be extracted. The resulting podcasts-transcripts folder will be used by this script
- This script converts the transcripts to a VTT format and to a word-level representation. The transcript will be uploaded during the catalogue creation to be indexed by the Elastic Search instance
python3 utils/4-extract_transcript_files.py
- This script, similar to step (2), creates a new folder and it fetches only the transcript files that are listed in episodes.json
Whilst Podify is built in Ruby on Rails, these scripts have been provided in Python. This is to ease the researchers' job of customising and adapting these procedures to their own needs.
In a terminal window, and from the root folder (cd Podify
), run Sidekiq:
bundle exec sidekiq
With Sidekiq operating and ready to accept incoming jobs, the following task will create the catalogue. Please be aware that this process may take some time, depending on the number of episodes that are going to be uploaded onto Podify.
rails episodes:seed_episodes bucket_segments_object_key="episodes.json"
Once the catalogue is fully created (the pending jobs, if any, can be found in localhost:3000/admin/sidekiq
), the Sidekiq process can be stopped and the terminal closed. Although user behaviour can be manually downloaded via the admin dashboard, a cron schedule is also implemented to avoid any potential data loss. Please note that this requires a running Sidekiq process.
Install the Heroku CLI with the following guide: https://devcenter.heroku.com/articles/heroku-cli
Once the CLI is installed, and you are logged in (heroku login
), run the following:
cd Podify
heroku apps:create --stack=heroku-20 neurasearch-podify
heroku buildpacks:set heroku/nodejs --index 1
heroku buildpacks:set heroku/ruby --index 2
heroku buildpacks:add --index 3 https://github.com/jonathanong/heroku-buildpack-ffmpeg-latest.git
git push heroku main
heroku run rake db:migrate
heroku ps:scale web=1
heroku open
Please, cite this work as follows:
@inproceedings{10.1145/3539618.3591824,
author = {Meggetto, Francesco and Moshfeghi, Yashar},
title = {Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research},
year = {2023},
isbn = {9781450394086},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3539618.3591824},
doi = {10.1145/3539618.3591824},
booktitle = {Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {3215–3219},
numpages = {5},
keywords = {user behaviour, platform, podcast, logging, listening, search},
location = {Taipei, Taiwan},
series = {SIGIR '23}
}
Francesco Meggetto and Yashar Moshfeghi. 2023. Podify: A Podcast Streaming Platform with Automatic Logging of User Behaviour for Academic Research. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23). Association for Computing Machinery, New York, NY, USA, 3215–3219. https://doi.org/10.1145/3539618.3591824