Skip to content

A showcase project for my talk about typelevel stack architecture.

License

Notifications You must be signed in to change notification settings

kubukoz/dropbox-demo

Repository files navigation

dropbox-demo

An application accompanying my talk about structuring functional applications in Scala.

Run application

Backend

Prerequisites: sbt, Elasticsearch running on localhost:9200, tesseract binary available on the PATH and runnable.

Elasticsearch is available in the attached docker-compose setup. tesseract is available if you enter the attached nix-shell. The shell will also load environment variables from the env.sh file, if you have one (it's ignored in git).

Once you have these: sbt run - at time of writing the application starts on 0.0.0.0:4000, this can be configured with the HTTP_HOST/HTTP_PORT environment variables.

Frontend

Prerequisites: Node 14.x, npm (provided in a nix-shell in the frontend directory).

cd frontend
npm start

Project goals

Search images from some storage by the text on them.

I have tens of thousands of images to search, and the OCR (optical character recognition) process takes ~0.5 seconds per image for relatively small images, so live decoding is a no-no. Instead, we will allow the user to index a path from the store, and later search the database populated in that process.

The indexing will happen in the background, without the user having to wait for it to complete before getting a response. We'll run the whole process in constant memory (except this is kind of up to Tesseract, I haven't investigated its memory usage on large images yet) - so both downloading the list of images to index, and their actual bytes, is done in a streaming fashion, thanks to fs2.

Infrastructure

At the time of writing:

Right now, this only runs on a local machine. Tesseract is provided via the nix shell, Open Distro runs in docker. The application can be started using bloop.

Tech stack

The backend is built in Scala (obviously - that was the point of the talk), using the following libraries:

  • Cats Effect 3, for several things, such as monadic composition of asynchronous tasks (e.g. Elasticsearch client) and interop with other libraries from the ecosystem
  • fs2 - for streaming data, so that we can run the indexing process in constant memory, as long as the OCR implementation can do so
  • http4s - for the HTTP server, as well as a custom client for Dropbox
  • ciris - for compositionally loading configuration
  • circe - for decoding/encoding JSON
  • log4cats - for logging
  • Elasticsearch high-level Java client - for talking to Elasticsearch. Normally you could use something like elastic4s, but I only needed a subset of its functionality and wanted to show how this can be wrapped in cats.effect.IO
  • weaver - for testing
  • chimney - for transforming similar datatypes

The frontend is built with React + TypeScript. You'll need Node 14.x and npm, both are provided with the attached nix shell in the frontend directory.

Architecture

First of all, the data flow in processes that can be triggered by the user:

Data flows

There are three main processes:

Indexing

  • Wait for the user to provide a directory to index (a path to the directory in Dropbox)
  • Download metadata of all images within that directory (recursively)
  • For each entry, get a stream of bytes of its content, push it to Tesseract
  • Pass the metadata and the OCR decoding result to the indexer (Elasticsearch)

API:

# paths must start with /
http :4000/index path="/images"
POST /index HTTP/1.1
Accept: application/json, */*;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 27
Content-Type: application/json
Host: localhost:4000
User-Agent: HTTPie/2.4.0

{
    "path": "/images"
}


HTTP/1.1 202 Accepted
Content-Length: 0
Date: Sun, 02 May 2021 17:57:19 GMT

Search

  • Ask the user for a query
  • Pass the query to Elasticsearch with some level of fuzziness
  • Pass results back to the user (metadata of matching files)

API:

http :4000/search\?query=test
GET /search?query=test HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: localhost:4000
User-Agent: HTTPie/2.4.0



HTTP/1.1 200 OK
Content-Type: application/json
Date: Sun, 02 May 2021 17:58:41 GMT
Transfer-Encoding: chunked

[
    {
        "content": "Some decoded text",
        "imageUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile1.jpg",
        "thumbnailUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile1.jpg"
    },
    {
        "content": "Another file with test text",
        "imageUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile2.jpg",
        "thumbnailUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile2.jpg"
    }
]

Download

  • Take a concrete file's path from the user (from file metadata)
  • Return stream of bytes for that file

API:

http :4000/view/%2Fimages%2Ffile2.jpg
GET /view/%2Fimages%2Ffile2.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: localhost:4000
User-Agent: HTTPie/2.4.0



HTTP/1.1 200 OK
Content-Type: image/jpeg
Date: Sun, 02 May 2021 18:01:30 GMT
Transfer-Encoding: chunked



+-----------------------------------------+
| NOTE: binary data not shown in terminal |
+-----------------------------------------+

Module graph

Modules

The blue boxes correspond to the processes outlined above (core logic), green boxes are high-level adapters for the underlying vendor-specific implementations - this is similar to the hexagonal architecture / Ports&Adapters patterns.

These correspond almost directly to the interfaces (Tagless Final algebras) in the project.

About

A showcase project for my talk about typelevel stack architecture.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published