GPT-2 support (#44)

* GPT-2 first support * GPT-2 first support * cover TRT, ORT and PT (GPT-2) * optimize perf * makes ONNX Runtime fast * makes ONNX Runtime fast * remove pycuda dependency use torch tensors as input / output for tensorrt * move ort bindings to lib fix export fix trt gpt2 * fix sentence transfo export add doc * make int32 tensor default * small notebook fixes * change command line add new python generative task * delete old src * fix linter * add cache / no cache XP * make config more modular add gpt2 to the documentation * update Python tests * fix test * config for generative models * refactoring * fix sentence transformer * fix gpt2 generation * fix format * fix test * move infinity demo in a specific folder * fix decoder configuration generation * fix model call * add measures for CPU (gpt2) * fix code * add cache use in GPT-2 notebook * update notebook add tests update docker Triton image update ORT bump VERSION * add getting started section to README add sentence-transformers to Docker improve Makefile update docker version in documentation use tensorrt Python lib 8.2.2 (like Triton Docker image) Add tolerance check message * doc update * docker image includes notebook and pytorch-quantization documentation update change in index titles complete README.md * add social logo * add social logo * update notebook text * update notebook * fix Makefile
ELS-RD · Feb 8, 2022 · ccfeb21 · ccfeb21
1 parent c87a39e
commit ccfeb21
Show file tree

Hide file tree

Showing 52 changed files with 2,982 additions and 818 deletions.
diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
@@ -14,7 +14,7 @@ jobs:
     steps:
     - uses: actions/checkout@v2
 
-    - name: Set up Python 3.9
+    - name: Set up Python
       uses: actions/setup-python@v2
       with:
         python-version: "3.9"

diff --git a/.gitignore b/.gitignore
@@ -213,7 +213,7 @@ cython_debug/
 
 # custom
 *.onnx
+*.plan
 .idea/
 TensorRT/
 triton_models/
-demo/roberta-*/
diff --git a/Dockerfile b/Dockerfile
@@ -1,8 +1,9 @@
-FROM nvcr.io/nvidia/tritonserver:21.12-py3
+FROM nvcr.io/nvidia/tritonserver:22.01-py3
 
 # see .dockerignore to check what is transfered
 COPY . ./
 
 RUN pip3 install -U pip && \
     pip3 install nvidia-pyindex && \
-    pip3 install -e ".[GPU]" -f https://download.pytorch.org/whl/cu113/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com --no-cache-dir
+    pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu113/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com --no-cache-dir && \
+    pip3 install sentence-transformers notebook pytorch-quantization
diff --git a/Makefile b/Makefile
@@ -24,17 +24,21 @@ test:
 test_ci:
 	pytest -m "not gpu" || exit 1
 
-.PHONY: build_docker
-build_docker:
+.PHONY: docker_build
+docker_build:
 	DOCKER_BUILDKIT=1 docker build \
 	--rm \
 	-t ghcr.io/els-rd/transformer-deploy:latest \
 	-t ghcr.io/els-rd/transformer-deploy:$(VERSION) \
 	-f Dockerfile .
 
-.PHONY: build_push_docker
-build_push_docker:
+.PHONY: docker_build_push
+docker_build_push:
 	! docker manifest inspect ghcr.io/els-rd/transformer-deploy:$(shell cat VERSION) > /dev/null || exit 1
-	${MAKE} build_docker || exit 1
+	${MAKE} docker_build || exit 1
 	docker push ghcr.io/els-rd/transformer-deploy:latest || exit 1
 	docker push ghcr.io/els-rd/transformer-deploy:$(VERSION) || exit 1
+
+.PHONY: documentation
+documentation:
+	PYTHONWARNINGS=ignore::UserWarning:mkdocstrings.handlers.python	mkdocs serve
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Hugging Face Transformer submillisecond inference️ and deployment to production: 🤗 → 🤯
 
-[![Documentation](https://img.shields.io/website?label=documentation&style=for-the-badge&up_message=online&url=https%3A%2F%2Fels-rd.github.io%2Ftransformer-deploy%2F)](https://els-rd.github.io/transformer-deploy/) [![tests](https://img.shields.io/github/workflow/status/ELS-RD/transformer-deploy/tests/main?label=tests&style=for-the-badge)](https://github.com/ELS-RD/transformer-deploy/actions/workflows/python-app.yml) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg?style=for-the-badge)](./LICENCE) [![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg?style=for-the-badge)](https://www.python.org/downloads/release/python-360/)
+[![Documentation](https://img.shields.io/website?label=documentation&style=for-the-badge&up_message=online&url=https%3A%2F%2Fels-rd.github.io%2Ftransformer-deploy%2F)](https://els-rd.github.io/transformer-deploy/) [![tests](https://img.shields.io/github/workflow/status/ELS-RD/transformer-deploy/tests/main?label=tests&style=for-the-badge)](https://github.com/ELS-RD/transformer-deploy/actions/workflows/python-app.yml) [![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg?style=for-the-badge)](https://www.python.org/downloads/release/python-360/) ![Twitter Follow](https://img.shields.io/twitter/follow/pommedeterre33?color=orange&style=for-the-badge)
 
 
 ### Optimize and deploy in **production** 🤗 Hugging Face Transformer models in a single command line.  
@@ -18,7 +18,7 @@ We have tested many solutions, and below is what we found:
 
 [`Pytorch`](https://pytorch.org/) + [`FastAPI`](https://fastapi.tiangolo.com/) = 🐢  
 Most tutorials on `Transformer` deployment in production are built over Pytorch and FastAPI.
-Both are great tools but not very performant in inference.  
+Both are great tools but not very performant in inference (actual measures below).  
 
 [`Microsoft ONNX Runtime`](https://github.com/microsoft/onnxruntime/) + [`Nvidia Triton inference server`](https://github.com/triton-inference-server/server) = ️🏃💨  
 Then, if you spend some time, you can build something over ONNX Runtime and Triton inference server.
@@ -30,6 +30,8 @@ You will usually get 5X faster inference compared to vanilla Pytorch.
 Sometimes it can raises up to **10X faster inference**.  
 Buuuuttt... TensorRT can ask some efforts to master, it requires tricks not easy to come with, we implemented them for you!  
 
+[Detailed tool compare table](https://els-rd.github.io/transformer-deploy/compare/)
+
 ## Features
 
 * heavily optimize transformer models for inference (CPU and GPU) -> between 5X and 10X speed-up
@@ -39,10 +41,240 @@ Buuuuttt... TensorRT can ask some efforts to master, it requires tricks not easy
 * supported model: any model than can be exported to ONNX (-> most of them)
 * supported tasks: classification, feature extraction (aka sentence-transformers dense embeddings)
 
-<!--why-end-->
-
 > Want to understand how it works under the hood?  
 > read [🤗 Hugging Face Transformer inference UNDER 1 millisecond latency 📖](https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c?source=friends_link&sk=cd880e05c501c7880f2b9454830b8915)  
 > <img src="resources/rabbit.jpg" width="120">
 
+## Want to check by yourself in 3 minutes?
+
+To have a raw idea of what kind of acceleration you will get on your own model, you can try the `docker` only run below.
+For GPU run, you need to have installed on your machine Nvidia drivers and [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker).
+
+**3 tasks are covered** below: 
+
+* classification, 
+* feature extraction (text to dense embeddings) 
+* text generation (GPT-2 style).  
+
+Moreover, we have added a GPU quantization notebook to open directly on `Docker` to play with.
+
+### Classification/reranking (encoder model)
+
+Classification is a common task in NLP, and large language models have shown great results.  
+This task is also used for search engine to provide Google like relevancy (cf. [arxiv](https://arxiv.org/abs/1901.04085))
+
+#### Optimize existing model
+
+This will optimize model, generate Triton configuration and Triton folder layout in a single command:
+
+```shell
+docker run -it --rm --gpus all \
+  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+  bash -c "cd /project && \
+    convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
+    --backend tensorrt onnx \
+    --seq-len 16 128 128"
+
+# output:  
+# ...
+# Inference done on NVIDIA GeForce RTX 3090
+# latencies:
+# [Pytorch (FP32)] mean=5.43ms, sd=0.70ms, min=4.88ms, max=7.81ms, median=5.09ms, 95p=7.01ms, 99p=7.53ms
+# [Pytorch (FP16)] mean=6.55ms, sd=1.00ms, min=5.75ms, max=10.38ms, median=6.01ms, 95p=8.57ms, 99p=9.21ms
+# [TensorRT (FP16)] mean=0.53ms, sd=0.03ms, min=0.49ms, max=0.61ms, median=0.52ms, 95p=0.57ms, 99p=0.58ms
+# [ONNX Runtime (FP32)] mean=1.57ms, sd=0.05ms, min=1.49ms, max=1.90ms, median=1.57ms, 95p=1.63ms, 99p=1.76ms
+# [ONNX Runtime (optimized)] mean=0.90ms, sd=0.03ms, min=0.88ms, max=1.23ms, median=0.89ms, 95p=0.95ms, 99p=0.97ms
+# Each infence engine output is within 0.3 tolerance compared to Pytorch output
+```
+
+It will output mean latency and other statistics.  
+Usually `Nvidia TensorRT` is the fastest option, `ONNX Runtime` is usually a strong second option.  
+On ONNX Runtime, `optimized` means that kernel fusion and mixed precision are enabled.  
+`Pytorch` is never competitive on transformer inference, including mixed precision, whatever the model size.  
+
+#### Run Nvidia Triton inference server
+
+Note that we install `transformers` at run time.  
+For production, it's advised to build your own 3 lines docker image with `transformers` pre-installed.
+
+```shell
+docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
+  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.01-py3 \
+  bash -c "pip install transformers && tritonserver --model-repository=/models"
+
+# output:
+# ...
+# I0207 09:58:32.738831 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001
+# I0207 09:58:32.739875 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000
+# I0207 09:58:32.782066 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
+```
+
+#### Query inference 
+
+Query ONNX model (replace `transformer_onnx_inference` by `transformer_tensorrt_inference` to query TensorRT engine):
+
+```shell
+curl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \
+  --data-binary "@demo/infinity/query_body.bin" \
+  --header "Inference-Header-Content-Length: 161"
+
+# output:
+# {"model_name":"transformer_onnx_inference","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"output","datatype":"FP32","shape":[1,2],"data":[-3.431640625,3.271484375]}]}
+```
+
+Model output is at the end of the Json (`data` field).
+[More information about how to query the server from `Python`, and other languages](https://els-rd.github.io/transformer-deploy/run/).
+
+To get very low latency inference in your Python code (no inference server): [click here](https://els-rd.github.io/transformer-deploy/python/)
+
+### Feature extraction / dense embeddings
+
+Feature extraction in NLP is the task to convert text to dense embeddings.  
+It has gain some traction as a robust way to improve search engine relevancy (increase recall).  
+This project supports models from [sentence-transformers](https://github.com/UKPLab/sentence-transformers).
+
+#### Optimize existing model
+
+```shell
+docker run -it --rm --gpus all \
+  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+  bash -c "cd /project && \
+    convert_model -m \"sentence-transformers/msmarco-distilbert-cos-v5\" \
+    --backend tensorrt onnx \
+    --task embedding \
+    --seq-len 16 128 128"
+
+# output:
+# ...
+# Inference done on NVIDIA GeForce RTX 3090
+# latencies:
+# [Pytorch (FP32)] mean=5.19ms, sd=0.45ms, min=4.74ms, max=6.64ms, median=5.03ms, 95p=6.14ms, 99p=6.26ms
+# [Pytorch (FP16)] mean=5.41ms, sd=0.18ms, min=5.26ms, max=8.15ms, median=5.36ms, 95p=5.62ms, 99p=5.72ms
+# [TensorRT (FP16)] mean=0.72ms, sd=0.04ms, min=0.69ms, max=1.33ms, median=0.70ms, 95p=0.78ms, 99p=0.81ms
+# [ONNX Runtime (FP32)] mean=1.69ms, sd=0.18ms, min=1.62ms, max=4.07ms, median=1.64ms, 95p=1.86ms, 99p=2.44ms
+# [ONNX Runtime (optimized)] mean=1.03ms, sd=0.09ms, min=0.98ms, max=2.30ms, median=1.00ms, 95p=1.15ms, 99p=1.41ms
+# Each infence engine output is within 0.3 tolerance compared to Pytorch output
+```
+
+#### Run Nvidia Triton inference server
+
+```shell
+docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
+  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.01-py3 \
+  bash -c "pip install transformers && tritonserver --model-repository=/models"
+
+# output:
+# ...
+# I0207 11:04:33.761517 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001
+# I0207 11:04:33.761844 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000
+# I0207 11:04:33.803373 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
+
+```
+
+#### Query inference 
+
+```shell
+curl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \
+  --data-binary "@demo/infinity/query_body.bin" \
+  --header "Inference-Header-Content-Length: 161"
+
+# output:
+# {"model_name":"transformer_onnx_inference","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"output","datatype":"FP32","shape":[1,768],"data":[0.06549072265625,-0.04327392578125,0.1103515625,-0.007320404052734375,...
+```
+
+### Generate text (decoder model)
+
+Text generation seems to be the way to go for NLP.  
+Unfortunately they are slow to run, below we will accelerate the most famous of them: GPT-2.
+
+#### Optimize existing model
+
+Like before, command below will prepare Triton inference server stuff.  
+One point to have in mind is that Triton run:
+- inference engines (`ONNX Runtime` and `TensorRT`)
+- `Python` code in charge of the `decoding` part. `Python` code delegate to Triton server the model management.
+
+`Python` code is in `./triton_models/transformer_tensorrt_generate/1/model.py`
+
+```shell
+docker run -it --rm --gpus all \
+  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+  bash -c "cd /project && \
+    convert_model -m gpt2 \
+    --backend tensorrt onnx \
+    --seq-len 6 256 256 \
+    --task text-generation"
+
+# output:
+# ...
+# Inference done on NVIDIA GeForce RTX 3090
+# latencies:
+# [Pytorch (FP32)] mean=9.43ms, sd=0.59ms, min=8.95ms, max=15.02ms, median=9.33ms, 95p=10.38ms, 99p=12.46ms
+# [Pytorch (FP16)] mean=9.92ms, sd=0.55ms, min=9.50ms, max=15.06ms, median=9.74ms, 95p=10.96ms, 99p=12.26ms
+# [TensorRT (FP16)] mean=2.19ms, sd=0.18ms, min=2.06ms, max=3.04ms, median=2.10ms, 95p=2.64ms, 99p=2.79ms
+# [ONNX Runtime (FP32)] mean=4.99ms, sd=0.38ms, min=4.68ms, max=9.09ms, median=4.78ms, 95p=5.72ms, 99p=5.95ms
+# [ONNX Runtime (optimized)] mean=3.93ms, sd=0.40ms, min=3.62ms, max=6.53ms, median=3.81ms, 95p=4.49ms, 99p=5.79ms
+# Each infence engine output is within 0.3 tolerance compared to Pytorch output
+```
+
+#### Run Nvidia Triton inference server
+
+To run decoding algorithm server side, we need to install `Pytorch` on `Triton` docker image.
+
+```shell
+docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 8g \
+  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.01-py3 \
+  bash -c "pip install transformers torch==1.10.2+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html && \
+  tritonserver --model-repository=/models"
+
+# output:
+# ...
+# I0207 10:29:19.091191 1 grpc_server.cc:4195] Started GRPCInferenceService at 0.0.0.0:8001
+# I0207 10:29:19.091417 1 http_server.cc:2857] Started HTTPService at 0.0.0.0:8000
+# I0207 10:29:19.132902 1 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
+```
+
+#### Query inference 
+
+Replace `transformer_onnx_generate` by `transformer_tensorrt_generate` to query `TensorRT` engine.
+
+```shell
+curl -X POST  http://localhost:8000/v2/models/transformer_onnx_generate/versions/1/infer \
+  --data-binary "@demo/infinity/query_body.bin" \
+  --header "Inference-Header-Content-Length: 161"
+
+# output:
+# {"model_name":"transformer_onnx_generate","model_version":"1","outputs":[{"name":"output","datatype":"BYTES","shape":[],"data":["This live event is great. I will sign-up for Infinity.\n\nI'm going to be doing a live stream of the event.\n\nI"]}]}
+```
+
+Ok, the output is not very interesting (💩 in -> 💩 out) but you get the idea.  
+Source code of the generative model is in `./triton_models/transformer_tensorrt_generate/1/model.py`.  
+You may want to tweak it regarding your needs (defauld is set for greedy search and output 64 tokens).
+
+#### Python code
+
+You may be interested in running optimized text generation on Python directly, without using any inference server:  
+
+```shell
+docker run -p 8888:8888 -v $PWD/demo/generative-model:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+  bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root"
+```
+
+### Model quantization on GPU
+
+Quantization is a generic method to get X2 speedup on top of other inference optimization.  
+GPU quantization on transformers is almost never used because it requires to modify model source code.  
+
+We have implemented in this library a mechanism which update Hugging Face transformers library to support quantization.  
+It makes it easy to use.
+
+To play with it, open this notebook:
+
+```shell
+docker run -p 8888:8888 -v $PWD/demo/quantization:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
+  bash -c "cd /project && jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root"
+```
+
+<!--why-end-->
+
 ## Check our [documentation](https://els-rd.github.io/transformer-deploy/) for detailed instructions on how to use the package, including setup, GPU quantization support and Nvidia Triton inference server deployment.
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-0.3.0
+0.4.0