Deploy a Transformer as a RestAPI

12 min readDec 1, 2022

As a data scientist, building an efficient machine learning model is very rewarding but it is only one step of the problem. To be consumed in the real world it has to be accessible to users/developers, so that they can take the full advantage of the model you have built: this can be done by different ways; One common way, that we’ll describe in this post, consists in in serving the trained model as a REST API.

In my previous posts PART-1 and PART-2 I showed you how to use transformers in two different ways to model six analysis metrics of argumentative essays: cohesion, syntax, vocabulary, phraseology, grammar, and conventions.

In this post, we will focus on the deployment part: we will pass through all the steps of building and deploying the fine-tuned transformer with FastAPI and Docker.

It would cover the following parts:

Introduction to different methods to deploy an ML model
Building a REST API with FastAPI
Package the API with Docker
Continuous integration with Github Actions

The overall Project is available on github.

Different ways to deploy a model:

It exists different ways to use a trained model in production,:

Batch predictions:

The very first canonical usage of a machine learning model once trained consists in Computing Predictions to be injected into a production workflow. A common form of this usage is batch predictions, that is the option to chose to process accumulated data flows without an urgent need for immediate results.
For example a delivery company that needs to get monthly forecast of packages flows to deliver, to optimize staffing schedules.
Usually, tools like Airflow are used to schedule the pipeline : ingest new incoming data, use pre-trained model to make batch prediction, dump the results onto the data storage (it could be AWS S3 or Cloud Storage..)

Embedded model

During the last few years, great technological advances in ML optimizations, microprocessor architectures software frameworks, and embedded hardware.. made it possible to run complex Machine Learning algorithms on the smallest devices like micro-controllers.

The field of embedding ML algorithms on such devices is called TinyML, which is becoming more and more a fast growing field!

TensorflowLite is the most popular framework to deploy models on mobile, micro-controllers and other edge devices. It carries many types of optimizations such as distillation, quantization and weight pruning… you can check this post and this one if you want to get more details about TensorFlowLite optimization techniques

Embedded Model with TensorFlowLite — Image by the Author (inspired from this post)

Micro-service Model Serving

In this architecture, the model is served as a separate micro-service. This type of serving is the most popular and used one for many reasons: the model release is very flexible since serving can be done outside of the main application, possibility of generating online/real time predictions, easy to monitor.
This type of API is called Inference Endpoint; in this post we will will use FastAPI as a back-end framework to develop the REST endpoint allowing to predict writing skills from a POST request.

In the Micro-service Architecture, each application is encapsulated in a Docker container, so that it can be deployed independently of the host environment.

Model-serving as an Inference API- by Author

We will explore this micro-service based infrastructure to deploy a fine tuned deberta model (all the fine tuning steps are described in my previous post)

As a reminder the fine tuned model allows to evaluate writing skills from a given essay: It takes as input the essay string and gives as ouptut 6 writing skills metrics : cohesion, syntax, vocabulary, phraseology, grammar, and conventions. Each of them are ranged from 0 to 5. The model is saved as an artifact in weight and biases platform.

Building a REST API with FastAPI

Let’s start by creating a project repository

Repository template:

The project repository is available on Github.

GitHub - Athena75/EssayEvaluator-API: Multi regressor inference API to evaluate six metric of a…

Multi regressor inference API to evaluate six metric of a written essay (cohesion, syntax, vocabulary, phraseology…

github.com

The tree structure of our repository should be as follows:

project structure screenshot (By Author)

The dataand and the notebooks directories are optional here: they were used at the exploratory stages. In the /src directory we will mainly use the following scripts:

predict.py : it contains the needed functions to predict/score essays input with the transformer model
custom_transformer.py : from this python file we will need to import the custom transformer class that will define the fine tuned model, and the custom torch.utils.data.Dataset called EssayIteratorthat will be used to make batch predicion (I’ll describe in more details later how to define the api allowing to get the batch prediction)

I invite you to consult my source code here

Some Good Practices:

config.ini file:

In the config.ini file we will define all the required config parameters such as paths, model hyper-parameters, target labels names (=the writing skills names), then we use the configparser module to read and parse it into a config dictionary, as follows:

import configparser

config = configparser.ConfigParser()
config.read('config.ini')

For more details about the configparsermodule you can check this doc

2. Env variables:
In this project we need only the wandb api token to connect to the model artifact. (I’ll explain later how to get a wandb api for free from the W&B platform)

One simple way to set your environment variable the is to use export cmd :

$export WANDB_API=VALUE

then use os.environ.get("WANDB_API") inside the script to get your variable.
A better way to handle your environment variables is to create a .env file to store your secrets/env-variables, and use the dotenv python package to load them.

3. __ini__.pyfile:

Both the config dictionary and the env variables are loaded on the __ini__.py script so that later they can be import straight-fully on the other locations of the project:

Now, we will focus on the most important part that contains the main application, which is the api/main.py

Load the model locally:

Before defining anything related to the apis, We have to retrieve the model artifact from W&B (that I made it public). So that in the build Phase (that we’ll see later with the docker part) the first step consists in connecting to the wandb and downloading the model artifact:

Download locally the model artifact — during the build phase — Image By the Author

Once the model downloaded locally, all we have to do is to instantiate the custom model from using the FeedBackModel class defined in src/custom_transformer.py then load weights from the local pytorch model. Moreover, we need to load the corresponding tokenizer to be able to generate predictions using the transformer model: in fact the model takes as input the tokenizer encodings, transforms them into embeddings then generates the related hidden states (as described in more details in my previous post)

Later in main.py we will define the api routes definitions as well as their corresponding pydantic models:

Pydantic models

For the requests we will consider two cases:

a- Provide to the API a single essay to score:

Request : we define the SingleRequest body as a subclass of the BaseModelclass, having a single attribute: essay .
Later if the model is extended with other new features such as language, number of stop-words, essays length…, the SingleRequest model can be modified just by adding the new features names in the model definition
Response : we define the EssayScores Model containing the text filed which corresponds to the essay input from theSingleRequest request, plus the corresponding scores returned by the transformer model which are cohesion, syntax, vocalulary, phraseology, grammarand conventions, each of them is a float type.

b- Provide to the API multiple essays (or a batch) to score :

Request : similarly, the MultipleRequest model can be defined by a field called esssays which is a list of string
Response : the multi-text response is defined by the EssaysScores model, in which we set a field called batch containing a list of the pre-defined EssayScores response model

Build REST API

FastAPI has recently become one of the most popular web frameworks used to develop micro-services in Python for many reasons: combined with uvicorn we can get one of the fastest web servers. Besides, unlike Flask which is built upon Web Server Gateway Interface (WSGI), FastAPI can carry asynchronous requests as it is based on Asynchronous Server Gateway Interface (ASGI).

We’ll define three main routes

/index— the index route : a GET request that will return as a default value a sample of text example that we can use on the other POST requests
/single_essay route: a POST request allowing to predict a unique text input: this api can be the best choice for online applications. The /single_essay related function uses single_prediction function, defined in src/predict.py file
/multiple_essay : sometimes we might want to score multiple essays or different parts of a single essay: so instead of requesting the /single_essay api for each of them we implement /multiple_essayapi that supports batch predictions.
Similarly, the /multiple_essay api function would use a batch_prediction function imported from the src/predict.py python file

The APIs functions can look like this:

Now lets test our app locally:
As a reminder the FastAPI framework in built in ASGI. For that reason we will be using uvicorn to serve the app, as it is an ASGI web server.

For the project directory we launch this cmd:

$ uvicorn api.main:app --reload --port 8000

Then you can use many tools to request for example the /single_essay API such as the curl cmd or Postman application..

Lets check an Essay from an extract of the great Stephan Zweig from here. 👉It returned pretty good writing skill 💖💖 (all the evaluation metrics are around 4 and the max is 5):

Stephan Zweig extract scores — screenchot by the Author

Docker Image building

Once the app tested locally, we would like to make it fully packaged so that it would be possible to run it in an isolated environment.

We will use Docker that allows us to package our repository within a container, which is basically a virtual runtime that contains anything that’s needed to run our code (installs, libraries, system versions..)

To define the docker image we need to define the building instructions in a file called Dockerfile

The first line FROM python:3.9 provides the base image for the container, that will be pulled from the docker hub.
Set the container’s working directory in /api dir
As the uvicorn would run in 8000, we will add the EXPOSE instruction to 8000 port to tell Docker that our container listens for traffic on that port.
As explained previously, we will download the model artifact from W&B and save it locally in artifacts/ directory, we need to
1- Create the ./artifactsdirectory
2- Create a user called api with the useradd cmd
3- Grant to the api user owner privilege to the artifact directory with the chown cmd
By default, containers run as root. A saver practice consists in using the USER instruction to specify a non-root user for the containers. That why we added the USER api instruction
To install dependencies we needed to copy the requirements.txt file in the api dir then run the related pip install command
Then we need to COPY some elements from other sections of the repo to correctly exectue the app, basically the api and the src directories and the config.ini

Let’s build the docker using the docker build cmd

PS int the Github Repos I used the docker-compose tool to build and execute the docker image, which is optional in our case, as we have only a single container

$ docker build -t essayevaluator .

We obtain in the stdout the build process that looks like this:

All the step of the docker-image building process — Screenshot od the Author

Then to run the docker image, you need to create a wandb token API:

Sign up for a free account at https://wandb.ai/site and then login to your wandb account.
Retrieve your api directly from https://wandb.ai/authorize

Once you get your API token export it as an env vaiable:

$ export WANDB_API=YOUR_WANDB_TOKEN

Then execute the docker image with the following command:

$ docker run -e "WANDB_API=$WANDB_API" -p 8000:8000 -t essayevaluator

Then you can use postman on http://0.0.0.0:8000/ to test the image

Continuous integration with Github Actions

Now that we have manged to execute the code locally and create the related image docker so that it would be possible to be run on any host, we want to make sure that every time we make changes in our code and commit in the master branch, the API would remain functional and that each API route would not have an unexpected behavior: this practice is called Continuous Integration (CI).

In General, CI is a best practice for software development allowing to ensure tasks such as are revision control, build automation and automated testing. Many continuous integration tools are are widely used such as Jenkins, TeamCity…

In our case we will combine pytest (for the testing) with Github Actions (for CI) to automate testing at each push on the master branch:

First create the testing script and add it to the repo:
The test code is located in /api/test_main.py , you can check the source code here
Go to Actions tab and click on set up a workflow yourself

create a new workflow with Github Action — ScreenShot by the Author

3. Describe the CI jobs in the yml file:
At the root dir of the repository a .github/workflowsfolder will be automatically created in which we will find a file called main.yml : in this file we will describe the jobs that will be triggered at each push on the master branch:

actions/checkout: An action to checkout/clone your repo in your workflow
actions/setup_python : an action used to install a specefic version of python (and optionally caching dependencies for pip, pipenv and poetry which makes the CI much faster)
The next action would be to install dependencies from the requirements.txt file
flake8 action : optional but recommended : it executes flake8 stylistic and logical linting of Python source files
The final action would be to run the pytest cmd using as environemental variables WANDB_API

Beforhand you will have to save WANDB_API value as a secret in your repo : go to the Settings tab and click on Secretson the left tab then in Actions in the drop-down menu. The Github Secrets will be saved in a key/value format:

Github Secrets — Screenshot by the Author

On the next commit and push to the main branch, the CI jobs will be automatically triggered: in the Actions tab you will see appearing a new workflow having the same name as your commit comment:

new workflow created- Screenshot by the Author

When you click on it, all the build steps are shown :

Successful build steps — Screenshot by th Author

Credits:

I want to quote some resources that helped me a lot to realize this work, do not hesitate to visit them:

Ahmed Besbes’s great post that helped me a lot, especially in the continious integration part
Matthew Stewart post about TinyML: https://towardsdatascience.com/tiny-machine-learning-the-next-ai-revolution-495c26463868
https://blog.dennisokeeffe.com/blog/2021-08-08-pytest-with-github-actions
https://neptune.ai/blog/deploy-nlp-models-in-production

Conlusion

Thanks a lot for reading 🥰, I invite you to visit my previous posts to see how the deployed transformer model has been trained : the first post describes a singular way to use a pre-trained transformer as a feature extractor to train another regressor, the second post shows how to create a custom transformer from the pre-trained model and how to fine-tune it in a multi-regression task.

I am planning to make a benchmark of open-source solutions to optimize inference: such as TensorflowRT and ONNXRuntime, I’ll let you know about it😉