Deploy a Transformer as a RestAPI
As a data scientist, building an efficient machine learning model is very rewarding but it is only one step of the problem. To be consumed in the real world it has to be accessible to users/developers, so that they can take the full advantage of the model you have built: this can be done by different ways; One common way, that we’ll describe in this post, consists in in serving the trained model as a REST API.
In my previous posts PART-1 and PART-2 I showed you how to use transformers in two different ways to model six analysis metrics of argumentative essays: cohesion, syntax, vocabulary, phraseology, grammar, and conventions.
In this post, we will focus on the deployment part: we will pass through all the steps of building and deploying the fine-tuned transformer with FastAPI and Docker.
It would cover the following parts:
- Introduction to different methods to deploy an ML model
- Building a REST API with FastAPI
- Package the API with Docker
- Continuous integration with Github Actions
The overall Project is available on github.
Different ways to deploy a model:
It exists different ways to use a trained model in production,:
Batch predictions:
The very first canonical usage of a machine learning model once trained consists in Computing Predictions to be injected into a production workflow. A common form of this usage is batch predictions, that is the option to chose to process accumulated data flows without an urgent need for immediate results.
For example a delivery company that needs to get monthly forecast of packages flows to deliver, to optimize staffing schedules.
Usually, tools like Airflow are used to schedule the pipeline : ingest new incoming data, use pre-trained model to make batch prediction, dump the results onto the data storage (it could be AWS S3 or Cloud Storage..)
Embedded model
During the last few years, great technological advances in ML optimizations, microprocessor architectures software frameworks, and embedded hardware.. made it possible to run complex Machine Learning algorithms on the smallest devices like micro-controllers.
The field of embedding ML algorithms on such devices is called TinyML, which is becoming more and more a fast growing field!
TensorflowLite is the most popular framework to deploy models on mobile, micro-controllers and other edge devices. It carries many types of optimizations such as distillation, quantization and weight pruning… you can check this post and this one if you want to get more details about TensorFlowLite optimization techniques
Micro-service Model Serving
In this architecture, the model is served as a separate micro-service. This type of serving is the most popular and used one for many reasons: the model release is very flexible since serving can be done outside of the main application, possibility of generating online/real time predictions, easy to monitor.
This type of API is called Inference Endpoint; in this post we will will use FastAPI as a back-end framework to develop the REST endpoint allowing to predict writing skills from a POST request.
In the Micro-service Architecture, each application is encapsulated in a Docker container, so that it can be deployed independently of the host environment.
We will explore this micro-service based infrastructure to deploy a fine tuned deberta
model (all the fine tuning steps are described in my previous post)
As a reminder the fine tuned model allows to evaluate writing skills from a given essay: It takes as input the essay string and gives as ouptut 6 writing skills metrics : cohesion, syntax, vocabulary, phraseology, grammar, and conventions. Each of them are ranged from 0 to 5. The model is saved as an artifact in weight and biases platform.
Building a REST API with FastAPI
Let’s start by creating a project repository
Repository template:
The project repository is available on Github.
The tree structure of our repository should be as follows:
The data
and and the notebooks
directories are optional here: they were used at the exploratory stages. In the /src
directory we will mainly use the following scripts:
predict.py
: it contains the needed functions to predict/score essays input with the transformer modelcustom_transformer.py
: from this python file we will need to import the custom transformer class that will define the fine tuned model, and the customtorch.utils.data.Dataset
calledEssayIterator
that will be used to make batch predicion (I’ll describe in more details later how to define the api allowing to get the batch prediction)
I invite you to consult my source code here
Some Good Practices:
config.ini
file:
In the config.ini
file we will define all the required config parameters such as paths, model hyper-parameters, target labels names (=the writing skills names), then we use the configparser
module to read and parse it into a config dictionary, as follows:
import configparser
config = configparser.ConfigParser()
config.read('config.ini')
For more details about the configparser
module you can check this doc
2. Env variables:
In this project we need only the wandb api token to connect to the model artifact. (I’ll explain later how to get a wandb api for free from the W&B platform)
One simple way to set your environment variable the is to use export cmd :
$export WANDB_API=VALUE
then use os.environ.get("WANDB_API")
inside the script to get your variable.
A better way to handle your environment variables is to create a .env
file to store your secrets/env-variables, and use the dotenv
python package to load them.
3. __ini__.py
file:
Both the config dictionary and the env variables are loaded on the __ini__.py
script so that later they can be import straight-fully on the other locations of the project:
Now, we will focus on the most important part that contains the main application, which is the api/main.py
Load the model locally:
Before defining anything related to the apis, We have to retrieve the model artifact from W&B (that I made it public). So that in the build Phase (that we’ll see later with the docker part) the first step consists in connecting to the wandb and downloading the model artifact:
Once the model downloaded locally, all we have to do is to instantiate the custom model from using the FeedBackModel
class defined in src/custom_transformer.py
then load weights from the local pytorch model. Moreover, we need to load the corresponding tokenizer to be able to generate predictions using the transformer model: in fact the model takes as input the tokenizer encodings, transforms them into embeddings then generates the related hidden states (as described in more details in my previous post)
Later in main.py
we will define the api routes definitions as well as their corresponding pydantic models:
Pydantic models
For the requests we will consider two cases:
a- Provide to the API a single essay to score:
- Request : we define the
SingleRequest
body as a subclass of theBaseModel
class, having a single attribute:essay
.
Later if the model is extended with other new features such as language, number of stop-words, essays length…, theSingleRequest
model can be modified just by adding the new features names in the model definition - Response : we define the
EssayScores
Model containing thetext
filed which corresponds to the essay input from theSingleRequest
request, plus the corresponding scores returned by the transformer model which arecohesion
,syntax
,vocalulary
,phraseology
,grammar
andconventions
, each of them is a float type.
b- Provide to the API multiple essays (or a batch) to score :
- Request : similarly, the
MultipleRequest
model can be defined by a field calledesssays
which is a list of string - Response : the multi-text response is defined by the
EssaysScores
model, in which we set a field calledbatch
containing a list of the pre-definedEssayScores
response model
Build REST API
FastAPI has recently become one of the most popular web frameworks used to develop micro-services in Python for many reasons: combined with uvicorn we can get one of the fastest web servers. Besides, unlike Flask which is built upon Web Server Gateway Interface (WSGI), FastAPI can carry asynchronous requests as it is based on Asynchronous Server Gateway Interface (ASGI).
We’ll define three main routes
/index
— the index route : a GET request that will return as a default value a sample of text example that we can use on the other POST requests/single_essay
route: a POST request allowing to predict a unique text input: this api can be the best choice for online applications. The/single_essay
related function usessingle_prediction
function, defined insrc/predict.py
file/multiple_essay
: sometimes we might want to score multiple essays or different parts of a single essay: so instead of requesting the/single_essay
api for each of them we implement/multiple_essay
api that supports batch predictions.
Similarly, the/multiple_essay
api function would use abatch_prediction
function imported from thesrc/predict.py
python file
The APIs functions can look like this:
Now lets test our app locally:
As a reminder the FastAPI framework in built in ASGI. For that reason we will be using uvicorn to serve the app, as it is an ASGI web server.
For the project directory we launch this cmd:
$ uvicorn api.main:app --reload --port 8000
Then you can use many tools to request for example the /single_essay API such as the curl cmd or Postman application..
Lets check an Essay from an extract of the great Stephan Zweig from here. 👉It returned pretty good writing skill 💖💖 (all the evaluation metrics are around 4 and the max is 5):
Docker Image building
Once the app tested locally, we would like to make it fully packaged so that it would be possible to run it in an isolated environment.
We will use Docker that allows us to package our repository within a container, which is basically a virtual runtime that contains anything that’s needed to run our code (installs, libraries, system versions..)
To define the docker image we need to define the building instructions in a file called Dockerfile
- The first line
FROM python:3.9
provides the base image for the container, that will be pulled from the docker hub. - Set the container’s working directory in
/api
dir - As the uvicorn would run in 8000, we will add the
EXPOSE
instruction to 8000 port to tell Docker that our container listens for traffic on that port. - As explained previously, we will download the model artifact from W&B and save it locally in
artifacts/
directory, we need to
1- Create the./artifacts
directory
2- Create a user called api with theuseradd
cmd
3- Grant to the api user owner privilege to the artifact directory with thechown
cmd - By default, containers run as root. A saver practice consists in using the
USER
instruction to specify a non-root user for the containers. That why we added theUSER api
instruction - To install dependencies we needed to copy the
requirements.txt
file in theapi
dir then run the related pip install command - Then we need to
COPY
some elements from other sections of the repo to correctly exectue the app, basically theapi
and thesrc
directories and theconfig.ini
Let’s build the docker using the docker build
cmd
PS int the Github Repos I used the docker-compose tool to build and execute the docker image, which is optional in our case, as we have only a single container
$ docker build -t essayevaluator .
We obtain in the stdout the build process that looks like this:
Then to run the docker image, you need to create a wandb token API:
- Sign up for a free account at https://wandb.ai/site and then login to your wandb account.
- Retrieve your api directly from https://wandb.ai/authorize
Once you get your API token export it as an env vaiable:
$ export WANDB_API=YOUR_WANDB_TOKEN
Then execute the docker image with the following command:
$ docker run -e "WANDB_API=$WANDB_API" -p 8000:8000 -t essayevaluator
Then you can use postman on http://0.0.0.0:8000/ to test the image
Continuous integration with Github Actions
Now that we have manged to execute the code locally and create the related image docker so that it would be possible to be run on any host, we want to make sure that every time we make changes in our code and commit in the master branch, the API would remain functional and that each API route would not have an unexpected behavior: this practice is called Continuous Integration (CI).
In General, CI is a best practice for software development allowing to ensure tasks such as are revision control, build automation and automated testing. Many continuous integration tools are are widely used such as Jenkins, TeamCity…
In our case we will combine pytest (for the testing) with Github Actions (for CI) to automate testing at each push on the master branch:
- First create the testing script and add it to the repo:
The test code is located in/api/test_main.py
, you can check the source code here - Go to Actions tab and click on set up a workflow yourself
3. Describe the CI jobs in the yml file:
At the root dir of the repository a .github/workflows
folder will be automatically created in which we will find a file called main.yml
: in this file we will describe the jobs that will be triggered at each push on the master branch:
actions/checkout
: An action to checkout/clone your repo in your workflowactions/setup_python
: an action used to install a specefic version of python (and optionally caching dependencies for pip, pipenv and poetry which makes the CI much faster)- The next action would be to install dependencies from the
requirements.txt
file - flake8 action : optional but recommended : it executes flake8 stylistic and logical linting of Python source files
- The final action would be to run the
pytest
cmd using as environemental variablesWANDB_API
Beforhand you will have to save WANDB_API value as a secret in your repo : go to the Settings
tab and click on Secrets
on the left tab then in Actions
in the drop-down menu. The Github Secrets will be saved in a key/value format:
On the next commit and push to the main branch, the CI jobs will be automatically triggered: in the Actions tab you will see appearing a new workflow having the same name as your commit comment:
When you click on it, all the build steps are shown :
Credits:
I want to quote some resources that helped me a lot to realize this work, do not hesitate to visit them:
- Ahmed Besbes’s great post that helped me a lot, especially in the continious integration part
- Matthew Stewart post about TinyML: https://towardsdatascience.com/tiny-machine-learning-the-next-ai-revolution-495c26463868
- https://blog.dennisokeeffe.com/blog/2021-08-08-pytest-with-github-actions
- https://neptune.ai/blog/deploy-nlp-models-in-production
Conlusion
Thanks a lot for reading 🥰, I invite you to visit my previous posts to see how the deployed transformer model has been trained : the first post describes a singular way to use a pre-trained transformer as a feature extractor to train another regressor, the second post shows how to create a custom transformer from the pre-trained model and how to fine-tune it in a multi-regression task.
I am planning to make a benchmark of open-source solutions to optimize inference: such as TensorflowRT and ONNXRuntime, I’ll let you know about it😉