import torch
import torchvision
import PIL
import io
import base64
import litserve as ls
from fastapi import UploadFile, File
from io import BytesIO
from PIL import Image
import numpy as np
class ImageClassifierAPI(ls.LitAPI):
def setup(self, device):
= torchvision.models.ResNet152_Weights.DEFAULT
weights self.image_processing = weights.transforms()
self.model = torchvision.models.resnet152(
=weights).eval().to(device)
weightsprint("Device:", device)
def decode_request(self, request: UploadFile = File(...)) -> torch.Tensor:
= request.file.read()
contents = PIL.Image.open(io.BytesIO(contents)).convert("RGB")
pil_image = self.image_processing(pil_image)
processed_image return processed_image.unsqueeze(0).to(self.device)
def predict(self, x: torch.Tensor) -> str:
with torch.inference_mode():
= self.model(x)
outputs = torch.max(outputs, 1)
_, predictions = predictions.tolist()
prediction = prediction[0]
class_idx = torchvision.models.ResNet152_Weights.DEFAULT
weights return weights.meta["categories"][class_idx]
def encode_response(self, output):
return {"output": output}
if __name__ == "__main__":
= ImageClassifierAPI()
api = ls.LitServer(api)
server =8000) server.run(port
LitServe and LitGpt
Creating LLM apps is one thing, deploying them is another. Things like dockerizing LLMs can be daunting, especially if you need GPU support. The folks at Lightning AI have developed to this end a couple of frameworks which help gettings things done. I refrain to use ‘make it easy’ as this is so over (ab)used in many ways.
LitServe is a superset of FastAPI and wraps the essence of an AI API.
The following is a basic image upload and classification example using ResNet:
This will:
- automatically download the necessary model files
- swagger the API based on the data types
- make use of the available GPU (or MPS if on Mac)
- run things with uvicorn
and more. To achieve the same with standard FastAPI would take a lot more code and know-how.
There is also an automatic dockerization with
litserver dockerize api.py --port 8000 --gpu
creating the dockerfile
for you. The gpu
switch is optional and you can run things on CPU, though it will be slower. Within the container the model will be downloaded and it will make the container a standalone LLM API.
A bot based on Qwen can also be assembled:
import torch
import litserve as ls
from transformers import AutoTokenizer, AutoModelForCausalLM
class QwenAPI(ls.LitAPI):
def __init__(
self,
str = "Qwen/Qwen2.5-7B",
model_name: int = 500
max_new_tokens:
):self.model_name = model_name
self.max_new_tokens = max_new_tokens
def setup(self, device):
self.model = AutoModelForCausalLM.from_pretrained(
self.model_name, torch_dtype=torch.bfloat16, device_map="auto"
eval()
).self.tokenizer = AutoTokenizer.from_pretrained(
self.model_name, legacy=False)
def decode_request(self, request: str) -> str:
return request
def predict(self, text: str) -> str:
= self.tokenizer(text, return_tensors='pt')
inputs = inputs.to(self.model.device)["input_ids"]
input_ids = self.model.generate(
output_ids
input_ids,=self.max_new_tokens
max_new_tokens0]
)[= self.tokenizer.decode(output_ids, skip_special_tokens=True)
text return text
def encode_response(self, output):
return {"output": output}
if __name__ == "__main__":
= QwenAPI()
api = ls.LitServer(api)
server =8000) server.run(port
To pre-download a model you can use the (fast) hub:
from huggingface_hub import snapshot_download
="Qwen/Qwen2.5-7") snapshot_download(repo_id
or via the HF CLI:
huggingface-cli download Qwen/Qwen2.5-7
and it will return the location of the download.
A pre-downloaded model can be executed with something like the following:
from transformers import AutoTokenizer, AutoModelForCausalLM
# pre-download with huggingface-cli download Qwen/Qwen2.5-0.5B --local-dir ~/temp/qwen
= "~/temp/qwen"
model_path = AutoModelForCausalLM.from_pretrained( model_path, local_files_only=True)
model = AutoTokenizer.from_pretrained(model_path, legacy=False)
tokenizer
= "Hello, my name is Anna. What is your name?"
text = tokenizer(text, return_tensors='pt')
inputs = inputs["attention_mask"]
attention_mask = inputs.to(model.device)["input_ids"]
input_ids = model.generate(input_ids, max_new_tokens=500, attention_mask=attention_mask,pad_token_id=tokenizer.eos_token_id)[0]
output_ids = tokenizer.decode(output_ids, skip_special_tokens=True)
output
print(output)
All of this does not replace the grand power of dedicated services like Azure AI, OpenAI or Vertex but if you really need an independent container you can and LitServe simplifies things.