LitServe and LitGpt

LLM
Flexible, high-throughput serving of LLMs.

Creating LLM apps is one thing, deploying them is another. Things like dockerizing LLMs can be daunting, especially if you need GPU support. The folks at Lightning AI have developed to this end a couple of frameworks which help gettings things done. I refrain to use ‘make it easy’ as this is so over (ab)used in many ways.

LitServe is a superset of FastAPI and wraps the essence of an AI API.

The following is a basic image upload and classification example using ResNet:

import torch
import torchvision
import PIL
import io
import base64
import litserve as ls
from fastapi import UploadFile, File
from io import BytesIO
from PIL import Image
import numpy as np


class ImageClassifierAPI(ls.LitAPI):
    def setup(self, device):
        weights = torchvision.models.ResNet152_Weights.DEFAULT
        self.image_processing = weights.transforms()
        self.model = torchvision.models.resnet152(
            weights=weights).eval().to(device)
        print("Device:", device)

    def decode_request(self, request: UploadFile = File(...)) -> torch.Tensor:
        contents = request.file.read()
        pil_image = PIL.Image.open(io.BytesIO(contents)).convert("RGB")
        processed_image = self.image_processing(pil_image)
        return processed_image.unsqueeze(0).to(self.device)

    def predict(self, x: torch.Tensor) -> str:
        with torch.inference_mode():
            outputs = self.model(x)
            _, predictions = torch.max(outputs, 1)
            prediction = predictions.tolist()
        class_idx = prediction[0]
        weights = torchvision.models.ResNet152_Weights.DEFAULT
        return weights.meta["categories"][class_idx]

    def encode_response(self, output):
        return {"output": output}


if __name__ == "__main__":
    api = ImageClassifierAPI()
    server = ls.LitServer(api)
    server.run(port=8000)

This will:

and more. To achieve the same with standard FastAPI would take a lot more code and know-how.

There is also an automatic dockerization with

litserver dockerize api.py --port 8000 --gpu

creating the dockerfile for you. The gpu switch is optional and you can run things on CPU, though it will be slower. Within the container the model will be downloaded and it will make the container a standalone LLM API.

A bot based on Qwen can also be assembled:

import torch
import litserve as ls
from transformers import AutoTokenizer, AutoModelForCausalLM


class QwenAPI(ls.LitAPI):

    def __init__(
        self,
        model_name: str = "Qwen/Qwen2.5-7B",
        max_new_tokens: int = 500
    ):
        self.model_name = model_name
        self.max_new_tokens = max_new_tokens

    def setup(self, device):
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name, torch_dtype=torch.bfloat16, device_map="auto"
        ).eval()
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.model_name, legacy=False)

    def decode_request(self, request: str) -> str:
        return request

    def predict(self, text: str) -> str:
        inputs = self.tokenizer(text, return_tensors='pt')
        input_ids = inputs.to(self.model.device)["input_ids"]
        output_ids = self.model.generate(
            input_ids,
            max_new_tokens=self.max_new_tokens
        )[0]
        text = self.tokenizer.decode(output_ids, skip_special_tokens=True)
        return text

    def encode_response(self, output):
        return {"output": output}


if __name__ == "__main__":
    api = QwenAPI()
    server = ls.LitServer(api)
    server.run(port=8000)

To pre-download a model you can use the (fast) hub:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="Qwen/Qwen2.5-7")

or via the HF CLI:

huggingface-cli download Qwen/Qwen2.5-7

and it will return the location of the download.

A pre-downloaded model can be executed with something like the following:

from transformers import AutoTokenizer, AutoModelForCausalLM


# pre-download with huggingface-cli download Qwen/Qwen2.5-0.5B --local-dir ~/temp/qwen

model_path = "~/temp/qwen"
model = AutoModelForCausalLM.from_pretrained(    model_path, local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, legacy=False)

text = "Hello, my name is Anna. What is your name?"
inputs = tokenizer(text, return_tensors='pt')
attention_mask = inputs["attention_mask"]
input_ids = inputs.to(model.device)["input_ids"]
output_ids = model.generate(input_ids, max_new_tokens=500, attention_mask=attention_mask,pad_token_id=tokenizer.eos_token_id)[0]
output = tokenizer.decode(output_ids, skip_special_tokens=True)

print(output)

All of this does not replace the grand power of dedicated services like Azure AI, OpenAI or Vertex but if you really need an independent container you can and LitServe simplifies things.