Jae-Kyung Cho Being unique is better than being perfect

MLOps study - Raviraja Week 0: Pytorch Lightning

I’m going to study MLOps by referring to Raviraja’s blog posts. Starting is half the battle!! :D

Week 0 sets up the environment for studying MLOps. Since Raviraja researches NLP, let’s keep in mind that the environment setup is biased toward NLP and a bit of classification, and move on. (Later I’ll develop this into MLOps that uses RL.) Basically, it uses the Pytorch-Lightning library. Pytorch lightning is a kind of pytorch wrapper :D

Pytorch Lightning largely consists of 4 modules. Let’s look at them in turn.




DataModule

Pytorch lightning uses a DataModule similar to Pytorch’s DataLoader. There’s a process of preprocessing the data before using the DataLoader, and you can think of it as having all of that included inside the module.

< methods you need to define >

  • prepare_data -> download data
  • setup -> preprocess data
  • train_dataloader, val_dataloader, test_dataloader -> data loaders

< tasks performed inside the DataModule >

  • Download / tokenize / process
  • Clean and save to disk
  • Load inside Dataset
  • Apply transforms (rotate, tokenize, etc…)
  • Wrap inside a DataLoader (Pytorch)
class DataModule(pl.LightningDataModule):
    def __init__(self, model_name="google/bert_uncased_L-2_H-128_A-2", batch_size=32):
        super().__init__()

        self.batch_size = batch_size
        self.tokenizer = AutoTokenizer.from_pretrained(model_name) # Transformer (BERT) model

    def prepare_data(self):
        cola_dataset = load_dataset("glue", "cola")
        self.train_data = cola_dataset["train"]
        self.val_data = cola_dataset["validation"]

    def tokenize_data(self, example):
        # processing the data
        return self.tokenizer(
            example["sentence"],
            truncation=True,
            padding="max_length",
            max_length=256,
        )

    def setup(self, stage=None):
        if stage == "fit" or stage is None:
            self.train_data = self.train_data.map(self.tokenize_data, batched=True)
            self.train_data.set_format(
                type="torch", columns=["input_ids", "attention_mask", "label"]
            )

            self.val_data = self.val_data.map(self.tokenize_data, batched=True)
            self.val_data.set_format(
                type="torch", columns=["input_ids", "attention_mask", "label"]
            )

    def train_dataloader(self):
        return torch.utils.data.DataLoader(
            self.train_data, batch_size=self.batch_size, shuffle=True
        )

    def val_dataloader(self):
        return torch.utils.data.DataLoader(
            self.val_data, batch_size=self.batch_size, shuffle=False
        )




LightningModule

Just as we inherited torch.nn.Module when building a model in Pytorch, Pytorch-lightning inherits pl.LightningModule. Unlike before when you only had to define forward, here you need to define a few additional methods. (Document)

< methods you need to define >

  • forward -> model forward
  • training_step -> Update and Loss computation
  • validation_step
  • test_step (optional)
  • configure_optimizers -> Optimizer initialization
class ColaModel(pl.LightningModule):
    def __init__(self, model_name="google/bert_uncased_L-2_H-128_A-2", lr=1e-2):
        super(ColaModel, self).__init__()
        self.save_hyperparameters()

        self.bert = AutoModel.from_pretrained(model_name)
        self.W = nn.Linear(self.bert.config.hidden_size, 2)
        self.num_classes = 2

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)

        h_cls = outputs.last_hidden_state[:, 0]
        logits = self.W(h_cls)
        return logits

    def training_step(self, batch, batch_idx):
        logits = self.forward(batch["input_ids"], batch["attention_mask"])
        loss = F.cross_entropy(logits, batch["label"])
        self.log("train_loss", loss, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        logits = self.forward(batch["input_ids"], batch["attention_mask"])
        loss = F.cross_entropy(logits, batch["label"])
        _, preds = torch.max(logits, dim=1)
        val_acc = accuracy_score(preds.cpu(), batch["label"].cpu())
        val_acc = torch.tensor(val_acc)
        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", val_acc, prog_bar=True)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams["lr"])




Trainer

The DataModule and the Pytorch-lightning model are trained using the Trainer. You could see it as an approach similar to Tensorflow’s Session.
< examples of options the Trainer can use >

  • logging
  • gradient accumulation
  • half precision training
  • distributed computing

< Loggers >

  • TensorboardLogger
  • WandbLogger

< Callbacks > Documents

cola_data = DataModule()
cola_model = ColaModel()

checkpoint_callbacks = [
    ModelCheckpoint(dirpath="./models", monitor="val_loss", mode="min"), # Save model
    EarlyStopping(monitor="val_loss", patience=3, verbose=True, mode="min"),
]

trainer = pl.Trainer(
    gpus=(1 if torch.cuda.is_available() else 0),
    max_epochs=1,
    fast_dev_run=False, # True: one batch training one validation -> for debugging
    logger=pl.loggers.TensorBoardLogger("logs/", name="cola", version=1), # directory: logs/cola
    # logger = pl.loggers.WandbLogger(name='cola',project='pytorchlightning')
    callbacks=checkpoint_callbacks,
)
trainer.fit(cola_model, cola_data)




Inference

MLOps separates the model’s Training and Inference modules. This is because, even while training is in progress on the server, you need to be able to freeze the model, manage versions, and debug it.

< methods you need to define >

  • predict

< tasks performed inside Inference >

  • Load the trained model
  • Get the input
  • Convert the input in the required format
  • Get the predictions
class ColaPredictor:
    def __init__(self, model_path):
        self.model_path = model_path
        # loading the trained model
        self.model = ColaModel.load_from_checkpoint(model_path)
        # keep the model in eval mode
        self.model.eval()
        self.model.freeze()
        self.processor = DataModule()
        self.softmax = torch.nn.Softmax(dim=0)
        self.lables = ["unacceptable", "acceptable"]

    def predict(self, text):
        # text => run time input
        inference_sample = {"sentence": text}
        # tokenizing the input
        processed = self.processor.tokenize_data(inference_sample)
        # predictions
        logits = self.model(
            torch.tensor([processed["input_ids"]]),
            torch.tensor([processed["attention_mask"]]),
        )
        scores = self.softmax(logits[0]).tolist()
        predictions = []
        for score, label in zip(scores, self.lables):
            predictions.append({"label": label, "score": score})
        return predictions

Honestly, it doesn’t seem like that big of a change, but they say that if Pytorch is the ice cream, then Pytorch Lightning is the cherry on top. I’m still not sure what features count as MLOps, but considering its compatibility with Pytorch, it seems like I’ll be able to use these features much more simply :ㅇ

Download the ipynb file




references:

Comments