MLOps study - Raviraja Week 0: Pytorch Lightning
11 Oct 2022I’m going to study MLOps by referring to Raviraja’s blog posts. Starting is half the battle!! :D
Week 0 sets up the environment for studying MLOps. Since Raviraja researches NLP, let’s keep in mind that the environment setup is biased toward NLP and a bit of classification, and move on. (Later I’ll develop this into MLOps that uses RL.) Basically, it uses the Pytorch-Lightning library. Pytorch lightning is a kind of pytorch wrapper :D
Pytorch Lightning largely consists of 4 modules. Let’s look at them in turn.
DataModule
Pytorch lightning uses a DataModule similar to Pytorch’s DataLoader. There’s a process of preprocessing the data before using the DataLoader, and you can think of it as having all of that included inside the module.
< methods you need to define >
- prepare_data -> download data
- setup -> preprocess data
- train_dataloader, val_dataloader, test_dataloader -> data loaders
< tasks performed inside the DataModule >
- Download / tokenize / process
- Clean and save to disk
- Load inside Dataset
- Apply transforms (rotate, tokenize, etc…)
- Wrap inside a DataLoader (Pytorch)
class DataModule(pl.LightningDataModule):
def __init__(self, model_name="google/bert_uncased_L-2_H-128_A-2", batch_size=32):
super().__init__()
self.batch_size = batch_size
self.tokenizer = AutoTokenizer.from_pretrained(model_name) # Transformer (BERT) model
def prepare_data(self):
cola_dataset = load_dataset("glue", "cola")
self.train_data = cola_dataset["train"]
self.val_data = cola_dataset["validation"]
def tokenize_data(self, example):
# processing the data
return self.tokenizer(
example["sentence"],
truncation=True,
padding="max_length",
max_length=256,
)
def setup(self, stage=None):
if stage == "fit" or stage is None:
self.train_data = self.train_data.map(self.tokenize_data, batched=True)
self.train_data.set_format(
type="torch", columns=["input_ids", "attention_mask", "label"]
)
self.val_data = self.val_data.map(self.tokenize_data, batched=True)
self.val_data.set_format(
type="torch", columns=["input_ids", "attention_mask", "label"]
)
def train_dataloader(self):
return torch.utils.data.DataLoader(
self.train_data, batch_size=self.batch_size, shuffle=True
)
def val_dataloader(self):
return torch.utils.data.DataLoader(
self.val_data, batch_size=self.batch_size, shuffle=False
)
LightningModule
Just as we inherited torch.nn.Module when building a model in Pytorch, Pytorch-lightning inherits pl.LightningModule. Unlike before when you only had to define forward, here you need to define a few additional methods. (Document)
< methods you need to define >
- forward -> model forward
- training_step -> Update and Loss computation
- validation_step
- test_step (optional)
- configure_optimizers -> Optimizer initialization
class ColaModel(pl.LightningModule):
def __init__(self, model_name="google/bert_uncased_L-2_H-128_A-2", lr=1e-2):
super(ColaModel, self).__init__()
self.save_hyperparameters()
self.bert = AutoModel.from_pretrained(model_name)
self.W = nn.Linear(self.bert.config.hidden_size, 2)
self.num_classes = 2
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
h_cls = outputs.last_hidden_state[:, 0]
logits = self.W(h_cls)
return logits
def training_step(self, batch, batch_idx):
logits = self.forward(batch["input_ids"], batch["attention_mask"])
loss = F.cross_entropy(logits, batch["label"])
self.log("train_loss", loss, prog_bar=True)
return loss
def validation_step(self, batch, batch_idx):
logits = self.forward(batch["input_ids"], batch["attention_mask"])
loss = F.cross_entropy(logits, batch["label"])
_, preds = torch.max(logits, dim=1)
val_acc = accuracy_score(preds.cpu(), batch["label"].cpu())
val_acc = torch.tensor(val_acc)
self.log("val_loss", loss, prog_bar=True)
self.log("val_acc", val_acc, prog_bar=True)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.hparams["lr"])
Trainer
The DataModule and the Pytorch-lightning model are trained using the Trainer. You could see it as an approach similar to Tensorflow’s Session.
< examples of options the Trainer can use >
- logging
- gradient accumulation
- half precision training
- distributed computing
< Loggers >
- TensorboardLogger
- WandbLogger
< Callbacks > Documents
cola_data = DataModule()
cola_model = ColaModel()
checkpoint_callbacks = [
ModelCheckpoint(dirpath="./models", monitor="val_loss", mode="min"), # Save model
EarlyStopping(monitor="val_loss", patience=3, verbose=True, mode="min"),
]
trainer = pl.Trainer(
gpus=(1 if torch.cuda.is_available() else 0),
max_epochs=1,
fast_dev_run=False, # True: one batch training one validation -> for debugging
logger=pl.loggers.TensorBoardLogger("logs/", name="cola", version=1), # directory: logs/cola
# logger = pl.loggers.WandbLogger(name='cola',project='pytorchlightning')
callbacks=checkpoint_callbacks,
)
trainer.fit(cola_model, cola_data)
Inference
MLOps separates the model’s Training and Inference modules. This is because, even while training is in progress on the server, you need to be able to freeze the model, manage versions, and debug it.
< methods you need to define >
- predict
< tasks performed inside Inference >
- Load the trained model
- Get the input
- Convert the input in the required format
- Get the predictions
class ColaPredictor:
def __init__(self, model_path):
self.model_path = model_path
# loading the trained model
self.model = ColaModel.load_from_checkpoint(model_path)
# keep the model in eval mode
self.model.eval()
self.model.freeze()
self.processor = DataModule()
self.softmax = torch.nn.Softmax(dim=0)
self.lables = ["unacceptable", "acceptable"]
def predict(self, text):
# text => run time input
inference_sample = {"sentence": text}
# tokenizing the input
processed = self.processor.tokenize_data(inference_sample)
# predictions
logits = self.model(
torch.tensor([processed["input_ids"]]),
torch.tensor([processed["attention_mask"]]),
)
scores = self.softmax(logits[0]).tolist()
predictions = []
for score, label in zip(scores, self.lables):
predictions.append({"label": label, "score": score})
return predictions
Honestly, it doesn’t seem like that big of a change, but they say that if Pytorch is the ice cream, then Pytorch Lightning is the cherry on top. I’m still not sure what features count as MLOps, but considering its compatibility with Pytorch, it seems like I’ll be able to use these features much more simply :ㅇ