Jae-Kyung Cho Being unique is better than being perfect

MLOps study - Raviraja Week 3: DVC

Week 3 is about DVC (Data Version Control). General programming code is managed through Git. However, in the case of ML models, the files are too large to manage the entire model through Git. So the key idea is to represent the entire model with a very small metafile, creating a structure that can be managed perfectly well even in Git. In addition, the large files are managed separately through remote storage servers such as Google drive.
Through Git and DVC, we can 1) manage versions, 2) handle large files, and 3) make the project reproducible.




Start DVC

python -m pip install dvc
dvc init

One thing to be careful about is that when you run dvc init, you need to be in the top-most folder. You enter the command in the directory where the .git folder is located, and once the .dvc folder and the .dvcignore file are created, you’re done!




Configuring remote storage

DVC can store large files such as model parameters or datasets using a separate remote storage, not Github. Various storages such as Amazon server, Google drive, etc. can be used, but here we’ll look at how to store files directly on the server you use via SSH.

First, in order to use an SSH storage in DVC, you need the dvc-ssh module. After that, you just save the SSH information. The important thing is that the password must be given the –local option. If you don’t, your server password could end up published on git, so be careful!!

python -m pip install dvc-ssh
dvc remote add -d storage ssh://xxx.xxx.xxx.xxx/<data_dir>
dvc remote default storage
dvc remote modify <the remote name you set> user <server user name>
dvc remote modify <the remote name you set> port <the port you opened>
dvc remote modify --local <the remote name you set> password <server password>

ref: Using an SSH storage in DVC




Saving model to the remote storage

dvc add <model name>
dvc push

When you run dvc add, a <model name>.dvc file and a .gitignore file are newly created. If you want to manage the dvc files by putting them in one folder, you can do as follows.

dvc add <model name> --file <dvc folder>/<dvc file name>
dvc push <dvc folder>/<dvc file name>

The interesting thing is that even when you actually check the storage, only a bunch of strangely named folders have been created, and the files aren’t uploaded directly. (I struggled with this for quite a while!!!) But it’s been saved properly, so you can rest assured!! Because if you delete the file you saved and type the command below, you can see that the saved model is immediately downloaded again.

dvc pull <dvc folder>/<dvc file name> 




Versioning the model

This is the most important part, model versioning. The order is as follows, so let’s get it into our muscle memory so we don’t skip a step!!

  1. dvc add --file /
  2. dvc push /
  3. git tag -a “" -m ""
  4. git push origin
  5. git push origin

This binds that model to a git tag, and as a result, when you go to that tag, you’ll be able to load the model from that point in time.

git checkout tags/<version> -b <branch name>
dvc pull




references:

Comments