My previous experiences of being GPU-broke along with irritating OOM errors led me to wonder if there was a way to compress the models for use on edge devices and devices with lesser compute. This led me to discover quantization.1 1. I remember someone telling me that Tim Dettmers wrote the CUDA kernel for quantization in one sitting. Absolutely legendary!
I am writing this blog in an effort to share the learnings and also to test my own understanding in this topic.