Various strategies have been proposed to overcome optimization difficulty and accuracy degradation when compressing large models. model_compression/bert/config/XTC/ds_config_layer_reduction_W1Q8_fp32.json in DeepSpeedExamples is the example configuration where we set the layer reduction to be true on top of model_compression/bert/config/XTC/ds_config_W1A8_Qgroup1_fp32.json. Reducing the size of large models is critical when deploying them on both servers and client devices. We find that under FP16 training, smaller number of quantization group (e.g., 1 or 2) could lead to unstable training. Overall, this work introduces a simple yet effective compression pipeline for extreme compression in pretrained transformers, providing a possible solution for deploying such models. It allows for easy composition of multitude of features within a single training, inference or compression pipeline. The kernels also fuse quantization and dequantization operations before and after GeMM, further reducing the kernel invocation overhead and improving the memory bandwidth utilization. To tease apart their effects, we perform a systematic study on the impacts of various techniques currently used for extreme compression. The core piece of DeepSpeed Compression is a component called compression composer, which includes several significant features: After the DNN model has been compressed, DeepSpeed Compression replaces the compressed layers with highly optimized kernels in the DeepSpeed Inference engine to maximize hardware efficiency. The example includes the following changes to the client code (model_compression/bert/ in DeepSpeedExamples): (1) When initial the model, the number of layers in the model config should be the same as keep_number_layer in DeepSpeed config JSON file. To find more details about ZeroQuant, refer to ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. The DeepSpeed library (this repository) implements and packages the innovations and technologies in DeepSpeed Training, Inference and Compression Pillars into a single easy-to-use, open-sourced repository. System optimizations and model compression are very much complementary, and they can be synergistically combined to provide a multiplicative reduction on inference latency and cost. Extreme Compression for Pre-trained Transformers Made Simple and Efficient, ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. Existing methods have limited composability from two aspects. Installation: Examples of XTC extreme compression for BERT models are at model_compression/bert/bash_script/XTC in DeepSpeedExamples. Tutorial for ZeroQuant: efficient and affordable post-training quantization. One can run our layer reduction example in DeepSpeedExamples by: To apply layer reduction for task-agnostic compression, we provide an example on how to do so in the GPT pre-training stage. Given the above layer-reduced models ready, we now continue to compress the model with 1/2-bit quantization. One can run our sparse pruning example in DeepSpeedExamples by: Row pruning sets all the elements in certain rows of the weight matrix with zero values. One can run our activation quantization example in DeepSpeedExamples by: Pruning aims to reduce the number of parameters and operations involved in generating a prediction by removing network connections. By combining extreme quantization and lightweight layer reduction, we can further improve the binarized model, achieving 50x model size reduction while retaining 97% of the accuracy. With pruning, you can lower the overall parameter count in the network. model_compression/bert/config/XTC/ds_config_layer_reduction_fp16.json in DeepSpeedExamples is the example configuration for reducing the 12-layer BERT-base to a 6-layer one. For example, we developed variations of efficient low-bit computation such as INT8 GeMM kernels. With this config, we quantize the existing fined-tuned models downloaded from Hugging Face. Please make sure they match the teacher model dimensions in the checkpoint. BLOG DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization. This is the amount of data samples that leads to one step of model update. It offers multiple cutting-edge compression methods, as shown in Table 1, including extreme quantization, head/row/channel pruning, and knowledge distillation, that can effectively reduce model size and inference cost. Currently the DeepSpeed Compression includes seven compression methods: layer reduction via knowledge distillation, weight quantization, activation quantization, sparse pruning, row pruning, head pruning, and channel pruning. By default, it will load the bottom layers of the teacher models for initialization, but you can pass your own checkpoints for initialization. We provide the zero-shot perplexity result from WikiText-2 and LAMBADA in the following table. In addition, users also can choose whether to reinitialize the input/output layers from the given model (teacher model) by other_module_name. It can improve computation efficiency similar to weight quantization. 