Implementation of Language Models within an Infrastructure Designed for Natural Language Processing
Abstract
This paper explores cost-effective alternatives for resource-constrained environments in the context of language models by investigating methods such as quantization and CPU-based model implementations.
The study addresses the computational efficiency of language models during inference and the development of infrastructure for text document processing.
The paper discusses related technologies, the CLARIN-PL infrastructure architecture, and implementations of small and large language models. The emphasis is on model formats, data precision, and runtime environments (GPU and CPU). It identifies optimal solutions through extensive experimentation.
In addition, the paper advocates for a more comprehensive performance evaluation approach. Instead of reporting only average token throughput, it suggests considering the curve's shape, which can vary from constant to monotonically increasing or decreasing functions. Evaluating token throughput at various curve points, especially for different output token counts, provides a more informative perspective.
References
T. B. Brown et al., “Language models are few-shot learners,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20. Red Hook, NY, USA: Curran Associates Inc., 2020. [Online]. Available: https://proceedings.neurips. cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a- Paper.pdf
H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307. 09288
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423
C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, p. 67, 2020, id/No 140. [Online]. Available: jmlr.csail.mit.edu/papers/v21/20-074. html
T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 2020, pp. 38–45. [Online]. Available: https://aclanthology.org/2020.emnlp- demos.6
Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” 2019. [Online]. Available: http://arxiv.org/abs/1907.11692
M. Lewis et al., “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 7871–7880. [Online]. Available: https://aclanthology.org/2020.acl- main.703
J. Bai et al., “ONNX: Open neural network exchange,” 2019. [Online]. Available: https://github.com/onnx/onnx
G. Gerganov, “GGML - tensor library for machine learning,” 2023. [Online]. Available: https://github.com/ggerganov/ggml
——, “Inference of LLaMA model in pure C/C++,” 2023. [Online]. Available: https://github.com/ggerganov/llama.cpp
NVIDIA Corporation, “Triton inference server: An optimized cloud and edge inferencing,” 2019. [Online]. Available: https://github.com/ triton- inference- server/server
C. Olston et al., “Tensorflow-serving: Flexible, high-performance ml serving,” in Workshop on ML Systems at NIPS 2017, 2017. [Online]. Available: https://doi.org/10.48550/arXiv.1712.06139
NVIDIA Corporation, “Fastertransformer,” 2019. [Online]. Available: https://github.com/NVIDIA/FasterTransformer
D. Li, H. Wang, R. Shao, H. Guo, E. P. Xing, and H. Zhang, “MPCFORMER: fast, performant and provate transformer inference with MPC,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=CWmvjOEhgH-
G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/ conference/osdi22/presentation/yu
N. Yang et al., “Inference with reference: Lossless acceleration of large language models,” 2023. [Online]. Available: https://doi.org/10.48550/ arXiv.2304.04487
B. Wu, Y. Zhong, Z. Zhang, G. Huang, X. Liu, and X. Jin, “Fast dis- tributed inference serving for large language models,” arXiv:2305.05920, 2023.
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “Gpt3.int8(): 8-bit matrix multiplication for transformers at scale,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper files/paper/2022/ hash/c3ba4962c05c49636d4c6206a97e9c8a-Abstract-Conference.html
Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez, “Train big, then compress: Rethinking model size for efficient training and inference of transformers,” in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 5958–5968. [Online]. Available: http://proceedings.mlr.press/v119/li20m.html
The Apache Software Foundation, “Apache Airflow,” 2023. [Online]. Available: https://airflow.apache.org
Prefect Technologies, Inc., “Prefect,” 2023. [Online]. Available: https://www.prefect.io
Explosion, “spaCy: Industrial-strength NLP,” 2023. [Online]. Available: https://github.com/explosion/spaCy
P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, “Stanza: A python natural language processing toolkit for many human languages,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, Jul. 2020, pp. 101–108. [Online]. Available: http://dx.doi.org/10.18653/v1/2020.acl- demos. 14
A. Branco et al., “The CLARIN infrastructure as an interoperable language technology platform for SSH and beyond,” Language Resources and Evaluation, Jun. 2023. [Online]. Available: https: //doi.org/10.1007/s10579- 023- 09658- z
M. Hinrichs, T. Zastrow, and E. W. Hinrichs, “Weblicht: Web-based LRT services in a distributed escience infrastructure,” in Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta, N. Calzolari et al., Eds. European Language Resources Association, 2010. [Online]. Available:
http://www.lrec- conf.org/proceedings/lrec2010/summaries/270.html
A. Lemmens and V. Vandeghinste, “A lightweight NLP workflow engine for CLARIN-BE,” in CLARIN Annual Conference Proceedings, 2022, pp. 29–34. [Online]. Available: https://www.clarin.eu/sites/default/files/ CLARIN2022 P 2.1.4 LemmensVandeghinste.pdf
M. Pol, T. Walkowiak, and M. Piasecki, “Towards CLARIN-PL LTC digital research platform for: Depositing, processing, analyzing and visualizing language data,” in Reliability and Statistics in Transportation and Communication, I. Kabashkin, I. Yatskiv, and O. Prentkovskis, Eds. Cham: Springer International Publishing, 2018, pp. 485–494. [Online]. Available: https://doi.org/10.1007/978- 3- 319- 74454- 4 47
T. Walkowiak, “Web based engine for processing and clustering of polish texts,” in Theory and Engineering of Complex Systems and Dependability, W. Zamojski et al., Eds. Cham: Springer International Publishing, 2015, pp. 515–522. [Online]. Available: https://doi.org/10.1007/978- 3- 319- 19216- 1 49
S. Newman, Monolith to microservices: evolutionary patterns to trans- form your monolith. O’Reilly Media, 2019.
VMware, “RabbitMQ,” 2023. [Online]. Available: https://www. rabbitmq.com
OASIS, “Advanced Message Queuing Protocol,” 2023. [Online]. Available: https://www.amqp.org
The Linux Foundation, “Kubernetes,” 2023. [Online]. Available: https://kubernetes.io
——, “Kubernetes Event-driven Autoscaling,” 2023. [Online]. Available: https://keda.sh
——, “HELM The package manager for Kubernetes,” 2023. [Online]. Available: https://helm.sh
T. Walkowiak, “Language processing modelling notation – orchestration of NLP microservices,” in Advances in Dependability Engineering of Complex Systems, ser. Advances in Intelligent Systems and Computing, W. Zamojski et al., Eds., vol. 582. Springer, 2017, pp. 464–473. [Online]. Available: https://doi.org/10.1007/978- 3- 319- 59415- 6 44
L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text embeddings by weakly-supervised contrastive pre-training,” 2022. [Online]. Available: https://doi.org/10.48550/arXiv. 2212.03533
Y. Belkada and T. Dettmers, “A gentle introduction to 8-bit matrix multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes,” 2022. [Online]. Available: https://huggingface.co/blog/hf- bitsandbytes- integration
NVIDIA Corporation, “NVIDIA Multi-Instance GPU user guide,” 2023. [Online]. Available: https://docs.nvidia.com/datacenter/tesla/pdf/ NVIDIA MIG User Guide.pdf
——, “Time-slicing GPUs in Kubernetes,” 2023. [Online]. Available: https://docs.nvidia.com/datacenter/cloud- native/gpu- operator/ latest/gpu- sharing.html
P. Pezik, A. Mikołajczyk, A. Wawrzyn ́ski, B. Niton ́, and M. Ogrodniczuk, “Keyword extraction from short texts with a text- to-text transfer transformer,” in Recent Challenges in Intelligent Information and Database Systems, E. Szczerbicki et al., Eds. Singapore: Springer Nature Singapore, 2022, pp. 530–542. [Online]. Available: https://doi.org/10.1007/978- 981- 19- 8234- 7 41
C. Klamra et al., “Devulgarization of polish texts using pre-trained language models,” in Computational Science – ICCS 2022. Cham: Springer International Publishing, 2022, pp. 49–55. [Online]. Available: https://doi.org/10.1007/978- 3- 031- 08754- 7 7
N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [Online]. Available: http: //arxiv.org/abs/1908.10084
B. Peng et al., “RWKV: Reinventing RNNs for the transformer era,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.13048
L. de la Torre, J. Chacon, D. Chaos, S. Dormido, and J. Sa ́nchez, “Using server-sent events for event-based control in networked control systems,” IFAC-PapersOnLine, vol. 52, no. 9, pp. 260–265, 2019, 12th IFAC Symposium on Advances in Control Education ACE 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S2405896319305555
Downloads
Published
Issue
Section
License
Copyright (c) 2024 International Journal of Electronics and Telecommunications
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
1. License
The non-commercial use of the article will be governed by the Creative Commons Attribution license as currently displayed on https://creativecommons.org/licenses/by/4.0/.
2. Author’s Warranties
The author warrants that the article is original, written by stated author/s, has not been published before, contains no unlawful statements, does not infringe the rights of others, is subject to copyright that is vested exclusively in the author and free of any third party rights, and that any necessary written permissions to quote from other sources have been obtained by the author/s. The undersigned also warrants that the manuscript (or its essential substance) has not been published other than as an abstract or doctorate thesis and has not been submitted for consideration elsewhere, for print, electronic or digital publication.
3. User Rights
Under the Creative Commons Attribution license, the author(s) and users are free to share (copy, distribute and transmit the contribution) under the following conditions: 1. they must attribute the contribution in the manner specified by the author or licensor, 2. they may alter, transform, or build upon this work, 3. they may use this contribution for commercial purposes.
4. Rights of Authors
Authors retain the following rights:
- copyright, and other proprietary rights relating to the article, such as patent rights,
- the right to use the substance of the article in own future works, including lectures and books,
- the right to reproduce the article for own purposes, provided the copies are not offered for sale,
- the right to self-archive the article
- the right to supervision over the integrity of the content of the work and its fair use.
5. Co-Authorship
If the article was prepared jointly with other authors, the signatory of this form warrants that he/she has been authorized by all co-authors to sign this agreement on their behalf, and agrees to inform his/her co-authors of the terms of this agreement.
6. Termination
This agreement can be terminated by the author or the Journal Owner upon two months’ notice where the other party has materially breached this agreement and failed to remedy such breach within a month of being given the terminating party’s notice requesting such breach to be remedied. No breach or violation of this agreement will cause this agreement or any license granted in it to terminate automatically or affect the definition of the Journal Owner. The author and the Journal Owner may agree to terminate this agreement at any time. This agreement or any license granted in it cannot be terminated otherwise than in accordance with this section 6. This License shall remain in effect throughout the term of copyright in the Work and may not be revoked without the express written consent of both parties.
7. Royalties
This agreement entitles the author to no royalties or other fees. To such extent as legally permissible, the author waives his or her right to collect royalties relative to the article in respect of any use of the article by the Journal Owner or its sublicensee.
8. Miscellaneous
The Journal Owner will publish the article (or have it published) in the Journal if the article’s editorial process is successfully completed and the Journal Owner or its sublicensee has become obligated to have the article published. Where such obligation depends on the payment of a fee, it shall not be deemed to exist until such time as that fee is paid. The Journal Owner may conform the article to a style of punctuation, spelling, capitalization and usage that it deems appropriate. The Journal Owner will be allowed to sublicense the rights that are licensed to it under this agreement. This agreement will be governed by the laws of Poland.
By signing this License, Author(s) warrant(s) that they have the full power to enter into this agreement. This License shall remain in effect throughout the term of copyright in the Work and may not be revoked without the express written consent of both parties.