Implementation of Language Models within an Infrastructure Designed for Natural Language Processing



This paper explores cost-effective alternatives for resource-constrained environments in the context of language models by investigating methods such as quantization and CPU-based model implementations.
The study addresses the computational efficiency of language models during inference and the development of infrastructure for text document processing.
The paper discusses related technologies, the CLARIN-PL infrastructure architecture, and implementations of small and large language models. The emphasis is on model formats, data precision, and runtime environments (GPU and CPU). It identifies optimal solutions through extensive experimentation.
In addition, the paper advocates for a more comprehensive performance evaluation approach. Instead of reporting only average token throughput, it suggests considering the curve's shape, which can vary from constant to monotonically increasing or decreasing functions. Evaluating token throughput at various curve points, especially for different output token counts, provides a more informative perspective.


T. B. Brown et al., “Language models are few-shot learners,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20. Red Hook, NY, USA: Curran Associates Inc., 2020. [Online]. Available: https://proceedings.neurips. cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a- Paper.pdf

H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023. [Online]. Available: 09288

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available:

C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, p. 67, 2020, id/No 140. [Online]. Available: html

T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 2020, pp. 38–45. [Online]. Available: demos.6

Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” 2019. [Online]. Available:

M. Lewis et al., “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 7871–7880. [Online]. Available: main.703

J. Bai et al., “ONNX: Open neural network exchange,” 2019. [Online]. Available:

G. Gerganov, “GGML - tensor library for machine learning,” 2023. [Online]. Available:

——, “Inference of LLaMA model in pure C/C++,” 2023. [Online]. Available:

NVIDIA Corporation, “Triton inference server: An optimized cloud and edge inferencing,” 2019. [Online]. Available: triton- inference- server/server

C. Olston et al., “Tensorflow-serving: Flexible, high-performance ml serving,” in Workshop on ML Systems at NIPS 2017, 2017. [Online]. Available:

NVIDIA Corporation, “Fastertransformer,” 2019. [Online]. Available:

D. Li, H. Wang, R. Shao, H. Guo, E. P. Xing, and H. Zhang, “MPCFORMER: fast, performant and provate transformer inference with MPC,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023., 2023. [Online]. Available:

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for Transformer-Based generative models,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 521–538. [Online]. Available: conference/osdi22/presentation/yu

N. Yang et al., “Inference with reference: Lossless acceleration of large language models,” 2023. [Online]. Available: arXiv.2304.04487

B. Wu, Y. Zhong, Z. Zhang, G. Huang, X. Liu, and X. Jin, “Fast dis- tributed inference serving for large language models,” arXiv:2305.05920, 2023.

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “Gpt3.int8(): 8-bit matrix multiplication for transformers at scale,” in NeurIPS, 2022. [Online]. Available: files/paper/2022/ hash/c3ba4962c05c49636d4c6206a97e9c8a-Abstract-Conference.html

Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez, “Train big, then compress: Rethinking model size for efficient training and inference of transformers,” in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 5958–5968. [Online]. Available:

The Apache Software Foundation, “Apache Airflow,” 2023. [Online]. Available:

Prefect Technologies, Inc., “Prefect,” 2023. [Online]. Available:

Explosion, “spaCy: Industrial-strength NLP,” 2023. [Online]. Available:

P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, “Stanza: A python natural language processing toolkit for many human languages,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, Jul. 2020, pp. 101–108. [Online]. Available: demos. 14

A. Branco et al., “The CLARIN infrastructure as an interoperable language technology platform for SSH and beyond,” Language Resources and Evaluation, Jun. 2023. [Online]. Available: https: // 023- 09658- z

M. Hinrichs, T. Zastrow, and E. W. Hinrichs, “Weblicht: Web-based LRT services in a distributed escience infrastructure,” in Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta, N. Calzolari et al., Eds. European Language Resources Association, 2010. [Online]. Available:


A. Lemmens and V. Vandeghinste, “A lightweight NLP workflow engine for CLARIN-BE,” in CLARIN Annual Conference Proceedings, 2022, pp. 29–34. [Online]. Available: CLARIN2022 P 2.1.4 LemmensVandeghinste.pdf

M. Pol, T. Walkowiak, and M. Piasecki, “Towards CLARIN-PL LTC digital research platform for: Depositing, processing, analyzing and visualizing language data,” in Reliability and Statistics in Transportation and Communication, I. Kabashkin, I. Yatskiv, and O. Prentkovskis, Eds. Cham: Springer International Publishing, 2018, pp. 485–494. [Online]. Available: 3- 319- 74454- 4 47

T. Walkowiak, “Web based engine for processing and clustering of polish texts,” in Theory and Engineering of Complex Systems and Dependability, W. Zamojski et al., Eds. Cham: Springer International Publishing, 2015, pp. 515–522. [Online]. Available: 3- 319- 19216- 1 49

S. Newman, Monolith to microservices: evolutionary patterns to trans- form your monolith. O’Reilly Media, 2019.

VMware, “RabbitMQ,” 2023. [Online]. Available: https://www.

OASIS, “Advanced Message Queuing Protocol,” 2023. [Online]. Available:

The Linux Foundation, “Kubernetes,” 2023. [Online]. Available:

——, “Kubernetes Event-driven Autoscaling,” 2023. [Online]. Available:

——, “HELM The package manager for Kubernetes,” 2023. [Online]. Available:

T. Walkowiak, “Language processing modelling notation – orchestration of NLP microservices,” in Advances in Dependability Engineering of Complex Systems, ser. Advances in Intelligent Systems and Computing, W. Zamojski et al., Eds., vol. 582. Springer, 2017, pp. 464–473. [Online]. Available: 3- 319- 59415- 6 44

L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text embeddings by weakly-supervised contrastive pre-training,” 2022. [Online]. Available: 2212.03533

Y. Belkada and T. Dettmers, “A gentle introduction to 8-bit matrix multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes,” 2022. [Online]. Available: bitsandbytes- integration

NVIDIA Corporation, “NVIDIA Multi-Instance GPU user guide,” 2023. [Online]. Available: NVIDIA MIG User Guide.pdf

——, “Time-slicing GPUs in Kubernetes,” 2023. [Online]. Available: native/gpu- operator/ latest/gpu- sharing.html

P. Pezik, A. Mikołajczyk, A. Wawrzyn ́ski, B. Niton ́, and M. Ogrodniczuk, “Keyword extraction from short texts with a text- to-text transfer transformer,” in Recent Challenges in Intelligent Information and Database Systems, E. Szczerbicki et al., Eds. Singapore: Springer Nature Singapore, 2022, pp. 530–542. [Online]. Available: 981- 19- 8234- 7 41

C. Klamra et al., “Devulgarization of polish texts using pre-trained language models,” in Computational Science – ICCS 2022. Cham: Springer International Publishing, 2022, pp. 49–55. [Online]. Available: 3- 031- 08754- 7 7

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [Online]. Available: http: //

B. Peng et al., “RWKV: Reinventing RNNs for the transformer era,” 2023. [Online]. Available:

L. de la Torre, J. Chacon, D. Chaos, S. Dormido, and J. Sa ́nchez, “Using server-sent events for event-based control in networked control systems,” IFAC-PapersOnLine, vol. 52, no. 9, pp. 260–265, 2019, 12th IFAC Symposium on Advances in Control Education ACE 2019. [Online]. Available: S2405896319305555






Applied Informatics