FedDDF: Dynamic Dataset Filtering in Federated Large Language Model Training

Publisher

Data-efficient centralized training for large language models (LLMs) aims to match or surpass the performance of traditional training while using much smaller datasets. These approaches significantly reduce computational costs by avoiding reliance on massive web-scale or document corpora collections. Nonetheless, while this training approach mitigates certain limitations associated with traditional training approaches, it does not inherently address data leakage concerns, which remain critical. This growing need has driven increased interest in federated learning (FL), a decentralized paradigm enabling collaborative model updates without compromising local data. Despite these advantages, FL faces challenges arising from the variety of dataset quality among clients, which can lead to inconsistencies and performance degradation if data efficiency is not prioritized. Many existing approaches attempt to tackle this issue by integrating additional filtering modules. However, such methods introduce high computational overhead, increasing overall training time and limiting the efficiency of federated LLM training. To overcome these limitations, the Federated Dynamic Dataset Filtering system (FedDDF) is introduced, which is designed to optimize distributed LLM training by carefully selecting high-quality training data to maximize training performance while minimizing training time. Leveraging perplexity-based influence scoring, experimental results demonstrate that our method accelerates the training process by 1.12-2.04 times compared to the baseline and enhances overall model performance. Consequently, it reduces the need for computationally expensive hardware, which makes large-scale distributed training more efficient and accessible, even in resource-constrained environments. Moreover, our results demonstrate that the benefits of data pruning may extend beyond centralized to distributed learning.

Publisher: Proceedings of the International Workshop on Secure and Efficient Federated Learning in Conjunction with ACM Asiaccs 2025 Fl Asiaccs 2025

Article number: 3

Keywords

  • Data Quality Filtering
  • Federated Learning
  • Large Language Models

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Graphics and Computer-Aided Design
  • Computational Theory and Mathematics

Publication year

2025

Fingerprint