FedDDF: Dynamic Dataset Filtering in Federated Large Language Model Training

Data-efficient centralized training for large language models (LLMs) aims to match or surpass the performance of traditional training while using much smaller datasets. These approaches significantly reduce computational costs by avoiding reliance on massive web-scale or document corpora collections. Nonetheless, while this training approach mitigates certain limitations associated with traditional training approaches, it does not inherently address data leakage concerns, which remain critical. This growing need has driven increased interest in federated learning (FL), a decentralized paradigm enabling collaborative model updates without compromising local data. Despite these advantages, FL faces challenges arising from the variety of dataset quality among clients, which can lead to inconsistencies and performance degradation if data efficiency is not prioritized. Many existing approaches attempt to tackle this issue by integrating additional filtering modules. However, such methods introduce high computational overhead, increasing overall training time and limiting the efficiency of federated LLM training. To overcome these limitations, the Federated Dynamic Dataset Filtering system (FedDDF) is introduced, which is designed to optimize distributed LLM training by carefully selecting high-quality training data to maximize training performance while minimizing training time. Leveraging perplexity-based influence scoring, experimental results demonstrate that our method accelerates the training process by 1.12-2.04 times compared to the baseline and enhances overall model performance. Consequently, it reduces the need for computationally expensive hardware, which makes large-scale distributed training more efficient and accessible, even in resource-constrained environments. Moreover, our results demonstrate that the benefits of data pruning may extend beyond centralized to distributed learning.

FedDDF: Dynamic Dataset Filtering in Federated Large Language Model Training

Publisher

Keywords

ASJC Scopus subject areas

Publication year

Fingerprint

Areca nut policy developments in the Asia-Pacific: Warning labels

Magnetic softness and hyperthermia efficiency of Fe3O4-Au nanoparticles with silica shell

Technological innovation and export scale: Evidence from the lithium-ion battery industry

FedDDF: Dynamic Dataset Filtering in Federated Large Language Model Training

Publisher

Keywords

ASJC Scopus subject areas

Publication year

Fingerprint

Related articles

Areca nut policy developments in the Asia-Pacific: Warning labels

Magnetic softness and hyperthermia efficiency of Fe3O4-Au nanoparticles with silica shell

Technological innovation and export scale: Evidence from the lithium-ion battery industry