AI Decoding the DNA of Large Language Models: A Comprehensive Survey on Datasets, Challenges, and Future Directions