Posts

Showing posts from January, 2024

Training LLM - Dataset

Image
Large Language Models (LLMs) like OpenAI's GPT-4 are trained on a diverse and extensive range of internet text sources.  For explanation on GAI view this link: The introduction to GAI and LLM can be viewed here : The data used for training these models typically include: 1. Books : A wide variety of books, covering fiction, non-fiction, and textbooks, provide a rich source of diverse vocabulary and complex sentence structures. 2. Websites : Content from a broad array of websites, including news sites, blogs, and forums, contributes to the model's understanding of current events, colloquial language, and various viewpoints. 3. Scientific Papers : Academic and scientific literature helps in training the model to understand and generate text related to scientific, technical, and academic topics. 4. Wikis : Websites like Wikipedia offer structured and well-organized information on a vast array of topics, which is valuable for training models in factual knowledge and encyclo...