Skip to content

LLMs

Bpetokenizer

A Byte Pair Encoding (BPE) tokenizer, which algorithmically follows along the GPT tokenizer(tiktoken), allows you to train your own tokenizer. The tokenizer is capable of handling special tokens and uses a customizable regex pattern for tokenization(includes the gpt4 regex pattern). supports save and load tokenizers in the json format. The bpetokenizer also supports pretrained tokenizers.

Overview

The Byte Pair Encoding (BPE) algorithm is a simple yet powerful method for building a vocabulary of subword units for a given text corpus. This tokenizer can be used for training your tokenizer of the LLM on various languages of text corpus.

this algorithm is first introduced in the paper Neural Machine Translation of Rare Words with Subword Units and then used this in the gpt2 tokenizer(Language Models are Unsupervised Multitask Learners)

Every LLM(LLama, Gemini, Mistral..) use their own Tokenizers trained on their own text dataset.

How to Make the Best Out of Open-Source LLMs

Ever found yourself wondering if you could tap into the magic of top-notch language models without breaking the bank? Well, good news – open-source LLMs like Gemma, Mistral, Phi2, and a bunch of others are here to save the day! And guess what? They won't cost you a single penny. If you're scratching your head about how to get them up and running on your own machine and use them for all sorts of cool stuff, you're in the right place. Let's dive in!

Why Bother with Open-Source LLMs?

Okay, so GPT-4 is like the rockstar of AI models, but let's be real – it's got a pretty steep price tag attached. And for us students, coughing up a bunch of cash for a project just isn't in the cards. But fear not, my friend, because here's where open-source LLMs come strutting in to save the day. They're like the friendly neighborhood superheroes of the AI world, and here's why they're totally awesome:

RAG

What interested me in doing this...

I was really interested in AI, which is literally the most important tool to enhance everyone's life and solve real-time problems that require significant human resources. By replacing them with AI, we can address all these issues.

I became interested in building projects around the APIs of models, GPTs, and OSS models. When I was a beginner, it seemed really cool to generate new content from foundational models. However, when I asked real-time questions or something outside the model's training data, I often received default answers like, "I'm only trained on data up to 2022." alt text

This made me question: what are the ways I can train the foundational model on real-time data and make it available for users? This curiosity led me to delve into fine-tuning, which involves training the foundational model on your private data.

Lets talk about LLMs

My experience with LLMs

I've explored a variety of Large Language Model(LLMs), ranging from the commercial GPT-4, gemini-pro to open source alternatives like Mistral-7B and GPT-3.5-turbo. Additionally, I've also explored the smaller language models, like phi-2 of Microsoft which is also an open source model which is trained on just 2.7B params.

Amongst these, gpt4 stands out for its efficiency and the response generation. as per my knowledge, I think there's no llm that has matched the quality of GPT-4 responses but the information is limited to 2022.

I recommend using the gemini-pro for the more updated informtaion and faster response compared to GPT-4.

What are LLMs

Large Language Models, are AI models trained on large text datasets which include articles, books, and other texts.

all you need is pydantic

Hey there, language enthusiasts! Ever wondered about the dynamic duo of Pydantic and the OpenAI instructor library? Well, you’re in for a treat. This blog is your ticket to exploring how these two pals can tag-team to make your language model interactions not just effective but downright awesome. Join me as we uncover the magic of combining Pydantic’s finesse with the OpenAI instructor library’s wizardry for a seamless and efficient NLP journey.

Purpose

Why are we diving into this combo, you ask?

Simple. Pydantic and the OpenAI instructor library aren’t just tools; they’re superheroes in the world of language processing. Together, they form a powerhouse that not only prompts models like a champ but also ensures the responses are top-notch and well-behaved. This blog? It’s your guide to making this dynamic duo work wonders. Perfect for developers and language lovers who want to make their NLP game strong!