Thuan: Alex, I use AI every day. ChatGPT, Claude, Copilot. But I realized something embarrassing — I don’t actually understand how they work. Like, what’s really happening when I type a prompt and get an answer?
Alex: You’re in good company. Most developers use AI without understanding the mechanics. And that’s OK for basic use. But if you want to build with AI, or make smart decisions about it, a little understanding goes a long way.
Thuan: So explain it to me. Start from zero.
Alex: OK. Let’s start with the biggest misconception. AI doesn’t “think.” It doesn’t “understand.” It does something much simpler and much more impressive at the same time: it predicts the next word.
Thuan: Wait, that’s it? All of ChatGPT is just… predicting the next word?
Alex: At its core, yes. When you ask, “What’s the capital of France?” the model processes your text and generates, word by word: “The” — most likely next word — “capital” — most likely next word — “of” — “France” — “is” — “Paris.” Each word is a prediction based on everything that came before it.
Thuan: But it gives intelligent answers. How can predicting the next word produce something that seems smart?
Alex: Because it was trained on an enormous amount of text. Billions and billions of sentences. It saw patterns. It saw that “the capital of France” is almost always followed by “Paris.” It saw that technical explanations follow certain structures. It learned the statistical patterns of human language at a scale that no human could process.
Machine Learning: Teaching a Computer to Find Patterns
Thuan: OK, let’s back up. What is machine learning in the simplest terms?
Alex: Machine learning is teaching a computer to find patterns in data, instead of programming the rules yourself. Here’s an analogy. Imagine you want to teach a child to recognize cats. The traditional programming approach: you’d write rules. “A cat has four legs, pointy ears, whiskers, a tail, and says meow.” But what about cats with floppy ears? Three-legged cats? Cats that don’t meow?
Thuan: The rules would get impossibly complex.
Alex: Exactly. The machine learning approach is different. You don’t write rules. Instead, you show the child ten thousand pictures of cats and ten thousand pictures of not-cats. You say, “This is a cat. This is not a cat.” Over and over. Eventually, the child learns to recognize cats — not because they memorized a rule, but because they absorbed the patterns.
Thuan: And that’s what training a model means? Showing it millions of examples?
Alex: Yes. During training, the model adjusts millions of internal numbers — called parameters or weights — to get better at predicting the right answer. At first, it’s random. It guesses wrong constantly. But with each example, it adjusts slightly. After millions of examples, those tiny adjustments add up to something that looks like intelligence.
Thuan: It’s like tuning a guitar by ear. Each string needs thousands of tiny adjustments, and eventually, it sounds right.
Alex: That’s a beautiful analogy. And just like a guitar, if you tune it wrong — bad training data, wrong approach — it’ll sound terrible no matter how expensive the guitar is.
Large Language Models: The Really Smart Parrot
Thuan: So what makes LLMs — Large Language Models — special?
Alex: Scale. Regular machine learning models might have thousands or millions of parameters. GPT-4 has hundreds of billions. It’s like the difference between a parrot that learned 50 phrases and a parrot that read every book in every library in the world.
Thuan: A really smart parrot.
Alex: Exactly. And here’s the key: the parrot doesn’t understand what it’s saying. It doesn’t have beliefs or experiences. But it has absorbed so many patterns that its responses are often indistinguishable from someone who does understand.
Thuan: Is that why they’re called “stochastic parrots” by some researchers?
Alex: Yes. Stochastic means probabilistic — based on probability. The model doesn’t “know” the answer. It generates the most probable sequence of words given the input. Sometimes that probability aligns perfectly with truth. Sometimes it doesn’t. And when it doesn’t, you get…
Thuan: Hallucinations.
Alex: Right. Confident, well-structured, completely wrong answers. The model isn’t lying. It’s not trying to deceive you. It’s just generating the most probable text, and sometimes the most probable text is factually wrong. It’s fluent nonsense.
Embeddings: Turning Words Into Math
Thuan: You mentioned AI is math. How does text even become math?
Alex: Through embeddings. This is actually one of the most elegant ideas in AI. An embedding turns a word — or a sentence, or a whole document — into a list of numbers. A vector. And here’s the magical part: words with similar meanings end up close together in “number space.”
Thuan: Give me an example.
Alex: Imagine a simple two-dimensional space. The word “king” might be at position [3, 5]. The word “queen” might be at [3, 6]. They’re close because they’re related. The word “dog” might be at [8, 2]. Far away from king and queen because it’s a completely different concept.
Thuan: And real embeddings have more than two dimensions?
Alex: Way more. Typically 768 or 1,536 dimensions. Impossible to visualize, but the math works the same. Similar concepts cluster together. And you can even do math with them. The classic example: “king” minus “man” plus “woman” equals something very close to “queen.” The model learned the relationship between gender and royalty just from reading text.
Thuan: That’s amazing. And where do embeddings get used in practice?
Alex: Everywhere. Semantic search — instead of matching exact keywords, you match meaning. If someone searches for “how to fix a slow website,” traditional keyword search looks for those exact words. Embedding search understands that “optimize web performance” and “speed up page load” mean the same thing.
Recommendation systems use embeddings too. Netflix embeds movies into vector space. Movies similar to what you watched are nearby vectors. “You liked this action movie? Here are movies nearby in embedding space.”
RAG: Teaching AI About Your Stuff
Thuan: OK, here’s something I’ve been hearing a lot. RAG — Retrieval Augmented Generation. What is it and why does everyone talk about it?
Alex: RAG solves a fundamental problem with LLMs. LLMs were trained on internet data from a specific time. They don’t know about your company’s private data. They don’t know what happened last week. They don’t know your internal documentation. If you ask ChatGPT about your company’s HR policy, it’ll make something up.
Thuan: Right. I’ve tried that. Confidently wrong.
Alex: RAG fixes this by adding a step before generation. Here’s the flow:
Step 1: Retrieve. Take the user’s question. Search your own data — documents, databases, knowledge bases — for relevant information. This is the “R” in RAG.
Step 2: Augment. Take the relevant information you found and add it to the prompt. “Here’s context from our internal docs. Now answer the user’s question using this context.” This is the “A.”
Step 3: Generate. The LLM generates an answer based on both the question and the provided context. This is the “G.”
Thuan: So instead of relying on the LLM’s training data, you’re giving it the right information at question time?
Alex: Exactly. Think of it like an open-book exam. Without RAG, the LLM takes the exam from memory — and memory can be wrong. With RAG, the LLM can look up the answers in the textbook. It’s still responsible for formulating a good answer, but it has the right source material.
Thuan: How does the retrieval part work? How does it find the right documents?
Alex: Embeddings! You take all your documents and convert them into embeddings — vectors. You store them in a vector database like Pinecone, Weaviate, or Chroma. When a user asks a question, you convert the question into an embedding too. Then you find the documents whose embeddings are closest to the question’s embedding. Those are the most relevant documents.
Thuan: So the whole flow is: question becomes embedding, embedding finds similar document embeddings, documents get added to the prompt, LLM generates answer using those documents as context.
Alex: Exactly. And this is how most “chat with your documents” products work. It’s not magic. It’s embeddings plus search plus an LLM.
Tokens: The Currency of AI
Thuan: One more thing I keep seeing — tokens. People say “GPT-4 has a 128K token context window.” What’s a token?
Alex: A token is roughly a word or a piece of a word. The model doesn’t read words — it reads tokens. The word “hello” is one token. The word “unbelievable” might be three tokens: “un”, “believ”, “able.” On average, one token is about four characters, or roughly three-quarters of a word.
Thuan: And the context window is the maximum number of tokens the model can process at once?
Alex: Yes. Think of it as the model’s working memory. A 128K context window means the model can consider about 96,000 words at once — your input plus its output. That’s roughly a 300-page book. Sounds like a lot, but for complex analysis of large codebases, it fills up fast.
Thuan: And processing more tokens costs more money?
Alex: Yes. API pricing is typically per token. Input tokens — your prompt — and output tokens — the response — are priced separately. Output tokens are usually more expensive. This is why prompt engineering matters. A concise, well-structured prompt saves money and often gets better results.
The Limits You Need to Know
Thuan: What are the real limits of current AI that every developer should understand?
Alex: Five important ones.
Limit 1: No real understanding. The model doesn’t understand concepts. It predicts text. It can write perfect code for a sorting algorithm because it’s seen thousands of examples. But it doesn’t “understand” what sorting means.
Limit 2: Training data cutoff. The model doesn’t know about events after its training data ends. RAG can help, but basic LLM queries are stuck in the past.
Limit 3: Confidently wrong. The model never says “I don’t know” unless trained to. It always generates an answer. And wrong answers look exactly like right answers — same confidence, same fluency.
Limit 4: Context window limits. Even 128K tokens isn’t infinite. For very large codebases or documents, you need strategies like chunking, summarization, or RAG.
Limit 5: Non-deterministic. Ask the same question twice, get slightly different answers. This is by design — there’s randomness in the generation process, controlled by a parameter called “temperature.” Temperature 0 means more deterministic and “safe.” Higher temperature means more creative and varied.
Key Takeaways You Can Explain to Anyone
Thuan: Summary time. If my mom asks me “what is AI?” — what do I say?
Alex:
-
AI predicts the next word. It doesn’t think. It generates the most probable text based on patterns it learned from billions of examples.
-
Machine learning is pattern recognition. Instead of programming rules, you show the computer examples and it learns the patterns itself.
-
Embeddings turn words into numbers. Similar meanings become close numbers. This makes search, recommendations, and AI understanding possible.
-
RAG gives AI your data. Instead of relying on its training, you retrieve relevant documents and include them in the prompt. Open-book exam, not memory test.
-
AI is confidently wrong sometimes. Always verify. Never trust without checking. The more important the decision, the more important the verification.
Thuan: That last point is crucial for us as developers. AI is a powerful tool, but it’s a tool. It doesn’t replace judgment.
Alex: Exactly. Use AI to go faster. Use your brain to go in the right direction.
Thuan: Next time — putting AI into production. Because building an AI demo is easy. Making it work reliably on Monday morning is a whole different story.
Alex: Oh, I have war stories for that one.
This is Part 5 of the Tech Coffee Break series — casual conversations about real tech concepts, designed for listening and learning.
Next up: Part 6 — Cool Demo, But Will It Work Monday Morning?