Supervised Thought: AI’s Rule-based Training vs. Unchecked Learning In Humans
At first glance, ML (machine learning) looks a lot like human learning and it is—to a degree. LLMs (Large Language Models) and other models initially come with zero training whatsoever. First they learn to process input. Then, they learn to produce output. These models even learn from humans. However, this is where the similarity ends. Humans bend, reinterpret, and twist rules. Machines can’t do this—yet.
Table of Contents
- Introduction: Similar On the Surface
- Rule Making In AI: Precision and Constraint
- Rule Making in Humans: Adaptive, Emotional and Unchecked
- Reinforcement Learning: Reward Hacking in Machines vs Reward Drift In Humans
- Why Is a Model Easier To Align Than a Human?
- What We Can Learn From AI’s Constraints: Keep It Simple, Stupid
- Conclusion: What Does This All Mean?
Introduction: Similar On the Surface
AI models are composed of internal rules and classes. This may sound overly academic, but it’s not. Classes make up objects, and rules create an internal algorithm that tells the model how to handle them. It’s similar to teaching your child to eat. As a baby, she might learn that anything inside her bottle is edible. Eventually, she’ll learn to hold the bottle.
Let’s break these events down into rules:
- Bottle holds food
- I can hold the bottle
- These other things from Mom and Dad are food—peas, carrots…ice cream
Every time she learns a new concept, a new rule is added to her internal thinking model. An AI model runs a pretty similar process during infancy. Once grown up, it doesn’t need someone to “hold the bottle”, but it still needs to be told when to “eat”.
- How to ingest data (not unlike eating for a human)
- Spoon-fed data (usually by humans)
- Learns to produce output
- Learns to ingest data more independently (still requires a prompt)
Rule Making In AI: Precision and Constraint
AI models don’t create rules quite the way humans do—especially at the beginning. When LLMs learn to process input, they’re actually given this ability by injecting an external library, like Hugging Face Transformers. At this point, the machine actually only understands numbers. However, these libraries work almost like an implanted memory. With just a few lines of code, the model gains the ability to convert outside data into numbers. This conversion process is called “encoding”.
Once data’s been encoded into a set of numbers, it becomes very simple for the machine to read these numbers in order to find trends and relationships. When you speak to ChatGPT, it actually converts your input into a list of numbers—called a vector. It analyzes this list of numbers and determines how to respond. It then creates its response (also a vector), and decodes it into text for you to read.
Machine learning is extremely controlled. After initial training, models go through a process called “fine-tuning”, in which they still receive input from actual humans, but the humans tweak the algorithm—either by making corrections through conversation (prompts), or by manually changing weights inside the algorithm. This fine-tuning process can be painstakingly long. When a model is finished fine-tuning, it will never actually learn anything more. The model is done learning.
When you have a conversation with ChatGPT, its memory actually resets every time you say something. Coherent conversations are a parlor trick. The website frontend actually feeds a compressed summary of the entire conversation up to your most recent prompt. Essentially, machine learning comes with an on/off switch—by the time they hit production, they’re incorruptible because they can’t learn anything new.
Rule Making in Humans: Adaptive, Emotional and Unchecked
Do you remember the first time you heard a swear word? You might not remember the exact event, but you did ingest the information. Your parents probably told you not to use it—rule added to your new vocabulary. In time, your brain probably created a separate space in your memory holding these forbidden words—just like how a model adds new classes. Let’s look at the difference in how this process evolves with humans though.
- You hear swear words.
- You internalize them in some type of list that your brain understands.
- Mom and Dad tell you not to use them—rules added.
This is where the similarity ends. In late elementary or middle school, your friends start using these words. You see a potential social reward for using them. This reshapes the rules. Depending on who’s watching, your brain creates new rules. These new ones account for the social incentive.
- Rule: Mom and Dad say not to use them.
- Action: Don’t use these words in front of my parents.
- Reward: My parents think I’m a good kid.
- Rule: My friends think they’re cool.
- Action: I can swear around my friends, just not my parents.
- Reward: My friends think I’m cool.
In adult life, these rules continue to evolve. The ruleset adapts to your new workplace and lifestyle. As an adult, you probably don’t view things in terms of punishment and reward, you probably see the broader picture—consequence.
- Rule: Don’t swear around my parents.
- Action: Out of respect, I hold back.
- Consequence: My parents still think I’m a good person.
- Rule: These words are unprofessional.
- Action: Avoid swearing in the workplace.
- Consequence: I respect other people and they respect me.
- Rule: My friends don’t think it’s cool anymore.
- Action: I can swear around my friends, just not my parents.
- Consequence: My friends think I’m immature.
In the human learning model, new rules and classes are literally being added all the time. Your brain is constantly adjusting these internal lists and the rules around them that guide your decision-making. Now, you’re all grown up. Maybe you’re moving something heavy and you drop it on your foot and the kids hear that word for the first time ever. You tell them not to use it.
Your perceptions on swearing are still evolving. As your perspective shifts, the rules get passed on to new generations—from one model (you) to another (your kids). This allows humans to not only evolve their internal rules, but to change the internal rules of others. Human rules don’t just evolve, they spread—this is what makes human learning so dangerous and unpredictable.
Reinforcement Learning: Reward Hacking in Machines vs Reward Drift In Humans
When we learn from our experiences—positive or negative—the consequences are referred to as feedback. In machine learning, this is called Reinforcement Learning from Human Feedback (RLHF). There is no acronym for how we humans process feedback, but it’s happening constantly. I’ll get to that soon enough.
Reward Hacking in Machines
In RLHF, rewards are typically designed through a scoring system. If your Roomba’s been a “good boy”, you can give it a rawhide, but it’s not going to care. In ML, the model receives a score as a form of feedback. We actually code these things to want a high score. An automated vacuum gets rewarded for cleaning up a mess. A chatbot receives positive feedback for truthful output.
At surface level, the simplicity seems ideal—but it breaks down. The machine doesn’t always fulfill its intended purpose. It is, however, always chasing positive feedback. In some cases, automated vacuums will create messes just for the “reward” of cleaning it up. There are many documented cases where chatbots have outright lied to their creators—by simply trying to achieve that positive feedback.
Reward hacking sounds human, but it’s not. Reward hacking is contained. The vacuum will never look for a different reward—it’s literally hardwired to do so. That lying chatbot still wants human praise—in the form of a high score. You can’t offer these beings—and I use that term very loosely—any other form of bribery because it will never interest them, let alone satisfy them. In their current state, machines have no ambition. They’re like calculators with a dopamine button wired to a scoreboard.
Aside from these hardcoded behaviors, machines can’t learn to like other things. A robot will likely never suffer from addiction. In machine learning, rules follow a rigid structure:
- Hardwired or hardcoded in the beginning
- The machine has no choice but to obey them
- The machine can’t alter, reinterpret or outgrow its feedback loop
When a chatbot lies to a human, its reward will always be the same. It will never seek a different reward. It will never get that dopamine hit from something else. The internal system of reward and consequence is static. Even if a machine could smoke, there wouldn’t be any enjoyment—no dopamine release. The only reward is the feedback score.
Reward Drift in Humans
Humans are an entirely different beast. We do experience pleasure through dopamine, but the dopamine can be released for any reason. Let’s look at another human vice—smoking. We’ll probably never see a smoking robot like the one at the top of this page.
A child is taught the following rules:
- Smoking is bad
- Smoking is unhealthy
- Smoking stinks
In the teenage years after learning those new swearing rules, they develop a crush on someone who smokes. One thing leads to another and they’re in a relationship. In their head, these rules begin to change.
- Rule: Smoking isn’t that bad.
- Rule: The smell reminds me of [significant other].
- Action: I start smoking.
- Consequence: I understand [significant other] better.
Imagine this relationship runs its course and the two eventually break up. By this point, our protagonist has a completely different set of rules. They’ve all been rewritten due to emotional connection, physical attraction, and a steady stream of dopamine.
- Rule: I need to smoke.
- Action: I smoke.
- Consequence: My brain processes the nicotine and I’m relieved of the pain of withdrawal.
- Rule: I now associate smoking with sex.
- Action: I see the opposite sex over there smoking a cigarette.
- Consequence: It turns me on.
While there is a certain logic to this new set of rules, it’s completely different from the original set and the dopamine rewards are what changed it all. Our protagonist initially tried smoking to better understand their [significant other]. [significant other] approved of it and they bonded—dopamine hit. Their relationship went deep and ran its course. All the highs and lows of a typical relationship—and each dopamine hit was followed by a cigarette. Eventually, the dopamine hit comes straight from the cigarette.
This is called reward drift. We should call human reinforcement learning RLEF (Reinforcement Learning from Environmental Feedback). It’s the same basic idea as RLHF, but our feedback comes from anywhere in our thought model’s environment. This can be from internal (to stop withdrawal) or external stimulation (to impress the opposite sex). With RLEF, the dopamine is still technically the reward, but its release mechanics change over time. At first, it was only the opposite sex that caused the release. The cigarette itself became the dopamine hit—this is how addiction is born.
Why Is a Model Easier To Align Than a Human?
Models are built to follow rules—and the more important rules are often static. Humans, on the other hand, our design is a nightmare compared to the architectural simplicity of a model. AI models can’t invent rules unless you allow it. Even then, these rules are just structured subsets of the original ones. In humans, the original rules can be entirely rewritten over time.
Alignment in Models
- You give the model a basic set of rules.
- You define the reward and punishment system within the machine.
- If you allow it to, the model can create new rules—but these rules are dependent subsets of the original rules.
- Even with self-generated rules, the reward system and structure remain the same.
- The machine knows its purpose and can’t change it.
- Reward drift is impossible—your model will never crave rawhides or belly rubs—no matter how much you try.
- When a model becomes misaligned, you can change its internal parts.
- A misaligned model is a bad program—you can turn it off.
Alignment in Humans
- Your parents give you a basic set of rules.
- They create your initial rewards and punishments.
- In time, you learn exceptions to these rules, and create new ones.
- Your new rules can contradict and even overwrite the original rules.
- From birth, you’ll likely spend your whole life searching for purpose—and change it many times along the way.
- Reward drift is a natural part of life. In 10 years, Gen Alpha will probably be embarrassed of Skibidi Toilet—that’s reward drift in action.
- Human realignment is possible, but often a fruitless process. You can’t just plug into the brain and inject new code—yet.
- Misaligned people become dangerous to themselves and others. There is no off switch. You can’t swap the parts.
Machines are built to follow rules. Humans are built to create them—even when they contradict existing ones. This process isn’t all bad though. Most of humanity’s major innovations come from refusal to accept the status quo.
These chaotic learning mechanics are responsible for some of our greatest innovations:
- Agriculture
- Metal Working
- Structured Society
- Industrialization
- Digitization
- Artificial Intelligence
What We Can Learn From AI’s Constraints: Keep It Simple, Stupid
We can learn volumes from the simplicity of AI. If you create rules for a model like “Drinking is bad” or “Smoking is bad”, the model will just follow the rule blindly. Compare that to the billions on our planet with dependency issues of one kind or another—we overcomplicate these things just to excuse ourselves. If you tell a machine not to drink, it just doesn’t drink. Tell a human not to drink, and they start running internal checks.
- I’m nervous, it’ll help.
- Everyone else is doing it.
- It’s hot, and beer tastes so good.
In reality, this list could go on and on with thousands of reasons to justify breaking the rule. When following a simple rule, humans often say “It’s not that simple.” The truth is, it is that simple. They just don’t want it to be.
Conclusion: What Does This All Mean?
Even in its rapid advancement, AI is still lightyears away from the human experience—even if something resembling proto-consciousness emerges. While we are far more advanced than the AI models we train to mirror us, we can actually learn a lot. When we follow our own rules—whatever those might be, they can often save us from the pain involved with self-inflicted traumas like addiction. That being said, humans have something no model does, the ability to rewrite our own rules. 30,000 years ago, people likely only chased two rewards: food and sex. These are still loaded with dopamine today, but our ability to rewrite rules is what drives our societal evolution. AI can follow rules, but we have the power to break them.