An Understanding and Explanation of the Objective Truth About LLMs
11 min read
Reddit has become the most popular girl at the dance these days, all of a sudden every AI company is tripping over themselves to do a deal with them. What’s so appealing about Reddit that would cause AI companies to want to form content partnerships with them?
Earlier this year Google, for example, announced a $60 million per year agreement with Reddit to use their data to train its AI systems. And last week, OpenAI announced a similar deal with Reddit.
Reddit has fundamentally not changed in about 30 years of existence. It’s a traditional site, a forum mostly known for opinionated and biased discourse. So why would the likes of Google and ChatGPT call on Reddit for the important task of training their AI systems?
Large Language Models (LLMs) are running out of good options and the hype and fascination are dying down. And more people are beginning to realize that all the wonderful stories that have been told about the future of AI, are not coming through.
Therefore, as reality sets in so does desperation. OpenAI’s ChatGPT and Google’s Gemini have to consume massive copious of data — we need more data to make it work.
They push the narrative that more data will make the models work better. which is fundamentally untrue. This drives the impression that LLMs can interpret natural language, written or verbal. Under this premise turning to Reddit’s platform for common vernacular data would make sense. But it’s not that easy or simple.
The objective truth is that LLMs are struggling to interpret human-like language and context. Nevertheless, you still have the likes of OpenAI’s Sam Altman speaking untruths about the fantasy of Artificial General Intelligence (AGI) going to happen – someday or AGI is just around the corner.
So, according to Altman, billions in investment must continue to flow to keep the fantasy alive. But more data doesn’t get us any closer to the AGI fantasy, and they know that but are stuck! So much money and stories have been invested and driving LLM that it’s taken on a too-big-to-fail tone. I wonder how and who they think will bail them out.
“I don’t care if I burn $50 billion a year”, Sam Altman said in a recent talk at Stanford. A grandiose Altman asserted, without qualification, that “we’re making AGI and I don’t care if I burn $50 billion a year”. Downplaying of course, why his current approaches are not getting us any closer to AGI. The elephants in the room: hallucinations and unreliability. And despite having a disproportionate boatload of money thrown at OpenAI and Generative AI, it continues to struggle. Why?
LLMs have some amazing capabilities but the hard fact is that training data can’t capture everyday language vernacular. And turning to Reddit might only make things worse.
The colloquial or conventional words that allow us to communicate meaning clearly and concisely don’t translate to Machine language functionality in any meaningful or convincing way.
LLMs are neither natural nor sustainable, it’s an Impossible Language according to the father of modern linguistics, and former MIT Professor Noam Chomsky. Chomsky says “We don’t know the world and don’t understand nature and its restraints. Too often we accept things that can’t ever logically be true.”
Best-selling author Yuval Noah Harari — books including Sapians and 21 Lessons for the 21st Century, says humans are “susceptible to nice stories”.
Soren Kierkegaard, an existentialist philosopher, once said there are two ways to be fooled. “One is to believe what isn’t true. The other is to refuse to believe what is true.” Humans have straddled between the two throughout time with the many different conceptualized forms of phenomena. Humans tend to accept things that go beyond reason or without evidence; too often relying on emotion and illogical irrationality.
So the idea of ‘natural language’ models that are supposed to be able to communicate like humans; Chomsky calls this pure science fiction. Language is a “creative infinite act” he says, constructed in real-time with no restraints, and the genesis of language is based on human experiences through organic matter. No mechanical or artificial forms or systems can replicate human biology and experiences.
When asked the big question in an interview with “Machine Learning Street Talk” about ChatGPT “receiving massive investments, and continues to be hyped beyond belief despite very strong theoretical arguments for the futility of learning language from data alone.” Chomsky answers “LLMs have not achieved anything in this domain…achieved zero. It’s a theory of anything goes…that includes all the laws of nature, the ones we know and the ones we do not know yet.”
“With a supercomputer, it can look at 45 terabytes of data and find some superficial regularities, and it can imitate. ChatGPT has done nothing!”
French software engineer and computer scientist, Francois Chollet calls Large Language Models (LLMs) “make-believe AI…thus the road to nowhere.”
American cognitive and computer scientist and entrepreneur, Gary Marcus even called it a parlour trick. Recently, Gary Marcus wrote “Information pollution reaches new heights…AI is making shit up, and that made-up stuff is trending on X.”
He also suggested in an article in the Financial Times that:
“Performance may get worse: large language models produce untrustworthy output, which is then sucked back into other LLMs. The models become permanently contaminated.”
LLMs are desperately seeking whatever gets it closer to more human-like written communication vernacular. I.e., Reddit.
Still, much of this natural writing is in private places. Human electronic communications like WhatsApp, text messages, emails, etc., however, most people are unwilling to turn over their chat history and privacy, to aid lBig Tech, i.e., OpenAI and Google.
And even if they could get this training data they still couldn’t match natural general human intelligence in the vernacular. Human communication/writing is unfiltered, but machines are filtered and programmed. However, meaning is communicated through deeply personal experiences and in real-time, powered by surroundings with the five senses.
So casual everyday writing, what Big Tech seeks from Reddit, is impossible to frame and communicate naturally through impossible machine language.
The danger of training data from Reddit
Reddit is an anonymous site where anyone can create an account and post information about anything — writing under pseudonyms. This anonymity encourages unfiltered biased written comments, people voicing their opinions about what they think they know. A platform where everyone is brave when they don’t have to present sources for verification and put their name on things.
Reddit is a voting machine, users vote on the quality of each post and elevate it to the top or the bottom. Not very objective to say the least.
So, like politicians and politics, the best story-tellers (or liars) with the most convincing posts win out — those that most reflect the Reddit community consensus tend to get upvoted. The most extreme opinions become the gatekeepers, proliferating garbage in garbage out. Again, as Gary Marcus said: The models become permanently contaminated.
Accordingly, using Reddit for training data presents clear and present dangers — significant AI-safety risks. Anytime wrong and biased information becomes normalized the truth suffers, and so does humanity.
However, it is an open question how Reddit’s users will react to their data being sold to Big Tech. Furthermore, when these AI companies finish with that data, what happens, is likely to be sold somewhere else on the web. So Big Tech is weaving an arrogant and tangled web in pursuit of LLM…AGI. However, more significant challenges lay ahead for LLMs.
The Reddit platform can contaminate and become a breeding ground for manipulation, and misinformation, for misogyny and cyberbullying. For angry posts and extreme opinions. What is more popular than what is true comes to dominant; the views of adolescent white males — diversity be damned.
The platform caters to young angry white males, many with low self-esteem who hate women because they’ve been rejected so often by them. Reddit, therefore, can serve as a venue to get their frustrations out…a waste pool of toxicity, unhelpful to society. If this is vernacular Google and OpenAI are after, then they too become a threat to the degradation of society through LLMs.
Effectively, LLM colloquial speech data will be filled with bias, divisive speech, and trolling — aiding the circumvention of democracy, which is now trending in America. So conversational language data can effectively be trained on biases, for example, by right-wing agendas and biases against specific racial groups. Stereotyping and hate become more prevalent and more emboldened in LLMs.
The implications here are serious, and Big Tech doing training data deals with Reddit is alarming, but it’s also a sign of desperation. Cornered by their own rhetoric they are now forced to go to the fringes to try and underpin itself. Another red flag!
Unregulated and uncontrolled, generative AI, in both text and images, can also paint a picture of a world that amplifies bias in gender, race and beyond. Big Tech and its billionaires don’t care, they don’t fear governments or regulation policies. What they fear most, however, are meaningful reductions in their wealth. So they create crafty narratives to sell you more stuff you don’t need.
Whether that produces toxic images and text, tech billionaires and the new tech aristocracy, their courtiers, aka VCs, don’t care. Market share and wealth are all that matter. This small group dominated mainly by white males, ends up ruling the universe and producing harm, and is seen as good business by them.
These biased LLM models, for example, can become harmful to certain groups of people when used in areas like housing and mortgage financing decisions, business loan applications, crime proceedings, healthcare and social programs. With public policymakers and politics, access to venture capital and much more. So if we don’t put time into AI safety we are only setting our societies up for more harm.
Einstein says that we must not put the cart before the horse — putting out nice theories (neural networks and LLMs) and then going about to prove it. But what happens when those theories can’t be proven to work?
Agatha Christie says “Everything must be taken into account. If the fact will not fit the theory — let the theory go”.
We must take a physics-guided approach to things and accept scientific facts — accept reality and not fall for nice stories. And for scientists, don’t fall in love with your own theories…for fame and fortune.
Einstein was clear; “…there is no method capable of being learned and systematically applied so that it leads to a new [principle]. The scientist [must] worm these principles out of nature by perceiving, in comprehensive complexes of empirical facts certain general features which permit precise formulation.”
Facts or statements about things can’t be isolated to serve nice theories, they can often be conceptualized and highly selective of the “facts”. So when it comes to certain theories about generative AI, theories/stories can become self-serving for those who want to sell us more stuff.
We must also remember that facts can be statements about phenomena, and they don’t exist on their own. They too can be conceptualized which means they are, in the words of cosmologist, Stephon Alexander, “… implicitly constructed theoretically.”
Therefore, we must be intelligent and apply our intelligence in thinking for ourselves, and critically. Be skeptical and consider what we’re being sold today about certain aspects of AI, i.e., GenAI, nice narratives to convince of an alternative reality that doesn’t exist.
Chomsky, once again, tells us that this impossible language is “a made-up language that violates every principle of language, so there is no point in even looking at its deficiencies because it does nothing! And all this is doing is wasting a lot of energy in California. You can fool New York Times reporters ecstatic about ChatGPT-4, but you shouldn’t be able to fool scientists,” he says.
Therefore, we need a better more focused and practical way to utilize the scientific value of language models, in the best interest of humanity.
LLMs operate on billions of pieces of data designed to draw relationships between words in different contexts; amazing yes, but at the end of the day, when it comes to GenAI application. It’s only a predictor of the next word; a combination of words most likely to appear next to each other.
"It’s not an autonomous, intelligent system, able to think and decide like we do. Instead, as Emily Bender and colleagues emphasize, generative AI is a mimic of human action, parroting back our words and images. It doesn’t think, it guesses — and often quite badly in what is termed AI hallucination."
— Kean Birch, director, of the Institute for Technoscience & Society at York University.
Humans do not perceive things as they are, rather we are constantly inventing our world and correcting our mistakes by the microsecond. We experience the world and the self, with, through, and because of our bodies. Accordingly, intelligence is not only in the mind it’s also in our physical body. Reactions are based on our senses. This commonsensical explanation of reality runs against the science fiction neural network theory that pushes intelligence as something only in the mind.
According to Chompsky, LLMs are synthetic or not natural…made-up, which violates all the rules of nature, grammar, culture, experiences and symbolism. Those rules of language, allow us to communicate effectively as humans; we all can pick up context, spot errors, falsehoods, tone, etc; most can tell when the other is lying, and know when they are lying too.
Machines have none of these abilities this is why the problem of making stuff up, hallucination, and there is no cure in sight. Which continues to make LLMs unreliable.
LLMs, like pure mathematics, are made up and can fool you through their elegance; it’s the way ChatGPT performs, with the illusion of ‘intelligence’. Creating and adhering to its own made-up rules and trying to force it onto society, for power, profit and control.
Nobel Prize-winning mathematician and philosopher, Bertrand Russell’s paradox, (also known as Russell’s antinomy) sums things up well — shows that every set theory that contains an unrestricted comprehension principle leads to contradictions.
Using Language Models properly and responsibly
There is still significantly enough available in language models if applied properly and responsibly, can be effective value creators in all aspects of society. From organizations to individuals.
Unfortunately, in the hands of Big Tech, LLMs are being developed to serve the profitability needs of its billionaires, without regard for the harm caused to humanity. But more humanity-friendly useful tools can be developed to enhance human productivity and value.
So it doesn’t have to be Large Language Models or die trying, as 50 Cent famous song says.
About 6ai Technologies
6ai Technologies utilizes generative AI purposefully and responsibly to augment human capacity and ingenuity. Our proprietary six steps to applied intelligence (6ai) guide users through a scientific strategy design process, retrieving highly relevant information from verified and reputable sources. That can be taken as empirically warrantable and turned into insights.
Our Focused Language Models (FLMs)…retrieval augmented generation method, give our template language models access to highly relevant knowledge bases. So FLMs are more practical, efficient and significantly less expensive, than training LLMs from scratch, and much easier to manage.
The strategy-canvas allows anyone to design anywhere from high-level multi-billion-dollar corporate strategies, to personal growth strategies, at an infinitesimal fraction of the cost of using traditional consultants or advisors. Without the use of specialized tools or training.
6aiTech.com is rapidly democratizing access to technical capabilities once dominated by consultants and advisors. 6ai uses GenAI properly and responsibly, providing a higher dimensional level of useful, robust and reliable insights, empowering do-it-yourself strategy.
תגובות