Scrapbooking Intelligence

Gargee Dixit

SY B.Sc. 

Source: Pinterest 

Yes, another mediocre AI article in the vast sea of the Internet; but, I think I am approaching this topic with some fresh perspective (backed by John Green). When people think about AI, they often consider the philosophical arguments about Artificial Intelligence or how to ‘harness’ its potential. Rarely, though, do they reflect on how AI was constructed, how it derives its ‘thinking’ power.

Generative AI is a specific type of AI that uses neural networks to identify patterns within the data set that you have provided to generate new and ‘original’ content (the debate about ‘AI’ content being original will be explored later). One example of this are Large Language Models (LLMs). Essentially your ChatGPT, Gemini, Snapchat AI, etc. For example, when you ask for ChatGPT to generate a text or an image, it is not doing so based on its ‘own’ intelligence, it is borrowing and reorganising the information on the web to provide you with an answer. It is scrapbooking it together without adding any real value. So if GenAI isn’t creating its own content out of thin air, whom is it referring to? Whose shoulders is it standing on and relying upon?

The answer is a bit complicated and as you will see later, comes with a lot of legal issues. The reason why AI works is its large processing power. In its current form, AI trains on a huge dataset, from which it gathers information, establishes patterns and observes behaviour. Just like a regression model in statistics, the model is as good as the data you give it- and that’s the whole crux of the problem. You need data and you need good data. The fact that ChatGPT cannot understand the sarcastic reporting of ‘The Onion’ and takes everything at face value is proof of that. In the 21st century, Data is the new oil. Every company is hungry for it and just like oil, the US is the leading entity. Without good data to train on, the entire point of the Internet falls apart. The algorithms that govern our activity, the method with which we research on the internet is all woven together with data and how the companies choose to use it… and sell it. 

Companies have sold their data to Meta, Microsoft and Open AI so they could train the LLMs while some websites (like Reddit) on the internet have given free access to LLMs to roam around and scour their data. This begs the question, which human-created content is off limits? Well, Youtube videos should be. A vast database of videos, 14 Billion of them in fact, are available on youtube. It would be foolish to not train GenAI on such a platform. Well, according to John Green, Google is using Youtube videos to train its Gemini AI, without the knowledge of the youtubers. 

That raises serious questions not only about the consent of youtubers, but all of us as internet users. There is a difference between open source content and content in public spaces. The former is copyright free and can be modified and redistributed. The latter is owned by companies or individuals that can very much retaliate against Google. But how? There is no legal precedent for AI disputes nor are there laws or regulations taking AI into consideration. The New York Times is currently suing OpenAI for illegal use and profiting off its work. But the legal proceedings have yet to unfold. A legal argument against the youtubers can be made that AI is technically ‘learning’ from the same content, the way humans do and improve their knowledge. This brings us back to the philosophical paradox of constructing laws around AI. 

Do you treat AI akin to humans, or akin to technology? Do you create a separate legal classification for AI? Well, the European Union is certainly taking a crack at it but it is still very premature and cannot answer the question relevant to us. How do you classify AI? A human-like intelligence that actually creates value by training and ‘learning’ on pre-existing data just like we draw inferences from the past and build new theories? Or is AI just a con artist? Trying to mimic human-intelligence (omg Oh My Sapien name drop)  and human-ness while just being glorified big data processors?

One can ask the question, why do you have to train AI on human generated content? Why can’t we train AI on AI-generated content itself? Well like the infinite money loop schemes, it doesn’t work. Some researchers included AI-generated data in the training samples, and the model devolved into itself with incremental iterations. It is poisonous to itself, and that is very concerning for the future. How will one segregate AI-generated data from non-AI, while forming the training sample? How does one know it’s not contaminated, when our AI checks aren’t as robust yet? This situation very much reminds me of the Space debris debacle. We, as ambitious humans, kept sending satellites and rockets into space and it resulted in a layer of space junk or debris revolving around the earth, and now it’s actively hurting us in space missions.

So how is all this relevant for the average person who probably uses ChatGPT to generate instagram captions? If Youtubers’ data is being used to ‘train’ AI models without their knowledge, and you have to use human-generated content, is it a big leap forward to question whether our data that we willingly provide to Google, Meta and Microsoft isn’t being used to further provide ‘human-made’ content to these generative AIs. Are you sure that you, as a consumer (and a product) of the internet, are okay with ChatGPT or Gemini going through your photos, videos and texts, and using your likeness to generate content?

This is just one stone unturned in the field of AI. AI has changed how we view the world economically, socially and legally. We need to continuously think about how it will affect us in our everyday lives. So I will leave you with arguably the most famous and important quote regarding AI by Joanna Maciejewska

“I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do my laundry and dishes.”

Leave a Reply

Your email address will not be published. Required fields are marked *