It can be easy to play with a new AI tool for fun for non-mission critical work. But when using an AI tool for anything important (e.g. with real-life consequences), it’s important to know a little bit about how it was created (including what data it’s using) and how it works, since those things can greatly affect what it can do and the answers or output it gives.
In this micro-lesson, you will learn the ways GenAI (in particular, ChatGPT) is created and what data AI is built on and where it comes from (if possible). You will also learn to distinguish how that affects what AI can (and can’t) know or do and hopefully be able to start identifying the privacy implications and the human roots of AI bias.
Read:
Read:
How ChatGPT and our foundation models are developed and How your data is used to improve model performance (5 min read)
AI Automated Discrimination. Here’s How to Spot It (15 min read; if you can’t access that link, try this)
Metz, C., Kang, C., Frenkel, S., Thompson, S. A., & Grant, N. (2024, April 6). How Tech Giants Cut Corners to Harvest Data for A.I. The New York Times. https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html (18 min optional but highly recommended read)
Discuss:
What are some of the implications you can see from how AI models are trained and what does that mean for the output you get from generative AI tools?
20 replies on “thing 10: How AI Was Created and Why It Matters”
One implication of how AI models are trained is that both correct and incorrect information is used to train the model. While OpenAI is purposeful in excluding materials from the dark web or adult websites, does it do the same for conspiracy theories that don’t fall into the filtered categories? Maybe it was this microlearning session or somewhere else, but there is a concern amongst political scientists that AI could be used to spread disinformation, and with automation bias, people might just be inclined to believe the disinformation.
We should be cognizant that AI is limited by human fallacy and will always be. As such, we should practice checking the AI’s sources and fact-checking with other trusted sources.
I can only wonder how the AI models that are trained on so much information/data will be affected/controlled by increasing impact of compliance with copyright law, privacy laws, limited access to personal data. When there is no more data to train the AI models will the reliance on synthetic data corrupt the validity of GenAI – a sort of overfarming and its negative effects.
Based on the readings on OpenAI’s policies, I was struck by how vague some of the language was. Saying things like they “avoid” allowing the models to interact with and access different kinds of information was interesting because why wouldn’t they outright say that they restrict, prevent, or prohibit their models from doing so? I understand that they are covering themselves legally with the specific language, but I think it has big implications for how they are trained and the results we see. Using vague language and saying they try to tell the models not to be biased or seek private information can give the uneducated user a feeling of assurance that they shouldn’t have in the outputs. I think this makes it even more important to fact check what we receive from GenAI. I also find the hoops one must jump through to even know how to turn off the setting that allows the model to train on your conversations and how even turning it off may not fully prohibit it from training (as it said with the like and dislike feedback scenario).
GenAI’s training makes it inherently bias, because the humans giving it information are subconsciously so. This might be too meta, but eventually I see us getting to a point where AI trains AI, eliminating the human input. This can only be beneficial if it is able to overcome human bias. I do think it is interesting that people seem more concerned to systemically fix AI bias than human bias. And by interesting I mean disappointing.
But until we can get better-than-human input, we won’t get improved outputs. Just like any other work done by people, we need to double check the AI’s output to ensure its accuracy.
Very interesting to think about the biases that may be a part of AI. While AI is trained on lots of information, it strikes me, that there is still lots of information that isn’t available (rightfully so) for it to be trained on. It seems that while AI is a good tool to use, it should not be end-all, be all.
This race for data at scale has seemingly overridden everything else at these companies: there is no way to have any quality control when you are looking at trillions of tokens. Copyright law and understanding of fair use have lagged dramatically behind technology and emerging trends for years, and generative AI has really amplified this. I wonder too if this has an impact on the quality of output as the hallucination rates have seemed to increase, not decrease over time.
While training on a large quantity of data would be helpful in the algorithm giving the best answers and potentially reducing bias by including underrepresented sources and images, that comes at real harm to the people who created these works. I have seen that play out in multiple spaces I am in online, where small authors, youtube creators and artists worry about their data being used in ways that could potentially directly compete with their ability to continue to support themselves through their creations.
I recently read an article about how many of the Chinese models are open source, while there are almost no open source US models. I wonder if this is part of the larger issue, where everything is being done with little transparency by individual groups who are driven by certain profit maximizing factors and not approaching as a bigger discussion to think about how we can fairly compensate people, have better quality outputs, and stay on the cutting edge of this emerging technology.
I worry about how the training might be manipulated by people. Could people start producing lots of publicly available documents with incorrect information on purpose to adjust what AI sees and then gives back to us?
I did not see any problems with the way AI systems are trained.
One thing that stood out for me was that AI is consuming data faster than it is being produced, and it is predicted that companies will run out of data by 2026. I did a bit more research and found that date has now shifted to 2028, but that’s still disturbing. One possible solution (which some companies are already using) is that AI will produce “synthetic” data and use that data to train itself. Researchers are divided on whether this will work. I’m doubtful.
My whole life I have been learning about biases and discrimination. We are very concerned with what is fair or not from a very early age and for good reasons. These issues were a problem before GenAI. We all potentially have biases and may discriminate inappropriately because of them. I try to identify those biases in things I see or read or in actions people take. We should continue those efforts in all endeavors, including with GenAI.
Gen AI is a mere reflection of the ideas people already hold. This includes biased ideas and is why critical engagement is necessary for those who choose to use it. This alone proves why the work we do in the humanities and social sciences is so important because no matter if we’re talking about ideas in more traditional mediums or AI, human assessment is still necessary.
The way ChatGPT works using the ” next most likely word when generating a response, one word at a time” shows that the basis of an AI response is really a pattern prediction and most likely word to follow, instead of understanding the text. It comes down to what the AI models are trained with to ensure the best possible correct answers can be deduced. The amount of data that needs to be accurate to outweigh data that may be fictional or opinion-based is huge. I see especially an issue with older texts and historic narratives (tales are written by the ruler) being able to skew answers. “AI can only learn from what it’s been given.”
I can attest that there is bias against non-native speakers in all kinds of voice-controlled devices and software. Simple things, like talking to a navigation app do not work for me as well as for my husband.
AI entrepreneurs argue that scale is the most important thing and that long-form writing is the best way to train GenAI, but they don’t want to pay authors for their work. They are happy to pour millions of dollars into data centers, but they are not willing to appropriately value human creative labor. I think this is an important facet of GenAI that requires our ethical attention.
As a statistician, one of the things I worry about is the representativeness of the data that I am using to develop my models. This isn’t a new question. For the GenAI models, however, the amount of data they use takes the issue to a new scale. When you are considering huge portions of the internet, it is difficult to assess all (or even a fraction) of the dimensions along which a data set might be unrepresentative.
One of the things that strikes me is the vagueness of the OpenAI guidelines. They’re blocking certain content from being used for training, but what happens when that changes? It seems that the companies running these services could flip a switch at any moment to include or exclude information they deem suitable (or unsuitable).
I thought a striking thing in the ChatGPT notes was how vague about what constituted the “public” vs. “private” data it pulled from. The Vox article and NYT article both confirmed why: the large companies are drawing on a lot of data that a conventional user might not consider to be freely in the public sphere. These conversations remind me a lot of how the law in general is struggling to keep up with redefining “public” for an online age. Madison and the Founders were terrified of someone reading their letters locked in a desk, the possibility of Google Docs being scraped to train models should be taken as seriously.
I was also struck by how open the AI companies themselves were with how little “good” data there is left to scrape. Where will the next database come from, as we push for more and more noticeable improvements? Once we tell all the truths, are there only lies left?
The biggest issues with training AI models is where the input data is coming from, how thorough the data is, and the underlying weights given to responses. The input data could be incorrect or not be through enough to give a good estimation for a proper output. Or in other words, garbage it garbage out. One example is if someone wanted to create a facial recognition AI model that detects faces to determine the age and gender of the person, but only used images of 40 year old men from one city. It may be good at the one thing it was trained on, but any variation would most likely be incorrect or have no results.
AI bias reflect the same issue internet browsers experienced in the 1990’s — the results come from the data fed to the algorisms. When the majority of the data was produced by white, Euro-Centric society the societal bias appear in the output. I disagree with Sam Altman approach, that “systems will eventually fix themselves.” Why not bring a cultural perspective into the development by consciously selecting the training data that reflect the diverse human world? That’s intelligence.
Knowing how AI was created and how it trains is important for people to know. Even as AI improves on accuracy and bias, we always need to remember that humans are behind it; it is not perfect; it is not all-knowing. If there is very little information in the world on a subject, AI will not be as helpful or will simply make something up. As higher education has become less popular among some, I find the teaching and research done at our institutions all the more important in the age of AI and misinformation.
AI models are trained on massive datasets of text, images, and other media, which means their outputs reflect the strengths and weaknesses of that training data. On the positive side, this allows them to generate responses that are broad, flexible, and often highly creative. But it also means outputs can carry biases, gaps in knowledge, or inaccuracies inherited from the data. Also, the tool may involve private personal information. While chat gpt may have the lawful way to use our personal data, we still would prefer some info to not be public.