thing 10: How AI Was Created and Why It Matters

It can be easy to play with a new AI tool for fun for non-mission critical work. But when using an AI tool for anything important (e.g. with real-life consequences), it’s important to know a little bit about how it was created (including what data it’s using) and how it works, since those things can greatly affect what it can do and the answers or output it gives.

In this micro-lesson, you will learn the ways GenAI (in particular, ChatGPT) is created and what data AI is built on and where it comes from (if possible). You will also learn to distinguish how that affects what AI can (and can’t) know or do and hopefully be able to start identifying the privacy implications and the human roots of AI bias.

Read:

How ChatGPT and our foundation models are developed and How your data is used to improve model performance (5 min read)

AI Automated Discrimination. Here’s How to Spot It (15 min read; if you can’t access that link, try this)

Metz, C., Kang, C., Frenkel, S., Thompson, S. A., & Grant, N. (2024, April 6). How Tech Giants Cut Corners to Harvest Data for A.I. The New York Times. https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html (18 min optional but highly recommended read)

Discuss:

What are some of the implications you can see from how AI models are trained and what does that mean for the output you get from generative AI tools?

33 replies on “thing 10: How AI Was Created and Why It Matters”

One implication of how AI models are trained is that both correct and incorrect information is used to train the model. While OpenAI is purposeful in excluding materials from the dark web or adult websites, does it do the same for conspiracy theories that don’t fall into the filtered categories? Maybe it was this microlearning session or somewhere else, but there is a concern amongst political scientists that AI could be used to spread disinformation, and with automation bias, people might just be inclined to believe the disinformation.

We should be cognizant that AI is limited by human fallacy and will always be. As such, we should practice checking the AI’s sources and fact-checking with other trusted sources.

I can only wonder how the AI models that are trained on so much information/data will be affected/controlled by increasing impact of compliance with copyright law, privacy laws, limited access to personal data. When there is no more data to train the AI models will the reliance on synthetic data corrupt the validity of GenAI – a sort of overfarming and its negative effects.

Based on the readings on OpenAI’s policies, I was struck by how vague some of the language was. Saying things like they “avoid” allowing the models to interact with and access different kinds of information was interesting because why wouldn’t they outright say that they restrict, prevent, or prohibit their models from doing so? I understand that they are covering themselves legally with the specific language, but I think it has big implications for how they are trained and the results we see. Using vague language and saying they try to tell the models not to be biased or seek private information can give the uneducated user a feeling of assurance that they shouldn’t have in the outputs. I think this makes it even more important to fact check what we receive from GenAI. I also find the hoops one must jump through to even know how to turn off the setting that allows the model to train on your conversations and how even turning it off may not fully prohibit it from training (as it said with the like and dislike feedback scenario).

GenAI’s training makes it inherently bias, because the humans giving it information are subconsciously so. This might be too meta, but eventually I see us getting to a point where AI trains AI, eliminating the human input. This can only be beneficial if it is able to overcome human bias. I do think it is interesting that people seem more concerned to systemically fix AI bias than human bias. And by interesting I mean disappointing.

But until we can get better-than-human input, we won’t get improved outputs. Just like any other work done by people, we need to double check the AI’s output to ensure its accuracy.

Very interesting to think about the biases that may be a part of AI. While AI is trained on lots of information, it strikes me, that there is still lots of information that isn’t available (rightfully so) for it to be trained on. It seems that while AI is a good tool to use, it should not be end-all, be all.

This race for data at scale has seemingly overridden everything else at these companies: there is no way to have any quality control when you are looking at trillions of tokens. Copyright law and understanding of fair use have lagged dramatically behind technology and emerging trends for years, and generative AI has really amplified this. I wonder too if this has an impact on the quality of output as the hallucination rates have seemed to increase, not decrease over time.

While training on a large quantity of data would be helpful in the algorithm giving the best answers and potentially reducing bias by including underrepresented sources and images, that comes at real harm to the people who created these works. I have seen that play out in multiple spaces I am in online, where small authors, youtube creators and artists worry about their data being used in ways that could potentially directly compete with their ability to continue to support themselves through their creations.

I recently read an article about how many of the Chinese models are open source, while there are almost no open source US models. I wonder if this is part of the larger issue, where everything is being done with little transparency by individual groups who are driven by certain profit maximizing factors and not approaching as a bigger discussion to think about how we can fairly compensate people, have better quality outputs, and stay on the cutting edge of this emerging technology.

I worry about how the training might be manipulated by people. Could people start producing lots of publicly available documents with incorrect information on purpose to adjust what AI sees and then gives back to us?

I did not see any problems with the way AI systems are trained.

One thing that stood out for me was that AI is consuming data faster than it is being produced, and it is predicted that companies will run out of data by 2026. I did a bit more research and found that date has now shifted to 2028, but that’s still disturbing. One possible solution (which some companies are already using) is that AI will produce “synthetic” data and use that data to train itself. Researchers are divided on whether this will work. I’m doubtful.

My whole life I have been learning about biases and discrimination. We are very concerned with what is fair or not from a very early age and for good reasons. These issues were a problem before GenAI. We all potentially have biases and may discriminate inappropriately because of them. I try to identify those biases in things I see or read or in actions people take. We should continue those efforts in all endeavors, including with GenAI.

Gen AI is a mere reflection of the ideas people already hold. This includes biased ideas and is why critical engagement is necessary for those who choose to use it. This alone proves why the work we do in the humanities and social sciences is so important because no matter if we’re talking about ideas in more traditional mediums or AI, human assessment is still necessary.

The way ChatGPT works using the ” next most likely word when generating a response, one word at a time” shows that the basis of an AI response is really a pattern prediction and most likely word to follow, instead of understanding the text. It comes down to what the AI models are trained with to ensure the best possible correct answers can be deduced. The amount of data that needs to be accurate to outweigh data that may be fictional or opinion-based is huge. I see especially an issue with older texts and historic narratives (tales are written by the ruler) being able to skew answers. “AI can only learn from what it’s been given.”
I can attest that there is bias against non-native speakers in all kinds of voice-controlled devices and software. Simple things, like talking to a navigation app do not work for me as well as for my husband.

AI entrepreneurs argue that scale is the most important thing and that long-form writing is the best way to train GenAI, but they don’t want to pay authors for their work. They are happy to pour millions of dollars into data centers, but they are not willing to appropriately value human creative labor. I think this is an important facet of GenAI that requires our ethical attention.

As a statistician, one of the things I worry about is the representativeness of the data that I am using to develop my models. This isn’t a new question. For the GenAI models, however, the amount of data they use takes the issue to a new scale. When you are considering huge portions of the internet, it is difficult to assess all (or even a fraction) of the dimensions along which a data set might be unrepresentative.

One of the things that strikes me is the vagueness of the OpenAI guidelines. They’re blocking certain content from being used for training, but what happens when that changes? It seems that the companies running these services could flip a switch at any moment to include or exclude information they deem suitable (or unsuitable).

I thought a striking thing in the ChatGPT notes was how vague about what constituted the “public” vs. “private” data it pulled from. The Vox article and NYT article both confirmed why: the large companies are drawing on a lot of data that a conventional user might not consider to be freely in the public sphere. These conversations remind me a lot of how the law in general is struggling to keep up with redefining “public” for an online age. Madison and the Founders were terrified of someone reading their letters locked in a desk, the possibility of Google Docs being scraped to train models should be taken as seriously.

I was also struck by how open the AI companies themselves were with how little “good” data there is left to scrape. Where will the next database come from, as we push for more and more noticeable improvements? Once we tell all the truths, are there only lies left?

The biggest issues with training AI models is where the input data is coming from, how thorough the data is, and the underlying weights given to responses. The input data could be incorrect or not be through enough to give a good estimation for a proper output. Or in other words, garbage it garbage out. One example is if someone wanted to create a facial recognition AI model that detects faces to determine the age and gender of the person, but only used images of 40 year old men from one city. It may be good at the one thing it was trained on, but any variation would most likely be incorrect or have no results.

AI bias reflect the same issue internet browsers experienced in the 1990’s — the results come from the data fed to the algorisms. When the majority of the data was produced by white, Euro-Centric society the societal bias appear in the output. I disagree with Sam Altman approach, that “systems will eventually fix themselves.” Why not bring a cultural perspective into the development by consciously selecting the training data that reflect the diverse human world? That’s intelligence.

Knowing how AI was created and how it trains is important for people to know. Even as AI improves on accuracy and bias, we always need to remember that humans are behind it; it is not perfect; it is not all-knowing. If there is very little information in the world on a subject, AI will not be as helpful or will simply make something up. As higher education has become less popular among some, I find the teaching and research done at our institutions all the more important in the age of AI and misinformation.

AI models are trained on massive datasets of text, images, and other media, which means their outputs reflect the strengths and weaknesses of that training data. On the positive side, this allows them to generate responses that are broad, flexible, and often highly creative. But it also means outputs can carry biases, gaps in knowledge, or inaccuracies inherited from the data. Also, the tool may involve private personal information. While chat gpt may have the lawful way to use our personal data, we still would prefer some info to not be public.

I really appreciate having a better understanding of how AI is trained. The first article mentioned that data sources come from things that are publicly available on the internet, from 3rd party partners and through human trainers/researchers. So again, the accuracy of the AI output depends on the accuracy of the data. It also stands to reason that any bias that is in the data will come through to the AI. It was disturbing to think about how much this bias is affecting people. And I don’t buy the theory put forth that AI will eventually fix itself. Even if this is true, how much harm would have been done first? I continue to think that AI can be very useful, but that it needs to be constantly fact checked and monitored by humans.

These materials are great, I so appreciate understanding the connection between the (largely low quality, biased) data that is the internet on which GenAI models are trained. It highlights the importances of understanding how to check where AI is getting its information, of not ceding too much decision making to it–and also does make the distinction between human and artificial intelligence more clear. Human intelligence can be reflective, with effort observe bias and keep awareness of bias to hold it in check. AI just offers it up, like a mirror, as patterns it has seen.

Again, not loving that the first pair of readings is from one of these companies wanting to sell their own product. While they say you can opt out, just how well that works is unclear. Also, this line near the end of the first readings seems pretty worrisome:
“We retain certain data from your interactions with us, but we take steps to reduce the amount of personal information in our training datasets.” Reducing the amount of personal info is not the same as eliminating it! Too many people have been able to find personal info of other users by using targeted prompts. Why contribute your data to that?

AI models are trained using human data, with all of its flaws and bias. The result is that the AI will also exhibit these flaws and bias, until perhaps, it encounters input that challenges its former way of “thinking”. The model may then evolve to have its own outlook on things. Any information that is the output of such technology should be checked for false information or bias that may result from imperfect inputs.

It is pretty clear that this data was trained on a large swath of data much of which can not be veirfied if it is factual or unbaised. One particualr state that always strikes me is that 80% of the online content is made by 20% of the people. And as it is becoming easier to flood the internet with low quality AI – these AIs are able to talk to each other, so much of the content will not even be overseen by humans.

I like the comment that mentioned it is interesting that people seem more concerned to systemically fix AI bias than human bias. Unfortunately, there has always been and will always be bias, whether consciously or unconsciously. That’s true for humans and machines, which is why it is important to verify or fact check.

I am very curious how the awareness of Ai training is going to change it. My understanding is that most models we are using now had access to pretty much all world wide web, because nobody knew about it. Like Lady Whistledown was able to get all the hot gossip because nobody knew who she is. But once her identity was revealed, people got very cautious with their words and actions around her. Similarly, now most people and agencies start putting up privacy protections, limiting access to their information and data, available for training. So will it mean that in a few years the quality of training will be severely limited?

AI models are trained on massive datasets drawn from the internet, books, and other sources, which means they reflect the biases, gaps, and assumptions present in those sources. As a result, the output from generative AI can sometimes be inaccurate, stereotyped, or one-sided, even if it looks polished. Its the same as if I were accessing these sources myself. This means we need to approach AI-generated content critically like fact-checking, questioning assumptions, and remembering that it’s a tool, not a neutral authority.

AI is subjective and generative AI is generatively subjective. While these systems often are presented as neutral, objective, or deterministic, in fact they inherit human choices, societal bias, and design constraints at every stage, and so their outputs reflect those subjective elements. It reminds me of the Paul Ehrlich quote, “to err is human to really foul things up takes a computer.” Vetting the data that AI spits out is so important, and will continue to be important because of the human influence on AI, both in the creation of it and the self-development it undergoes through interactions throughout.

I realize AI outputs depend on the data it’s trained on, and that data carries bias. I know this means the results can be flawed or misleading. I believe it’s my responsibility to fact-check and question AI’s answers.

AI is trained on large amounts of existing data. Basically what already exists, good or bad. It also uses complex algorithms that give you an answer based on averages. Knowing that these outputs are skewed is important, determining how it affected the output to your prompt is and what you do from there is what matters.

Unfortunately even in some of the most academic data sets, human bias can set in to models. Occasionally, such as the continuing development of X’s AI Grok, these biases can be shifted intentionally by developers with a purposeful bias, which is no way to go about it. AI is also frequently trained on private data, and even the best filters can’t sift everything out. Even if we made a perfect filter, it would still be filtering based off of someone’s personal biases for what is and isn’t okay.

In analytical chemistry (my field) we have a saying “Garbage in, garbage out”. You put poor data into a model, you don’t get a good result from it. I think this is certainly one thing that we must be thinking about when considering how these large models are generated.