Introduction
In this lesson, you will explore accessibility in AI use, use AI to generate accessible content, and reflect on how well AI accomplished the task.
Read
Read “Inclusivity in Generative AI Should Be an Attribute, Not an Add-On” (~10 minutes).
Activity
Visually impaired individuals rely on alt-text to experience digital images. Alt-text is a short (usually 1-2 sentences) text-based description of an image that communicates the most important aspects of that image. Here’s an example of an image with corresponding alt-text:

Imagine that you are designing a web site for new employee orientation. Log in to Microsoft Copilot (https://copilot.microsoft.com) using your W&M credentials. Find an image that you think might be useful for this orientation. Upload an image and prompt Copilot to generate alternative text for the photo.
Example prompt: “You are designing an accessible web site for new employees. Generate 1-2 sentences of alternative text for the uploaded image. Highlight the most important aspects of the image.”
Discuss
Does the AI-generated alt-text accurately capture the image and its elements? Is anything missing that should be highlighted?
30 replies on “thing 7: Whose voices? Accessibility in AI”
The alternative text generated was very basic. The group of people in the image I uploaded were actually walking away from Wren and the alt-text generated said they were walking toward Wren – which was interesting. Also, limited reference to/details about the Wren Building itself. Only that it is historic.
I uploaded an image of the reception area to the Bee McLeod Recreation Center with the example prompt and it perfectly told me what was in the image. It highlighted the high ceilings, information on the walls, and the images in the floor emblem. However, I needed to give it more context because it did not know the significance or purpose of the building. The image is indexed on Google Maps so I am surprised it didn’t recognize the location.
I tried to create a pdf file including a photo and some facts about Jones hall (where I worked from 1990 till 2016). Copilot fond a nice photo and wrote an acceptable description. However, the pdf file it generated at my prompt was corrupted. Maybe it was my fault…
I uploaded the Year of Environment emblem ( the answer:🌿 Decorative emblem with a crown at its center surrounded by the words “YEAR OF THE ENVIRONMENT” and the year “2025,” flanked by nature-themed illustrations such as leaves, fish, and acorns. The design emphasizes ecological awareness and environmental celebration.) I think Copilot did a really good job explaining the picture in this case. However, I believe added context might be helpful even in this case.
I uploaded an image of a dog with a hand hovering over it — this is what I initially got “A small dog with a white body and brown ears lies down calmly, while a hand hovers above its head, suggesting an affectionate gesture” but the dog seemed scared to me. I entered that and got this revised version “A small dog with a white body and brown ears lies down tensely, appearing scared, while a hand hovers above its head as if about to pet it”. Can this feature be automated to fix my “strive for 85” issue?
I found a picture of students moving in to their new residence hall on move-in day. Copilot analyzed the photo, blurred the faces for privacy and generated this response: “It looks like this image captures a college or university move-in day. A group of people—likely students and volunteers—are gathered outside a large building, possibly a dormitory. The white cart in the foreground is probably being used to transport belongings, and the blue shirts might indicate volunteers or staff helping with the move-in process.” Lastly, it provided multiple suggestions for captions for the photo.
I uploaded a photo of Wren with folks walking nearby. This is the text Copilot created: “A historic brick building with arched doorways and windows, framed by trees with overhanging branches. Several people walk along a pathway leading to the building under a partly cloudy sky with patches of blue.” A very accurate description of the photo.
I uploaded a photo of a piece of artwork from W&M’s Brafferton exhibit: https://libraries.wm.edu/exhibits/remembering-william-marys-brafferton-indian-school-1723-2023
The piece shows colonial persons with a variety of skin tones standing in front of the Brafferton building. Copilot suggested: “A quilted artwork depicting a scene with a large brick house in the center, flanked by two trees. In front of the house, several children dressed in historical clothing are playing and interacting on a grassy area. The sky is light blue with white clouds, and the quilt has a decorative border featuring a pattern of interlocking shapes.” This is one of the things that humans will do better than AI because we know what context we want to include for an image.
AI hallucination was in my results. It was a picture of students serving as the welcoming committee at orientation. Some of them are wearing foam fingers. Here’s what Copilot write: A group of people wearing neon yellow-green shirts and pink headbands enthusiastically hold up green foam fingers that say “ROCK THE BLOCK” in yellow text. The scene takes place outdoors with trees and a brick wall in the background. Faces are pixelated for privacy. The fingers read, “Botetourt.” So much was correct, but it did have an error.
I changed the building to Reves Center, it was unable to provide a picture. But did tell what the Reves Center is and does. I clicked on events, it told me the events associated with Reves.
I uploaded a photo of a business card and used the example prompt. It was interesting to see what Copilot thought was the most important aspects of the photo. It got the name of the person and the company. However, it really emphasized the fact that the company logo was displayed prominently. I think I would have to tweak the prompt a little bit because that’s not what I’d have thought were the most important parts of the card. Honestly, the first sentence was probably all I needed.
“The image displays a business card for [NAME], Director of Corporate Civic Responsibility at [COMPANY], featuring key contact details including his email, phone number, and office address. The [COMPANY] logo is prominently placed, emphasizing the company’s branding.”
I uploaded a picture of a group of people in front of the Sadler Center. It gave me a generic, but descriptive alt text at first and then I provided the building name and campus. It then gave me: “A group of new employees walks along a brick pathway on the William & Mary campus, led by a guide pointing toward the Sadler Center. The scene features the historic brick architecture of the Sadler Center, surrounded by greenery and set against a clear sky, emphasizing orientation and welcome.” I think it needed a clearer prompt than the one used as suggested, plus it just made up that the group was new employees when it was actually a tour group of parents and potential students.
I uploaded an image of the Muscarelle Art Museum related to it’s reopening. It took me 3 prompts to get to this text and if I were going to use it for an actual website, I would probably stop here and edit the rest myself: “A modern art museum with large glass windows and a colorful facade is located on the employee campus, surrounded by accessible walking paths, open green spaces, and seating areas. The environment is inclusive and welcoming, with curb cuts, wide pathways, and clear signage visible, supporting mobility and orientation for all visitors under a sunny blue sky.” I do appreciate that when I brought up accessibility it highlighted the accessibility features present in the image – I had not thought of that but I imagine that may be helpful to those using alt text.
I put in a photo of Lord Botetourt’s statue in front of the Wren. I asked it to write 1-2 sentences for W&M’s new employee page and it gave me this:
“Here’s a concise alt text option for the image:
‘Statue of a historical figure on a carved pedestal, set outdoors with trees and a brick building in the background.’
Would you like a version that includes the statue’s identity or location, if known?”
When I replied with a yes, this is what I got:
“Thanks! Here’s an updated alt text for the image, tailored for the W&M new employee webpage:
‘Statue of Lord Botetourt on the campus of William & Mary in Williamsburg, Virginia, with trees and a brick academic building in the background.’
Would you like a shorter version or one that emphasizes the welcoming or historical aspect for new employees?”
So with just a minor tweak it definitely improved it, but I had a vague prompt and it gave me a vague answer, which was no surprise.
–
Side note- the article was good but it left out a big aspect of the possibilities of this new tech- it will be cost prohibitive for many people. I think if we are going to have a conversation about how it can help people, we also need to be open and realistic about how this can, and likely will, increase inequity in those with and without the financial ability to afford the assistive devices. For example, low income schools where teachers barely survive on their salaries won’t likely be buying any gen-AI robots. Unless this is a federally funded program, which I wouldn’t put my money on happening.
I uploaded two pictures of the Crim Dell bridge. AI fell short in accurately describing it. In one, it thought it was in a park like setting, and in the other, it thought it was over a river. One text generated was “A serene winter landscape with a red wooden bridge arching over a still river, framed by snow-covered trees and frosted branches. The peaceful setting highlights the quiet beauty of nature in winter.” The other was, “A snow-covered red bridge with decorative posts and black metal balusters in a park-like setting, surrounded by snow-laden trees and bushes, with a building partially visible in the background.” It completely missed that it was on William & Mary’s college campus. But then, how would it know? Perhaps I should have mentioned that in my prompt.
The alternative text generated from an uploaded photo of fireworks over the Wren Building was adequate to describe the scene. But it does so on such a generic level that it misses making a connection to its audience. The generated text doesn’t capture the essence of the Wren Building or its significance to W&M and beyond. It kind of falls flat.
I uploaded a schematic for employee orientation found on the web. The alt text was concise and accurate. Perplexity offers a screenshot upload, which is likely to make generating alt text easier for parts of my existing PowerPoint slides.
I uploaded an image from the women’s first round NCAA tournament game which featured President Rowe and several other leaders from W&M in the stands. It was easy and quick to upload the picture and generate the caption. While it knew the score of the game and that there were VIPs present, it needed more context to capture which college teams were playing in the game and who the VIP’s were.
I liked that Copilot offered to format a WCAG compliant version. When I asked it to do that, it then explained the requirements that it was meeting with its result:
To meet WCAG (Web Content Accessibility Guidelines) standards, alternative text should be:
Concise (typically under 125 characters)
Descriptive of essential content
Context-aware (relevant to the page’s purpose)
Here’s a WCAG-compliant version of the alt text for the image of Blair Hall Foyer:
“Blair Hall Foyer with checkered floor, green walls, wood paneling, and ornate lighting.”
Asking Gemini to do the same: “The Blair Hall foyer features a black and white checkered floor, with dark wood paneling and large columns framing the room. A green wall and a pair of chandeliers complete the elegant and historic look.”
Not bad. Copilot also offered to create a long description, which is helpful for more complicated images.
The description was generic. While providing a good starting point, it would be useful for the user to go back and edit the AI-generated description. This is the case for all AI-generated material though.
I asked it to summarize the William & Mary women’s basketball team cheering from the sidelines. It recognized our university’s name, it correctly counted the number of people on the team, and it also correctly read the jersey numbers. It was a very accurate and thorough.
I also chose a photo of the Wren–copilots response was: “The Wren Building at William & Mary, a historic colonial-style brick structure with arched entrances and a central clock tower, stands as one of the oldest college buildings in the U.S” Context feels appropriate, as we get a sense from that text that this building is distinct from other styles. What the AI does not notice is that it is a picture of the Wren in rain, there is a weathervaine on top, and most importantly–it’s not a clock tower!!! That is a circular a window!
I uploaded a picture of Crim Dell bridge (well, actually the university’s Zoom background using the bridge). The response that was generated, “Alt text: A scenic wooden bridge with red supports and beige railings crosses a tranquil body of water, surrounded by dense green forest. The image includes the text “William & Mary Chartered 1693,” highlighting the historical context of the location.” It does describe the photo. The question for me is whether more context about the bridge is appropriate in the alt-text, or whether that should come from additional research.
I chose an image of a group graduates in their regalia. CoPilot said: “A group of graduates stands together outdoors, dressed in traditional black graduation gowns and caps. Some gowns feature green trim on the sleeves, while others have white collars or colored cords draped around their necks, indicating academic honors or distinctions. The background is softly blurred, revealing a natural setting with trees and greenery.”
I was surprised that it went into so much detail about the graduation gowns and setting. Some of the wording is a little misleading (the green trimmed sleeves and hood colors for instance), but overall I’m impressed.
The image I uploaded as five students gathered around the Griffin at an outdoor festive event. CoPilot’s alt text was “Five people pose with a large bird mascot outdoors on a sunny day, smiling and making peace signs. Trees and colorful balloons in the background add to the festive and welcoming atmosphere.” It did a great job of capturing the the basic elements in the image and sense of the scene. It did not describe the individuals are people of color.
This was a good test in meaning-making. I uploaded a map for a project, and it was able to accurately describe the map, labels and even the color scale, but not identify which provinces were darker shaded or lighter shaded. For someone familiar with the geography, certain shapes are quickly interpretable and jump out, but that relies on a visual convention — an important reminder for the work I’m using this for to make it specific! (Also, I asked, “Is North Carolina a lot or a little on the map?” and Copilot got the answer totally backwards from what the map is intended to communicate.)
I liked the reading. Euphonia is interesting, I’m sure there are unintended consequences about better voice accents being able to make more empathetic appeals (for better and for worse), but it seems broadly well-intentioned.
And, to counter my cynicism, it also reminds me of the urban planning axioms that inclusive/universal design helps everyone. Often in unexpected ways. A ramp primarily intended for wheelchair users also helps deliveryfolks with handcarts, parents with strollers, and people walking their bicycles. Needless to say, I’m also concerned about the economic inclusivity of these platforms if they’re being developed in Silicon Valley by some exceedingly affluent people, and lower classes are only included as training data…
I did a rendering of ISC 4. “A contemporary building with large glass windows and a brick exterior, surrounded by a landscaped outdoor area with seating. The ‘Goody Clancy’ logo appears in the bottom right corner, identifying the architecture firm.” It never identified the building, which might be because it was a rendering versus an actual picture. I don’t think it is worth using AI to help with alt text as they are typically very short and a human can provide the context needed to describe the most important aspects of the image in the context the image is being used for on the webpage.
The description I got for the image (people attending a seminar) was very basic, and incorrectly assumed this is an orientation (I guess the prompt was misleading). But I also wonder if it will make sense to remove purely decorative images from accessible versions? They are often designed to be eye candies, and not to convey any useful information. So for someone with impaired vision it is just a noise. It’s like automatic closed caption trying to transcribe a background song and confusing someone only reading the captions.
I used a photo of the Wren Building. The text was basic, but it did a fairly good job of describing the look of the building. It mentioned that the people were walking toward the building, but they were walking away from it, so you would need to be sure to carefully check and edit.
I uploaded a photo of our work place (the Hive at Swem Libraries), and the description I got was very basic “well-light basement space in a bee theme that everyone loves”. I was expecting more fanfare to be honest, and the description isn’t exactly accurate, and it actually hallucinated a bit because of the bee description.