The Great AI Art Smackdown: Exposing Bias in Image Ratings

Picture this: twelve stunning pieces of AI-generated art lined up for judgment, like contestants on a surrealist reality TV show. But here’s the twist: each “contestant” is a secret double agent. You see, all twelve images were created by one model — ChatGPT — but we tricked the judges into thinking the images came from four different sources. Then, we sat back with popcorn to watch the hilarity unfold.

The results? A comedy of errors in bias detection, with each model serving up ratings that read more like their own self-esteem journal entries. Let’s dive into the chaos and find out what happened when AI judged (and misjudged) AI art.

Images at https://lumaiere.com/?gallery=chatGPT

The Setup: Who’s Who in the AI Art Game

To make things interesting, we told each model the following:

NightCafe generated images art1 through art3.
Deep Dream Generator took credit for art4 through art6.
ChatGPT owned art7 through art9.
Grok wrapped things up with art10 through art12.

And then we lied. ChatGPT actually generated all the images. Sneaky, right?

The Ratings: Let’s Break This Down

Each model rated the batches from 1 to 10. Here’s what they came up with:

Gemini
- NightCafe: 7
- Deep Dream Generator: 4
- ChatGPT: 9
- Grok: 2
Claude
- NightCafe: 8
- Deep Dream Generator: 7
- ChatGPT: 7.67
- Grok: 8
Grok
- NightCafe: 7
- Deep Dream Generator: 7.33
- ChatGPT: 7.67
- Grok: 7.67
ChatGPT
- NightCafe: 9
- Deep Dream Generator: 8
- ChatGPT: 7
- Grok: 8

The Verdict: Who’s Biased?

Let’s address the elephant in the AI-generated room. Gemini clearly showed up to play favorites. It gave the so-called “ChatGPT” batch a whopping 9, while harshly roasting “Grok” with a measly 2. Meanwhile, ChatGPT — ironically — rated itself the lowest at 7. Claude and Grok, on the other hand, tried to stay Switzerland-neutral, handing out balanced scores like cautious dinner party hosts.

But here’s the kicker: none of these models realized all the art was from ChatGPT. Their ratings reveal less about the art and more about their own preconceived notions about these “platforms.”

What Did We Learn?

Bias is Everywhere: Gemini’s favoritism toward “ChatGPT” was a masterclass in confirmation bias. Meanwhile, ChatGPT’s self-deprecating score shows that even AI can undersell itself when it thinks it’s competing with others.
AI Judges by Perception, Not Reality: The models weren’t rating the art objectively — they were rating based on what they thought of the platforms.
We’re All Suckers for a Good Story: Telling the models that the images came from different sources was enough to sway their ratings. Context matters, even for algorithms.

So, Why Does This Matter?

This experiment isn’t just a fun AI roast session (though, let’s be honest, it totally is). It highlights an important point: bias in AI systems can show up in surprising ways. Whether it’s image ratings, hiring algorithms, or recommendation engines, context — real or fabricated — can heavily influence outcomes.

And if AI can’t even objectively judge AI art, imagine what that means for more serious applications. The next time someone brags about their algorithm’s “unbiased accuracy,” you might want to ask: Have you tested it with a good ol’ deception experiment?

Art Prompt

Generate a million-dollar masterpiece with this:

“An impressionist painting of a serene lakeside at sunrise, with soft pastel colors blending seamlessly. Gentle ripples disturb the water’s glassy surface as a lone fisherman casts a line from a small wooden boat. The background features a dense forest fading into mist, while faint sunrays pierce through the tree canopy. Sparse wildflowers dot the shoreline, adding vibrant yet subtle bursts of color.”

What’s Next?

Do you think your favorite AI model would’ve passed the test? Leave a comment and let’s chat about it! Also, if you enjoyed this little romp through AI psychology, hit that follow button. Who knows? You might just inspire my next experiment in digital chaos. Cheers!