[GOOGLE] - A Stolen Imag(EN)ation
The Synopsis:
Google is being sued for using copyrighted information to train its text-to-image model, Imagen. The dataset, The Large-Scale Artificial Intelligence Open Network (LAION), is what Imagen has trained on, which are the 400 million images that contain the plaintiffs’ copyrighted work. The plaintiffs, Sarah Andersen, Hope Larson, Jessica Fink, and Jingna Zhang are artists whose work have been used to train Imagen, and they have not received consent, credit, or compensation for their works.1
The Public Commentary:
On WCCFTECH, there is commentary about the lawsuit:
Username, luigi, commented, “This would be saying that it is now illegal to look at pictures.”
Username, 94Ecl1pseGST, commented, “AI training is really no different than artists being inspired by others art. It does not store a copy of the image if it did models would be 100s of gbs.”
Username, Sate1122, commented, “…Personally I don’t see what the alarmism is. It’s just a fun tool for the general user to play around with.”
The Analysis:
I am not surprised that Google has trained on copyrighted content without consent, compensation, or credit. On July 11, 2023, a class-action lawsuit was filed against Google because of scraping copyrighted information to train their Large-Language Model (LLM), Bard (currently called Gemini). I am uncertain that companies, like Google, are aware of the implications of training their models on public copyrighted data; to someone who’s unaware of the ethical implications, Google’s training may appear innocuous, since it is scraping on publicly available information and provides answers. However, pertaining to the plaintiffs’ complaint, the information is intended to be used for education and research, not commercial endeavors. Google did not obtain permission from the plaintiffs to use their information to train their models for commercial products, and these models may attempt to replicate their works of art, eroding the creative-content industry.
Pertaining to the Public Commentary, I sense that these users have not created content because they would not scoff at the idea that creators’ content is used without consent. Mocking content creators’ pleas for compensation, credit, and consent of their marvelous work is inconsiderate and inhumane; our ability, as humans, to be empathetic and compassionate to wrongdoing is the key to our humanity. If we lose those qualities, we fail ourselves from ever building and sustaining a civilization. I have written a blog, Algorithmic Justice League - Fight for the Right to Write that addresses the same plight happening with writers; we need to continue pressing on to hold companies accountable for their actions.
If you have been negatively impacted by AI systems, you are free to share your story at https://report.ajl.org
The Terminology:
Infiniset - An amalgamation of various internet content to meticulously improve the model’s conversational abilities. —Infiniset was used to train Bard (previous name of LLM), now Gemini (current name of LLM).2
C4 - A dataset created by Google in 2020 to train Bard - Gemini.2
Common Crawl Dataset - An open-source data that has a collection of web pages and websites that has been collected for over 12 years; it is owned by a non-profit to be used for research purposes, not for commercial use.2
The Endnotes:
1 Blake Brittain, “Google sued by US artists over AI image generator”, Reuters, accessed May 3, 2024,
2 Gianluca Campus, “Generative AI: the US class action against Google Bard (and other AI Tools) for web scraping”, Kluwer Copyright Blog, accessed May 3, 2024,