[OPEN AI] - Data Contamination of Closed-Source LLMs
The Synopsis:
Open AI has released Large Language Models (LLMs), such as ChatGPT 3.5 and ChatGPT 4.0, to facilitate researching and creating content. However, Open AI’s LLMs have been subject to data contamination; data contamination is a phenomenon in which the LLMs are evaluated on which the data they were trained. Researchers from the Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics have researched 255 papers and estimated that 42% of the papers have leaked into the two ChatGPT models from 263 datasets over 4.7 million samples; thus, the researchers have analyzed papers on which the data contamination has happened, and they have provided evaluations, recommendations, and a project to which other research can contribute their findings.1
The Public Commentary:
Timnit Gebru, founder and director of the Distributed AI Research Institute, and Simone Balloccu, researcher of the article, lambasted companies that are not transparent with their models, alluding to the dishonesty of Open AI’s LLMs and others alike2:
The Analysis:
Open AI’s LLMs are proprietary and closed source, and the LLMs are hidden to the public, which means model weights, training data, and carbon footprint of their LLMs are obscured. If the researchers cannot access the details of the LLMs, then their efficacy is inconclusive and unverifiable. Additionally, ChatGPT models have been suspected of data contamination. An example of data contamination: There is a research project on different types of eye colors and their prevalence in a population. The LLM is prompted to list different shades of brown eyes, yet the list outputs the same shade of brown; brown eyes come in many shades, such as Honey, Cognac, Chestnut, Russet, and Chocolate3. Though, the algorithm only outputs Honey as the ‘standard’ brown-eye color. Therefore, the research is biased to Honey-eyed color, and it is inconclusive because Honey is not representative of the entire spectrum of brown eyes.
Data contamination in LLMs is concerning, and the researchers have provided protocols to evaluate closed-source foundation models: Access the model in a way that does not leak data [be updated on the LLMs policy], Interpret performance with caution [due to possible data contamination, do not think that the foundation model is working perfectly], Avoid using closed-source models [LLM vendors are the priority, not the humans who use it], Adopt a fair and objective comparison [if comparing closed and open LLMs, use the same number of sets and approaches], Make the evaluation reproducible [ensure that the results can be done by others], and Report indirect data leaking [inform the company about data contamination]. The aforesaid protocols are essential to ensuring that the foundation models are human-centered and beneficial to humans in the short and long term. We have to be our own inquirer and investigator to study companies’ product before using them.
Pertaining to the Public Commentary, Timnit Gebru’s indignation is justifiable because A.I. tools are being used in public and private sectors that affect lives. If a company can release a product in which analyzing the product is troublesome, then it should not be on the market. Timnit has stated that the aforesaid companies regard their LLMs model as a ‘digital god’ to evade responsibility. When companies misplace accountability, such as this statement alludes, the trustworthiness of the product erodes; an LLM is not responsible for its actions, since it is merely the expression and physical manifestation of its creator, the developer. Responsible A.I. is imperative for the longevity of these tools because our lives, literally, depend on our benevolent creation and implementation of A.I. tools; the A.I. tools must abide by the A.I. ethics guidelines for a sustainable and human-focused algorithmic society.
The Questions:
Q1: Do you believe there are advantages to closed-source systems, despite their nebulous structure?
Q2: Have you used closed-source LLMs and open-sourced LLMs? What were the challenges that were faced in using one or the other?
Q3: If you could suggest an implementation in Open AI’s closed-source LLMs, what would it or they be?
Q4: What are some strategies to mitigate data contamination in LLMs?
The Terminology:
Data Contamination - Data contamination occurs when a model’s training data includes information that should only be present in the test set; the overlap would regurgitate information it has seen during training, rather than processing new information.4
Data Leakage - Data Leakage happens when a user improves or updates the training set through prompting.1
The Endnotes:
1 Simone Balloccu et al.,“Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs”, arxiv , accessed Feb 22, 2024,
https://arxiv.org/pdf/2402.03927.pdf
2 Timnit Gebru.,“In what world is it acceptable to have a product whose behavior is not reproducible at all?”, X , accessed Feb 22, 2024,
https://twitter.com/timnitGebru/status/1756531975522070856
3 Autumn Sprabary.,“Brown Eyes: Facts About the Most Common Eye Color”, Eyeglasses News, Advice, and Tips , accessed Feb 22, 2024,
https://www.eyebuydirect.com/blog/brown-eyes-facts-about-the-most-common-eye-color
4 Raghunadha Kotha.,“Large Language Models and the Challenge of Data Contamination”, LinkedIn , accessed Feb 22, 2024,
https://www.eyebuydirect.com/blog/brown-eyes-facts-about-the-most-common-eye-color