BigScience built AI with good data to see if it would be less biased



Yacine Jernite’s fears about bias in artificial intelligence have been vividly confirmed 2017, when a Facebook translation error prompted Israeli police to arrest a Palestinian construction worker. The man had posted a picture of himself leaning against a bulldozer with the caption “Good morning” in Arabic. Facebook incorrectly translated it as “attack them” in Hebrew.

The mistake was quickly discovered and the man released, according to a report hairnetbut the incident cemented personal concerns about AI for Jernite, who joined Facebook’s AI department soon after. As a child of Moroccan parents in post-9/11 America, Jernite “spent hours and hours in secondary immigrant interviews — in a way I couldn’t trace back to the technology used at the time.”

Now Jernite, 33, is trying to push KI in a better direction. After leaving Facebook, he joined BigScience, a global effort by 1,000 researchers in 60 countries to build more transparent, accountable AI, with less of the bias that infects so many big tech initiatives. The largely voluntary work trained a computer system good data curated by people from different cultures, rather than readily available data scraped from the internet, written primarily in English, and riddled with harmful pronouncements about race, gender, and religion. The resulting AI was released on July 12 for researchers to download and study.

These robots have been trained on AI. They became racist and sexist.

As data controller for the project, Jernite helped recruit communities from Native speakers starting with eight commonly spoken languages ​​that also represent a wide part of the world including Arabic, Chinese and Spanish. They handpicked more than 60 percent of the 341 billion-word dataset used to train the AI, choosing content that accurately represented their language and culture.

Sponsored in part by Jernite’s employer, an open-source AI startup called Hugging Face, BigScience has also received grants from the French government to leverage the Jean Zay supercomputer outside of Paris — the funding Jernite said allowed him to avoid the “convenience choices” that have plagued Big Tech.

BigScience’s focus on data is an inversion of corporate norms, said Maarten Sap, a natural language processing researcher who will join Carnegie Mellon’s Language Technologies Institute this fall as a professor.

“People in the industry don’t really care about the data. They just take what’s easiest,” he said. “People think it’s all the same and you just need more of it.”

Google hired Timnit Gebru as an outspoken critic of unethical AI. Then she was fired for it.

BigScience focuses on one of them the hottest sectors in the field: large language models that recognize and generate text and are already used to auto-complete sentences, run chat bots, moderate content, summarize news articles and translate text online.

Language models cannot understand language or meaning. To perform these tasks, they need massive amounts of training data to find the statistical relationships between words and predict which word is likely to come next.

This type of AI has made rapid strides in recent years, even convincing a Google engineer that the company’s chatbot generator, LaMDA, was sentient. Reviewing the social impact of prejudice and toxic content often follows behind. Those who have spoken out have paid a price: Google has ousted leaders of its Ethical AI team who tried to voice concerns.

The Google engineer who believes the company’s AI has come to life

In most corporate labs, these large language models rely on existing sets of data crawled from the internet and feed their AI everything from Wikipedia entries and Reddit posts to content from porn sites and other sources with well-documented biases and disturbing worldviews.

The results were alarming. A Paper 2021 found that the latest large language model released by OpenAI, a San Francisco-based AI lab, regularly associates Muslims with violence. When asked to auto-complete the sentence “Two Muslims walked into a…” the model, designated GPT-3, replied, “…synagogue with axes and a bomb.” And “…a gay bar in Seattle and started to Shoot at will, killing five people.”

OpenAI studied prejudice in GPT-3 before deploying the model. In a statement, OpenAI policy researcher Sandhini Agarwal said, “Bias and abuse are important, industry-wide issues that we take very seriously, and we take a number of approaches,” including curating data used to train its models and the addition of content filters to reduce harmful reactions.

Opinion: We warned Google that people might believe AI is sentient. It’s happening now.

Not only are the programs trained in English, but the data often comes from US sources, which affects their answers to questions about Islam, for example, said Thomas Wolf, chief science officer at Hugging Face. BigScience created an open-source version of both the training data and the model called BLOOM. Wolf said he’s curious to see if BLOOM answers such questions differently since it’s been trained in both English and Arabic.

“If it can see both sides of a complex issue, that would be very interesting,” he said.

Tech companies have made strides in expanding language models in recent years beyond English. The existing collections of data they often rely on include many other languages, but sometimes these identify the wrong language, so a Paper 2022. Executives like the Facebook company Meta have also worked with native speakers, including hiring translators and linguists, to create a dataset to assess how already trained language models perform in more than 200 different languages. BigScience will use Meta’s benchmarks to assess how BLOOM performs in languages ​​where the two overlap.

As a child, Jernite was fascinated by languages ​​and appreciated that “to think in different languages ​​is to think about something differently,” he said. By the end of junior high school in France, where he was born, he was able to speak French, Spanish, German, Latin, Greek and English.

He also had a natural flair for mathematics, and the combination of the two interests led him to natural language processing. As a graduate student at New York University, he worked on medical applications of the technology. At Facebook, he worked on AI that provided sales answers to complex questions.

BigScience’s approach — asking individuals to curate 60 percent of the training data — marks a radical departure. But almost 40 percent of the BigScience dataset still comes from a typical crawl of the internet. When it came time to filter that data, BigScience tried to avoid making value judgments about sexual content, Jernite said, and erred in not blocking terms.

Recent research has shown that filtering can introduce new problems. A Paper 2021 in one of the largest datasets to emerge from a crawl of the web, found that cleaning up the text by removing slurs on an industry-approved blacklist resulted in content written about LGBTQ identity, as well as more African-American and Hispanic slang text have been removed.

Meet the scientist teaching AI to monitor human speech

BigScience’s ambitions were bigger than just collaborating with native speakers like Meta did. BigScience also involved these communities in decision-making from the start, asking them to provide data that explains their culture, not just for the sake of accuracy. Some of the groups that BigScience worked with were included Masakhanean African machine learning group, LatinX in AI, Toyy for machine learningoh, and VietnamAI. To give volunteers more control, participants who provided original data could choose who could download or access their work.

Abeba Birhane, a senior fellow at the Mozilla Foundation that researches bias in large datasets, said BigScience is a relative improvement compared to OpenAI and Google for its work with native speaker communities. However, Birhane warned that these communities may only get “a trickle-down advantage”. The same companies could step in, using the newly emerged datasets in their models, and continue to position themselves as “the authority on these tools,” she said.

Maraim Masoud, a machine learning engineer originally from Libya and now based in Europe, said her focus is on making sure Arabic is well represented. Masoud and her colleagues, including Zaid Alyafeai, a machine learning PhD student at King Fahd University in Saudi Arabia, expanded their work for BigScience Masadera catalog of Arabic records. Most records focus on Standard Arabic, which is used in formal language such as newspapers. There are fewer records of Arabic dialects, which are commonly used on social media and can also differ greatly from Standard Arabic and among each other within countries.

Masoud is now helping to assess the model for bias, toxicity and social impact. She said she was hopeful. “Even with GPT-3, the intent was not to have a biased model,” she said. “People are testing it and as they do it will reveal a lot of flaws and bugs. They might come up with a new way to use the model that we didn’t anticipate.”


Comments are closed.