Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
MLCOMMASS, Nonprofit AI Security Working Group, has joined the AI Dev Platform Huging Face to publish one of the world’s largest public ownership collection for AI research.
The data set that is called Speech by unattended peopleIt contains more than one million hours, which is comprehensive at least 89. According to Mlcommons, he was motivated to create the desire to support R&D “in different areas of speech technology”.
“Supporting a wider natural language processing research for languages outside English helps communication technologies worldwide bring communication technologies,” the organization wrote a blog post Thursday. “We expect the research community to continue to build and develop, especially with regard to the development of low -source language speech models, improving various accent and dialects, and new applications of speech synthesis.”
This is a wonderful goal to be sure. But AI data sets, such as speech by unattended people, can be a risk to researchers who use them.
Biased data is one of these risks. Recording of unattended people’s speech at Archive.org, non -profit organizations that are perhaps the best known for the Wayback Machine web archive tool. Since many contributors to Archive.org speak English and American-unattended people’s speech almost all recording in American accented English Readme on the official project siteOr
This means that without careful filtration, AI systems such as speech recognition and sound synthesizer models that have been formed in the speech of unattended people can show the same prejudices. For example, they may struggle to rewrite an English language spoken by a non -indigenous speaker or find it difficult to generate synthetic sounds in languages outside English.
The speech of unattended people can also contain recordings that do not know that their voices are used for research purposes, including commercial applications. While MLCommons says that all the data sets are available in public or are available under Creative Commons licenses, there is a possibility.
According to a mit analysisHundreds of publicly available AI training data sets do not have licensing information and contain errors. Creator supporters, including Ed Newton-Rex, CEO of the AI Ethics-centered nonprofit organization, is quite skilled, claiming that the creators should not “exit” the AI data sets because of the heavy load.
“Many creators (eg. Written by Newton-Rex In the June X post last year. “For the creators who can Exit, there are several overlapping Leport-out methods that (1) are incredibly disturbing and (2) unfortunately they are incomplete in their coverage. Even if there is a perfect universal opt-out, it would be extremely unfair to put the opt-out burden on the creators, as they are competing with their generative AI work with them simply do not realize that they can quit. ”
MLCommons says it is committed to updating, maintaining and improving the quality of speech of unattended people. But due to potential errors, developers need to be very careful.