小程序
传感搜
传感圈

Your Personal Information Is Probably Being Used to Train Generative AI Models

2023-10-20 17:37:37
关注

Artists and writers are up in arms about generative artificial intelligence systems—understandably so. These machine learning models are only capable of pumping out images and text because they’ve been trained on mountains of real people’s creative work, much of it copyrighted. Major AI developers including OpenAI, Meta and Stability AI now face multiple lawsuits on this. Such legal claims are supported by independent analyses; in August, for instance, the Atlantic reported finding that Meta trained its large language model (LLM) in part on a data set called Books3, which contained more than 170,000 pirated and copyrighted books.

And training data sets for these models include more than books. In the rush to build and train ever-larger AI models, developers have swept up much of the searchable Internet. This not only has the potential to violate copyrights but also threatens the privacy of the billions of people who share information online. It also means that supposedly neutral models could be trained on biased data. A lack of corporate transparency makes it difficult to figure out exactly where companies are getting their training data—but Scientific American spoke with some AI experts who have a general idea.

Where do AI training data come from?

To build large generative AI models, developers turn to the public-facing Internet. But “there’s no one place where you can go download the Internet,” says Emily M. Bender, a linguist who studies computational linguistics and language technology at the University of Washington. Instead developers amass their training sets through automated tools that catalog and extract data from the Internet. Web “crawlers” travel from link to link indexing the location of information in a database, while Web “scrapers” download and extract that same information.

A very well-resourced company, such as Google’s owner, Alphabet, which already builds Web crawlers to power its search engine, can opt to employ its own tools for the task, says machine learning researcher Jesse Dodge of the nonprofit Allen Institute for AI. Other companies, however, turn to existing resources such as Common Crawl, which helped feed OpenAI’s GPT-3, or databases such as the Large-Scale Artificial Intelligence Open Network (LAION), which contains links to images and their accompanying captions. Neither Common Crawl nor LAION responded to requests for comment. Companies that want to use LAION as an AI resource (it was part of the training set for image generator Stable Diffusion, Dodge says) can follow these links but must download the content themselves.

Web crawlers and scrapers can easily access data from just about anywhere that’s not behind a login page. Social media profiles set to private aren’t included. But data that are viewable in a search engine or without logging into a site, such as a public LinkedIn profile, might still be vacuumed up, Dodge says. Then, he adds, “there’s the kinds of things that absolutely end up in these Web scrapes”—including blogs, personal webpages and company sites. This includes anything on popular photograph-sharing site Flickr, online marketplaces, voter registration databases, government webpages, Wikipedia, Reddit, research repositories, news outlets and academic institutions. Plus, there are pirated content compilations and Web archives, which often contain data that have since been removed from their original location on the Web. And scraped databases do not go away. “If there was text scraped from a public website in 2018, that’s forever going to be available, whether [the site or post has] been taken down or not,” Dodge notes.

Some data crawlers and scrapers are even able to get past paywalls (including Scientific American’s) by disguising themselves behind paid accounts, says Ben Zhao, a computer scientist at the University of Chicago. “You’d be surprised at how far these crawlers and model trainers are willing to go for more data,” Zhao says. Paywalled news sites were among the top data sources included in Google’s C4 database (used to train Google’s LLM T5 and Meta’s LLaMA), according to a joint analysis by the Washington Post and the Allen Institute.

Web scrapers can also hoover up surprising kinds of personal information of unclear origins. Zhao points to one particularly striking example where an artist discovered that a private diagnostic medical image of herself was included in the LAION database. Reporting from Ars Technica confirmed the artist’s account and that the same data set contained medical record photographs of thousands of other people as well. It’s impossible to know exactly how these images ended up being included in LAION, but Zhao points out that data get misplaced, privacy settings are often lax, and leaks and breaches are common. Information not intended for the public Internet ends up there all the time.

In addition to data from these Web scrapes, AI companies might purposefully incorporate other sources—including their own internal data—into their model training. OpenAI fine-tunes its models based on user interactions with its chatbots. Meta has said its latest AI was partially trained on public Facebook and Instagram posts. According to Elon Musk, the social media platform X (formerly known as Twitter) plans to do the same with its own users’ content. Amazon, too, says it will use voice data from customers’ Alexa conversations to train its new LLM.

But beyond these acknowledgements, companies have become increasingly cagey about revealing details on their data sets in recent months. Though Meta offered a general data breakdown in its technical paper on the first version of LLaMA, the release of LLaMA 2 a few months later included far less information. Google, too, didn’t specify its data sources in its recently released PaLM2 AI model, beyond saying that much more data were used to train PaLM2 than to train the original version of PaLM. OpenAI wrote that it would not disclose any details on its training data set or method for GPT-4, citing competition as a chief concern.

Why are dodgy training data a problem?

AI models can regurgitate the same material that was used to train them—including sensitive personal data and copyrighted work. Many widely used generative AI models have blocks meant to prevent them from sharing identifying information about individuals, but researchers have repeatedly demonstrated ways to get around these restrictions. For creative workers, even when AI outputs don’t exactly qualify as plagiarism, Zhao says they can eat into paid opportunities by, for example, aping a specific artist’s unique visual techniques. But without transparency about data sources, it’s difficult to blame such outputs on the AI’s training; after all, it could be coincidentally “hallucinating” the problematic material.

A lack of transparency about training data also raises serious issues related to data bias, says Meredith Broussard, a data journalist who researches artificial intelligence at New York University. “We all know there is wonderful stuff on the Internet, and there is extremely toxic material on the Internet,” she says. Data sets such as Common Crawl, for instance, include white supremacist websites and hate speech. Even less extreme sources of data contain content that promotes stereotypes. Plus, there’s a lot of pornography online. As a result, Broussard points out, AI image generators tend to produce sexualized images of women. “It’s bias in, bias out,” she says.

Bender echoes this concern and points out that the bias goes even deeper—down to who can post content to the Internet in the first place. “That is going to skew wealthy, skew Western, skew towards certain age groups, and so on,” she says. Online harassment compounds the problem by forcing marginalized groups out of some online spaces, Bender adds. This means data scraped from the Internet fail to represent the full diversity of the real world. It’s hard to understand the value and appropriate application of a technology so steeped in skewed information, Bender says, especially if companies aren’t forthright about potential sources of bias.

How can you protect your data from AI?

Unfortunately, there are currently very few options for meaningfully keeping data out of the maws of AI models. Zhao and his colleagues have developed a tool called Glaze, which can be used to make images effectively unreadable to AI models. But the researchers have only been able to test its efficacy with a subset of AI image generators, and its uses are limited. For one thing, it can only protect images that haven’t previously been posted online. Anything else may have already been vacuumed up into Web scrapes and training data sets. As for text, no such similar tool exists.

Website owners can insert digital flags telling Web crawlers and scrapers to not collect site data, Zhao says. It’s up to the scraper developer, however, to opt to abide by these notices.

In California and a handful of other states, recently passed digital privacy laws give consumers the right to request that companies delete their data. In the European Union, too, people have the right to data deletion. So far, however, AI companies have pushed back on such requests by claiming the provenance of the data can’t be proven—or by ignoring the requests altogether—says Jennifer King, a privacy and data researcher at Stanford University.

Even if companies respect such requests and remove your information from a training set, there’s no clear strategy for getting an AI model to unlearn what it has previously absorbed, Zhao says. To truly pull all the copyrighted or potentially sensitive information out of these AI models, one would have to effectively retrain the AI from scratch, which can cost up to tens of millions of dollars, Dodge says.

Currently there are no significant AI policies or legal rulings that would require tech companies to take such actions—and that means they have no incentive to go back to the drawing board.

参考译文
您的个人信息可能正在被用来训练生成式人工智能模型
艺术家和作家对生成式人工智能系统感到义愤填膺,这是可以理解的。这些机器学习模型之所以能生成图像和文本,是因为它们经过了大量真实人类创作作品的训练,其中许多作品仍受版权保护。如今,包括OpenAI、Meta和Stability AI在内的主要AI开发公司正面临多项诉讼。这些法律诉讼也得到了独立分析的支持;例如,8月,《大西洋月刊》报道指出,Meta在其大型语言模型(LLM)的训练中部分使用了一组名为Books3的数据,其中包含超过17万本盗版和受版权保护的书籍。而这些模型的训练数据集还包括书籍以外的内容。在争相打造和训练更大AI模型的过程中,开发者们已经扫过了大部分可搜索的互联网内容。这不仅可能侵犯版权,而且威胁到数以亿计在互联网上分享信息的人的隐私。这也意味着所谓的中立模型可能是在存在偏见的数据上训练而成的。由于企业缺乏透明度,很难搞清楚公司具体是从哪里获得它们的训练数据——但《科学美国人》采访了一些AI专家,他们对这些数据的来源有一些总体认识。AI训练数据从哪里来?为了打造大型生成式AI模型,开发者们转向公共互联网。但华盛顿大学的计算语言学和语言技术研究员艾米丽·M·本德(Emily M. Bender)指出:“你不可能去一个地方下载整个互联网。”相反,开发者们通过自动化工具来收集和提取互联网上的数据,从而构建他们的训练集。网络“爬虫”在链接之间穿梭,为数据库索引信息的位置,而网络“抓取器”则下载并提取相同的信息。机器学习研究员杰西·多奇(Jesse Dodge)来自非营利组织AI研究所,他表示像谷歌母公司Alphabet这样的资金雄厚的公司,由于已有为其搜索引擎构建网络爬虫的经验,可以自行使用这些工具。其他公司则使用现有的资源,如Common Crawl,它曾为OpenAI的GPT-3提供数据,或像LAION这样的数据库,它收录了图片及其配文的链接。Common Crawl和LAION均未回应采访请求。想要使用LAION作为AI资源的公司(据多奇说,它曾是图像生成器Stable Diffusion的训练集的一部分)可以遵循这些链接,但必须自己下载内容。网络爬虫和抓取器可以轻松访问几乎所有不设登录页的地方,不包括设为私人状态的社交媒体资料。但多奇指出,如果数据在搜索引擎中或无需登录网站即可查看(如公共LinkedIn资料),仍可能被“吸”走。他补充说,还有一些内容绝对会被这些网络抓取程序收录,包括博客、个人网页和公司网站。这包括流行图片共享网站Flickr、在线市场、选民登记数据库、政府网页、维基百科、Reddit、研究数据库、新闻媒体和学术机构。此外,还有盗版内容合集和网络档案,它们通常包含已经从原始位置删除的数据。而且抓取数据库并不会消失。多奇指出:“即使2018年从一个公共网站抓取了文本,即使该网站或帖子已被删除,它仍然会被永久保存。”芝加哥大学的计算机科学家本·赵(Ben Zhao)说,有些数据爬虫和抓取工具甚至可以通过伪装成付费账户来绕过付费墙(包括科学美国人网站的付费墙)。“你可能会惊讶于这些爬虫和模型训练者为了获取更多数据会走多远,”赵说。根据《华盛顿邮报》和AI研究所的联合分析,Google的C4数据库(用于训练Google的LLM T5和Meta的LLaMA)中,付费新闻网站是排名靠前的数据来源之一。网络抓取器还可以收集令人惊讶的、来源不明的个人信息。赵指出,有一个特别醒目的例子,一位艺术家发现自己的私人诊断医学图像被包含在LAION数据库中。Ars Technica的报道证实了这位艺术家的说法,并指出相同数据集也包含数千人的医疗记录照片。我们无法确切知道这些图像为何被包含在LAION中,但赵指出,数据可能被误放,隐私设置通常也很宽松,泄露和入侵也很常见。那些原本不打算公之于众的信息,却经常出现在公共互联网上。除了这些网络抓取的数据,AI公司也可能有意地将其他来源(包括他们自己的内部数据)纳入模型训练中。OpenAI会根据用户与聊天机器人互动的数据微调其模型。Meta表示其最新的AI部分是基于公开的Facebook和Instagram帖子进行训练的。据埃隆·马斯克(Elon Musk)称,社交媒体平台X(原名Twitter)计划对其用户的内容做同样的事。亚马逊也表示,它将使用客户与Alexa对话中的语音数据训练其新的LLM。但除了这些公开信息外,公司最近几个月越来越避谈有关其数据集的细节。尽管Meta在其关于第一版LLaMA的技术论文中提供了一般的数据分解情况,但几个月后发布的LLaMA 2包含的信息远少。谷歌在其最近发布的PaLM2 AI模型中也没有明确说明其数据来源,除了提到PaLM2使用了比原始PaLM版本更多的数据进行训练。OpenAI则表示,由于竞争是主要顾虑,它不会透露GPT-4的训练数据集或训练方法的任何细节。为什么可疑的训练数据是个问题?AI模型可能会原封不动地再现用于训练它们的材料,包括敏感的个人数据和受版权保护的作品。许多广泛使用的生成式AI模型都设有阻止它们分享个人识别信息的机制,但研究人员一再证明,这些限制是可以绕过的。对于创意工作者而言,即使AI输出的内容不构成抄袭,赵表示,它们仍可能通过模仿特定艺术家独特的视觉技巧来侵占付费机会。但如果没有关于数据来源的透明度,很难将此类输出归因于AI的训练;毕竟,AI可能只是偶然“幻想”出这些有问题的内容。纽约大学的数据记者梅瑞狄斯·布鲁萨尔(Meredith Broussard)指出,训练数据缺乏透明度还引发了与数据偏见相关的严重问题。“我们都知道互联网上有精彩的内容,也有极其有毒的内容,”她说。例如,Common Crawl等数据集就包括白人至上主义网站和仇恨言论。即使来源不那么极端,其中的内容也可能促进刻板印象。此外,网络上还有大量色情内容。因此,布鲁萨尔指出,AI图像生成器往往会产生女性的性化图像。“输入有偏见,输出就有偏见,”她说。本德对此表示赞同,并指出偏见甚至更深入——深入到谁能在互联网上发布内容本身。“这将导致数据偏向富裕阶层、偏向西方社会、偏向特定年龄群体等,”她说。网络骚扰还通过迫使边缘群体离开某些网络空间,加剧了问题。本德指出,这意味着从互联网上抓取的数据无法代表现实世界的全部多样性。本德表示,很难理解一个如此依赖偏见数据的技术的价值和适当应用,特别是如果公司不坦率说明潜在的偏见来源。你能保护你的数据不被AI使用吗?不幸的是,目前几乎没有实质性地将数据排除在AI模型之外的选项。赵和他的同事开发了一种名为Glaze的工具,可以用来使图像对AI模型“不可读”。但研究人员只能在一小部分AI图像生成器上测试该工具的有效性,其用途有限。例如,它只能保护那些此前未在互联网上发布过的图像。其他内容可能已经被网络抓取工具吸走并纳入训练数据集。至于文本,目前还没有类似的工具。网站所有者可以在网站上插入数字标记,指示网络爬虫和抓取器不要收集网站数据,赵表示。但是否遵守这些指示则取决于抓取器的开发者。在加利福尼亚州和少数其他州,最近通过的数字隐私法赋予消费者请求公司删除其数据的权利。在欧盟,人们也有数据删除权。然而,斯坦福大学的隐私与数据研究员金·金(Jennifer King)指出,AI公司通常以无法证明数据来源为由,甚至直接忽视这些请求,来对抗此类请求。即使公司尊重这些请求并将你的信息从训练集中删除,赵指出,也没有明确的策略让AI模型“遗忘”它之前吸收的内容。多奇表示,要想真正将这些AI模型中所有的版权或潜在敏感信息删除,就必须有效地从头开始重新训练AI,这成本可能高达数千万美元。目前,尚无重大AI政策或法律裁决要求科技公司采取此类行动——这意味着他们没有动力回到画板重新开始。
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

广告
提取码
复制提取码
点击跳转至百度网盘