小程序
传感搜
传感圈

The Latest AI Chatbots Can Handle Text, Images and Sound. Here’s How

2023-10-11 04:49:38
关注

Slightly more than 10 months ago OpenAI’s ChatGPT was first released to the public. Its arrival ushered in an era of nonstop headlines about artificial intelligence and accelerated the development of competing large language models (LLMs) from Google, Meta and other tech giants. Since that time, these chatbots have demonstrated an impressive capacity for generating text and code, albeit not always accurately. And now multimodal AIs that are capable of parsing not only text but also images, audio, and more are on the rise.

OpenAI released a multimodal version of ChatGPT, powered by its LLM GPT-4, to paying subscribers for the first time last week, months after the company first announced these capabilities. Google began incorporating similar image and audio features to those offered by the new GPT-4 into some versions of its LLM-powered chatbot, Bard, back in May. Meta, too, announced big strides in multimodality this past spring. Though it is in its infancy, the burgeoning technology can perform a variety of tasks.

What Can Multimodal AI Do?

Scientific American tested out two different chatbots that rely on multimodal LLMs: a version of ChatGPT powered by the updated GPT-4 (dubbed GPT-4 with vision, or GPT-4V) and Bard, which is currently powered by Google’s PaLM 2 model. Both can both hold hands-free vocal conversations using only audio, and they can describe scenes within images and decipher lines of text in a picture.

These abilities have myriad applications. In our test, using only a photograph of a receipt and a two-line prompt, ChatGPT accurately split a complicated bar tab and calculated the amount owed for each of four different people—including tip and tax. Altogether, the task took less than 30 seconds. Bard did nearly as well, but it interpreted one “9” as a “0,” thus flubbing the final total. In another trial, when given a photograph of a stocked bookshelf, both chatbots offered detailed descriptions of the hypothetical owner’s supposed character and interests that were almost like AI-generated horoscopes. Both identified the Statue of Liberty from a single photograph, deduced that the image was snapped from an office in lower Manhattan and offered spot-on directions from the photographer’s original location to the landmark (though ChatGPT’s guidance was more detailed than Bard’s). And ChatGPT also outperformed Bard in accurately identifying insects from photographs.

Based on this photograph of a potted plant, two multimodal AI-powered chatbots—OpenAI’s ChatGPT (a version powered by GPT-4V) and Google’s Bard—accurately estimated the size of the container. Credit: Lauren Leffer

For disabled communities, the applications of such tech are particularly exciting. In March OpenAI started testing its multimodal version of GPT-4 through the company Be My Eyes, which provides a free description service through an app of the same name for blind and low-sighted people. The early trials went well enough that Be My Eyes is now in the process rolling out the AI-powered version of its app to all its users. “We are getting such exceptional feedback,” says Jesper Hvirring Henriksen, chief technology officer of Be My Eyes. At first there were lots of obvious issues, such as poorly transcribed text or inaccurate descriptions containing AI hallucinations. Henriksen says that OpenAI has improved on those initial shortcomings, however—errors are still present but less common. As a result, “people are talking about regaining their independence,” he says.

How Does Multimodal AI Work?

In this new wave of chatbots, the tools go beyond words. Yet they’re still based around artificial intelligence models that were built on language. How is that possible? Although individual companies are reluctant to share the exact underpinnings of their models, these corporations aren’t the only groups working on multimodal artificial intelligence. Other AI researchers have a pretty good sense of what’s happening behind the scenes.

There are two primary ways to get from a text-only LLM to an AI that also responds to visual and audio prompts, says Douwe Kiela, an adjunct professor at Stanford University, where he teaches courses on machine learning, and CEO of the company Contextual AI. In the more basic method, Kiela explains, AI models are essentially stacked on top of one another. A user inputs an image into a chatbot, but the picture is filtered through a separate AI that was built explicitly to spit out detailed image captions. (Google has had algorithms like this for years.) Then that text description is fed back to the chatbot, which responds to the translated prompt.

In contrast, “the other way is to have a much tighter coupling,” Kiela says. Computer engineers can insert segments of one AI algorithm into another by combining the computer code infrastructure that underlies each model. According to Kiela, it’s “sort of like grafting one part of a tree onto another trunk.” From there, the grafted model is retrained on a multimedia data set—including pictures, images with captions and text descriptions alone—until the AI has absorbed enough patterns to accurately link visual representations and words together. It’s more resource-intensive than the first strategy, but it can yield an even more capable AI. Kiela theorizes that Google used the first method with Bard, while OpenAI may have relied on the second to create GPT-4. This idea potentially accounts for the differences in functionality between the two models.

Regardless of how developers fuse their different AI models together, under the hood, the same general process is occurring. LLMs function on the basic principle of predicting the next word or syllable in a phrase. To do that, they rely on a “transformer” architecture (the “T” in GPT). This type of neural network takes something such as a written sentence and turns it into a series of mathematical relationships that are expressed as vectors, says Ruslan Salakhutdinov, a computer scientist at Carnegie Mellon University. To a transformer neural net, a sentence isn’t just a string of words—it’s a web of connections that map out context. This gives rise to much more humanlike bots that can grapple with multiple meanings, follow grammatical rules and imitate style. To combine or stack AI models, the algorithms have to transform different inputs (be they visual, audio or text) into the same type of vector data on the path to an output. In a way, it’s taking two sets of code and “teaching them to talk to each other,” Salakhutdinov says. In turn, human users can talk to these bots in new ways.

What Comes Next?

Many researchers view the present moment as the start of what’s possible. Once you begin aligning, integrating and improving different types of AI together, rapid advances are bound to keep coming. Kiela envisions a near future where machine learning models can easily respond to, analyze and generate videos or even smells. Salakhutdinov suspects that “in the next five to 10 years, you’re just going to have your personal AI assistant.” Such a program would be able to navigate everything from full customer service phone calls to complex research tasks after receiving just a short prompt.

The author uploaded this image of a bookshelf to the GPT-4V-powered ChatGPT and asked it to describe the owner of the books. The chatbot described the books displayed and also responded, “Overall, this person likely enjoys well-written literature that explores deep themes, societal issues, and personal narratives. They seem to be both intellectually curious and socially aware.” Credit: Lauren Leffer

Multimodal AI is not the same as artificial general intelligence, a holy grail goalpost of machine learning wherein computer models surpass human intellect and capacity. Multimodal AI is an “important step” toward it, however, says James Zou, a computer scientist at Stanford University. Humans have an interwoven array of senses through which we understand the world. Presumably, to reach general AI, a computer would need the same.

As impressive and exciting as they are, multimodal models have many of the same problems as their singly focused predecessors, Zou says. “The one big challenge is the problem of hallucination,” he notes. How can we trust an AI assistant if it might falsify information at any moment? Then there’s the question of privacy. With information-dense inputs such as voice and visuals, even more sensitive information might inadvertently be fed to bots and then regurgitated in leaks or compromised in hacks.

Zou still advises people to try out these tools—carefully. “It’s probably not a good idea to put your medical records directly into the chatbot,” he says.

参考译文
最新的人工智能聊天机器人能够处理文本、图像和声音。以下是其工作原理
十个多月前,OpenAI 首次向公众发布其 ChatGPT。它的问世开启了有关人工智能的持续不断头条新闻时代,并加速了谷歌、Meta 等科技巨头开发竞争性大型语言模型(LLM)的进程。自那以后,这些聊天机器人展现出了令人印象深刻生成文本和代码的能力,尽管并不总是准确。如今,能够解析文本、图像、音频等的多模态人工智能(AI)正在兴起。OpenAI 首次于上周向付费订阅用户发布了由其 LLM GPT-4 驱动的多模态 ChatGPT,而该公司早在几个月前就已宣布了这些功能。谷歌则早在今年5月就已经在其 LLM 驱动的聊天机器人 Bard 某些版本中整合了与新 GPT-4 类似的图像和音频功能。Meta 也在今年春天宣布在多模态方面取得了重大进展。虽然这项技术尚处于起步阶段,但它已经能够执行各种任务。多模态 AI 能够做什么?《科学美国人》测试了两种不同的基于多模态大型语言模型的聊天机器人:一种是由 GPT-4 驱动的 ChatGPT(被称为 GPT-4 with vision,或 GPT-4V),以及目前由谷歌的 PaLM 2 模型驱动的 Bard。两者都可以仅通过音频进行免提语音对话,并且可以描述图片中的场景并解析图片中的文字。这些功能有广泛的应用。在我们的测试中,使用一张购物小票的照片和两条提示语,ChatGPT 准确地将一个复杂的酒吧账单分摊给四位不同的人,并计算出每人应付的金额,包括小费和税款。整个任务耗时不到30秒。Bard 表现得也很接近,但将一个“9”误认为“0”,导致最终总额出错。在另一项测试中,当给出一张摆满书的书架照片时,两个聊天机器人都给出了关于假定书架主人性格和兴趣的详细描述,几乎就像是由 AI 生成的星座信息。两个系统都从一张照片中识别出了自由女神像,推断出这张照片是在曼哈顿下城区的办公室拍摄的,并提供了从拍照位置到地标的确切路线(尽管 ChatGPT 提供的指引更为详细)。此外,在根据照片准确识别昆虫方面,ChatGPT 也优于 Bard。根据这张盆栽植物的照片,两个多模态 AI 驱动的聊天机器人——OpenAI 的 ChatGPT(由 GPT-4V 驱动的版本)和谷歌的 Bard——都准确估计了容器的大小。来源:Lauren Leffer对于残障人士社区来说,这种技术的应用尤其令人兴奋。今年三月,OpenAI 开始通过 Be My Eyes 公司测试其多模态版的 GPT-4。该机构通过同名应用程序为视力障碍和盲人提供免费的描述服务。早期的测试效果足够好,以至于 Be My Eyes 正在向其所有用户推出 AI 驱动版本的应用程序。“我们收到了非常出色的反馈。”Be My Eyes 的首席技术官 Jesper Hvirring Henriksen 表示。起初出现了许多明显的问题,比如文本转录不准确或描述中包含 AI 幻觉。然而 Henriksen 表示,OpenAI 已在这些初始缺陷方面进行了改进,虽然错误仍然存在,但出现频率已经降低。“因此,人们开始谈论重新获得独立。”他说。多模态 AI 是如何工作的?在这波新聊天机器人中,工具已超越了文字。但它们仍然围绕着基于语言的人工智能模型构建。这怎么可能呢?虽然各家公司不愿分享其模型的具体基础,但这些公司并不是唯一在研究多模态人工智能的团体。其他 AI 研究人员对幕后发生的事情已经有了相当好的理解。斯坦福大学兼职教授、机器学习课程讲师以及 Conual AI 公司首席执行官 Douwe Kiela 表示,从一个仅文本的 LLM 到能够响应视觉和音频提示的 AI,主要有两种方法。Kiela 解释说,最基本的方式是将 AI 模型逐层叠加。用户在聊天机器人中输入一张图片,但图片会通过另一个专门用于生成详细图像标题的 AI 进行处理。(谷歌多年来就拥有这样的算法。)然后将该文本描述反馈给聊天机器人,聊天机器人再根据翻译后的提示进行回应。相比之下,Kiela 说,“另一种方式是将它们紧密结合。”计算机工程师可以通过将每个模型背后的代码基础设施合并,将一种 AI 算法的一部分“嫁接”到另一种中。Kiela 表示,这“有点像将一种树的一部分嫁接在另一棵树的树干上。”之后,被嫁接的模型会在包括图片、带标题的图片和纯文本描述在内的多媒体数据集上进行重新训练,直到 AI 吸收了足够的模式,能够准确地将视觉表示和文字联系起来。虽然这种方法比第一种策略更消耗资源,但它可以产生更强大的 AI。Kiela 理论认为谷歌在 Bard 上使用了第一种方法,而 OpenAI 可能使用了第二种方法来创建 GPT-4。这个观点或许可以解释这两个模型在功能上的差异。无论开发者如何融合他们的不同 AI 模型,其背后的基本过程是相同的。大型语言模型基于预测一个短语中下一个单词或音节的基本原理运行。卡内基梅隆大学的计算机科学家 Ruslan Salakhutdinov 表示,为了实现这一点,它们依赖于“Transformer”架构(GPT 中的“T”)。Salakhutdinov 说,这种神经网络会将某种事物(如书面句子)转化为由向量表示的一系列数学关系。对于 Transformer 神经网络来说,一个句子不仅仅是单词的字符串,它是一张描绘语境的连接网络。这使得聊天机器人更加人性化,能够处理多种含义,遵循语法规则并模仿风格。要组合或叠加 AI 模型,算法必须在输出之前将不同的输入(无论是视觉、音频还是文本)转化为相同类型的向量数据。Salakhutdinov 表示,从某种意义上说,这相当于“让两组代码学会彼此交流”。反过来,人类用户也可以用新的方式与这些聊天机器人进行交互。下一步是什么?许多研究人员认为现在是这一切可能性的开始。一旦你开始对不同类型的 AI 进行对齐、整合和改进,快速进展就一定会持续发生。Kiela 想象着一个不久的将来,机器学习模型可以轻松地响应、分析和生成视频,甚至气味。Salakhutdinov 猜测:“在接下来的五到十年里,你就会拥有自己的个人 AI 助手。”这种程序在收到简短提示后,能够处理从完整的客户服务电话到复杂研究任务的一切。作者上传了一张书架的照片到由 GPT-4V 驱动的 ChatGPT,并要求它描述书架主人的情况。聊天机器人描述了所展示的书籍,并还回应称:“总体而言,这个人可能喜欢探讨深层主题、社会问题和个人叙述的优秀文学作品。他们似乎既具有求知欲又具有社会意识。”来源:Lauren Leffer多模态 AI 并不等同于人工智能的终极目标——人工通用智能(AGI),即机器学习领域中计算机模型超越人类智能和能力的“圣杯”。不过,斯坦福大学的计算机科学家 James Zou 表示,多模态 AI 是迈向这一目标的重要一步。人类通过交织的多种感官来理解世界。可以推测,为了达到通用 AI,计算机也需要拥有同样的能力。尽管它们令人印象深刻且令人兴奋,但 Zou 指出,多模态模型具有与它们单一功能的前身相同的问题。“主要的挑战是幻觉问题,”他指出。如果 AI 助手随时可能虚构信息,我们如何能信任它?还有隐私问题。随着声音和视觉等高信息密度的输入,更多敏感信息可能会无意中被输入到聊天机器人中,然后在泄露或黑客攻击中被泄露或篡改。Zou 仍然建议人们谨慎地尝试这些工具。“将你的医疗记录直接输入聊天机器人可能不是一个好主意。”他说。
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

广告

scientific

这家伙很懒,什么描述也没留下

关注

点击进入下一篇

循迹网络:深度造假与新闻真实体制

提取码
复制提取码
点击跳转至百度网盘