ChatGPT was launched simply over a yr in the past (on the finish of November 2022), and numerous folks have already written about their experiences utilizing it in all types of settings. (I even contributed my very own sizzling take final yr with my O’Reilly Radar article Real-Real-World Programming with ChatGPT.) What extra is left to say by now? Well, I guess only a few of these folks have really chatted with ChatGPT. And by “chat” I imply the unique sense of the phrase—to carry a back-and-forth verbal dialog with it identical to how you’ll chat with a fellow human being. I lately chatted with ChatGPT, and I need to use that have to mirror on the usability of voice interfaces for AI instruments primarily based on Large Language Models. I’m personally on this matter since I am a professor who researches human-computer interplay, person expertise design, and cognitive science, so AI voice interfaces are fascinating to me.
Here’s what I did: In December 2023 I put in the official ChatGPT iOS app from OpenAI on my iPhone and used its voice enter mode to carry a number of hour-long conversations with it whereas driving long-distance on California highways. I wore commonplace Apple earbuds with a built-in mic and talked with ChatGPT identical to how I can be speaking to somebody on the telephone whereas driving. These lengthy solo drives have been the right alternative to check out ChatGPT’s voice characteristic as a result of I couldn’t work together with the app utilizing my arms for security causes.
Learn quicker. Dig deeper. See farther.
I had a really clear use case in thoughts: I needed a dialog companion to maintain me awake and alert whereas driving long-distance on my own. I’ve discovered that listening to music or podcasts doesn’t maintain me alert when I’m drained as a result of it’s such a passive expertise—however what does maintain me awake is having somebody to speak to, both within the automotive or remotely on the telephone. Could ChatGPT substitute a human dialog companion on this position?
The Good: ChatGPT Made Personalized Podcasts to Keep Me Engaged While Driving
To not bury the lede, it seems that it did a outstanding job! As I was driving I was in a position to interact in a number of hour-long conversations with ChatGPT that ended solely as a result of I needed to take a relaxation cease or hit the utilization restrict for GPT-4. (I pay for a ChatGPT Plus subscription so I can use probably the most superior GPT-4 mannequin, however that comes with a utilization restrict that I often hit after about an hour.)
The greatest technique to describe my expertise is (borrowing an exquisite time period my buddy coined) that it felt like listening to a personalised podcast. Since ChatGPT did a lot of the speaking, it was a largely passive listening expertise on my half apart from occasions when I needed to ask follow-up questions or direct it to vary matters. Critically, this meant I might nonetheless focus most of my consideration on driving safely with a stage of distraction on par with listening to a podcast. But it stored me extra alert than an everyday podcast since I might actively direct the circulate of the dialog.
For a concrete instance of what such a customized podcast felt like, I began one dialog by straight-up asking ChatGPT to maintain me awake whereas I was driving in Southern California from Los Angeles to San Diego. So it began by making small-talk about highway journeys generally and asking me about numerous California landmarks that I’ve visited, culminating in asking me extra about San Diego (the place I stay). When it requested me what locations I favored visiting probably the most right here, I talked about the San Diego Zoo and it began telling me a bit about what makes this specific zoo notable. It talked about the idea of “naturalistic enclosures”—a time period I had not heard earlier than—so I requested it to elaborate on what this meant. ChatGPT’s rationalization of this idea bought me within the historical past of zoos, particularly the development from preserving animals in cages to right this moment’s cage-less naturalistic enclosures, which purpose to be higher for animal welfare. During that section it talked about the time period “menagerie” in passing, which I had not heard of in that context earlier than, so I requested it to elaborate extra. It then went again farther in historical past to explain how a menagerie refers back to the phenomenon of historic rulers preserving unique animals for show with out as a lot regard for the animals’ well-being. Listening to that made me notice that I had really heard the time period menagerie in reference to a Star Trek episode of some kind, however I forgot which one, so I requested ChatGPT to jog my reminiscence. It seems that The Menagerie was a really well-known episode of the unique Star Trek tv sequence, so after chatting about that episode and different well-known Star Trek episodes for a bit, we bought onto the subject of why that present was canceled after solely three seasons however later discovered a a lot bigger viewers in syndication (i.e., re-runs). That in flip bought me curious concerning the idea of syndication within the tv enterprise, so ChatGPT dived extra into this matter. Just a few extra conversational twists and turns later, then I out of the blue realized that the hour had flown by and it was time to tug over for a loo break. Success!
Now, I don’t count on you to care in any respect concerning the particulars of the dialog I simply described because it wasn’t your dialog—it was mine! But I definitely cared about it on the time since I was genuinely curious to be taught extra concerning the matters that ChatGPT talked about, usually offhand within the midst of telling me about one thing else. It felt a bit like diving down a Wikipedia rabbit gap of following associated hyperlinks, the place every follow-up query I requested led it down one other meandering path. It was excellent for preserving me from becoming bored and sleepy throughout my lengthy drive.
ChatGPT isn’t simply good at this type of superficial “personalized podcast about Wikipedia-level trivia” … it might additionally interact me in a extra substantive dialog a couple of job I really wanted assist with in the mean time. In one other hour-long automotive chat, I prompted ChatGPT to assist me design a way to prepare my enormous assortment of virtually 30 years’ value of private and work-related information for backup. I’ve been diligent about knowledge backup all through my life, however my information are fragmented amongst totally different media through the years—burning CDs and DVDs again within the day, a number of generations of exterior arduous drives (which might be in numerous states of decay), college servers, Dropbox, and different cloud providers. For years I had an aspirational objective of unifying all of my backups into one central listing tree, akin to the idea of a monorepo in software program improvement. I’ve lately been brainstorming concepts for the best way to design such a system and the best way to deal with the sensible challenges of scaling and upkeep. So I figured that ChatGPT might assist me brainstorm throughout one among my lengthy drives. Again it did a superb job at partaking me on this bespoke dialog, and the hour flew by earlier than I needed to take a relaxation cease. I gained’t bore you with particulars of what we mentioned, nevertheless it felt like speaking with an professional in knowledge administration who was giving me recommendation about the best way to deal with my specific problem.
Intermission: Why It Feels Kind of Magical
Skeptical readers could also be pondering at this level, “What’s the big deal, it’s just ChatGPT under the hood. I can already do all this from my computer by typing into the ChatGPT text box!” Although that’s technically true, there’s one thing magical about having the ability to do that all hands-free by way of voice. If you don’t consider me, simply strive it for an hour. My folks idea is that talking and listening are hard-wired into our mind’s innate language circuitry, however writing and studying are discovered expertise (i.e., “software” somewhat than “hardware” in our brains). And that’s why it feels extra magical to carry a verbal dialog with an AI versus having the very same dialog in a textual content field on a display. If the AI is nice sufficient, then it nearly feels such as you’re speaking to an actual individual … at sure occasions when I was getting deep right into a back-and-forth dialog I practically forgot I was speaking to a machine. However, that phantasm broke in a number of methods …
The Not-So-Good: Usability Limitations of the ChatGPT Voice Interface
Despite my optimistic experiences with ChatGPT’s voice mode, it nonetheless didn’t stay as much as the gold commonplace of feeling like I was speaking with a fellow human being. That’s okay, although, since that is an extremely excessive bar! Here are a few of the methods it fell quick.
- Must converse total request : Most notably, it felt unnatural to have to talk my total request with out pausing. Whenever I paused for too lengthy, ChatGPT would interpret what I mentioned as far as my request and begin processing it. As an analogy, when typing a request in a textual content chat, you may hit the Enter or Send buttons … think about how bizarre it could be if ChatGPT began answering you the very second you stopped typing for one second! Note that in human conversations, particularly face-to-face, we use visible cues to inform whether or not our dialog companion is finished speaking or whether or not they’re pausing a bit to consider the following factor to say. Even over the telephone, we will inform by vocal inflections whether or not they’re briefly paused and need to maintain speaking, or whether or not they’re executed with their flip and prepared for us to reply. Since ChatGPT can’t do any of that (but!) I usually needed to assume arduous about what I needed to say after which say it with out pausing. This was fantastic for easy requests like “Tell me more about naturalistic enclosures in zoos,” however for extra complicated requests like describing some aspect of my knowledge backup setup, it was painful to should blurt out as a lot as I might with out pausing. Even extra annoyingly, I would typically make errors when speaking a lot with out pausing. Ideally the app would do a greater job at detecting pauses in human speech, taking each context and vocal intonations into consideration. An simpler hack can be to have a voice command like “DONE” or “OVER” (like when folks use walkie-talkies) to sign that I am executed speaking; nonetheless, this is able to additionally really feel unnatural for informal customers.
- Unpredictable wait occasions: Wait occasions (latency) for ChatGPT’s responses are unpredictable, and there aren’t audio cues to assist me set up an expectation for a way lengthy I want to attend earlier than it responds. There’s a click on sound when it begins processing my request, however then I might have to attend a number of seconds in silence earlier than listening to a response … perhaps it’s just one second or perhaps it’s 5 seconds. That mentioned, if I ask it to browse the net, then it performs a steady ready sound; internet looking takes longer, perhaps ten to twenty seconds, however not less than I get to listen to a “waiting” sound. (I don’t thoughts ChatGPT taking longer right here since a human would additionally take extra time to browse the net. However, internet looking is annoying when I don’t explicitly ask it to browse. Oftentimes I need a quick reply however one thing I say triggers a browse with out me desiring to.) In distinction, when talking with a human face-to-face, I can use visible cues to inform whether or not the opposite individual is deep in thought or when they’ll doubtless reply; and even over the telephone the opposite individual might say “ummm” or “hold on one sec, lemme think” or “ok let me look this up on the web, hang tight for a while …” in the event that they want extra time to assume by means of their response. However, since I don’t get any of those verbal cues from ChatGPT, unpredictable wait occasions break the phantasm of speaking to an individual.
- Cannot interrupt whereas it’s talking: I all the time needed to await ChatGPT to utterly end speaking earlier than it could take heed to my subsequent request. And since I by no means know forward of time how lengthy it deliberate to speak for throughout a specific flip (i.e., what number of phrases its LLM-generated response is), when I needed to say one thing mid-way it was aggravating to have to attend. I later noticed that I might really interrupt it by tapping on the app on my telephone display, however since I was driving and hands-free, I couldn’t safely do this. Also, that looks like a cumbersome interplay; I ought to have the ability to simply discuss when I need to, even when it’s speaking. This limitation made the dialog really feel like we have been utilizing a walkie-talkie the place just one get together can discuss without delay. And it’s not simply me—this idea of overlapping speech is widely-studied in linguistics and communication analysis. Humans naturally discuss over each other for numerous causes, so not having the ability to do that with ChatGPT made our dialog really feel much less fluid. Even implementing a characteristic like a voice command for interruption can be nice, like perhaps if I say “pause” or “wait” then it might cease and await my request.
- Speech recognition errors: ChatGPT’s speech recognition system (presumably primarily based on OpenAI’s open-source Whisper mannequin) is superb, nevertheless it does at occasions misread what I’m saying. What’s stranger is that typically it thinks I mentioned one thing when I didn’t, perhaps as a result of it picked up on background rumbles in my automotive. Several occasions I wouldn’t be saying something and out of the blue it responds out of the blue; and when I examine the written transcript later, it thinks that I mentioned one thing like “Thank you for watching!” (which I by no means mentioned). At different occasions it tries to prematurely finish the dialog though I’m not executed, perhaps as a result of it mistakenly detected that I mentioned one thing alongside the strains of “Thanks …” with none follow-up. Misrecognizing phrases is forgivable, however I really feel that it shouldn’t ever interpret background sounds as phrases. Of course, if there have been different folks within the automotive with me and both they talked or I was speaking to them, then I might additionally perceive how ChatGPT would mistakenly interpret that as being a request for it; always-listening house assistants like Alexa have had this subject for years. A extra superior AI would be taught to filter out each different folks’s voices and in addition infer when I was talking with another person and never it. For occasion, when it detects that my sentence is means off matter, perhaps meaning I’m talking with another person within the automotive; it might not less than ask me “Were you talking to me just now?” when it’s unsure. More usually, the concept of explicitly asking me for clarification when it’s unsure would go a good distance towards making these interactions really feel extra human; that’s what I (a consultant human!) would do if I have been on a loud telephone connection with somebody and didn’t hear them clearly.
- Overly-agreeable synthetic tone: Lastly, it’s nonetheless ChatGPT beneath the hood, so all of the common limitations of ChatGPT apply right here. Most notably, ChatGPT is tuned to be overly-friendly and overly-agreeable (sounding like a customer support agent) so it is going to merely go alongside with no matter you are saying. Thus, by default it is not going to be good at pushing again on you or difficult your pondering in any significant methods, identical to the way you wouldn’t count on a customer support agent to problem what you say. Moreover, the overly-friendly tone of its responses might come off as insincere and nearly sarcastic at occasions, though that wasn’t the designers’ intent. Relatedly, it had a bent to ask me superficial questions after it responds, which sound mildly condescending and break the circulate of our chat, like, “Sooo, what do YOU think about the San Diego Zoo? What’s YOUR favorite part of the zoo?!?” … when a standard human wouldn’t break the conversational circulate so awkwardly like that. Lastly, ChatGPT is educated on knowledge on the general public web (and can even browse the net to get extra up to date internet contents), so it gained’t do as properly in the event you’re asking about issues that haven’t been mentioned a lot on-line.
To summarize the above limitations, chatting with ChatGPT on my telephone felt like utilizing a walkie-talkie over a loud channel to speak to an overly-agreeable however socially-unaware customer support agent who has in depth data concerning the contents of the general public web.
Parting Thoughts: Cautiously Optimistic About the Future
Despite these limitations, I’m excited to see what’s in retailer for future voice interfaces to LLM-based AI instruments like ChatGPT. My early experiences of speaking with ChatGPT whereas driving gave me a glimpse into what many people have seen rising up in sci-fi exhibits comparable to Star Trek, the place folks can discuss to an omnipresent pc to ask questions, maintain conversations, or subject instructions. Hands-free operation isn’t helpful solely whereas driving—it will probably make computing actually ubiquitous by letting us seamlessly work together with computation whereas we’re within the midst of doing housekeeping, cooking, or childcare; and it will probably make computing extra accessible to broader teams of individuals, comparable to these with mobility impairments.
We nonetheless have a protracted technique to go, although. Right now the ChatGPT iPhone app isn’t hooked as much as exterior instruments beside a fundamental internet browser, however with the recently-announced GPT retailer (and sure upcoming LLM app shops from different firms) it is going to quickly be doable to hook up LLMs to a wide range of instruments that may handle our emails, procuring lists, private funds, house automation, and extra. Recent analysis has began exploring these concepts by connecting ChatGPT to house assistants comparable to Amazon Alexa (2023 arXiv paper PDF). Another promising line of labor is best context consciousness: for example, Meta and Ray-Ban lately introduced new Smart Glasses which permit customers to speak with an AI assistant that may see what they’re seeing (evaluation from The Verge). In my driving situation, you would think about carrying these glasses and having the AI act extra like a passenger sitting alongside you within the automotive seeing what you see somewhat than somebody on the opposite finish of a telephone name. Critically, a passenger can pause the dialog and let you know to look at the highway extra fastidiously in the event that they see a doable hazard forward; a future AI powered by such sensible glasses might be able to do the identical factor. Alternatively, automobiles are actually beginning to instantly embed AI into leisure methods (e.g., Volkswagen announcement at CES 2024), so future iterations might combine cameras and 3-D monitoring to enhance LLMs. One might additionally think about smartglasses-based multimodal interactions the place you level to things in any bodily atmosphere and begin conversations with the AI assistant about your environment (try this MKBHD YouTube Short displaying AI chat with sensible glasses).
Of course, these increasingly-intense ranges of AI interplay and automation come with dangers, comparable to person overreliance, unintended command execution, psychological or bodily well being hazards, and safety/privateness violations. Thus, it will likely be vital to design methods to each handle these dangers and educate customers about the best way to safely function these increasingly-powerful methods. Thank you very a lot for studying. Sooo, what do YOU take into consideration ChatGPT’s voice mode?!? What’s YOUR favourite and least favourite elements?