On speaking terms with machines

We have interacted with our computers in mostly the same way for almost 60 years. But now we’re entering the age of conversational interfaces. Schibsted’s Futures Lab has experimented to understand more of their capabilities and constraints. The experience was surreal.

By Christopher Pearsell-Ross

On speaking terms with machines

By Christopher Pearsell-Ross

With the invention of the mouse in the 1960s, command-line interfaces gave way to a visual paradigm defined by graphical user interfaces (GUIs). Icons, menus and windows made computing more accessible to more people, and more applicable to a broader range of tasks.

In the mobile age, we have left our cursors behind in favour of the touchscreen. Now more than ever, we are reliant on visual metaphors to interact with our machines. We discover, create and explore our digital worlds with clicks and scrolls, taps and swipes, but this reliance on two-dimensional GUIs does not fully reflect our shared vision of how future technology should look.

These visions, exemplified by scenes in science fiction film and television, help define our shared expectations for what our technology should be capable of. In the future we are often shown, machines will speak and understand us. They will know us, anticipate our needs, and for better or worse, have the agency to act on our behalf. Large language models and tools like ChatGPT appear to be changing the paradigm again, bringing these sci-fi visions closer to reality.

Developed in 1964

These conversational interfaces are by no means new. Eliza, the first convincing chatbot, was developed at MIT in 1964 using simple pattern matching and rule-based responses. Siri was launched in 2011 as part of iOS using machine learning to recognise speech and to make sense of our intentions, letting many of us speak to our computers with our own voices for the first time.

But these interfaces have been limited to the responses and actions their programmers pre-defined. AI might have changed the input side of the equation, but these tools are still a lot closer to Eliza than we might care to admit. Advancements in AI technology over the last few years are radically altering this equation.

The introduction of generative AI, built on advanced neural networks called transformers, is reshaping the way our computers understand, process, and even create text. These AI models are what revolutionary new products like ChatGPT are built on, but they are also driving incredible improvements beyond text generation, including new capabilities in speech recognition, voice synthesis, sentiment analysis, image and video generation, and even the creation of 3D assets and animations.

In the year since ChatGPT was released, several key tech trends are shaping the future of conversational interfaces. Context windows are growing, essentially giving these tools longer memories and leading to more nuanced and relevant conversations. These tools are also getting connected to external data sources and digital services, enabling them to provide reliable and referenced answers, perform calculations and data analysis, and even take actions on behalf of the user. Lastly, as a recent release from ChatGPT shows, these tools are becoming multi-modal, meaning they are capable of processing not only text but also audio and images as both inputs and outputs, further expanding their versatility.

Until now, conversational interfaces have been limited to pre-defined responses and actions. AI is radically altering this equation.

Aside from technology, social trends are also shaping this conversational paradigm. Firstly, populations in the developed world are ageing as birth rates decline, life expectancies increase and immigration and healthcare systems struggle to keep up. At the same time, feelings of loneliness and isolation are growing. In 2022, the number of single-person households in Sweden grew to over two million, and in 2023, the US Surgeon General warned of the public health effects of a growing epidemic of loneliness. Finally, in many parts of the world, education gaps are also growing. Inequities like racial, gender and economic disparities mean more people around the world are left out and left behind when it comes to the opportunities that education affords.

Taken together, we are seeing signs that point toward a future in which we increasingly rely on our technology for tasks and companionship that have traditionally been performed by people. There are opportunities and risks here. Conversational tools might enable new forms of healthcare and companionship services, give knowledge workers new superpowers or provide personalised tutors to children who need them. And they might also replace human connection, displace workers, or widen inequities.

Conversational user interfaces can bridge the best of what computers and humans have to offer.

While looking at hypothetical scenarios and possible outcomes is an important part of how we inform our strategy, our mission at Futures Lab doesn’t end there. To learn more about what we can and should do today, we need to get our hands dirty with practical experimentation.

Speculative prototyping is like a kind of time travel – it allows us to rehearse possible futures, and to experience what it might feel like to be there ourselves. In this case, we built a phone powered by ChatGPT to learn about how we might talk with AI-enabled devices in the future.

Inspired by science fiction examples like Samantha from the film “Her,” we set out to build an audio-only interface. Our goal was to explore the technical maturity, usability and applicability of CUIs in today’s landscape.

We scoured Finn.no for a suitable device to house our new tool and settled on a 1970s-era Ericofon 700. To provide a context for our experiment, we decided to explore workplace productivity and set out to design a weekly de-briefing tool to help us reflect on our work and keep our stakeholders updated.

We were able to use the original speaker but replaced the dialling mechanism and speaker with a Raspberry Pi minicomputer, new microphone and a proximity sensor so we could tell when the phone was lifted. Using OpenAI’s Whisper service for voice recognition, we sent a transcript of what users said to ChatGPT using a custom system prompt. This prompt helps GPT know how to respond, what role to play and which tone of voice to use. Finally, the system’s response is played back to the user using Google Cloud text-to-speech functionality.

The result was compelling and eerily similar to some examples from science fiction. While you still need to take turns speaking and listening, the conversation flows fairly naturally. Our AI agent can ask highly relevant follow-up questions, keep the conversation on-task and help users reflect on their work in new ways. Once the system determines it has enough information (usually after a few minutes of back-and-forth conversation) it writes a summary for the user, which it can either re-write or submit to a database at the user’s instruction. From there the summaries can be used in any number of ways, from providing a searchable archive of our progress to creating tailored newsletters and Slack updates.

The audio-only experience allows us to assess what actually speaking with our machines in open-ended, flowing conversations might be like, without relying on the graphical and visual indicators we normally use.

Using these new interfaces has been as informative as it has been surreal. The scenes from “Her” and “Star Trek” that we took as inspiration are very quickly becoming reality. Testing prototypes like this can help us understand the capabilities and limitations of the technology, how to design usable products, and where and when CUIs are an appropriate choice.

Impressed with the quality

The people who have tested our phone interface were impressed by the overall quality of the conversations and the relevance of the follow-up questions. Being able to go off-script and have an actual voice conversation with a computer has been revelatory, though not without its frustrations.

Audio-only experiences are clearly outliers, but prototyping in this extreme format and comparing the experience to conventional chatbots has highlighted some important usability considerations. The things we may take for granted when using well-designed GUIs – namely, seeing the system status, understandable actions with clear icons and buttons, and information hierarchies that prevent cognitive overload – become more complicated when we only have our ears to rely on.

When it comes to usability and user experience, user preferences are strongly divided between the audio and text-based interfaces. Some users felt the intimacy, distraction-free focus, and ability to speak plainly without pretension or self-editing created a novel experience, one in which they were prompted to reflect and share a sense of openness and safety. Other users expressed a strong preference for text-based communication. They cited the efficiency of typing, the ability to refer to questions and previous answers, having time to formulate and edit their responses, as well as the ability to read and paste in other materials as important factors for them.

An important consideration in both text and audio-based CUIs is expectation management. These tools have come a long way and are able to converse at such a high level that many users will expect them to have capabilities and understandings far beyond their current capabilities. We can blame this partly on the quality of synthesised voices available today – the more human the system sounds, the more human we expect it to behave.

ChatGPT and other conversational tools like it are already demonstrating two key superpowers. First, they are great conversationalists and interviewers – they are able to understand our meaning clearly, provide tailored answers, and ask highly relevant questions. They are also able to translate human language into data consumable by machines, and to take complex data and translate it back into comprehensible human language.

We see these tools being most useful in contexts in which both of these abilities can be leveraged. Obvious applications include games and interactive media, personalised content production in news media, customer service, sales and product discovery. They are already proving highly useful as task-specific assistants in programming, research and data analysis, and we expect them to be applied as pervasive personal assistants and tutors in the very near future. Less obvious, and perhaps more far-fetched and ethically challenging, applications include AI therapists, healthcare advisors and personal companions for the vulnerable.

Combination of superpowers

Conversational user interfaces can bridge the best of what computers and humans have to offer. They can leverage the high-speed data analysis and computational superpowers of computers while making sense of the messy, creative and intuitive understanding we have as humans. In the best-case scenario, this combination of superpowers will help make the world more accessible to people with visual and cognitive differences, help make education more accessible and tailored to individual needs, increase our productivity at work and free up more of our time for the things that truly matter. On the other hand, these tools also have significant potential to disrupt labour with new forms of automation and to create emotionally impactful, biased content that drives further social isolation, misinformation, and inequity. The reality is that both scenarios are likely to coexist.

This is a rapidly changing landscape and things we thought of as science fiction are already becoming reality. We can’t predict the future, but through foresight and experimentation, we can better prepare ourselves for the changes, challenges and opportunities that are coming. That’s the approach we try to take at Schibsted’s Futures Lab. We are seeing a new paradigm of interaction on the verge of being born. CUIs have incredible potential to empower people in their daily lives… if we can get it right.

This text was human-generated by the Futures Lab team. ChatGPT was used as a sparring partner and writing critic throughout the process. Special thanks to our summer intern Lucía Montesinos for driving much of this work.

Christopher Pearsell-Ross
UX designer, Schibsted Futures Lab
Years in Schibsted: 2.5
My favourite song the last decade: Your Best American Girl – Mitski

On speaking terms with machines

By Christopher Pearsell-Ross

On speaking terms with machines

By Christopher Pearsell-Ross

Developed in 1964

Impressed with the quality

Combination of superpowers

Empowering peoplein their daily life

Empowering people
in their daily life