It is a great privilege to be one of the last fully human beings.

technically incorrect

I recognize that in the tangible future, artists formerly known as humans will become moving hybrids of flesh and chips.

Perhaps I shouldn’t have been surprised when Microsoft researchers came along to slightly hasten the hopeless future.

It all seemed so innocent and so scientific. The headlines of the researchers’ paper were creatively opaque.

what do you think this means? Are there new, faster ways for machines to write down your spoken words?

The researcher’s abstract begins gently. For example, it uses many words, phrases, and acronyms that are unfamiliar to many lay human language models. It explains that the neural codec language model is called VALL-E.

Surely this name should ease you. What’s so scary about technology that sounds like a cute robot from a heartwarming movie?

Perhaps this is because “VALL-E has in-context learning capabilities, synthesizing high-quality personalized speech simply by using 3-second pre-registered recordings of invisible speakers as acoustic prompts.” can be used for

I have often wanted to develop learning abilities. Instead, I had to resort to waiting for them to appear.

And what emerges from the researcher’s final words is a shudder. Microsoft’s big brains can forge longer sentences or even larger speeches that probably aren’t made by you in just three seconds of utterance, but they sound like you.

Neither of us would benefit from it, so I don’t get too into science.

Suffice it to say that VALL-E uses an audio library created by one of the most admired and trusted companies in the world, Meta. A repository called LibriLight, with 7,000 people talking for a total of 60,000 hours.

Of course, I also listened to VALL-E’s works.

Listened to a man speaking for 3 seconds. He then listened for eight seconds when his VALL-E version of him prompted:

If there is, you won’t notice a big difference.

It’s true that many of the prompts sounded like very bad snippets of 18th century literature. Sample: “Thus, this humane and righteous father comforted his unfortunate daughter, and the mother hugged her daughter again and did all she could to ease her daughter’s feelings. “

But what can we do but hear more examples presented by researchers? Some VALL-E versions were a little more suspect than others. The wording didn’t go well. They felt connected.

However, the overall effect is appropriately horrifying.

Of course you have already been warned. Just in case when a scammer calls you, they record you and recreate your words to make you maliciously order expensive products in your abstracted voice. I know I shouldn’t talk to them.

However, this seems like another level of refinement. Perhaps I’ve seen too many episodes of Peacock’s “The Capture” where deepfakes are presented as a natural part of government. Microsoft is a pretty nice, non-aggressive company these days, so maybe you shouldn’t worry.

But the idea that someone, anyone, can be easily tricked into believing I’m saying something I’ve never said doesn’t comfort me. In particular, the researchers claim it can also reproduce his “emotional and acoustic environment” during the first three seconds of his speech.

“VALL-E can synthesize speech that preserves the identity of the speaker, so it can be used for speech identification spoofing and spoofing.” It can carry potential risks of model misuse, such as impersonating a particular speaker.”

solution? Researchers say they are building a detection system.

One or two people may ask, “So why did you do this?”

In the world of technology, the answer is often “because we can”.

