whuffman's comments

whuffman · on Feb 18, 2021

Just as a heads up, the HazCams on Perseverance are in fact in color (Source: https://link.springer.com/article/10.1007/s11214-020-00765-9 - "The Mars 2020 Navcams and Hazcams offer three primary improvements over MER and MSL. The first improvement is an upgrade to a detector with 3-channel, red/green/blue (RGB) color capability that will enable better contextual imaging capabilities than the previous engineering cameras, which only had a black/white capability.") Your observations are correct though - the stereo precision is important, so there was additional analysis of the stereo depth computation to make sure it wouldn't cause an issue.

nwallin · on Feb 19, 2021

Huh, I guess so. Looking over the study it looks like they had issues by looking at dirt in scoops and being unable to tell whether it's Martian dirt or a shadow.

I have a feeling I'd be the angry guy in the meeting who wouldn't accept the consensus. "but what about latency! what about the descend and landing!" shakes fist

rmonroe · on Feb 19, 2021

Nah, your concerns are 100% reasonable - they just operate on a different context. On Earth, latency is king. On Mars, especially until the Primary Mission is complete, it's all about risk mitigation. Since we're light-minutes away from Earth, a few frames of latency is nothing. At the same time, you want to avoid breaking your $3B machine, which is hard to operate given the time-of-light delay and comms limitations. Just a different set of tradeoffs. IIRC they first tested on-device deep learning for hazard avoidance in Curiosity, but don't quote me on that.

-Worked at JPL for a few years and have dozens of friends, a few in the vision system.

whuffman · on Feb 18, 2021

FYI - the HazCams on Perseverance are in fact in color (this is new, they were black and white on Curiosity)! Stereo precision was a concern based on the switch to color sensors, so there was some algorithmic work done to make sure it wouldn't cause an issue. (Source: https://link.springer.com/article/10.1007/s11214-020-00765-9 - "The Mars 2020 Navcams and Hazcams offer three primary improvements over MER and MSL. The first improvement is an upgrade to a detector with 3-channel, red/green/blue (RGB) color capability that will enable better contextual imaging capabilities than the previous engineering cameras, which only had a black/white capability.")

handedness · on Feb 18, 2021

Interesting, I didn't know that. I knew the Cachecam was color, but somehow missed that detail, despite actually seeing the camera in person at one point...

m4rtink · on Feb 19, 2021

Wow, real upgrades all around compared to Curiosity!

What are they going to do next ? Put on board a solar powered Mars helicopter ?? ;-)

whuffman · on Sept 2, 2019

I work in this space, so I'd love to give a bit of detail on the ML if you're interested:

Style transfer is trickier to do with speech than images! One significant issue is the lack of a good "content" versus "style" distinction. In images you can get great results by calling the higher-level features of an object classifier network "content" and holding that constant. Some people have tried this for audio with e.g. a phoneme classifier, but there are additional characteristics (such as inflection) that relate the emotional content of speech which wouldn't be held constant.

Another issue is that much of the speech classification work is done in spectrogram (or with further processing MFCC) space, which lets you treat audio similar to images and leverage a bunch of technology that we have for classifying those. But for synthesizing speech, spectrograms aren't a fantastic representation, because small errors in spectrogram space can translate into large errors in the waveform which are very clearly audible, and humans in general are pretty sensitive to audio errors. There are cool neural spectrogram inversion methods out there which can help, but those should still be trained to be robust to the kinds of errors that a style transfer algorithm would make, so it's still pretty tricky.

My company, Modulate, is building speech style transfer tech; and we've found a lot more success with adversarial methods on raw audio synthesis, where the adversary forces the generator to produce plausible speech from the target speaker!

One of the coolest parts of the kind of BMI research in this article, to me, is the potential to buy back some latency margin for speech conversion! If you're working on already-produced speech, there are super tight latency requirements if you want to hear your own speech in the converted voice - over 20-30ms for the entire audio loop, and you start to get echo-like feedback that makes speaking difficult. Even without looping back, you don't want more than 100-200ms of latency in a conversation before it starts impeding the flow of dialogue. This means your style transfer algorithm gets almost no future context, and limits the kinds of manipulations that you can do (not to mention the size of the network that you can do them with, depending on available compute power!).

whuffman · on May 13, 2019

My company, Modulate, is building "voice skins" for exactly this purpose: customizing your voice in chat for online videogames!

It's a really interesting technical challenge. On the one hand, changing your timbre to another human voice is much more complicated than basic pitch shifting, so we ended up using deep neural networks for a kind of simultaneous speech-recognition and speech-synthesis approach (though training this system to preserve e.g. the emotional complexity of your input speech while still changing out the timbre convincingly is difficult - we use adversarial training on the raw audio waveform, which is powerful but also pretty much unknown territory compared to images).

On the other hand, it's important to run with very low latency on your device while you're playing, which means that we can't simply "throw the biggest network we can" at the problem. So we have a tradeoff between model latency, which is easy to characterize, and audio quality / voice skin plausibility, which is pretty ambiguous and subjective.

Finally, as this kind of tech improves the potential for misuse becomes an important problem, so we need to build in protections (like watermarking the audio) that can help prevent fraud while not compromising the speed of the algorithm or the quality of the output audio.

Wowfunhappy · on May 13, 2019

> Finally, as this kind of tech improves the potential for misuse becomes an important problem, so we need to build in protections (like watermarking the audio) that can help prevent fraud while not compromising the speed of the algorithm or the quality of the output audio.

Legitamate question: (absolutely not rhetorical, I don't know the answer!)

What the potential misuse of a gender voice changer? If someone wants to change their perceived gender in a video game (presumably to avoid harassment), should that be detectable? If both genders should ideally receive identical treatment anyway, is there any harm in swapping?

whuffman · on May 13, 2019

I should definitely expand a bit: the voice skins that we're building give you a _specific_ person's timbre (or you can do some cool "voice space" vector manipulations to combine timbres). The cool application here is being able to sound like a character or celebrity in the game, but the risk of misuse for having a specific person's vocal cords is much greater than that for just swapping your gender.

That said, there _are_ some interesting things to be careful around, even for changing your gender, or age, or other basic variables around your voice. We're mostly worried about what impact this would have for communities built around these kinds of commonalities: for example, is it okay for a child to masquerade as an adult in an adults-only social group? I don't think there's a clear answer to all of those situations - but until we see more use of realistic voice skins in the real world, we're playing it safe and building in these kinds of tools!

bencoveney · on May 13, 2019

Catfish (verb) - lure (someone) into a relationship by means of a fictional online persona.

hooloovoo_zoo · on May 14, 2019

Interesting approach; is your system low enough latency to avoid the delayed audio feedback problem?

whuffman · on March 1, 2019

Modulate | Boston, MA, USA | Machine Learning Engineer, Core Software Engineer | Onsite | Full-time

Modulate is developing real-time voice skins to allow gamers and social app users to digitally customize their voice as they speak! Sound like the character you're playing; your favorite celebrity; or design a whole new voice unique to your online persona. (Try it at https://modulate.ai for yourself!)

We're looking for developers to help us deliver our tech to our pilot customers. Our machine learning engineer will work directly with our CTO to continue researching new techniques to improve the quality and speed of our voice skins, while the core engineer will help drive our work on integration with a variety of platforms while maintaining real-time speeds.

We use tensorflow and python to prototype and C and C++ in production.

Learn more at https://modulate.ai/careers, or email careers@modulate.ai to apply.