From a technology point of view, this is really cool.
From the view of someone that occasionally watches videos on YouTube, I am trying to figure out a nice way to say... I hate it. Or more specifically, I hate that it generates the voice, and basically enables video content spam.
What we don't need more of is cheap, easy to automatically generate videos that are basically spam and/or clickbait, trying to get views. The problem with auto-generated voices in videos like this is as a viewer I can't distinguish between work that someone put deliberate production time into, and something churned out by a content farm. The demo video even tricked me at first, I didn't realize it was a generated voice until a couple sentences in, at which point I had a visceral negative reaction, the same as when I accidently click on a content farm-generated video.
It seems a major feature is automatically syncing the narration to the slides. Perhaps a way to enhance this while avoiding spam generation is to use the generated voice only for internal timing, and generated a karaoke-like display for a narrator (human) to read? As a paid service, you could even provide professional voice-offers as an add-on.
> The problem with auto-generated voices in videos like this is as a viewer I can't distinguish between work that someone put deliberate production time into, and something churned out by a content farm.
If machine voiced vs human voiced is the only discernible difference in the end, this seems like a non-argument.
As someone that is building a tool in the roughly same space (machine voiced video generation), I can just say that the use-cases go far beyond "content-farm". It also enables a lot of useful content like e.g. internal training videos, or paired with browser automation, you can have narrated always up-to-date video manuals of your product. In the education space, it enables a more iterative way to produce material where you previously couldn't afford to tweak parts of a video, as you would have to narrate it again.
And I also don't think that it will amplify the existence of such videos significantly. There are already Youtube channels that already do just that, and people don't seem to mind. E.g. there is a channel that uploads "car news" content, which basically just has a narration on top of a series of pictures of a car, and the amount of views and the rating on those videos is pretty good. In the end its just a few fact bulletins stretched into an overly long video using the same old worn out phrases (just like regular "car news"), and I don't see why a human would need to waste their time to voice that.
>> The problem with auto-generated voices in videos like this is as a viewer I can't distinguish between work that someone put deliberate production time into, and something churned out by a content farm.
> If machine voiced vs human voiced is the only discernible difference in the end, this seems like a non-argument.
The problem is getting to the end -- I don't want to spend several minutes trying to decide if it's spam or useful. It's simply easier and safer for me to use "contains auto-generated voice" as a filter to avoid watching garbage. Specifically I'm talking about videos like the ones discussed in this video [1].
Though I'd generally agree that good quality content is good quality content, I personally think there's something lost by using a machine-generated voice. Good human narrators add nuance and emphasis and energy, and it's much more interesting when someone is passionate or excited about the topic they're talking about and you can hear that come through.
Some humans are bad narrators, of course, and the machine-generated voice may not be worse by comparison. The problem is I'd just rather not listen to an emotionless voice -- whether it's machine-generated or human -- read a script, I'd rather just read it myself.
Maybe I'm wrong and the generated voices are much better than I've heard (any examples?) but I think part of the problem remains in that unless I'm forced to watch (eg, internal training) or have a recommendation come from someone I trust, it's still safer to filter out videos with machine-generated voice as "probably spam/garbage".
> it enables a more iterative way to produce material where you previously couldn't afford to tweak parts of a video, as you would have to narrate it again
I think this is a very compelling feature, but as a potential consumer of these videos (either accidentally on youtube or forced via internal training) I wish someone would come up with a way to enable this without having to resort to using the emotionless robot voice.
This again could just be my personal preference: I think emotionless robot voice is pretty much going to always mean somewhere between low- and mediocre-quality video, and I also think a low quality video is significantly worse than just having an easily-updatable HTML/PDF/whatever document with pictures/screenshots/diagrams as appropriate.
> Good human narrators add nuance and emphasis and energy, and it's much more interesting when someone is passionate or excited about the topic they're talking about and you can hear that come through.
And... there are some humans tasked with making videos for others and they're just really bad. Again, internal/training videos, etc, done by people without much passion for, or even knowledge of, the task they're training you on. I prefer machine generated voice in those cases, or perhaps even some sort of subtitling that could be piped to the TTS engine of my choice.
I get your point, and in large parts agree with you, but sometimes it's just nicer to have a video to watch (especially if there is also an important visual component to it) than reading the equivalent script.
> Some humans are bad narrators, of course, and the machine-generated voice may not be worse by comparison. The problem is I'd just rather not listen to an emotionless voice -- whether it's machine-generated or human -- read a script, I'd rather just read it myself.
If it's in the end really just a script that's read off, I'd rather have it auto-generated. You are right, there are some people that are "bad narrators" on a technical level (looking at you, people in the CircleCI Youtube ads, which prompted me to start this project), but even "good narrators" like e.g. the guy from the Kurzgesagt videos, often times don't convey any more emotion, and could be replaced with an auto-generated voice.
> Maybe I'm wrong and the generated voices are much better than I've heard (any examples?)
The best one in my opinion you can get of the shelf are the Google WaveNet[0] ones. They are the least "tinny" ones with good pronunciation. Out of the open source ones, Mozilla TTS[1] has some very good results[2], but like all other open source ones it's very hard to get running, and even then it has a much more limited featureset (languages, pronunciation, etc.). Happy to hear suggestions here!
I think we've already crossed a point where the quality is good enough for a lot of applications (= it doesn't distract from the script through constant wrong pronunciation), and the future for the field is looking pretty good.
>Good human narrators add nuance and emphasis and energy
Imagine you have a product that is used in 3 or 4 different countries and you need to produce regular training videos for customers in those languages.
You seem to be making this "either / or" its simply another tool.
The primary issue I have with auto-generated video is its ability to systematically reduce accuracy over time due to being generated from out-of-date information, unmaintained APIs, or simple typos.
Manual editing and manual narration tends to act as a forcing function to review the information and approve its accuracy before publishing.
Auto-generated videos can often be published without a final review or fact check. As we see auto-generated video for things like product demos, and company training, it will open a new problem domain of catching “bugs” in those auto-generated videos.
you can easily replace generated voice with professionally recorded voice later, and have it re-sync everything. generated voice is great for experimenting and iterating.
> The problem with auto-generated voices in videos like this is as a viewer I can't distinguish between work that someone put deliberate production time into, and something churned out by a content farm.
There's a big difference between good content that is automated into a video, and spam. The key use case for this was helping me focus more on the content, rather than on fiddling with synchronisation and resizing assets. I'm not a native English speaker, and although I speak at quite a few conferences per year, listening to my broken English accent (which sounds like a Bond villain) in a video is quite distracting, even for me. Even with my best efforts to record my own voice professionally, generated voice sounds a lot better than what I can do.
We need better reporting and labelling of farm-generated or auto-generated videos - possibly ML models that detect this. However it will cost YouTube revenue because they're profiting off of content farms. I don't see any other way to fix the problem.
Bear in mind that English is everyone's second-favorite language, which means that probably half its speakers don't always feel comfortable recording or public speaking. This helps them over the hump.
Definitely (example: me), but it can also apply to a native speaker of any language. Maybe s/he doesn't have a good voice, no money to spend on an actor or no time to invest on finding one or not of the same gender of the most appropriate voice for the video. Furthermore a synthetic voice makes everything faster. Also no need for a silent recording environment (again, cheaper and faster.)
that for the ads of that specific video? Or uploading a few of these would make the whole channel un-monetizable?
I mean,YT now has restrictions on how much engagement you need before you start monetizing. One could use those videos to bump up the numbers, and then monetize their real videos
I haven't gotten the chance to try it out yet, but an alternative in this space is Komposition, which bills itself as "a video editor built for screencasters". I gather that mostly means that if you take certain liberties when recording your screen and voice (putting pauses in the right places), Komposition will take care of automatically splitting your input media based on when it determines a transition.
Slightly different aim compared to Video Puppet (the source being plain text is not the goal, which means you will likely have to edit and re-record a script multiple times) but still interesting, especially you'd rather avoid an auto-generated voice.
you can easily replace auto-generated voice with your own, or a professional recording in Video Puppet scripts. Just add
(audio: file.wav) to your scene.
This is amazing! I'm going to have a lot of fun with this. I would love to be able to save these as videos but I guess I could just use my mac's screen record functionality
edit: I'm going down a rabbit hole looking through your site. Digging the twisted early internet aesthetic.
This is really cool that you took notice of my projects - thank you for the kindness, solstice. Please - a link to anything you are working on - or where I can find your feed - would be appreciated.
Nothing to share with the wider internet at the moment, I'm afraid. We're just ships passing each other at night in the middle of the ocean. Smooth sailing!
Getting a real kick out of using Video Puppet. The idea of creating a video from assets and a script is not a new one, I first saw it in the context of Real Estate at a Kaltura conference back in 2012:
The existing tools for doing this sort of thing seem to either require quite a bit of programming / video skills e.g. Media Lovin' Toolkit, ffmpeg, sox, jimp, ImageMagick etc or they are templated / opinionated tools like https://www.magisto.com/
What I love about Video Puppet is that it provides a simple and easy to use set of tools and an API that through GitHub actions allows you to put version control and early/often feedback loops at the heart of your projects.
I'm using it to document the development story and back story of an Indie Video Game I'm working on. Previously I was doing it as a Google doc which I was sharing with my collaborators.
With Video Puppet, it requires little more overhead - I was writing this stuff already - but when I see and hear the results played back I can immediately see whether the story makes sense or not. I can see if I am jumping into talking about something I haven't set up properly or if I am trying to say too much.
One thing that would help me is to get feedback on fails in the markdown script quicker, before even pushing to GitHub. For code, including things like Terraform, I'd use a linter, or CircleCI has a validator tool you can run locally.
The other place I'm going to start using it is for describing defects in a product I am coaching a team on. Previously I would do a screen cap and then upload that to frame.io. Now I can do the screen cap, describe the problem and stick the whole lot into version control with a bunch of github actions to point the team to the resulting video.
I will be following this product closely and actively using it.
I'm building the reverse, video to markdown. Paircast combines screen recording, voice transcriptions, and code changes into a markdown guide.
http://paircast.io
Wow, that is FANTASTIC. I've not tried it yet, but it looks like a very approachable execution of a brilliant idea. I'm a DevRel who's fascinated by DX and I WANT THIS.
It's a shame it doesn't also capture the code's output and, ideally, the state of the interpreter. For example: at 4:45 in the demo video, he tries to run his code and it fails with an error. It's important for both coding tutorials and DX analysis to capture the text of the output/error.
What would be even better would be capturing the error _and_ the detailed stack trace, ideally with the state of each stack frame. My employer produces SDKs for different languages, so it'd be invaluable for debugging.
I can imagine a couple of different ways of doing this which might not be horrifically complicated to add to the Paircast recorder, though I suspect you're already going down this road. If you'd like to chat more, yell!
Just completed full support for scripting videos as Markdown files using Video Puppet. Check out the post for some basic info. For more examples, see https://github.com/videopuppet/examples
sure. the video conversion is running on AWS Fargate, with bits and pieces running on AWS Lambda. The speech synthesis is either Amazon Polly (neural voices) or Google Cloud Text to Speech (Wavenet).
Under the hood, the conversion system is using Chrome headless to generate slides, render markdown and provide syntax highlighting. Most of the video and audio processing is with FFMpeg and SOX.
I love this. I've been messing around with Premiere Pro and Audacity for the past couple of days trying to get more into making video. Video puppet looks way easier to debug and collaborate on since scrolling back in forth in your video looking for stuff gets very tedious very quickly.
Is there any way I can add my own voice and then still write the words that I want my voice to say?
VideoPuppet is excellent. I am using it to create videos for the Five Minutes Serverless Youtube channel, and so far, results are outstanding. I can create a video from the markdown file really fast.
It is an application/shared library for Linux, released as free software. It has a GUI program for live narration and one, "Vox", for creating video from PDF or still images using speech synthesis (Festival).
The Kinetophone shared library could be used as a plug in for presentation software. Kinetophone's file format is XML. I haven't updated it for years, and it does require occasional patches to support the latest FFMPEG. It was originally a commercial application for OS X called Ishmael, back in about '07 which I ported to Linux after my company went out of business.
I think this would be great for all the professors/teachers who suddenly have to teach courses online. If the lecture can be made beforehand, then the teacher can just focus on addressing questions or problems on zoom/skype(or whatever platform is used for teaching online)
I'm trying to imagine all the useful things you could do with code generated videos.
I'm imagining a daily routine of airplaying the video to your TV with an annotated dashboard of quantified self metrics, weather forecast, plotted local Covid-19 cases, health advisories, etc.
I don't know about that. If I'm making breakfast and can see a screen, both narration and visuals are useful. I think that's why Google and Amazon came out with device assistants that have screens
If you are making breakfast and have a system that rotates dashboards (such as Grafana for example), that is exactly the same as having a video from the viewing angle, but otherwise better, since you can still come from the room and do some interactive stuff not possible if it were only on video.
So, video is only limiting you, nothing else.
I can imagine this to be used alike to PDF in specific contexts - if you need 100% guarantiee that local devices/viewers/etc wont change any output detail.
I am a tech writer and I write tasks and procedures using DITA-XML. I was thinking about transforming my .dita files to .mlt to use in shortcut/melt, but I think I'm going to use this instead.
Video Puppet can also process YAML and JSON files, so if you are running an automated conversion from XML, it might be easier to output JSON instead of Markdown; in any case, should you need help, drop me an email at gojko@neuri.co.uk
Yes yes yes! I was literally thinking to implement this myself, but didn't have time. It's a shame doesn't appear to be open source though - I might still end up creating one.
I could see this being really useful for creating product onboarding video tutorials - wondering if there's an ability to preview and edit/adjust before exporting the final video?
Building a full video is fairly quick compared to traditional editing tools, so I haven't built any faster preview yet. I usually just build the whole thing and look at it, then tweak the script and build it again.
You can easily upload just the script file into an existing project and re-build the video as many times you like, then download the version you are happy with at the end.
I'm planning to build a visual interface that would allow users to preview individual scenes. Meanwhile, if the main video build is too long because you have lots of scenes and want to check out how one would look with different parameters, just create a different script file with that scene only and build it. that's the beauty of text file editing, you can easily copy and paste and experiment.
This looks cool. Something I have looked for on and off without create success would be a fully scriptable NLE
Something like this that would support simple fades, transitions, and maybe animation. The kind of stuff you can do fairly easily in a video editor, but with lots of fiddly clicks and zooming in and out of timelines.
I'd like to have a script that let me specify when different source media start, when to apply effects, etc. All written as a basic text file.
Wow, I may be biased because this fills a particular niche usecase for me, but this is truly incredible.
I can't stand hearing the sound of my own voice, but do a lot of tutorial content production in Markdown for guides for learning material.
This would allow me to re-use all of the existing material I have, which already includes detailed step-by-step screenshots and text instructions, to make voice-over videos with slides and publish to Youtube. Amazing!
I'm not sure I see why you would want to base this on Markdown. Markdown is designed for a very specific niche, and this falls far outside that niche.
It seems it would make a lot more sense to just design the language from scratch, rather than try to bend Markdown to do something it was not at all meant for.
For instance, why would you WANT to have an example like this:
I totally disagree, and I have the exact opposite reaction. Markdown is something tons of people know already. I literally just glanced over the article and felt I could generate a "narrated PowerPoint", which seems like the main purpose of this, extremely quickly. Why would I want to learn a completely new language because there are some trivially minor syntax oddities with using Markdown?
btw, for narrated powerpoint, you can actually use Video Puppet directly with Powerpoint files - just put narration into speaker notes. Here's more info on that: https://videopuppet.com/docs/powerpoint/
Because you already have to learn a bunch of new stuff, since Markdown does not support this use case.
You could easily borrow some common things from Markdown to make things easier, but this seems to try to force following the Markdown syntax as much as possible, even when that syntax makes no sense in context.
It is much better to invent new things for the cases that are completely new, than try to force a square peg into a round hole.
you can use JSON or YAML if you like more structure. Markdown has good editor support, so using it as source for videos means video sources render nicely in GitHub, for example.
Also, if you don't like ![](), you can just use stage directions with brackets. The equivalent script will be :
Why have two ways to do the same thing, where one is awkward and the other is not? Just commit to doing things the less awkward way, and throw out the idea that you need to be backwards compatible with something designed for a completely different purpose.
Readable but has some intelligence and decoration that is not distracting. See it enough and the syntax will become invisible over time like using punctuation.
> This application is currently in beta version. While in beta, the application is free, and allows anyone to upload assets up to 25 MB. We will announce commercial pricing later, when the full version becomes ready. For now, experiment as much as you like!
I don't know if technically possible, but live preview would be really cool. Maybe javascript rendering to canvas without the roundtrip to the server for encoding to mp4.
Easier bulk upload / upload with curl / python requests is needed IMO.
thanks. that's one of my use cases as well, so as an extra tip, you can generate code snippets over video or images just by adding a fenced code block. eg
---
![](background.jpg)
```js
//anything here will be rendered as a slide on top of the background, with javascript highligthing
Another way to make it seem more real would be to render a virtual webcam overlay of a talking head, using UE4 or something. Maybe with an office-style background.
Nice and simple but there are videos that are easily made with hand-writing effect using https://www.videoscribe.co/.
For example, https://www.youtube.com/watch?v=MiybniIIvx0 is entirely made in software. These are conceptually similar, but obviously yours is more text-oriented and minimalistic.
From the view of someone that occasionally watches videos on YouTube, I am trying to figure out a nice way to say... I hate it. Or more specifically, I hate that it generates the voice, and basically enables video content spam.
What we don't need more of is cheap, easy to automatically generate videos that are basically spam and/or clickbait, trying to get views. The problem with auto-generated voices in videos like this is as a viewer I can't distinguish between work that someone put deliberate production time into, and something churned out by a content farm. The demo video even tricked me at first, I didn't realize it was a generated voice until a couple sentences in, at which point I had a visceral negative reaction, the same as when I accidently click on a content farm-generated video.
It seems a major feature is automatically syncing the narration to the slides. Perhaps a way to enhance this while avoiding spam generation is to use the generated voice only for internal timing, and generated a karaoke-like display for a narrator (human) to read? As a paid service, you could even provide professional voice-offers as an add-on.