Could isolate the text by compositing neighbor images and getting the pixels that don't differ. The background is moving, the text is not.
This would also allow you to extract out the subs without having to OCR them/get characters. Could just erase all static artifacts (including subs but also things like watermarks).
An approach I really want to try is taking a stream of the video without subs (can easily be found online) and subtracting the two. You'd have to deal with differences in resolution and compression between the two, and also handle cases where the background is either white or black, but in theory it should work very well. I haven't had time to dig into this.
In order to have access to vocabulary words. From the article:
> I wanted to get a transcript of the episode’s dialog so I could study the unfamiliar vocabulary. Unfortunately, the video files I have only have hard subtitles
One possible extension of this is to erase the original subtitles using something like gimp-resynthesizer's Heal Selection, and then replace them with translated ones, all automatically. (I've redrawn the video frame from the post[1] so you can see what I mean.)
Using a 2d image resynthesis algorithm is sub-optimal for video. There is basically no chance of it picking the same result every single frame, you will see a flickering shadow of weirdness where the old characters were.
You have lots of infomation of what image should be there; You have motion vectors signaling when an object or background has moved into (or out of) a blocked area. Or a maybe the motion vectors indicate object/background is stationary and you can use infomation from before/after the subtitles to fill it in.
I'm not aware of such a 3d video resynthesis algorithm existing, but it should be possible.
Lately I am using the Copyfish Chrome extension for help with Chinese subtitles/images. The very nice thing about it is that it plays nice with the Zhongwen dictionary, which is another essential Chrome extension for Chinese learners.
Can anyone who learned Chinese comment on the timeframe it takes to even get a basic understanding of what the subtitles say?
I used to learn English this way (watching US TV shows in English with English subs and very limited vocabulary). Eventually disabled the subs. Now watching everything in English. Love it.
edit: any tips on getting started with Chinese very welcome as well (apart from the standard stuff I find myself through Google or language courses)
After around 3000 hours of mostly self study I can understand 95+% of subtitles for TV shows and for a movie can follow enough dialog just through subtitles to understand what's going on at full speed. Study success scales pretty linearly with time spent as far as I can tell, assuming you have a fairly sensible study method. For 1000 hours study I'd imagine you'd get the basics of a lot of subtitles but probably not quickly enough to follow along at full speed.
Teachers can be helpful to point out some mistakes you'll miss through self study but unless you have a fair amount of cash to throw at the problem self study will likely be a better approach. For full fluency you probably want to target around 20k vocab, so it's a bit of a numbers game in terms of finding a quick way to improve your vocabulary. I use Skritter but I guess any SRS software should help a lot here.
Not entirely convinced by immersion as being necessary, I learned plenty of Chinese without it. Probably good for day to day vocab and motivation but unless your level is good you will still struggle to get into conversations where you use it, depending on how outgoing you are.
To be honest, learning Chinese has been a long and somewhat painful process. I'm about 6.5 years in at this point (2.5 years of school + 4 years of self study) and am starting to be able to pick up a newspaper or watch a show and get the gist of what it's about. I don't have a natural knack for languages and I was never devoting 100% of my time to it (I did software in school and now for work), so your experience may differ. Also I never had a chance for immersion—people I know that have spent 6 months in Taiwan or China are often above my level even though they've been studying for less total time, so immersion is really helpful.
I've found that SRS apps like Anki or Memrise and podcasts like Popup Chinese or ChinesePod have been really helpful. I'm also developing an approach where I extract vocabulary words from videos and newspaper articles and pre-study them before watching/reading—I'm hoping to blog about this at some point.
The author Khatzumoto taught himself Japanese in 18 months "by having fun", and then got a job in Japan (at Sony iirc). He then went on to use the same method to learn Chinese.
If you can get through the website's navigation, and scroll through the occasional incoherent rambling, there is solid gold advice throughout :) Good luck!
ps. While the site focuses on Japanese, the advice translates to learning any language. Plus there is advice and resources specifically for Chinese :)
>it takes to even get a basic understanding of what the subtitles say?
The main challenge with Chinese is that - unlike any Western language - you can not guess the characters you do not know. You either know it or you don't. Only the context might offer some clues.
Getting started with Chinese: Depends on your talent. For me (untalented) taken a "real" class was essential for a good start. Self-learning for languages does not work for me.
> you can not guess the characters you do not know
That's not completely true.
Chinese characters are composed of parts, some of which are referred to as radicals. If you know the part/block, you can sometimes guess its meaning or sound. Also, new Chinese words are made of existing characters, so you can often guess multi-character words by knowing the individual characters. For example, the word for computer is "electric brain". You might be able to guess that if you knew that 電means electric and 腦 means brain
Yeah, I think this is one of those things where if you get yourself into an "Asian languages are hard" mindset, you unnecessarily complicate things. I don't speak Chinese, but I speak Japanese, which I learned primarily from free reading (mostly manga, which is great because it is almost completely conversational Japanese).
While it is true that you can't guess the meaning of characters that you don't know without any context, you definitely can lean the mean of characters that you don't know with context. In the same way, while I might understand the Greek and Roman roots for words in English, if I see the work by itself without any context, I'm not going to be able to understand it. With context, I can usually puzzle it out.
However, it is a huge mistake to ignore the benefits of learning Chinese characters. It is the reverse that is most beneficial. In Chinese the 3600 most common characters cover some insane percentage of the common words in use (only 2200 for Japanese!). These characters form a powerful mnemonic for learning vocabulary. I can easily learn to read the characters for a word and the word faster than I can learn the word by itself. As you learn more characters, it only accelerates your progress.
2-3 thousand characters seems like a daunting task, but as I always point out in these threads, adult level proficiency in a language requires somewhere around 15,000 word families (1 family = a word, all its inflections/conjugations/etc, and all related compound words). People telling you that you can be "fluent" with 2000 words are selling you something (2000 word families is the level of a toddler -- and then people are disappointed that they can only speak like a toddler...).
The absolute best thing you can do for learning a language that has Chinese characters in its writing is to learn those characters. I recommend learning them at the same time as the vocabulary. And to get back to the original comment: even now if I hear an unfamiliar word, I will usually trace what I think the characters are on my hand. Often people will correct me and once I get the correct characters the meaning is almost always obvious.
Haha, well, I feel that one's easy too if you know that sometimes characters are used as phonetics for words in other languages. Just like English! Although you might not expect that at first from a non-phonetically written language.
Every language has exceptions to "rules". In fact, there are few hard and fast rules for any language, since every language is a mix of another.
But, once you get comfortable with a language, you can appreciate the differences and make interesting guesses about the exceptions' origins. It can be fun.
I like http://www.hanzicraft.com/ because it breaks down the characters into parts you can click on to get their definitions or origins. Hacking chinese also has a cool resources section here [1], categorized, so you can browse through dictionaries, listening tools, practice tests, browser add-ons, etc. etc.
Hanzicraft looks cool, though I feel like I should learn to speak Mandarin a lot better before worrying about reading it. Infact I'd be fairly happy to be a fluent speaker and illiterate.
Okay, people replying to you so far seem to understand what's going on.
For those that don't get it: Why are these characters used in your particular case?
I only know about Japanese but in the past characters could be chosen based on their pronunciation and not just their meaning e.g. 仏蘭西 (Buddha, Orchid, West, fu ran su, France). This is called ateji:
The ateji for Malaysia is 馬来, Horse, come, ma rai.
Nowadays loanwords are usually written alphabetically instead of ateji. フランス and マレーシア. The old way is still written for abbreviations; 仏 is the equivalent of writing "Fr" in English.
An interesting case is America (亜米利加)where the abbreviation is 米 (pronounced "bei") not 亜 (pronounced "a") because A is for Asia! (亜細亜).
As any one reading HN should know, naming things is hard...
The first character of 馬來 is "horse" and the second is "come/arrive". So I was sitting there puzzling it out until I realised that if I pronounced it in Mandarin, it sounded a lot like "Malay" - which is in fact what it means.
or you could just memorize diannao as computer because that is easier.
trying to connect electric brain to computer is similar to trying to connect "breakfast" to the morning meal. it's interesting to know the compound words, but breakfast is always the meal in the morning and not always the meal after a fast.
I use an app called Memrise and a website called chinesepod. There's a really good dictionary app called Pieco.
Going beyond the standard stuff I'd recommend a professional teacher since you really need to learn Chinese from multiple angles: learning to recognize and write the characters, reading pinyin, understanding spoken Mandarin and being able to speak.
Thanks for this! I'm looking forward to part 2 also.
A bit off-topic, but does anyone have Chinese TV show suggestions? I watched a few episodes of this show (他来了) already, but I didn't like it very much.
I was actually thinking of doing something like this using Amazon Mechanical Turk, maybe not to subtitle the entire show but just to get a much bigger test set than I have the patience to label myself. I'll check them out, thanks!!
No worries. Prices I've seen so far range from RMB 10 to RMB 100 per episode - probably depending on how popular the show is and whether they already have a transcript available because someone else wanted it or whether they'll need to transcribe it for you.
16 year Chinese learner here (not that it's relevant). I would try to hack a solution via the following approach.
1. Determine the area (if any) near the bottom with black-or-whiteness zones of a constant height (these are likely to be subtitles) by randomly selecting 10 frames from the middle of the movie. Extra points if you have it detect the sub color.
2. For each frame with unique subs, isolate the zones vertically (handles multi-line subs).
3. Determine black-or-whiteness of each vertical column in the text area. Moving inward from the left or right edge, crop everything until the black-or-whiteness within the constant height drops below a certain threshold. In the example shown, this would deal with ′…′二′′′'′ and would look like 0,0,0.01,0,0,0.01,3
4. Crop viciously within the assumed vertical height. This should remove issues like 逯 which should be 这.
5. There is probably a clustering-based approach you can use to remove background noise, either spatially or temporally, eg. temporally via imagemagick[0]: compare frame1.jpg frame2.jpg -compose src -fuzz 10% -highlight-color white -lowlight-color black output.jpg ... alternatively, if there is a surrounding color such as in the example, you could remove any pixel-groups that don't have it.
6. In terms of detecting frames in which you have a new set of subs, just compare the last black-white-extraction of the central maybe-has-subs area (ie. most commonly used portion thereof) with a delta of the last one, remembering no subs is also an option. In many cases this may align with keyframes.
If you like to solve image processing problems we're hiring - http://8-food.com/ - email in profile.
Actually very useful even for for other things, thanks for sharing! For example ripping DVD subtitles to SRT, or (I'm using my imagination) maybe in the future with content-aware fill removing hard coded subtitles and replacing them with filler space?
That should actually be possible with todays technology. Take an image and draw subtitles on it. This is input to train NN while original image is training output. Even better, use video stream directly... Not easy, but not impossible either.
DVD subtitles are already a separate layer to the movie stream, but it is a bitmap. Because it's a separate layer, OCR-ing should be easy.
And if you ask why it's a bitmap, that's because bitmaps support more than just plain text: color and typefaces to name 2 things. Imagine if DVD players have to implement text decoding ("Is this subtitle stream in UTF-8 or maybe some Cyrillic Code Page?") and rendering (color, placement, font files, etc...)
I totally thought the state-of-the-art for OCR would be ConvNets, but apparently it isn't? Or are there just not any easily available/usable libraries that do OCR with Convents? Or is the benefit of ConvNets marginal enough to not be useful?
I've always thought a great feature of shows/movies with subtitles would be to the ability to display multiple at the same time.
It would really be useful for learning assuming the subtitles are translated well. Right now I've been able to do something with my own content by merging two subtitle files.
Maybe you could also add a transliteration of the characters if its a language like Chinese.
I'm also learning Mandarin and was wondering if this was possible (for a different show) just the other week! Thanks for the article, will be looking forward to Part 2 and 3. Also, is there an easy way to extract all the frames with unique subtitles?
Once you have the text corresponding to each frame, you can de-dupe it with its neighbors based on Levenshtein distance (can't use exact-match because of recognition errors). I found that for this show subtitles generally hang on-screen for 1-3 seconds, so you wouldn't have to do many comparisons.
I already made a feature to list all the unique words in a movie, sort them by their frequency, and make a study sheet. I also made bash script generator to use ffmpeg to cut the movie to the subtitle time.
All I need to do now is recombine the subtitles based on the words, to make videos with lots of example sentences.
It's much easier to study with a real English translation though, instead of a literal word-for-word transcription. If you could help me get more input data (names of movies or songs, srt files), that would be wonderful!
It'd be very cool to use this to take a video with hard subs, inpaint the hard subs (even naïvely, or maybe with a motion compensator), and replace them with SRT/ASS subs.