Awesome
Burnt-in subtitle extractor
This project provides a set of basic extraction tools for burnt-in subtitles, i.e. subtitles that are part of the picture itself.
Tools
The burnt-in subtitle extractor consists of four different tools:
locsubtitles
locates subtitles within the picture, creates mask files and writes their names to an initial textual subtitle file along with timing informationgetsubtitles
applies those masks to the picture and writes the averaged out graphical subtitles to a new set of image filesremsubtitles
deletes the masked out areas from the input frames and blends in the surrounding colorsocrsubtitles
runs the extracted graphical subtitles through an external OCR and outputs the final textual subtitle file
Examples
The following examples will in part rely on default parameters.
You will typically have to provide parameters tuned to your input file.
Type e.g. getsubtitles --help
for details.
> ffmpeg -i test.flv -f yuv4mpegpipe - | ./locsubtitles > test.sub
> ffmpeg -i test.flv -f yuv4mpegpipe - | ./remsubtitles -s test.sub | ffplay -
How it works
This program relies on several patterns in how subtitles are typically added to videos:
- Subtitles typically have borders and a defined text and border color
- Subtitles are usually located in the same part of the picture and do not move
- Analysis can be limited to a specified region
- This program further relies on the presence of a frame without subtitles between frames with different subtitles
The algorithm then does the following:
- For every frame
- Create a text mask that includes all pixels that have the text color
- Expand that text mask in all four directions by the stroke width of the text border
- Create a border mask that includes all pixels that have the border color
- Expand that border mask in all four directions by the stroke width of the text
- Calculate the intersection of both masks
- If the number of pixels in the resulting mask is above a specified threshold
- (A subtitle has been found)
- If the previous frame did not have subtitles
- Mark the start position
- Otherwise: Set the subtitle mask to the intersection of both frames' subtitle masks
- Otherwise (no subtitle)
- If the previous frame had subtitles
- Write the previous (i.e. the non-empty) subtitle mask to an image file
- Write start and end time information and the name of the mask file to the subtitle file
- If the previous frame had subtitles
- When the end of the video has been reached, output any pending information and exit
The OCR component uses the Tesseract engine, which has to be installed on the system. The target language is currently hard-coded to Dutch. This can be changed in the wrapper script.
Prerequisites
The program has been tested with the following setup, only:
- GNU/Linux operating system
- GNU C++ compiler + Boost libraries
- Bash shell + Tesseract OCR
- Y4M video streams produced by ffmpeg
Other setups might work.