DaDaDa 2016 - Finding Your Next Series to Watch Through Data Analysis
Video content recommendation using Yahoo's NSFW deep learning model for frame analysis, combined with subtitle text processing and audio analysis. Elasticsearch's More Like This feature creates recommendations—Dexter suggests Hannibal. Multi-modal beats single-source.
Abstract
This presentation details the development of a content-based recommendation system for video content, building upon earlier text-based subtitle analysis by incorporating video and audio processing. Moving beyond basic text features (word counts, word clouds, SMOG grade complexity metrics), the speaker implements Yahoo's Open NSFW model—a pre-trained deep learning classifier using the Caffe framework—to analyze video content frame-by-frame at one-second intervals. Demonstrations on pilots from Desperate Housewives, Dexter, Game of Thrones, and Californication validate the model's accuracy in detecting inappropriate content across episodes, with visualizations showing temporal NSFW score distributions and sample frames confirming correct identification even in ambiguous situations. The analysis extends beyond content detection to encompass extractable video features including colors, brightness, length, scene rhythm, and overall structure, which combine with subtitle analysis for comprehensive series profiling. Inspired by IBM Watson's automated trailer generation, the presentation explores practical video summarization approaches accessible without supercomputer resources: text summarization algorithms applied to subtitles, scene detection in video, and audio level analysis to identify significant events (explosions, crowd reactions). A demonstration condenses a 12-minute short film using this combined methodology, proving feasibility despite rough results. The recommendation engine implementation leverages Elasticsearch's More Like This (MLT) feature as an out-of-the-box solution, functioning as a personal search engine similar to Solr. The process involves creating document models containing all calculated features, indexing documents into Elasticsearch, and querying with MLT on specified similarity fields. Testing with Dexter's pilot episode returns three recommendations including Hannibal, validated by audience confirmation as a quality match. The presentation emphasizes practical lessons: testing machine learning methods requires minimal theoretical understanding initially; creative data usage matters more than analyzing massive log files; mixing different data types (text, video, audio) reveals insights invisible to single-source analysis; and surprisingly effective out-of-the-box solutions like Elasticsearch merit adoption in production environments. The work represents a proof-of-concept demonstrating methodology potential rather than production-ready implementation, with part three promising deeper analysis as complexity increases.
About the speaker
Karim Jedda is a data scientist at ProSieben, a major German media company, working on content analysis and recommendation systems for video entertainment. Their work focuses on practical applications of machine learning and deep learning to understand and categorize television series and movies through multi-modal analysis combining text (subtitles), video (frame analysis), and audio processing. The speaker advocates for accessible approaches to machine learning that prioritize creative problem-solving and out-of-the-box solutions over purely theoretical understanding, successfully implementing systems that have been adopted for production use at ProSieben. They are actively involved in recruiting for similar data science positions at the company.
Transcript summary
Recap of Part One: Text-Based Analysis
The initial approach involved analyzing subtitles from series and movies. Key features extracted included word counts relative to episode length, word clouds to identify prevalent topics, and the SMOG grade metric. The SMOG grade measures text complexity by indicating the age level required to understand the vocabulary used in the subtitles. This metric was calculated using an existing algorithm available through a Python library.
The Need for Video and Audio Analysis
While text analysis provided valuable insights, the speaker recognized that video and audio contain the most important information since viewers watch and hear content rather than read it. This realization led to focusing on video analysis for this part of the project.
Understanding Video Structure
Videos consist of sequential frames (images) in various formats that are easily processable with Python. The challenge lies in determining what specific features to extract from video content. The speaker engaged the audience to identify potential features, with suggestions including colors, facial recognition, movement, camera angles, and scene detection.
Yahoo's Open NSFW Model
The speaker introduced Yahoo's recently published Open NSFW (Not Suitable for Work) model, which uses the Caffe deep learning framework. This pre-trained model can identify inappropriate content in images and, by extension, videos. The model is open-source with full documentation on GitHub, including training methodology and implementation details. The speaker noted a minor installation issue requiring a quick fix and humorously mentioned controversy around the model's publication.
Implementation Process
The implementation involves extracting individual frames from videos at one-second intervals, processing each frame through the NSFW classifier, and collecting scores for each frame. The code required is relatively compact, approximately 20 lines for video splitting. The speaker ran the analysis on an external machine due to computational requirements.
Analysis Results and Examples
The speaker demonstrated results using pilots from several series:
Desperate Housewives: Showed varying levels of detected content throughout the episode
Dexter: Displayed less dense but notable peaks, correctly identifying scenes with intimate content
Game of Thrones: The first episode showed moderate detection, accurately identifying a specific beach scene
Californication: Demonstrated high detection levels, validating the series' reputation
The visualizations plotted NSFW scores over time, with sample frames displayed to verify the model's accuracy. The model performed impressively well, even correctly identifying ambiguous situations like partial nudity or suggestive angles.
Additional Video Features
Beyond NSFW content detection, the speaker outlined other extractable video features including colors, brightness, length, scene rhythm, and overall structure. These features can be combined with subtitle analysis to create a comprehensive understanding of series structure and content.
Video Summarization: Bridging Video and Text
The speaker explored combining video and text analysis through automated video summarization. IBM Watson's trailer generation for a movie served as inspiration, where the AI system created a complete trailer including music selection without human editing.
For those without supercomputer resources, the speaker proposed a practical approach:
- Summarize subtitles using text summarization algorithms
- Detect scenes in the video
- Analyze audio levels to identify significant events (explosions, crowd reactions)
- Combine these elements to generate video summaries
A demonstration showed a 12-minute short film condensed to a summary using this approach. While the result was rough, it demonstrated the feasibility of automated summarization using simple algorithms and combined data types.
Building the Recommendation Engine with Elasticsearch
To create a functional recommendation system quickly, the speaker chose Elasticsearch as an out-of-the-box solution. Elasticsearch functions as a personal search engine similar to Solr. The implementation uses Elasticsearch's More Like This (MLT) feature, which finds similar content based on a selected document.
The process involves:
- Creating document models containing all calculated features
- Indexing documents into Elasticsearch
- Querying using the MLT function with specified fields for similarity calculation
The speaker tested the system by inputting Dexter's pilot episode, which returned three recommended series. One recommendation, Hannibal, was confirmed by an audience member as a quality match.
Key Takeaways and Future Work
The speaker emphasized several important lessons:
- Testing machine learning methods doesn't require deep theoretical understanding initially
- Using data creatively matters more than simply analyzing massive log files
- Mixing different data types reveals insights not apparent from single sources
- Out-of-the-box solutions like Elasticsearch can be surprisingly effective and have since been adopted at the speaker's company
Part three of the project is in development, promising deeper analysis as the work becomes more complex and interesting. The speaker also mentioned open positions at ProSieben for those interested in similar work.
Technical Notes
The demonstration acknowledged this was a quick proof-of-concept rather than a production-ready solution. The recommendation results may not be fully representative of a final implementation, but the approach validates the methodology's potential.