DaDaDa 2016 - Finding Your Next Series to Watch Through Data Analysis

Transcript summary

Introduction and Project Background

The speaker, a Big Data Architect at ProSieben, presented the second part of a personal project focused on using data analysis tools to discover new series and movies to watch. The project aims to build a recommendation system based solely on content analysis rather than user reviews, using Python as the primary tool.

Recap of Part One: Text-Based Analysis

The initial approach involved analyzing subtitles from series and movies. Key features extracted included word counts relative to episode length, word clouds to identify prevalent topics, and the SMOG grade metric. The SMOG grade measures text complexity by indicating the age level required to understand the vocabulary used in the subtitles. This metric was calculated using an existing algorithm available through a Python library.

The Need for Video and Audio Analysis

While text analysis provided valuable insights, the speaker recognized that video and audio contain the most important information since viewers watch and hear content rather than read it. This realization led to focusing on video analysis for this part of the project.

Understanding Video Structure

Videos consist of sequential frames (images) in various formats that are easily processable with Python. The challenge lies in determining what specific features to extract from video content. The speaker engaged the audience to identify potential features, with suggestions including colors, facial recognition, movement, camera angles, and scene detection.

Yahoo's Open NSFW Model

The speaker introduced Yahoo's recently published Open NSFW (Not Suitable for Work) model, which uses the Caffe deep learning framework. This pre-trained model can identify inappropriate content in images and, by extension, videos. The model is open-source with full documentation on GitHub, including training methodology and implementation details. The speaker noted a minor installation issue requiring a quick fix and humorously mentioned controversy around the model's publication.

Implementation Process

The implementation involves extracting individual frames from videos at one-second intervals, processing each frame through the NSFW classifier, and collecting scores for each frame. The code required is relatively compact, approximately 20 lines for video splitting. The speaker ran the analysis on an external machine due to computational requirements.

Analysis Results and Examples

The speaker demonstrated results using pilots from several series:

Desperate Housewives: Showed varying levels of detected content throughout the episode

Dexter: Displayed less dense but notable peaks, correctly identifying scenes with intimate content

Game of Thrones: The first episode showed moderate detection, accurately identifying a specific beach scene

Californication: Demonstrated high detection levels, validating the series' reputation

The visualizations plotted NSFW scores over time, with sample frames displayed to verify the model's accuracy. The model performed impressively well, even correctly identifying ambiguous situations like partial nudity or suggestive angles.

Additional Video Features

Beyond NSFW content detection, the speaker outlined other extractable video features including colors, brightness, length, scene rhythm, and overall structure. These features can be combined with subtitle analysis to create a comprehensive understanding of series structure and content.

Video Summarization: Bridging Video and Text

The speaker explored combining video and text analysis through automated video summarization. IBM Watson's trailer generation for a movie served as inspiration, where the AI system created a complete trailer including music selection without human editing.

For those without supercomputer resources, the speaker proposed a practical approach:

Summarize subtitles using text summarization algorithms
Detect scenes in the video
Analyze audio levels to identify significant events (explosions, crowd reactions)
Combine these elements to generate video summaries

A demonstration showed a 12-minute short film condensed to a summary using this approach. While the result was rough, it demonstrated the feasibility of automated summarization using simple algorithms and combined data types.

Building the Recommendation Engine with Elasticsearch

To create a functional recommendation system quickly, the speaker chose Elasticsearch as an out-of-the-box solution. Elasticsearch functions as a personal search engine similar to Solr. The implementation uses Elasticsearch's More Like This (MLT) feature, which finds similar content based on a selected document.

The process involves:

Creating document models containing all calculated features
Indexing documents into Elasticsearch
Querying using the MLT function with specified fields for similarity calculation

The speaker tested the system by inputting Dexter's pilot episode, which returned three recommended series. One recommendation, Hannibal, was confirmed by an audience member as a quality match.

Key Takeaways and Future Work

The speaker emphasized several important lessons:

Testing machine learning methods doesn't require deep theoretical understanding initially
Using data creatively matters more than simply analyzing massive log files
Mixing different data types reveals insights not apparent from single sources
Out-of-the-box solutions like Elasticsearch can be surprisingly effective and have since been adopted at the speaker's company

Part three of the project is in development, promising deeper analysis as the work becomes more complex and interesting. The speaker also mentioned open positions at ProSieben for those interested in similar work.

Technical Notes

The demonstration acknowledged this was a quick proof-of-concept rather than a production-ready solution. The recommendation results may not be fully representative of a final implementation, but the approach validates the methodology's potential.