DaDaDa 2017 - Deep Learning and the Industries - Part 1 of 3
Transcript summary
Speaker Background and Context
The speaker, Abhishek Thakur, works as a data scientist at Boost.ai, a Norway-based company building chatbots using deep learning and AI. They have experience participating in Kaggle competitions and maintain interests in automatic machine learning and large-scale text classification.
Introduction to Deep Learning
Deep learning is characterized as neural networks with more than three layers, distinguishing them from shallow networks. The key advantage is the removal of manual feature extraction steps, as each layer learns progressively complex features automatically. While often considered a black box, deep learning does provide interpretability through feature importance, particularly in image analysis where it's possible to understand which pixels or features are being learned.
Evolution of Neural Network Architectures
Neural network architectures have evolved significantly over time, progressing from basic convolutional and subsampling layers to more sophisticated designs. Max pooling was introduced as an improvement, and architectures grew increasingly complex. GoogleNet represents a milestone in this evolution, featuring numerous convolutional layers that made it large enough to barely fit on a presentation screen. The general trend has been toward deeper and more intricate network designs.
Applications of Deep Learning
Deep learning has found widespread applications across multiple domains. In computer vision, convolutional neural networks excel at identifying objects like butterflies, dogs, and cats. R-CNNs can identify multiple objects within a single image. The technology has advanced to solve challenging problems, such as distinguishing between visually similar items like dogs and muffins, or differentiating bugs from bread loaves.
Natural language processing represents another major application area, powering chatbots and enabling sophisticated text understanding. Speech processing leverages deep learning by converting audio into spectrograms, which can then be analyzed using CNNs or time-distributed layers.
Implementation Approaches
There are several approaches to implementing deep neural networks. Building from scratch involves coding all backpropagation and forward propagation manually, which is time-consuming and complex. A more modular approach decomposes networks into smaller components, implementing layers separately before assembling them. Fine-tuning offers an efficient alternative for specific tasks, particularly in image recognition, by adapting pre-trained models to new problems.
Working with Caffe
Caffe is a deep learning framework that offers several advantages: high speed, open-source availability, modular architecture, and an expression-based approach that doesn't require extensive coding knowledge. The framework has strong community support.
The Caffe workflow involves several steps. Data must first be converted to a proper format, either LMDB database format or raw images. Network architecture is defined in a prototxt file, which specifies layers and connections. A solver file configures training parameters including learning rates, iteration intervals, and snapshot locations for saving model weights. The framework allows easy switching between CPU and GPU processing.
Training a network in Caffe, despite its complexity in setup, ultimately reduces to a simple command-line instruction specifying the solver configuration file.
Fine-Tuning Strategy
Fine-tuning involves starting with a pre-trained model rather than training from scratch, which typically yields better results and saves significant time. The Caffe Model Zoo provides access to various pre-trained models, including GoogleNet. These models can be adapted to new tasks by modifying their final layers and adjusting parameters for the specific problem at hand.
Use Case: Honey Bee vs Bumblebee Classification
This classification challenge, hosted on drivendata.org by Metis, involved distinguishing between honey bees and bumblebees, a task more complex than typical dog vs cat classification. The dataset contained 80% positive samples and 20% negative samples, evaluated using the area under the ROC curve.
The initial approach used a basic custom network with three convolutional layers and max pooling, outputting two classes. This simple architecture achieved poor results with losses between 0.43-0.44 and an AUC around 0.85.
Switching to fine-tuning with GoogleNet required creating training and test files with image paths and labels, downloading and modifying prototxt files from the Model Zoo, and adapting them to the specific dataset. Key modifications included changing from LMDB format to image data format, adjusting batch sizes and image dimensions, renaming the final layer for fine-tuning purposes, modifying dropout ratios, and changing the output from 1000 to 2 classes.
To verify the network was learning meaningful features, the model was first initialized with random weights, showing mixed clusters of both bee types. Using pre-trained GoogleNet weights showed some clustering improvement. After fine-tuning, the visualization demonstrated clear separation between the two bee classes. Analysis of accuracy across different inception layers showed performance exceeding 90% in the final fine-tuned layer, demonstrating the effectiveness of transfer learning.
Transition to Python and Keras
While Caffe offers a Python interface, the speaker prefers working with Keras due to its simplicity and ease of use. Other available frameworks include TensorFlow, Theano, Lasagne, and Neon from Nervana Systems, but Keras stands out as the most straightforward option for implementation.
Use Case: Search Query Classification
This project addressed the classification of search queries without user logs, dealing with hundreds of millions of unlabeled keywords. The business motivation centers on understanding user intention, tracking keywords according to buying cycles, and optimizing website content. Companies want to ensure their websites attract relevant traffic matching their offerings.
The approach adapted word2vec concepts to search queries, creating sentence-to-vector representations. When searching for a location like Berlin, the system should return relevant results like weather forecasts without irrelevant information. The implementation used LSTM networks, where each word in a query is represented as a vector, creating a sequence fed into the LSTM model.
The Keras implementation proved remarkably concise, requiring only about ten lines of code with LSTM layers, dropout, and sigmoid activation. However, LSTMs demonstrated severe performance limitations when scaling to 100 million samples, with prohibitively slow training times that made the approach impractical for this scale.
Search Query Categorization
Search queries can be categorized into three types: navigational queries seeking specific websites, transactional queries indicating purchase intent, and informational queries seeking knowledge like things to do in a city.
An alternative approach represented search queries as images created from word2vec vectors. For each Google search, the system analyzed the top ten results, creating word2vec representations for each title, which were then fed into convolutional neural networks. This allowed the CNN to distinguish between visually similar concepts that might confuse traditional methods.
The speaker experimented with various machine learning models, including boosted trees with word2vec embeddings for classification. The final solution combined TF-IDF features with XGBoost, which proved most effective. For CNN implementation, the approach used either direct images or random crops averaged together, implemented through simple sequential models with dense layers in Keras, with similar architecture adaptable to convolutional models.
General Machine Learning Framework
The speaker developed a systematic framework for approaching machine learning problems during their PhD studies. The framework begins with dataset splitting into validation, training, and test sets. Data types are categorized as numerical, categorical, or text data. Categorical data undergoes label encoding, binarization, or one-hot encoding. The approach involves stacking different models and applying decomposition techniques like PCA or LDA. All combinations are evaluated systematically to select the best performing model.
This framework was adapted for neural network optimization, following similar systematic principles for architecture selection and hyperparameter tuning. However, selecting optimal neural network architecture remains a non-trivial challenge requiring careful experimentation and validation. The speaker mentioned this framework was never released publicly but expressed openness to collaboration on reimplementing it from scratch.