Dadada 2017 - Deep Learning and the Industries
Practical deep learning applications in industry: fine-tuning GoogleNet for image classification, processing hundreds of millions of search queries with LSTMs and CNNs, optimizing neural architectures, and building multilingual chatbots.
This video recording was unfortunately split into 3 parts:
Abstract
This talk presents practical applications of deep learning in industry settings, focusing on real-world implementations rather than theoretical concepts. The speaker covers the fundamentals of deep neural networks and their advantages over traditional machine learning approaches, particularly the automatic feature extraction capabilities that eliminate manual feature engineering. The presentation demonstrates three main use cases: distinguishing between honeybees and bumblebees using fine-tuned convolutional neural networks, classifying search queries at scale for understanding user intent, and optimizing neural network architectures. The talk emphasizes practical tools like Caffe and Keras for implementing deep learning solutions, with detailed discussions on fine-tuning pre-trained models like GoogleNet to achieve better performance with less training time. Additional topics include working with time series data using LSTMs and one-dimensional CNNs, approaches to any machine learning problem, and building multilingual chatbots that handle intent classification and synonym recognition.
About the speaker
Abhishek Thakur is a data scientist at Boost.ai, a Norway-based company that builds chatbots using deep learning and artificial intelligence. He has participated in Kaggle competitions and maintains a strong interest in automatic machine learning and large-scale text classification. His professional experience spans multiple positions where he has successfully deployed machine learning models into production environments. The speaker has worked on various deep learning projects across different domains, from computer vision tasks to natural language processing applications, and has contributed to research presented at ICML workshops. He emphasizes the importance of productionizing models and shares insights from his practical experience implementing deep learning solutions in business contexts.
Transcript summary
Introduction to Deep Learning
Deep learning has evolved from a technical innovation to an essential skill for data scientists, though the speaker notes with humor that many claim expertise without genuine understanding. Deep learning refers specifically to neural networks with more than three layers, distinguishing them from shallow networks. The key advantage of deep learning lies in its ability to automatically extract features from raw data, with each layer learning increasingly complex representations. Contrary to popular belief about deep learning being a black box, these models provide interpretability through feature importance, particularly in image processing where pixel-level contributions can be analyzed.
Evolution of Neural Network Architectures
The history of convolutional neural networks shows continuous advancement in complexity and capability. Early architectures featured basic convolution layers and subsampling layers, with max pooling introduced in later iterations. GoogleNet represented a significant leap forward with its extensive convolution layers, though its complexity was such that the architecture diagram could not fit properly on a standard screen. This progression toward deeper networks reflects the field's understanding that additional depth enables more sophisticated pattern recognition.
Capabilities of Deep Learning
Convolutional neural networks excel at image classification tasks, identifying objects like butterflies, dogs, and cats. Region-based CNNs (R-CNNs) advance this capability by detecting and localizing multiple objects within a single image. The speaker humorously illustrates the challenge of distinguishing visually similar items, such as Chihuahuas from blueberry muffins or dogs from bread loaves, demonstrating both the difficulty of these tasks and deep learning's ability to solve them. Beyond computer vision, deep learning applications extend to natural language processing for chatbots and speech processing, where spectrograms can be treated as images for CNN analysis or processed with time-distributed layers.
Implementation Approaches
Implementing deep neural networks can follow several paths, each with different complexity levels. Building networks from scratch requires understanding and coding backpropagation and forward propagation, which can become complicated. Alternatively, networks can be decomposed into smaller components, with individual layers implemented separately before being assembled into the complete training process. The most efficient approach for many applications involves fine-tuning pre-trained models, which saves significant development time and often produces superior results.
Working with Caffe
Caffe was the speaker's tool of choice for earlier projects, though he later transitioned to Keras. Caffe offers several advantages: exceptional speed, open-source accessibility, modularity allowing division into separate components, and an expression-based approach that eliminates the need for extensive coding knowledge. The framework has a substantial community providing support and resources. Implementation in Caffe requires converting data to either LMDB database format or raw images, defining the network architecture in prototxt files, specifying solver parameters, and training with optional pre-trained weights.
Solver and Network Definition
The solver configuration file specifies critical training parameters including the network name referencing the training prototxt file, number of test iterations and intervals, verbosity levels, learning rate policies, and snapshot prefixes indicating where model weights should be saved. The system allows easy switching between CPU and GPU modes. Network architecture is defined in separate prototxt files, exemplified by LeNet, where input dimensions are specified according to standard convolutional network requirements. Layers are defined as separate modules including convolutional layers, pooling layers (such as max pooling), and output layers using softmax for probability values.
Use Case: Honeybee vs Bumblebee Classification
The classification challenge from Metis, hosted on DrivenData, required distinguishing between honeybees and bumblebees, a more difficult task than typical cat versus dog classification. The dataset contained an imbalanced distribution with 80% positive samples and 20% negative samples, evaluated using area under the ROC curve. An initial approach using a basic network with three convolutional layers and max pooling produced high loss values around 0.43-0.44 and poor AUC scores near 0.85. Fine-tuning GoogleNet dramatically improved performance by creating training and test files, modifying prototxt files from the Caffe Model Zoo to match the data format, and running the Caffe solver with pre-trained weights.
Fine-Tuning Process and Results
The fine-tuning implementation required several modifications to the prototxt files. The data source was changed from LMDB format to image data, batch sizes were adjusted, and input dimensions were specified for height and width. The final layer name was modified specifically for fine-tuning purposes, dropout ratios were adjusted, and the number of output classes was reduced from GoogleNet's original 1000 to 2 for the binary classification task. To verify the network was learning meaningful patterns, the speaker initialized the model with random weights, showing honeybees and bumblebees mixed together in the feature space. Using pre-trained GoogleNet weights showed initial cluster formation, and after fine-tuning, the classes achieved clear separation in the feature space.
Analysis of Inception Layers
Breaking down the inception layers revealed how accuracy improved through the network depth. By extracting weights from each feature layer and applying logistic regression on top, the speaker measured classification accuracy at different network depths. The results showed accuracy exceeding 90% for the final fine-tuned layers, demonstrating that deeper layers captured more discriminative features. This analysis supports the case for fine-tuning: it provides very fast implementation, typically produces better results than training from scratch, and leverages existing knowledge without reinventing established architectures.
Python Implementation with Keras
While Caffe offers a Python interface, the speaker primarily uses Keras for its simplicity and ease of use. Other frameworks available include TensorFlow, Theano, Lasagne, and Neon from Nervana Systems. Keras stands out for its user-friendly API that makes deep learning accessible. The framework allows rapid prototyping and experimentation, making it the preferred choice for many practitioners developing production systems.
Search Query Classification
One of the speaker's deployed production systems involved classifying search queries to understand user intent. When users search for products like iPhones or activities like things to do in Munich, businesses want to determine the underlying purpose. This classification helps websites optimize content and avoid showing irrelevant products to users. For example, an electronics retailer does not want visitors interested in furniture. The system tracks keywords according to the buying cycle, distinguishing between awareness, consideration, and purchase stages, all without requiring user logs. The challenge involved handling hundreds of millions of unlabeled keywords.
Word2vec for Search Queries
The approach adapted word2vec concepts to search queries, similar to how traditional word2vec handles semantic relationships like Germany minus Berlin plus France equals Paris. For search queries, the system produces relevant results, such as returning Berlin weather forecasts when asked about weather in Berlin. The implementation uses sentence-to-vector representations rather than individual word vectors. The architecture employed LSTM networks, where each word in a query like "the white house" is represented as a vector, creating a sequence fed into the LSTM model. This can be implemented in approximately ten lines of Keras code using LSTM layers, dropout for regularization, and sigmoid activation for binary classification.
Performance Challenges with LSTMs
Despite achieving reasonable results with outputs like United States, President, and politician for "the white house," LSTMs proved extremely slow when processing 100 million samples. The training time per epoch was prohibitively long, and the speaker did not have patience to wait for error reduction. This performance limitation led to exploring alternative approaches for large-scale query classification.
Query Image Representation
Search queries can be categorized into navigational (like searching for Data Geeks event), transactional (like searching for iPhones), and informational (like things to do in Munich). The speaker developed a novel approach by representing search queries as images created from word2vec vectors. When searching on Google returns ten results, word2vec representations were created for each title in the results. These representations were then fed into a convolutional neural network, enabling the network to distinguish between visually similar concepts like Guild Wars (a game) and apple juice, which humans might struggle to differentiate.
Alternative Machine Learning Approaches
The speaker experimented with various machine learning models for query classification, starting with gradient boosted trees combined with word2vec embeddings to categorize queries. The final production solution used TF-IDF features combined with XGBoost for optimal performance at scale. For feeding features into CNNs, the approach involved either using images directly or employing random crops, with results combined through averaging. The neural networks used simple sequential models with dense layers in Keras, and similar architectures applied to convolutional models.
Framework for Machine Learning Problems
Two years ago, while working on a PhD that was later discontinued, the speaker developed a systematic framework for approaching any machine learning problem. The framework begins with dataset splitting into training, validation, and test sets. Next, data type is identified as numerical, categorical, or text. Categorical data undergoes conversion to labels or binarization through one-hot encoding. The approach involves stacking different models and applying decomposition techniques like PCA or LDA. Every combination is evaluated to select the best performing model. While detailed in a published paper, the speaker has not released the framework publicly and welcomes collaboration for reimplementing it from scratch.
Neural Network Optimization
Building on the general machine learning framework, the speaker developed a similar approach specifically for optimizing neural networks, which was presented at an ICML workshop. Selecting appropriate neural network architecture is not trivial and requires systematic evaluation. The framework provides a structured method for architecture search and hyperparameter tuning.
Time Series with LSTMs and 1D CNNs
LSTMs provide an effective way to feed time series data into neural networks. Two approaches exist: creating a separate network for each time step, or using a windowed approach. The windowed LSTM approach, introduced by Schmidhuber in 1997, uses two LSTM layers in the implementation. The speaker also incorporated one-dimensional CNNs using temporal convolution, where the input signal X is convolved with kernel H to produce output Y. This implementation is straightforward in modern frameworks.
Embeddings and Time-Distributed Layers
Embedding layers convert word counts into dense vector representations and always appear as the first layers in the network, transforming indexes to vectors. The speaker used time-distributed dense layers available in Keras, which applies the dense layer to every temporal slice of the input. The base model was SLNA, utilizing pre-trained GloVe embeddings for initialization. This architecture provided a foundation for building more complex models tailored to specific tasks.
Text Preprocessing for Deep Networks
Before training deep networks on text, tokenization and sequence conversion are necessary steps. The speaker used GloVe embeddings by loading embedding files into a dictionary structure, which involves no complex procedures. An embedding matrix is then created as required by Keras. The final model architecture is substantial and can be understood by examining its components. Models 1 and 2 share similar structures with translation layers using non-trainable weights initialized from the GloVe embedding matrix, replacing the averaging used in SLNA with a sum in lambda layers for dimensional consistency.
Chatbot Implementation
The final use case demonstrates a multilingual chatbot built with a single model supporting multiple languages. The system includes misspelling correction and synonym handling. It processes four intents, determining which intent category applies to a given question and selecting an appropriate random response message to show the user. The chatbot integrates with APIs for external information retrieval. An introductory video was created but not shown due to time constraints. The implementation showcases how deep learning enables sophisticated natural language understanding across language barriers with unified architectures.