Dadada 2017 - Word Embeddings - Thoughts, stories and ideas.

Thank you for the corrections! Now I'll generate the structured summary.

Abstract

This talk explores the application of word embeddings to German automotive data, examining whether raw word embeddings can effectively capture complex semantic relationships in car-related natural language. The speaker from Die ProduktMacher presents experimental results testing word2vec on multiple German car review datasets, analyzing how well embeddings capture similarities between car manufacturers and models. The research addresses fundamental questions about data preprocessing, hyperparameter optimization, and quality assessment when working with domain-specific German language data. Key findings reveal that data volume significantly impacts embedding quality, preprocessing steps like stop word removal improve results, and while embeddings successfully capture some manufacturer relationships and model similarities, certain premium brands and specific models show unexpected clustering behavior. The experiments demonstrate both the potential and limitations of word embeddings for automotive domain applications, highlighting the importance of careful data preparation even when using modern deep learning approaches.

About the Speaker

Fabian Dill represents Die ProduktMacher, a company passionate about creating innovative digital products. The team strongly believes that building great digital products in the near future requires incorporating data-driven approaches, machine learning, and artificial intelligence. This belief has driven their intensified research in these areas. Rather than focusing on deep learning applications directly, the speaker concentrates on establishing solid fundamentals for data preparation and feature engineering, exploring how to optimally feed data into neural networks before the actual deep learning process begins.

Transcript Summary

Motivation for Word Embeddings Research

Word embeddings have become the standard and fundamental basis for working with natural language data in data-driven products. While deep learning receives significant attention, the quality of input data and proper feature representation remain critical. The research focuses on understanding what can be extracted from raw word embeddings, assessing their inherent power, and determining whether additional preprocessing steps can improve their quality before feeding them into downstream applications.

Understanding Word Embeddings

Traditional text representation used one-hot encoding, where each word in the vocabulary becomes a vector filled with zeros except for a single position marking that word. This approach creates sparse data where every word maintains equal distance from every other word, failing to capture semantic relationships between words.

Word embeddings solve this problem by capturing semantic meaning. Two primary training methods exist:

CBOW (Continuous Bag of Words): Selects a focus word and predicts the probability of surrounding words in the neighborhood
GloVe: Uses context words to predict the probability of the center word

Both methods utilize neural networks where the hidden layer serves as the word embedding. This hidden layer is typically much smaller than the vocabulary size, creating a compressed representation that captures semantic relationships. The famous example demonstrates how embeddings can derive meaning dimensions - for instance, royalty as a derived concept where king, queen, and princess score high values while woman scores lower.

Car Data as Complex Domain

Automotive data presents particular complexity for natural language processing due to multiple semantic layers:

Emotional layer: Brand perception, status associations, and mental imagery
Technical layer: Model generations, facelifts, functionality, and body types
Detailed specifications: Equipment variants and model lines

For example, Audi represents the make, Audi A4 represents the model, and Audi A4 Avant specifies the station wagon variant. The hypothesis suggests word embeddings should capture these complex hierarchical relationships.

Dataset and Methodology

The research uses German language data from car news and reviews published before May 2018. Three different data sources (datasets A, B, and C) were collected to represent varied review styles and perspectives, with dataset D combining all three sources. The speaker trained custom German word embeddings using the word2vec implementation of Gensim to test domain-specific performance.

To assess quality, the speaker extracted word vectors for the 57 most frequent car manufacturers and created PCA visualizations for initial analysis.

Initial Results and Observations

Dataset A showed extreme outliers for Volkswagen and BMW, with luxury brands like Mercedes, Maserati, and Bentley clustering reasonably well, but unexpected placements like Alfa Romeo appearing among luxury brands despite being Italian mid-range.

Dataset B displayed extreme BMW outliers again, with most manufacturers clustering at a single point, though micro-level relationships showed some reasonable patterns like decreasing price from Lexus to Mini, Honda, and Nissan.

Dataset C presented less extreme outliers but problematic placements, such as Vauxhall positioned near Ferrari and Mercedes-Benz. However, Maserati and Aston Martin clustered appropriately.

Dataset D (combined data) showed the largest BMW outlier. Some clusters appeared reasonable, but expected relationships were missing - for example, Peugeot and Renault should logically position near Citroën, but Citroën appeared elsewhere.

The persistent BMW outlier phenomenon remained unexplained despite various hypotheses.

Quality Assessment Challenge

A fundamental challenge emerged: car manufacturer similarity lacks objective training data. Whether one make resembles another relies primarily on intuition rather than measurable metrics. To address this, the speaker manually created a similarity matrix for all 57 manufacturers based on expert judgment, providing a benchmark for comparison.

Extracting the 57 manufacturer vectors from each dataset's word embeddings and computing similarity matrices revealed significant differences between datasets. Word embeddings generally assigned higher similarity values than the manual benchmark, and combining all data (dataset D) produced results closest to the benchmark.

Computing the sum of squared errors between embeddings and the benchmark matrix confirmed that data volume directly correlates with result quality - larger datasets with more unique words produced better embeddings.

Impact of Data Preprocessing

Testing preprocessing effects on the two best-performing datasets showed that removing stop words using NLTK's German stop word list consistently improved quality compared to the manual benchmark. This finding challenges the notion that raw data can simply be fed into deep learning networks without preprocessing, demonstrating that understanding and preparing data remains essential even with modern approaches.

Hyperparameter Optimization

Experiments with vector dimensionality revealed that optimal results occurred with hidden layer dimensions between 100 and 200 for this specific task and dataset, though the speaker could not explain why this range performed best.

Manufacturer Similarity Findings

Examining specific manufacturer similarities revealed mixed results:

Škoda showed reasonable similar manufacturers: SEAT (correctly identified as fellow Volkswagen Group brand), plus various mid-range international brands, indicating appropriate clustering.

Peugeot successfully identified the French cluster with Renault and Citroën as most similar, followed by German-perspective alternatives and global budget players like Mitsubishi and Dodge.

Mitsubishi revealed an unexpected learning: high similarity to Citroën initially seemed wrong, but research uncovered that Mitsubishi and Citroën maintain close cooperation with a joint manufacturing plant in Russia - a relationship the embeddings captured that the speaker hadn't known.

Supercars clustered appropriately: Lamborghini showed high similarity to Ferrari, Koenigsegg, Maserati, Lotus, and Bugatti.

BMW (German premium brand) presented problematic results with the highest similarity of only 0.4 to Jaguar - a questionable match. Critically, the embeddings completely missed Audi and Mercedes-Benz as similar premium German brands, with no explanation for this failure.

Smart produced the most disturbing results, showing high similarity to Buick, GMC, and Chevrolet - appearing completely random and possibly indicating insufficient training data.

Model-Level Analysis

Moving to the model level (e.g., BMW X3, Ford Fiesta) using raw data primarily captured manufacturer clusters rather than model similarities. This occurred because word2vec learns from context words - BMW typically appears near X3 in text, making the manufacturer the strongest similarity signal.

Preprocessing addressed this by replacing standalone manufacturer names with a combined manufacturer-model token while removing redundant manufacturer mentions. This preprocessing significantly reduced manufacturer clustering effects.

Results improved substantially:

Renault Clio correctly matched with Škoda Fabia and Ford Fiesta as similar small cars
Ford Fiesta identified Renault Clio, Škoda Fabia, and Ford Focus as appropriate alternatives
Nissan Qashqai (SUV) matched with Kia Sportage and Ford Kuga as similar mid-range SUVs

The analysis focused only on single-word model names to avoid additional preprocessing complexity for models with alphanumeric designations like Mercedes models.

Key Takeaways

Word embeddings can capture meaningful relationships in automotive domain data, but success requires:

Sufficient data volume (more unique words improve results)
Thoughtful preprocessing (stop word removal helps)
Domain-specific adjustments (manufacturer-model tokenization)
Careful hyperparameter tuning (optimal dimensionality varies by task)
Realistic expectations (some relationships remain unexplained or incorrect)

The research demonstrates that while word embeddings show promise for automotive applications, they work well for some manufacturers and models while failing for others, suggesting additional preprocessing or data augmentation may be necessary for production use.