Topic Modeling Amazon Reviews

Adapted from Biel 2011

I found Professor Julian McAuley’s work at UCSD when I was searching for academic work identifying the ontology and utility of products on Amazon. Professor McAuley and his students have accomplished impressive work inferring networks of substitutable and complementary items. They constructed a browseable product graph of related products and discovered topics or ‘microcategories’ that are associated with product relationships to infer networks of substitutable and complementary products. Much of this work utilizes topic modeling, and as I’ve never applied it in academia or work, this blog will be a practical intro to Latent Dirichlet Allocation (LDA) through code.

More broadly what can we do with and what do we need to know about LDA?

  • It is an Unsupervised Learning Technique that assumes documents are produced from a mixture of topics
  • LDA extracts key topics and themes from a large corpus of text
  • Each topic is a ordered list of representative words (Order is based on importance of word to a Topic)
  • LDA describes each document in the corpus based on allocation to the extracted topics
  • Many domain specific methods to create training datasets
  • It is easy to use for exploratory analysis

We’ll be using a subset (reviews_Automotive_5.json.gz) of the 142.8 million reviews spanning May 1996 – July 2014 that Julian and his team have compiled and provided in a very convenient manner on their site.