Following Friday’s news of yhat’s ggplot port (which I hope they promptly rename to avoid search engine conflation with other variants), I thought it’d be fun to explore the large Stack Overflow dataset Facebook provided (9.7 GB) for their latest Kaggle competition. I discovered that the ggplot port is off to a great start and will only get better as they address all the missing core features (..count.. & ..density.. etc.). I skipped using it for the specific visualization I wanted and utilized matplotlib and R’s ggplot2 via rmagic/rpy2.
Below is work I generated in IPython with Pandas, Numpy, and R. When dealing with data this large, I’ve utilized Pandas’ HDF5 capabilities or imported the data into my local instance of MongoDB to then be queried from Python. Once this is in place I can then leverage some of NLTK’s features such as sentence & word tokenization, part-of-speech tagging, and text classification. NLTK allows you to leverage classification algorithms such as Naive Bayes, Maximum Entropy, Logistic Regression, Decision Tree, and SVM. I’ve only had the opportunity to use NLTK for one project involving a client’s presence on twitter (substantially smaller data) and I’d love to see how it handles larger datasets.
On the topic of sentiment analysis, check out Stanford’s incredible new API: