5 Takeaways from Max Kuhn’s “R for Machine Learning” Workshop

I recently had the opportunity to attend “R for Machine Learning,” one of the pre-conference workshops held in conjunction with the Predictive Analytics World conference in New York City. The instructor for this excellent hands-on workshop was none other than Max Kuhn, Data Scientist at RStudio and author of the “caret” R package (among others) as well as the book Applied Predictive Modeling.

During this day-long session, participants were exposed to a wide variety of material covering everything from data manipulation to modeling techniques to making R run more efficiently. Below is a quick snapshot of my 5 key takeaways from this excellent workshop:

  1. R as “Quarterback”: R’s open source architecture has allowed companies like RStudio to build an environment around the core software that ties into & takes advantage of other great products in order to create a more unified analytics platform to work from. That’s on top of the excellent features found in RStudio, the user-friendly GUI wrapper for base R.
    • R integrates nicely with SPSS and Python software platforms, among others.
    • RStudio has positioned itself as a home base within this environment: they’ve hired several well-known programmers (notably Kuhn, Winston Chang, and Hadley Wickham) who have built tools to make R friendlier, easier, and more consistent to use. Specifically, packages that adhere to the “Tidy Data” philosophy championed by Hadley Wickham and now affectionately known as the “Tidy-verse” including “dplyr,” “ggplot2,” “tidyr,” “recipes” and “caret.”
    • RStudio has also integrated several helpful tools for sharing/publishing work (Markdown), creating interactive graphics (Shiny, Sweave), and simplified project management through their “Projects” menu.
  2. Max’s R package, “caret” is an exceptionally powerful tool which makes model building in R faster, more consistent & generally more awesome (see the caret documentation here):
    • Caret (short for Classification And REgression Training) provides a consistent syntax for running over 230 different types of models (previously, users had to write code specific to the syntax for each different model). It also provides the ability to run, re-run, and compare several models quickly & efficiently.
    • It just “knows” what type of model you’re working with & adjusts commands and output appropriately.
    • There are also a number of very good caret functions which simplify & speed up the modeling process. Examples include “createDataPartition,” which splits data using stratified random sampling – thereby preserving original class proportions within your file; “preProcess,” which allows users to perform several data manipulation tasks in-line (see below for more); and the ability to optimize model tuning parameters using a grid search, random hyperparameter searching, or adaptive resampling. And that’s just the beginning of caret’s core functionality!
  3. Parallel Processing: caret also allows for processing tasks across multiple computer cores in parallel. It does this by linking up with the “doMC” (unix and OS X) or “doParallel” package (Windows). This helpful feature can greatly improve processing times for computationally intensive modeling tasks.
  4. Working with Data: The workshop also highlighted several techniques for working with imbalanced data and pre-processing raw data prior to modeling:
    • Imbalanced:
      1. Down-sampling within caret’s “train_control” command; SMOTE (a hybrid of up- and down-sampling that can yield better results than up- or down-sampling alone)
      2. Using ROC curves to inform cut-off thresholds for classification problems (as opposed to defaulting to 0.5 all the time)
      3. An entire chapter of Kuhn’s book is devoted to the subject of imbalanced data – which we deal with constantly in fundraising analytics – so there’s a lot to unpack here. Max also hinted at future publications devoted to this topic, so stay tuned!
    • Pre-processing:
      1. Caret includes a single function (“preProcess”) from which users can call several different techniques for working with raw data to improve model performance. These are a huge time-saver & caret is smart enough to automatically handle things like performing these transformations in the correct order, skipping non-numeric data, etc. Some of my favorite pre-processing functions included in caret are:
        1. Creating dummy variables
        2. Centering & Scaling variables to normalize them
        3. “Power” transformations to normalize variables
        4. Finding & removing variables with low variance, or with high correlation, as well as outliers
        5. Several techniques to impute missing values (k-nearest neighbors, bagging, median, etc.)
  5. “xgboost” Package: This specific package for gradient boosted models was mentioned several times. I wasn’t familiar with it previously, but it was designed to be laser-focused on two things only: processing speed and model performance. After a bit of research, it’s clear xgboost has an excellent track record in the data science community, particularly in Kaggle competitions. I look forward to running some comparisons myself using xgboost in the coming weeks.

Overall, this workshop was a great learning opportunity and built on many key analytics concepts I was already familiar with to one degree or another. Even though I’ve had a copy of Kuhn’s book on my desk since it came out, there is still so much to learn about predictive modeling and I found a lot of value in this workshop. I highly recommend this kind of hands-on workshop to anyone – regardless of industry or prior experience in predictive modeling.

If you’d like to learn more about fundraising analytics, the R statistical software, or what predictive modeling can do for your organization, check out the BWF Insight website or contact us today.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s