NBA Playoff Predictions Update 4 (5-3)

This is update 4 to my original post about predicting the NBA playoffs with R. With the Thunder beating the Spurs and the Heat losing to the Celtics, the algorithm went 1-1 on predictions, making it 5-3 so far. Making some improvements I have been posting for some time about incorporating more data into the models, and I finally got around to it. It is a common truism in data science that more (high-quality) data... »
Author's profile picture Vik Paruchuri on r, nba, playoffs, and machine learning

NBA Playoff Predictions Update 3 (4-2)

This is my third update to my original post on predicting the NBA playoffs with an algorithm. Here are updates 1 and 2. The algorithm correctly predicted a Boston win, but missed on the Spurs/Thunder game, so it is currently 4-2. Haven’t had any time to update yet, so I will only be able to give you predictions for the next games, unfortunately: Predicting a Miami win and an Oklahoma City win. »
Author's profile picture Vik Paruchuri on r, basketball, playoffs, nba, miami, heat, thunder, oklahoma city, boston, and celtics

NBA Playoff Predictions Update 2 and Results (3-1)

This is my second follow-up to my previous two posts which were about predicting NBA games with an algorithm, and my first update to the algorithm. The algorithm’s record is now 3-1, as it correctly predicted Boston and Oklahoma City as winners of their past games. Upcoming things to do Sadly, I have been a bit busy, and I have not been able to do any work on the algorithm the past couple of days.... »
Author's profile picture Vik Paruchuri on thunder, algorithm, oklahoma city, celtics, boston, heat, r, basketballs, miami, nba, spurs, san antonio, predicting, and predictive

Predicting NBA Playoff Games - Results and Update 1

Game Results I recently made a post about developing an algorithm to predict the NBA playoffs, and I concluded with 2 predictions. Although Miami beat the Celtics to make my algorithm 1-0 in terms of predictions, it fell to 1-1 when the Thunder beat the Spurs. So, we are now at .500 . Considering that the algorithm was about 61.5% accurate over the whole season, this is to be expected. I made some improvements to... »
Author's profile picture Vik Paruchuri on machine learning, basketball, predictive analytics, predictions, nba, ggplot, and r

Predicting the NBA Finals with R

This is the initial post about the algorithm. See updates 1, 2, and 3 for more. The algorithm is currently 4-2 in the playoffs! Overview I was struck by Martin O’Leary’s recent post on predicting the Eurovision finals, which led me to decide that I would try to predict NBA games using mathematical models. As the finals are ongoing, this is a quite timely decision! You can read through everything or scroll to the end... »
Author's profile picture Vik Paruchuri on predict, basketball, predictions, nba, data, analysis, finals, ggplot, statistics, regression, and r

Mapping US Radiation Levels in R

I have posted previously about the open data available on Socrata (, and I was looking at the site again today when I stumbled upon a listing of levels of various radioactive isotopes by US city and state. The data is available at /w9fb-tgv6 . You will need to click export, and then download it as a csv. I was struck by how the data was in a very nice format for analysis. I... »
Author's profile picture Vik Paruchuri on zips, zip codes, coordinates, mapping, radiation, r, and plotting

Loading and/or Installing Packages Programmatically

In R, the traditional way to load packages can sometimes lead to situations where several lines of code need to be written just to load packages. These lines can cause errors if the packages are not installed, and can also be hard to maintain, particularly during deployment. Fortunately, there is a way to create a function in R that will automatically load our packages for us. In this post, I will walk you through conceiving... »
Author's profile picture Vik Paruchuri on packages, deployment, production, install, load, and r

Monitoring Progress Inside a Foreach Loop

The foreach package for R is excellent, and allows for code to easily be run in parallel. One problem with foreach is that it creates new RScript instances for each iteration of the loop, which prevents status messages from being logged to the console output. This is particularly frustrating during long- running tasks, when we are often unsure how much longer we need to wait, or even if the code is doing what it is... »
Author's profile picture Vik Paruchuri on foreach, randomforest, dosnow, and r

Using LaTeX, R, and Sweave to Create Reports in Windows

LaTeX is a typesetting system that can easily be used to create reports and scientific articles, and has excellent formatting options for displaying code and mathematical formulas. Sweave is a package in base R that can execute R code embedded in LaTeX files and display the output. This can be used to generate reports and quickly fix errors when needed. There are some barriers to entry with LaTeX that seem much steeper than they actually... »
Author's profile picture Vik Paruchuri on windows, latex, sweave, and r

Parallel R Model Prediction Building and Analytics

Modifying R code to run in parallel can lead to huge performance gains. Although a significant amount of code can easily be run in parallel, there are some learning techniques, such as the Support Vector Machine, that cannot be easily parallelized. However, there is an often overlooked way to speed up these and other models. It involves executing the code that generates predictions and other analytics in parallel, instead of executing the model building phase... »
Author's profile picture Vik Paruchuri on foreach, loop, parallel, and r

Analyzing US Government Contract Awards in R

As I was exploring open data sources, I came across USA spending. This site contains information on US government contract awards and other disbursements, such as grants and loans. In this post, we will look at data on contracts awarded in the state of Maryland in the fiscal year 2011, which is available by selecting “Maryland” as the state where the contract was received and awarded here. I will use Maryland as a proxy for... »
Author's profile picture Vik Paruchuri on data analysis, government, spending, and r

R Regression Diagnostics Part 1

Linear regression can be a fast and powerful tool to model complex phenomena. However, it makes several assumptions about your data, and quickly breaks down when these assumptions, such as the assumption that a linear relationship exists between the predictors and the dependent variable, break down. In this post, I will introduce some diagnostics that you can perform to ensure that your regression does not violate these basic assumptions. To begin with, I highly suggest... »
Author's profile picture Vik Paruchuri on regression, diagnostics, and r

Analyzing Federal Bailout Recipients in R

I was searching for open data recently, and stumbled on Socrata. Socrata has a lot of interesting data sets, and while I was browsing around, I found a data set on federal bailout recipients. Here is the data set. However, data sets on Socrata are not always the most recent versions, so I followed a link to the data source at Propublica, where I was able to find a data set that was last updated... »
Author's profile picture Vik Paruchuri on finance, bailout, banking, and r

Intro to Ensemble Learning in R

Introduction This post incorporates parts of yesterday’s post about bagging. If you are unfamiliar with bagging, I suggest that you read it before continuing with this article. I would like to give a basic overview of ensemble learning. Ensemble learning involves combining multiple predictions derived by different techniques in order to create a stronger overall prediction. For example, the predictions of a random forest, a support vector machine, and a simple linear model may be... »
Author's profile picture Vik Paruchuri on machine learning and r

Improve Predictive Performance in R with Bagging

Bagging, aka bootstrap aggregation, is a relatively simple way to increase the power of a predictive statistical model by taking multiple random samples(with replacement) from your training data set, and using each of these samples to construct a separate model and separate predictions for your test set. These predictions are then averaged to create a, hopefully more accurate, final prediction value. One can quickly intuit that this technique will be more useful when the predictors... »
Author's profile picture Vik Paruchuri on r