This is update 4 to my original post about predicting the NBA playoffs with R. With the Thunder beating the Spurs and the Heat losing to the Celtics, the algorithm went 1-1 on predictions, making it 5-3 so far. Making some improvements I have been posting for some time about incorporating more data into the models, and I finally got around to it. It is a common truism in data science that more (high-quality) data... »

This is my third update to my original post on predicting the NBA playoffs
with an algorithm. Here are updates
1 and 2.
The algorithm correctly predicted a Boston win, but missed on the
Spurs/Thunder game, so it is currently 4-2. Haven’t had any time to update
yet, so I will only be able to give you predictions for the next games,
unfortunately: Predicting a Miami win and an Oklahoma City
win.
»

This is my second follow-up to my previous two posts which were about predicting NBA games with an algorithm, and my first update to the algorithm. The algorithm’s record is now 3-1, as it correctly predicted Boston and Oklahoma City as winners of their past games. Upcoming things to do Sadly, I have been a bit busy, and I have not been able to do any work on the algorithm the past couple of days.... »

Game Results I recently made a post about developing an algorithm to predict the NBA playoffs, and I concluded with 2 predictions. Although Miami beat the Celtics to make my algorithm 1-0 in terms of predictions, it fell to 1-1 when the Thunder beat the Spurs. So, we are now at .500 . Considering that the algorithm was about 61.5% accurate over the whole season, this is to be expected. I made some improvements to... »

This is the initial post about the algorithm. See updates 1, 2, and 3 for more. The algorithm is currently 4-2 in the playoffs! Overview I was struck by Martin O’Leary’s recent post on predicting the Eurovision finals, which led me to decide that I would try to predict NBA games using mathematical models. As the finals are ongoing, this is a quite timely decision! You can read through everything or scroll to the end... »

I have posted previously about the open data available on Socrata (https://opendata.socrata.com/), and I was looking at the site again today when I stumbled upon a listing of levels of various radioactive isotopes by US city and state. The data is available at https://opendata.socrata.com/Government/Sorted-RadNet-Laboratory-Analysis /w9fb-tgv6 . You will need to click export, and then download it as a csv. I was struck by how the data was in a very nice format for analysis. I... »

In R, the traditional way to load packages can sometimes lead to situations where several lines of code need to be written just to load packages. These lines can cause errors if the packages are not installed, and can also be hard to maintain, particularly during deployment. Fortunately, there is a way to create a function in R that will automatically load our packages for us. In this post, I will walk you through conceiving... »

The foreach package for R is excellent, and allows for code to easily be run in parallel. One problem with foreach is that it creates new RScript instances for each iteration of the loop, which prevents status messages from being logged to the console output. This is particularly frustrating during long- running tasks, when we are often unsure how much longer we need to wait, or even if the code is doing what it is... »

LaTeX is a typesetting system that can easily be used to create reports and scientific articles, and has excellent formatting options for displaying code and mathematical formulas. Sweave is a package in base R that can execute R code embedded in LaTeX files and display the output. This can be used to generate reports and quickly fix errors when needed. There are some barriers to entry with LaTeX that seem much steeper than they actually... »

Modifying R code to run in parallel can lead to huge performance gains. Although a significant amount of code can easily be run in parallel, there are some learning techniques, such as the Support Vector Machine, that cannot be easily parallelized. However, there is an often overlooked way to speed up these and other models. It involves executing the code that generates predictions and other analytics in parallel, instead of executing the model building phase... »

As I was exploring open data sources, I came across USA spending. This site contains information on US government contract awards and other disbursements, such as grants and loans. In this post, we will look at data on contracts awarded in the state of Maryland in the fiscal year 2011, which is available by selecting “Maryland” as the state where the contract was received and awarded here. I will use Maryland as a proxy for... »

Linear regression can be a fast and powerful tool to model complex phenomena. However, it makes several assumptions about your data, and quickly breaks down when these assumptions, such as the assumption that a linear relationship exists between the predictors and the dependent variable, break down. In this post, I will introduce some diagnostics that you can perform to ensure that your regression does not violate these basic assumptions. To begin with, I highly suggest... »

I was searching for open data recently, and stumbled on Socrata. Socrata has a lot of interesting data sets, and while I was browsing around, I found a data set on federal bailout recipients. Here is the data set. However, data sets on Socrata are not always the most recent versions, so I followed a link to the data source at Propublica, where I was able to find a data set that was last updated... »

Introduction This post incorporates parts of yesterday’s post about bagging. If you are unfamiliar with bagging, I suggest that you read it before continuing with this article. I would like to give a basic overview of ensemble learning. Ensemble learning involves combining multiple predictions derived by different techniques in order to create a stronger overall prediction. For example, the predictions of a random forest, a support vector machine, and a simple linear model may be... »

Bagging, aka bootstrap aggregation, is a relatively simple way to increase the power of a predictive statistical model by taking multiple random samples(with replacement) from your training data set, and using each of these samples to construct a separate model and separate predictions for your test set. These predictions are then averaged to create a, hopefully more accurate, final prediction value. One can quickly intuit that this technique will be more useful when the predictors... »