Update: you can find the next post in this series here. In a previous post, I looked at transcripts of Simpsons episodes and tried to figure out which character was speaking which line. This worked decently, but it wasn’t great. It gave us memorable scenes like this one: Homer : D'oh! A deer! A female deer. Marge : Son, you're okay! Bart : Dad, I can't let you sell him. Stampy and I are friends.... »
Update: you can find the next post in this series here. You probably have a favorite Simpsons character. Maybe you hope to someday block out the sun, Mr. Burns style, maybe you enjoy Homer’s skill in averting meltdowns, or maybe you identify with Lisa’s struggles for acceptance. Through its characters, the Simpsons made a huge impact on a generation, and although the show is still running, my best memories will be of the early seasons.... »
The determinant of a matrix is a number associated with a square (nxn) matrix. The determinant can tell us if columns are linearly correlated, if a system has any nonzero solutions, and if a matrix is invertible. See the wikipedia entry for more details on this. Computing a determinant is key to a lot of linear algebra, and by extension, to a lot of machine learning. It is easy to calculate the determinant for a... »
Linear regression is a very basic technique that we use a lot in machine learning. In a lot of cases (and I have been guilty of this), we just use it without much thought as to how the internals actually work. In a 2-D coordinate system, we can plot observations (such as, a child’s age is 1), and associated dependent variables (ie, the child has 1 friend) on an x/y axis, like the one below:... »
Introduction I had my natural predilection towards math crushed out of me at some point in school, and after that point, Math (yes, we are referring to the higher power of math) and I had a wary understanding. I dabbled quietly, and Math turned a blind eye to me ignoring some of its deeper theory. When I stuggled loudly, Math did its best to hide its smirks. I generally refrained from throwing textbooks. Ever since... »
This is the second, technical, part of this series. See the first part for the overview. Introduction This post will introduce the technical details behind the nfl season record prediction that was introduced in part one. After selecting the error metric and defining an acceptable baseline, which was setup in part one, the next step is to develop a plan of attack. In order to create and develop this plan, we will use the percept... »
This is the first, non-technical, part of this series. See the second part for more detail. Introduction I was recently looking for a good machine learning task to try out, and I thought that doing something NFL-related would be interesting, because the NFL season is about to start (finally!). Why was I looking for a good machine learning task to try out? I have mostly done my data analysis work in R, but recently, I... »
Introduction This will serve as an introduction to natural language processing. I adapted it from slides for a recent talk at Boston Python. We will go from tokenization to feature extraction to creating a model using a machine learning algorithm. The goal is to provide a reasonable baseline on top of which more complex natural language processing can be done, and provide a good introduction to the material. The examples in this code are done... »
I just gave a talk at Boston Python about natural language processing in general, and edX ease and discern in specific.
You can find the presentation source here, and the web version of it here.
There is a video of it here.
Nelle Varoquaux and Michael Selik also had interesting talks in the same video above, recommend checking them out.
Intro I recently had to create some sites quickly. After evaluating a few options, setting up a wordpress multisite seemed like a good option. In order to make this change, I setup a wordpress multisite installation with domain mapping. A multisite installation is when one wordpress install lets you run multiple websites. I like multisite because it enables me to flexibly manage multiple websites with less duplication of effort than a single wordpress installation for... »
How Many Data Scientists Are There? I’ve seen a lot of articles lately about “Big Data” and the looming “talent gap.” This article from the Wall Street Journal is a good example. It cites a McKinsey estimate that states that we will need 1.5 million more managers and analysts who are conversant with “big data.” Of course, some of this is the media latching on the the next “big thing” (data), but some of it... »
Introduction I recently posted about using the Wikileaks cable corpus to find word use patterns, both over time, and in secret cables vs unclassified cables. I received a lot of good suggestions for further topics to pursue with the corpus, and probably the most interesting was the idea to do sentiment analysis over time on a variety of named entities. Sentiment analysis is the process of discovering whether a writer feels negatively or positively about... »
6/18: A follow-up to this post is now available here. Recent Discoveries When I was a diplomat, I was always interested in the Wikileaks cables and what could be done with them. Unfortunately, I never got a chance to look at the site in depth, due to security policies. Now that the ex- is firmly prepended to diplomat in my resume, I think that I am finally ready to take that step. I recently realized... »
This is the sixth post in my series on predicting the NBA playoffs with an algorithm. After the Boston loss in their last game, the algorithm is now 5-4 in the playoffs. Hopefully it is correct tonight! Open Sourcing the Code I have had a couple of requests to open source the code, which I had planned to do at the end of this series of posts. However, there is one stumbling block in that... »