Skip to main content

Posts

Evaluating our Heart Disease Classifier

Let's continue our descent by evaluating how good our heart disease classifier is. We can do this by generating predictions on the test set and see how the predictions compare to the test set's ground truth labels. That can be done with the following lines of code: At the bottom we can see that the classifier predicted 24 true negatives, 9 false positives, 8 false negatives, and 19 true positives. That's pretty okay. There is obviously some inaccuracy in the predictions, but let's calculate the accuracy anyways. (24+19)/(24+19+8+9)=71.6. So the test accuracy was 71.6, while, if you recall from the last post, the training accuracy was nearing 90%. This disparity between training and testing accuracy is a result of overfitting. Essentially, 50,000 training loops was too much training for this little of data. The resulting network overfit to the noise inherent in the training data and, as a result, failed to generalize as well on the test set. Therefore, the testing...
Recent posts

Training a Heart Disease Classifier

As I begin writing this post I realize that we're not training a heart disease predictor , per se, as the classification task at hand isn't really to divine the onset of heart disease in the distant future. Rather, we're really training a heart disease classifier that identifies heart disease given test results, vital measurements, and demographic information. While much less cool than a predictor, this is still a pretty difficult task and I'm certain the definition of "heart disease" might present some aspects of ambiguity and subjectivity between cardiologists. To illustrate the difficulty of this task, you might ask yourself: if you were given the sex, angina status, heart rate measurements, cholesterol measurements, and EKG results of a patient, would you be able to diagnose that patient as having heart disease? Anyways, let's go ahead and train this network. That can be done with these lines of code. Execution of these lines of code takes about 20 m...

Partitioning Heart Data

We'll continue our descent by loading, pre-processing, and partitioning the heart disease data. This process will be similar to our previous experiment, but it'll be using the heart disease dataset instead of the iris dataset. First, we'll load the heart disease data into a dataframe. That can be done with the following lines of code. We'll also rename the columns to be reflective of the dataset's documentation. The "ca" and "thal" columns have some null values, so let's just remove those rows from the dataframe. There are ways to impute the data rather than just getting rid of the records, but those methods go beyond the scope of this blog and thus we'll pick the easy way out. Afterwards, we have a couple of categorical variable, if you can recall from the previous post. As they're read in, they're represented by integral values. I'm actually not sure how mxNet deals with these. Either they can be considered categoric...

A New Problem: Heart Disease

Next, we'll see if we can tackle a problem that may have real-world implications for deep learning. While I jest about the awesome power of neural networks for classifying flowers, there are some cool ways to show that the long arm of deep learning extends beyond the casual gardener's domain. Specifically, I'd like to show that deep learning can be used in a healthcare setting. There are plenty of ways that I might be able to show this, but perhaps the message would be most effective if I could tackle something relevant. So, we'll be tackling a prediction task related to heart disease, which, according to this image, is rather important: Heart disease affects most people whether it's directly or through the affliction of a loved one. It is the cause of death for 25% of Americans and, if your household is as pork-chop-laden as mine, probably has affected you as well. So, what then can we do to show that deep learning is relevant to heart disease? Well, it...

Classifying Flowers Part 2: Training and Evaluation

In the previous post, we fiddled with data and set out for an ambitious task of classifying plants. We ended that post after having downloaded, prepared, and partitioned our data, resulting two non-overlapping sets of data: a test set and a training set. We'll continue now by building, training, and evaluating a neural network to classify these flowers. First, we'll need to separate the values that we'll use as input to the network from the values we're intending to use as the output from the network. In other words, we need to place all the sepal and petal widths and lengths into a bag that the network can read, and all the species corresponding to the individual sets of sepal and petal widths and lengths into another bag. That way, when the network trains, it only takes into account the 4 inputs. When it produces an output, we can use the other bag (the one with species inside) to see if the network has predicted the species correctly. Before doing that, we'...

Classifying Flowers Part 1: Data

Everybody loves plants. So we'll continue our descent by building a neural network that can do something with plants. This construct will be able to take characteristics that describe a plant and, if we're good enough, hopefully the neural network will be able to tell us with accuracy what type of plant it is. To do this, we'll use the Iris flowers dataset, which contains 150 samples of 3 different species (50 samples each). The species in this dataset are Iris setosa, Iris versicolor, and Iris virginica. The dataset can be found at the following links. Data:  https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data ReadME:  https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names Here's a screenshot of the data we'll be working with: Initially, this data doesn't look too exciting and if it at all confuses you, you're perfectly normal. But we can figure some immediate information just from looking at the format ...

Installing the Tools

We'll continue our descent by installing the tools necessary to conduct deep learning. The tools will include R, MXNet (framework for building neural nets in R), Python, and Tensorflow (framework for building neural nets in Python). You might ask why I'm using 2 different languages and 2 different frameworks. Truth be told, I like the way MXNet does classic, feed-forward neural network classifiers. You'll see that the syntax is concise and doesn't require as much fiddling with formatting of data. Unfortunately, MXNet doesn't exhibit the same elegance for more complex network architectures that we'll encounter later in this blog, so we'll use Tensorflow for CNN's (Image classification) and RNN's (time series classification). So let's install R. First, we'll go to  https://cran.r-project.org/bin/windows/base/ to download the latest version of R on Windows: We'll click the "Download R 3.4.2 for Windows" link to download R (...