Deep Learning Notes: Skin Cancer Classification using DenseNets and ResNets

I recently came across the HAM10000 dataset, which consists of over 10,000 images of pigmented skin lesions. The dataset includes the .jpgs as well as metadata, and you can download it on Kaggle. Here is one not entirely gross one:

These can be divided into 7 distinct classes: Actinic keratoses and intraepithelial carcinoma / Bowen’s disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc).

At this stage, I have just been testing variants of ResNet and DenseNet architectures. You can see a rough version of my script here. For a good explanation of what these exotic networks are and how they relate to each other and all the other models in computer vision, check out this very good explanatory post by Manish Chablani. If you want a good deep dive into the DenseNet domain (see what I did there?) I would suggest this post by Pablo Ruiz Ruiz. Anyway, I mostly had a simple task ahead of me — take some of these architectures, add some FN and/or classification layers, and see how well they worked on a dataset like this.

First Attempts: ResNets 50 and 152

I use Keras with Tensorflow, for the most part, and the model zoo has a bunch of well-known architectures. However, while Keras has ResNet50 they don’t have ResNet-152, and I found a Keras-friendly implementation of that from Adam Casson. I wasn’t going to try too many models at first — I had my heart set on just using DenseNets, because that was at the heart of the ChexNet model that was performing on par with board-certified radiologists in diagnosing chest X-rays. However, I came across a paper by Victor Cheung on ensembling DenseNets and ResNets, which made me think I should expand my model horizons a bit.

TL;DR — ResNet50 didn’t offer a great performance, accuracy in train and val sets in the mid 70%s. ResNet-152, on the other hand, did better. I trained both for 100 epochs, doing a bit of hyperparameter tuning, mostly regularizations of the weights and gradients. ResNet-152 could get in the low 90%s for accuracy on the training set, but only in the mid-70%s for the validation set. I haven’t yet found the time to work on the overfitting and generalization problems; it was more important for me to see first hand how the different architectures on what I think is a very challenging data set.


Thankfully, Keras’ applications module has a bunch of these, including the 121,169, and 201 versions. I experimented with all 3, but, without the benefit of thorough testing of all three and grid searching the heck out of them, I would still put my money on the DenseNet-169 model. It could (overfit) on the training set and get accuracy > 90%, though the validation accuracy was not much over 80%. Still, it was by far better than the 121 or 201 versions. I used anywhere between 1–3 fully connected dense layers on top of the base model initialized on the imagenet weights. At first, I wanted to freeze most of the layers except the top, but subpar results lead me to eventually allow training on all layers, simply because there were generally much better results. I also used 1e-3 as the learning rate, I was getting bad results with 1e-2.

Also, I was using categorical cross-entropy as the loss function, though I was interested in trying out multi-class focal loss as well; I didn’t manage to get the implementation I found working right, but honestly, that’s something I’d want to tweak towards the end, once I had identified what I considered the best model.


DenseNets are definitely a good choice, especially the 169-version; ResNet-152 seems to be in the same league, roughly speaking. However, I am far from satisfied with the classification performance. I feel like this recipe is missing some key ingredient.

In particular, I have to find some way of dealing with the overfitting and poor generalizability of the model. Perhaps I should consider Weighted Model Averaging? Or Stacking generalization ensembles? Or should I consider using meta-data like the age, gender, and location of the lesion (using embeddings for these categorical variables, of course) to help with the diagnosis?

Another thing I have been experimenting with is Google Cloud’s Vision Beta, one of their autoML cloud services. I’m uploading my data set — all 10k pictures — onto the cloud, and will train with whatever network they are using there. It will be interesting to see how it does compared to the models above.

Please let me know if you have any suggestions for me! Thanks for reading.

Machine Learning Engineer