Introduction

For the 9th Workshop in Biomedical Engineering (http://wbme.fc.ul.pt/) a specific workshop was prepared to showcase how some of IBM Watson's API could be used; given the target audience and the overall themes of the WBME I had the following

  • Use one or more Watson APIs
  • Build a use-case which makes sense within the context of the WBME
  • Make the workshop interactive but also useful for those just watching
  • Have as little software requirements as possible
  • Make it useful even after the workshop, which includes being able to distribute it in some way

After some days experimenting with different options I settled on:

  • IBM Watson Visual Recognition API
  • A dermoscopic database of images that show skin lesions, both benign and malignate: the PH(2) database from the ADDI project (https://www.fc.up.pt/addi/ph2%20database.html).
  • Use IBM Data Science Experience to create, share and enable all to work on their one copies with no software requirements by making use of the Jupyter notebook support with enhanced user interface.

The initial developed was done using the Rmd format and it only required slight changes; one of the changes required the use of a specific zip file for the PH(2) dataset since the original one was in a RAR file and this compression mechanism isn't supported directly in R, which means that I was decompressing it using a system command, not something feasible when using a Spark execution environment. That was the only change to the original PH(2) archive (which means that there is not actual change in the contents).

I would like to thank the 9th WBME organisers for the opportunity and in particular to all of those who attended the workshop and that actively participated in it; a particular thank you to Sara Lobo (Biomedical Engineering, FCT/NOVA) in which laptop we ended up working together for the final part of the workshop that ended (successfully!) on the Student's Room of the Faculty of Sciences of the University of Lisbon more in the style of a group debugging session than a formal workshop - much to the advantage of all involved.

Concerning the topic

The use if visual recognition for detection of malignant skin lesions is something that has been studied intensively and has been the topic of many specialised papers and studies. While I've chose this topic due to being aligned with a bio-medical engineering event the goal of the Watson API workshop is to explain how to use the Visual Recognition API, not to build a useful model for something with such implications as melanoma detection.

I've added some relevant bibliography that clearly show the complexity and the possible approaches which are possible, including image pre-processing, the importance of masking, different strategies in terms of classification and many more, for those who want to explore the matter in more depth.

Setting up the environment

Throughout this paper we will need to use and install some R packages, and given the interactive nature of the notebook it is better to add all of those initial requirements in a single code statement, that way we can run it once and be done with it.

In [224]:
## Load the `caret` library, which includes the partitioning function we will use
library(caret)
## Package e1071 is used by `caret` for the confusion matrix
install.packages('e1071', dependencies=TRUE)
# Install needed packages for image conversion
install.packages("bmp")
install.packages("jpeg")
install.packages("pixmap")
## Load the image-related libraries
library(bmp)
library(jpeg)
library(pixmap)
library(jpeg)
library(grid)
## For the REST API
library(jsonlite)
library(httr)
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s238-e7c61469e53a3d-d61b00c8d317/R/libs’
(as ‘lib’ is unspecified)
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s238-e7c61469e53a3d-d61b00c8d317/R/libs’
(as ‘lib’ is unspecified)
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s238-e7c61469e53a3d-d61b00c8d317/R/libs’
(as ‘lib’ is unspecified)
Installing package into ‘/gpfs/global_fs01/sym_shared/YPProdSpark/user/s238-e7c61469e53a3d-d61b00c8d317/R/libs’
(as ‘lib’ is unspecified)

An IBM Bluemix (https://www.ibm.com/cloud-computing/bluemix/) account is needed; during the workshop this was the first step done but the process is simple: sign up and create a free account and get access to the entire catalogue, including Watson APIs.

Obtaining and preparing the data

The first step we are going to take is downloading the PH(2) database and preparing the data; these step is fundamental in terms of Data Science and we will spend some time going through it step-by-step to make sure that we have all the needed actions properly automated.

The focus on having everything done "in code" is important since it enables repeatability, which is crucial.

Cleaning up

This seems strange: why clean up when we are just starting? Given that we are in an interactive (and iteractive) environment which keeps data between runs it is assumed that we will want to run the code several times, and this first step is to remove all the files which we will create during this document. If there is nothing created yet it's not a problem: the unlink function deletes what exists and if it doesn't it does nothing.

There is a sole exception: we will not delete de compressed archive since it doesn't change and this avoids having to download the same context several times.

In [225]:
## Delete previous runs
zip_dir <-"ph2-training-files"
## If present remove existing archives from previous runs
unlink(zip_dir, recursive=T,force=T)
unlink("training-set-positive.zip", force=T)
unlink("training-set-negative.zip", force=T)
unlink("parameters.json", force=T)
unlink("*.csv", force=T)
unlink("PH2Dataset", force=T, recursive=T)
## Uncomment to delete the database zip as well
## unlink("PH2Dataset.zip", force=T)

Getting the dataset

The first action is to download the PH(2) database, available as a downloadable compressed archive; we set up some variables - the URL where the file is and the name of the zip file - and check for the zip file locally, bypassing the download if it already existing.

In [226]:
## New dataset
ph2_url  <- "https://www.dropbox.com/s/9a962jfcrs5x4iq/PH2Dataset.zip?dl=0"
ph2_archive <- "PH2Dataset.zip"

### Check for zipped dataset file: if it doesn't exist then download it
if (!file.exists(ph2_archive)) {
    print("Downloading PH(2) archive")
    download.file(ph2_url, destfile=ph2_archive, method="wget")
} else {
    print("File already exists, skipping download")
}
[1] "File already exists, skipping download"

We should have the PH(2) zip zile (and just that file) in our working directory.

In [227]:
dir()
'PH2Dataset.zip'

Now that we have the archive we unzip it.

In [228]:
dir(".")
unzip(ph2_archive)
'PH2Dataset.zip'

Initial cleanup

The structure of the newly created directory

In [229]:
dir("PH2Dataset")
  1. 'PH2 Dataset images'
  2. 'PH2_dataset.txt'
  3. 'PH2_dataset.xlsx'
  4. 'Readme.txt'

Of particular interest to us is the PH2_dataset.txt file that contains a text-based description of the obervations. Let's read the file into a variable and examine some of the content.

In [230]:
ph2_txt <- readLines("PH2Dataset/PH2_dataset.txt")
head(ph2_txt)
Warning message in readLines("PH2Dataset/PH2_dataset.txt"):
“incomplete final line found on 'PH2Dataset/PH2_dataset.txt'”
  1. '|| Name || Histological Diagnosis || Clinical Diagnosis || Asymmetry | Pigment Network | Dots/Globules | Streaks | Regression Areas | Blue-Whitish Veil || Colors ||'
  2. '|| IMD003 || || 0 || 0 | T | A | A | A | A || 4 ||'
  3. '|| IMD009 || || 0 || 0 | T | A | A | A | A || 3 ||'
  4. '|| IMD016 || || 0 || 0 | T | T | A | A | A || 3 4 ||'
  5. '|| IMD022 || || 0 || 0 | T | A | A | A | A || 3 ||'
  6. '|| IMD024 || || 0 || 0 | T | A | A | A | A || 3 4 ||'

It's a file best read using a monospaced font but one can clearly see that it has a header and then the observations, which columns divided by vertical bars; this is something easy for us to understand but not yet ideal to be used programatically, so we will do some transformations to convert it into something that uses a comma to separate the differente fields - a CSV file which can be directly imported into R, and which we will write to disk (something optional since we could just use the result of the transformations directly).

In [231]:
## Convert the txt to a csv file; due to the format this requires
## several operations.
ph2_new <- gsub ("^\\|\\|", "", fixed=FALSE,ph2_txt[1:201]) # Delete the || in the beginning of the file 
ph2_new <- gsub ("\\|\\|$", "", fixed=FALSE,ph2_new)        # Delete the || at the end of the file
ph2_new <- gsub ("||", ",", fixed=T,ph2_new)                # Replace all the remaining || with a comma
ph2_new <- gsub ("|", ",", fixed=T,ph2_new)                 # Replace all the | with a comma

## Save the result to a file
writeLines(ph2_new, "ph2_dataset.csv")
## ... and read that file into a variable
ph2_table <- read.csv("ph2_dataset.csv", header=T)

We now have a CSV file in our working directory,

In [232]:
dir()
  1. 'PH2Dataset'
  2. 'ph2_dataset.csv'
  3. 'PH2Dataset.zip'

... and a ph2_table variable which is an R data frame resulting from the import of the CSV file.

In [233]:
str(ph2_table)
head(ph2_table)
'data.frame':	200 obs. of  10 variables:
 $ Name                  : Factor w/ 200 levels " IMD002 "," IMD003 ",..: 2 6 11 17 19 20 26 29 33 35 ...
 $ Histological.Diagnosis: Factor w/ 7 levels "                        ",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Clinical.Diagnosis    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Asymmetry             : num  0 0 0 0 0 0 2 0 0 0 ...
 $ Pigment.Network       : Factor w/ 2 levels "              AT ",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Dots.Globules         : Factor w/ 3 levels "             A ",..: 1 1 3 1 1 3 1 3 3 3 ...
 $ Streaks               : Factor w/ 2 levels "       A ","       P ": 1 1 1 1 1 1 1 1 1 1 ...
 $ Regression.Areas      : Factor w/ 2 levels "                A ",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Blue.Whitish.Veil     : Factor w/ 2 levels "                 A ",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Colors                : Factor w/ 28 levels "       1  2  4  5 ",..: 22 15 16 15 16 15 10 25 16 23 ...
NameHistological.DiagnosisClinical.DiagnosisAsymmetryPigment.NetworkDots.GlobulesStreaksRegression.AreasBlue.Whitish.VeilColors
IMD003 0 0 T A A A A 4
IMD009 0 0 T A A A A 3
IMD016 0 0 T T A A A 3 4
IMD022 0 0 T A A A A 3
IMD024 0 0 T A A A A 3 4
IMD025 0 0 T T A A A 3

Creating the training and testing sets

Some columns are interpreted as numeric although they are factors; additionally we remove the "atypical nevus" images since for our purposes a smaller dataset is desirable since we want to keep the training time of the model small enough and also avoid reaching the daily allowance of images; this has the effect of removing a group of images which are ambiguous and will certainly result in a model which is simples in the sense that it works has been trained with less corner cases.

In [234]:
ph2_table$Clinical.Diagnosis <- as.factor(ph2_table$Clinical.Diagnosis)
ph2_table$Asymmetry <- as.factor(ph2_table$Asymmetry)
ph2 <- ph2_table[!ph2_table$Clinical.Diagnosis == "1", ]
ph2$Clinical.Diagnosis <- droplevels(ph2$Clinical.Diagnosis)
In [235]:
## We only have two factors now
print(ph2$Clinical.Diagnosis)
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [75] 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[112] 2 2 2 2 2 2 2 2 2
Levels: 0 2

We now split the training data into two different sets, training (70%) and testing (30%), using the Clinical Diagnosis as the outcome

In [236]:
inTrain <- createDataPartition(y=ph2$Clinical.Diagnosis, p=0.7, list=FALSE)
training_set <- ph2[inTrain, ]
testing_set <-  ph2[-inTrain, ]
In [237]:
## See what we have so far
head(training_set)
head(testing_set)
NameHistological.DiagnosisClinical.DiagnosisAsymmetryPigment.NetworkDots.GlobulesStreaksRegression.AreasBlue.Whitish.VeilColors
1 IMD003 0 0 T A A A A 4
2 IMD009 0 0 T A A A A 3
3 IMD016 0 0 T T A A A 3 4
5 IMD024 0 0 T A A A A 3 4
7 IMD035 0 2 T A A A A 2 3
8 IMD038 0 0 T T A A A 4 6
NameHistological.DiagnosisClinical.DiagnosisAsymmetryPigment.NetworkDots.GlobulesStreaksRegression.AreasBlue.Whitish.VeilColors
4 IMD022 0 0 T A A A A 3
6 IMD025 0 0 T T A A A 3
9 IMD042 0 0 T T A A A 3 4
12 IMD050 0 0 T T A A A 3
14 IMD101 0 0 T A A A A 3
30 IMD162 0 0 T T A A A 3 4

Our approach is straightforward and does not consider the need for oversampling/bootstraping: we will use the same proportion present in the pruned dataset.

In [238]:
print(sprintf("Common Nevus: %s%%", round(nrow(ph2[ph2$Clinical.Diagnosis == "0", ]) / nrow(ph2) *100, 0)))
print(sprintf("Melanoma: %s%%",     round(nrow(ph2[ph2$Clinical.Diagnosis == "2", ]) / nrow(ph2) *100, 0)))
## Plot
barplot(table(ph2$Clinical.Diagnosis))
[1] "Common Nevus: 67%"
[1] "Melanoma: 33%"

Positive and negative

To use IBM Watson's Visual Recognition we need both positive and negative sets to test the network; we will use our training set and divide it into a positive set which includes the melanoma diagnosis and a negative one which includes the rest.

In [239]:
## Total number of cases
sprintf("Total number of cases in training set: %d",nrow(training_set))
## Benign and malignal cases in the training set

sprintf("Total number of bening cases in training set: %d", nrow(training_set[training_set$Clinical.Diagnosis == "0",]))
sprintf("Total number of malignant cases in training set: %d", nrow(training_set[training_set$Clinical.Diagnosis == "2",]))

## Calculate the percentages...
print(sprintf("Common Nevus: %s%%", round(nrow(training_set[training_set$Clinical.Diagnosis == "0", ]) / nrow(training_set) *100, 0)))
print(sprintf("Melanoma: %s%%",     round(nrow(training_set[training_set$Clinical.Diagnosis == "2", ]) / nrow(training_set) *100, 0)))

## ... and plot them
barplot(table(training_set$Clinical.Diagnosis))

## Create variables for both sets to make it clearer
training_positive <- (training_set[training_set$Clinical.Diagnosis == "2",])
training_negative <- (training_set[training_set$Clinical.Diagnosis == "0",])
'Total number of cases in training set: 84'
'Total number of bening cases in training set: 56'
'Total number of malignant cases in training set: 28'
[1] "Common Nevus: 67%"
[1] "Melanoma: 33%"

Files and paths

The Visual Recognition API accepts the images as a zip file so we use the image names to build the adequate archives, but before let's explore the database through the directory structure in a bit more detail.

As we saw we have at the toplevel some files and a directory:

In [240]:
dir("PH2Dataset")
  1. 'PH2 Dataset images'
  2. 'PH2_dataset.txt'
  3. 'PH2_dataset.xlsx'
  4. 'Readme.txt'

Inside the images directory there are other directories, one for each file. We will use an option to dir to obtain the full path of the files and directories to make it clearer.

In [241]:
dir("PH2Dataset/PH2 Dataset images", full.names=T)
  1. 'PH2Dataset/PH2 Dataset images/IMD002'
  2. 'PH2Dataset/PH2 Dataset images/IMD003'
  3. 'PH2Dataset/PH2 Dataset images/IMD004'
  4. 'PH2Dataset/PH2 Dataset images/IMD006'
  5. 'PH2Dataset/PH2 Dataset images/IMD008'
  6. 'PH2Dataset/PH2 Dataset images/IMD009'
  7. 'PH2Dataset/PH2 Dataset images/IMD010'
  8. 'PH2Dataset/PH2 Dataset images/IMD013'
  9. 'PH2Dataset/PH2 Dataset images/IMD014'
  10. 'PH2Dataset/PH2 Dataset images/IMD015'
  11. 'PH2Dataset/PH2 Dataset images/IMD016'
  12. 'PH2Dataset/PH2 Dataset images/IMD017'
  13. 'PH2Dataset/PH2 Dataset images/IMD018'
  14. 'PH2Dataset/PH2 Dataset images/IMD019'
  15. 'PH2Dataset/PH2 Dataset images/IMD020'
  16. 'PH2Dataset/PH2 Dataset images/IMD021'
  17. 'PH2Dataset/PH2 Dataset images/IMD022'
  18. 'PH2Dataset/PH2 Dataset images/IMD023'
  19. 'PH2Dataset/PH2 Dataset images/IMD024'
  20. 'PH2Dataset/PH2 Dataset images/IMD025'
  21. 'PH2Dataset/PH2 Dataset images/IMD027'
  22. 'PH2Dataset/PH2 Dataset images/IMD030'
  23. 'PH2Dataset/PH2 Dataset images/IMD031'
  24. 'PH2Dataset/PH2 Dataset images/IMD032'
  25. 'PH2Dataset/PH2 Dataset images/IMD033'
  26. 'PH2Dataset/PH2 Dataset images/IMD035'
  27. 'PH2Dataset/PH2 Dataset images/IMD036'
  28. 'PH2Dataset/PH2 Dataset images/IMD037'
  29. 'PH2Dataset/PH2 Dataset images/IMD038'
  30. 'PH2Dataset/PH2 Dataset images/IMD039'
  31. 'PH2Dataset/PH2 Dataset images/IMD040'
  32. 'PH2Dataset/PH2 Dataset images/IMD041'
  33. 'PH2Dataset/PH2 Dataset images/IMD042'
  34. 'PH2Dataset/PH2 Dataset images/IMD043'
  35. 'PH2Dataset/PH2 Dataset images/IMD044'
  36. 'PH2Dataset/PH2 Dataset images/IMD045'
  37. 'PH2Dataset/PH2 Dataset images/IMD047'
  38. 'PH2Dataset/PH2 Dataset images/IMD048'
  39. 'PH2Dataset/PH2 Dataset images/IMD049'
  40. 'PH2Dataset/PH2 Dataset images/IMD050'
  41. 'PH2Dataset/PH2 Dataset images/IMD057'
  42. 'PH2Dataset/PH2 Dataset images/IMD058'
  43. 'PH2Dataset/PH2 Dataset images/IMD061'
  44. 'PH2Dataset/PH2 Dataset images/IMD063'
  45. 'PH2Dataset/PH2 Dataset images/IMD064'
  46. 'PH2Dataset/PH2 Dataset images/IMD065'
  47. 'PH2Dataset/PH2 Dataset images/IMD075'
  48. 'PH2Dataset/PH2 Dataset images/IMD076'
  49. 'PH2Dataset/PH2 Dataset images/IMD078'
  50. 'PH2Dataset/PH2 Dataset images/IMD080'
  51. 'PH2Dataset/PH2 Dataset images/IMD085'
  52. 'PH2Dataset/PH2 Dataset images/IMD088'
  53. 'PH2Dataset/PH2 Dataset images/IMD090'
  54. 'PH2Dataset/PH2 Dataset images/IMD091'
  55. 'PH2Dataset/PH2 Dataset images/IMD092'
  56. 'PH2Dataset/PH2 Dataset images/IMD101'
  57. 'PH2Dataset/PH2 Dataset images/IMD103'
  58. 'PH2Dataset/PH2 Dataset images/IMD105'
  59. 'PH2Dataset/PH2 Dataset images/IMD107'
  60. 'PH2Dataset/PH2 Dataset images/IMD108'
  61. 'PH2Dataset/PH2 Dataset images/IMD112'
  62. 'PH2Dataset/PH2 Dataset images/IMD118'
  63. 'PH2Dataset/PH2 Dataset images/IMD120'
  64. 'PH2Dataset/PH2 Dataset images/IMD125'
  65. 'PH2Dataset/PH2 Dataset images/IMD126'
  66. 'PH2Dataset/PH2 Dataset images/IMD132'
  67. 'PH2Dataset/PH2 Dataset images/IMD133'
  68. 'PH2Dataset/PH2 Dataset images/IMD134'
  69. 'PH2Dataset/PH2 Dataset images/IMD135'
  70. 'PH2Dataset/PH2 Dataset images/IMD137'
  71. 'PH2Dataset/PH2 Dataset images/IMD138'
  72. 'PH2Dataset/PH2 Dataset images/IMD139'
  73. 'PH2Dataset/PH2 Dataset images/IMD140'
  74. 'PH2Dataset/PH2 Dataset images/IMD142'
  75. 'PH2Dataset/PH2 Dataset images/IMD143'
  76. 'PH2Dataset/PH2 Dataset images/IMD144'
  77. 'PH2Dataset/PH2 Dataset images/IMD146'
  78. 'PH2Dataset/PH2 Dataset images/IMD147'
  79. 'PH2Dataset/PH2 Dataset images/IMD149'
  80. 'PH2Dataset/PH2 Dataset images/IMD150'
  81. 'PH2Dataset/PH2 Dataset images/IMD152'
  82. 'PH2Dataset/PH2 Dataset images/IMD153'
  83. 'PH2Dataset/PH2 Dataset images/IMD154'
  84. 'PH2Dataset/PH2 Dataset images/IMD155'
  85. 'PH2Dataset/PH2 Dataset images/IMD156'
  86. 'PH2Dataset/PH2 Dataset images/IMD157'
  87. 'PH2Dataset/PH2 Dataset images/IMD159'
  88. 'PH2Dataset/PH2 Dataset images/IMD160'
  89. 'PH2Dataset/PH2 Dataset images/IMD161'
  90. 'PH2Dataset/PH2 Dataset images/IMD162'
  91. 'PH2Dataset/PH2 Dataset images/IMD164'
  92. 'PH2Dataset/PH2 Dataset images/IMD166'
  93. 'PH2Dataset/PH2 Dataset images/IMD168'
  94. 'PH2Dataset/PH2 Dataset images/IMD169'
  95. 'PH2Dataset/PH2 Dataset images/IMD170'
  96. 'PH2Dataset/PH2 Dataset images/IMD171'
  97. 'PH2Dataset/PH2 Dataset images/IMD173'
  98. 'PH2Dataset/PH2 Dataset images/IMD175'
  99. 'PH2Dataset/PH2 Dataset images/IMD176'
  100. 'PH2Dataset/PH2 Dataset images/IMD177'
  101. 'PH2Dataset/PH2 Dataset images/IMD182'
  102. 'PH2Dataset/PH2 Dataset images/IMD196'
  103. 'PH2Dataset/PH2 Dataset images/IMD197'
  104. 'PH2Dataset/PH2 Dataset images/IMD198'
  105. 'PH2Dataset/PH2 Dataset images/IMD199'
  106. 'PH2Dataset/PH2 Dataset images/IMD200'
  107. 'PH2Dataset/PH2 Dataset images/IMD203'
  108. 'PH2Dataset/PH2 Dataset images/IMD204'
  109. 'PH2Dataset/PH2 Dataset images/IMD206'
  110. 'PH2Dataset/PH2 Dataset images/IMD207'
  111. 'PH2Dataset/PH2 Dataset images/IMD208'
  112. 'PH2Dataset/PH2 Dataset images/IMD210'
  113. 'PH2Dataset/PH2 Dataset images/IMD211'
  114. 'PH2Dataset/PH2 Dataset images/IMD219'
  115. 'PH2Dataset/PH2 Dataset images/IMD226'
  116. 'PH2Dataset/PH2 Dataset images/IMD240'
  117. 'PH2Dataset/PH2 Dataset images/IMD242'
  118. 'PH2Dataset/PH2 Dataset images/IMD243'
  119. 'PH2Dataset/PH2 Dataset images/IMD251'
  120. 'PH2Dataset/PH2 Dataset images/IMD254'
  121. 'PH2Dataset/PH2 Dataset images/IMD256'
  122. 'PH2Dataset/PH2 Dataset images/IMD278'
  123. 'PH2Dataset/PH2 Dataset images/IMD279'
  124. 'PH2Dataset/PH2 Dataset images/IMD280'
  125. 'PH2Dataset/PH2 Dataset images/IMD284'
  126. 'PH2Dataset/PH2 Dataset images/IMD285'
  127. 'PH2Dataset/PH2 Dataset images/IMD304'
  128. 'PH2Dataset/PH2 Dataset images/IMD305'
  129. 'PH2Dataset/PH2 Dataset images/IMD306'
  130. 'PH2Dataset/PH2 Dataset images/IMD312'
  131. 'PH2Dataset/PH2 Dataset images/IMD328'
  132. 'PH2Dataset/PH2 Dataset images/IMD331'
  133. 'PH2Dataset/PH2 Dataset images/IMD339'
  134. 'PH2Dataset/PH2 Dataset images/IMD347'
  135. 'PH2Dataset/PH2 Dataset images/IMD348'
  136. 'PH2Dataset/PH2 Dataset images/IMD349'
  137. 'PH2Dataset/PH2 Dataset images/IMD356'
  138. 'PH2Dataset/PH2 Dataset images/IMD360'
  139. 'PH2Dataset/PH2 Dataset images/IMD364'
  140. 'PH2Dataset/PH2 Dataset images/IMD365'
  141. 'PH2Dataset/PH2 Dataset images/IMD367'
  142. 'PH2Dataset/PH2 Dataset images/IMD368'
  143. 'PH2Dataset/PH2 Dataset images/IMD369'
  144. 'PH2Dataset/PH2 Dataset images/IMD370'
  145. 'PH2Dataset/PH2 Dataset images/IMD371'
  146. 'PH2Dataset/PH2 Dataset images/IMD372'
  147. 'PH2Dataset/PH2 Dataset images/IMD374'
  148. 'PH2Dataset/PH2 Dataset images/IMD375'
  149. 'PH2Dataset/PH2 Dataset images/IMD376'
  150. 'PH2Dataset/PH2 Dataset images/IMD378'
  151. 'PH2Dataset/PH2 Dataset images/IMD379'
  152. 'PH2Dataset/PH2 Dataset images/IMD380'
  153. 'PH2Dataset/PH2 Dataset images/IMD381'
  154. 'PH2Dataset/PH2 Dataset images/IMD382'
  155. 'PH2Dataset/PH2 Dataset images/IMD383'
  156. 'PH2Dataset/PH2 Dataset images/IMD384'
  157. 'PH2Dataset/PH2 Dataset images/IMD385'
  158. 'PH2Dataset/PH2 Dataset images/IMD386'
  159. 'PH2Dataset/PH2 Dataset images/IMD388'
  160. 'PH2Dataset/PH2 Dataset images/IMD389'
  161. 'PH2Dataset/PH2 Dataset images/IMD390'
  162. 'PH2Dataset/PH2 Dataset images/IMD392'
  163. 'PH2Dataset/PH2 Dataset images/IMD393'
  164. 'PH2Dataset/PH2 Dataset images/IMD394'
  165. 'PH2Dataset/PH2 Dataset images/IMD395'
  166. 'PH2Dataset/PH2 Dataset images/IMD396'
  167. 'PH2Dataset/PH2 Dataset images/IMD397'
  168. 'PH2Dataset/PH2 Dataset images/IMD398'
  169. 'PH2Dataset/PH2 Dataset images/IMD399'
  170. 'PH2Dataset/PH2 Dataset images/IMD400'
  171. 'PH2Dataset/PH2 Dataset images/IMD402'
  172. 'PH2Dataset/PH2 Dataset images/IMD403'
  173. 'PH2Dataset/PH2 Dataset images/IMD404'
  174. 'PH2Dataset/PH2 Dataset images/IMD405'
  175. 'PH2Dataset/PH2 Dataset images/IMD406'
  176. 'PH2Dataset/PH2 Dataset images/IMD407'
  177. 'PH2Dataset/PH2 Dataset images/IMD408'
  178. 'PH2Dataset/PH2 Dataset images/IMD409'
  179. 'PH2Dataset/PH2 Dataset images/IMD410'
  180. 'PH2Dataset/PH2 Dataset images/IMD411'
  181. 'PH2Dataset/PH2 Dataset images/IMD413'
  182. 'PH2Dataset/PH2 Dataset images/IMD417'
  183. 'PH2Dataset/PH2 Dataset images/IMD418'
  184. 'PH2Dataset/PH2 Dataset images/IMD419'
  185. 'PH2Dataset/PH2 Dataset images/IMD420'
  186. 'PH2Dataset/PH2 Dataset images/IMD421'
  187. 'PH2Dataset/PH2 Dataset images/IMD423'
  188. 'PH2Dataset/PH2 Dataset images/IMD424'
  189. 'PH2Dataset/PH2 Dataset images/IMD425'
  190. 'PH2Dataset/PH2 Dataset images/IMD426'
  191. 'PH2Dataset/PH2 Dataset images/IMD427'
  192. 'PH2Dataset/PH2 Dataset images/IMD429'
  193. 'PH2Dataset/PH2 Dataset images/IMD430'
  194. 'PH2Dataset/PH2 Dataset images/IMD431'
  195. 'PH2Dataset/PH2 Dataset images/IMD432'
  196. 'PH2Dataset/PH2 Dataset images/IMD433'
  197. 'PH2Dataset/PH2 Dataset images/IMD434'
  198. 'PH2Dataset/PH2 Dataset images/IMD435'
  199. 'PH2Dataset/PH2 Dataset images/IMD436'
  200. 'PH2Dataset/PH2 Dataset images/IMD437'

Each of these directories has a similar structure, and we will use a random one to list all the contents.

In [242]:
name_sample <- training_set[sample.int(nrow(training_set),1), ]
name_sample$Name
IMD150

To get there we just need to build the file path - this will be very important in the next step - by simply appending the file name to the images directory.

In [243]:
dir(sprintf("PH2Dataset/PH2 Dataset images/%s", name_sample$Name), full.names=T, recursive=T)

Is that directory empty? We actually know it isn't (you can download the archive and check it yourself), so what happened? The answer lies in the Name field: if we check it a bit better we can see that it contains extra characters.

In [244]:
sprintf("PH2Dataset/PH2 Dataset images/%s", name_sample$Name)
'PH2Dataset/PH2 Dataset images/ IMD150 '

Since we didn't clean the Name column properly it contains extra whitespace, which is passed into dir. The solution is to trim all the names, removing any whitespace.

Note that this, as many other actions we will perform, would be better done to the initial ph2_table data frame that contains the complete number of observations and from which we would then extract the needed datasets, but since we're learning as we go we will correct the derived datasets seperately.

In [245]:
## Trim function: remove unneeded whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
training_set$Name <- trim(training_set$Name)
testing_set$Name <- trim(testing_set$Name)

With this fixed let's see how it works.

In [246]:
## Get a new sample
name_sample <- training_set[sample.int(nrow(training_set),1), ]
## Check the file path
sprintf("PH2Dataset/PH2 Dataset images/%s", name_sample$Name)
'PH2Dataset/PH2 Dataset images/IMD080'

Looks correct so now we can try to obtain the directory listing of the image directory again, and hopefuly it will work now.

In [247]:
dir(sprintf("PH2Dataset/PH2 Dataset images/%s", name_sample$Name), full.names=T, recursive=T)
  1. 'PH2Dataset/PH2 Dataset images/IMD080/IMD080_Dermoscopic_Image/IMD080.bmp'
  2. 'PH2Dataset/PH2 Dataset images/IMD080/IMD080_lesion/IMD080_lesion.bmp'

Great, it worked.

The image we are going to use if the one in the Dermoscopy Image directory, and we will not use any of the two (one always present, the other optional) additional images made available which are masks which identify the lesion itself from the surrounding tissue. This is explained in the Readme.txt file, which should actually be the very first thing we should read in order to understand the database so let's take a look now.

In [248]:
readLines("PH2Dataset/Readme.txt")
Warning message in readLines("PH2Dataset/Readme.txt"):
“incomplete final line found on 'PH2Dataset/Readme.txt'”
 [1] "###############################################################################################################"                                                                   
 [2] "\t\t\t\t\t\t \tReadme"                                                                                                                                                             
 [3] "###############################################################################################################"                                                                   
 [4] "PH\xb2 Dataset contents:"                                                                                                                                                          
 [5] ""                                                                                                                                                                                  
 [6] "PH\xb2 Dataset images folder: Inside this folder there is a dedicated folder for every image of the database, which contains the original dermoscopic image, "                     
 [7] "the binary mask of the segmented lesion as well as the binary masks of the color classes presented in the skin lesion."                                                            
 [8] ""                                                                                                                                                                                  
 [9] "PH\xb2 Dataset.xlsx file: This file contains the classification of all images in a \".xlsx\" file according to the dermoscopic criteria that are evaluated in the PH\xb2 database."
[10] ""                                                                                                                                                                                  
[11] "PH\xb2 Dataset.txt file: This file contains the classification of all images in a \".txt\" file according to the dermoscopic criteria that are evaluated in the PH\xb2 database."  
[12] ""                                                                                                                                                                                  
[13] ""                                                                                                                                                                                  
[14] "###############################################################################################################"                                                                   

We are now able to add a correct file image path to each observation and so we add a new column - File - that contains that information.

In [249]:
## Trim the entries in the positive and negative sets
training_positive$Name <- trim(training_positive$Name)
training_negative$Name <- trim(training_negative$Name)
## Add the file image path to each observation
training_positive$File <- sprintf("PH2Dataset/PH2 Dataset images/%s/%s_Dermoscopic_Image/%s.bmp", training_positive$Name, training_positive$Name, training_positive$Name)
training_negative$File <- sprintf("PH2Dataset/PH2 Dataset images/%s/%s_Dermoscopic_Image/%s.bmp", training_negative$Name, training_negative$Name, training_negative$Name)
## Since we could forget about this latter we will do the same to the testing set right now
testing_set$File <- sprintf("PH2Dataset/PH2 Dataset images/%s/%s_Dermoscopic_Image/%s.bmp", testing_set$Name, testing_set$Name, testing_set$Name)

We can now see that we have the file path in an additional column

In [250]:
head(training_positive)
head(training_negative)
head(testing_set)
NameHistological.DiagnosisClinical.DiagnosisAsymmetryPigment.NetworkDots.GlobulesStreaksRegression.AreasBlue.Whitish.VeilColorsFile
162IMD061 2 2 AT A A P P 3 5 PH2Dataset/PH2 Dataset images/IMD061/IMD061_Dermoscopic_Image/IMD061.bmp
163IMD063 Melanoma 2 2 AT AT A A P 3 4 PH2Dataset/PH2 Dataset images/IMD063/IMD063_Dermoscopic_Image/IMD063.bmp
165IMD065 2 2 AT A A P P 4 6 PH2Dataset/PH2 Dataset images/IMD065/IMD065_Dermoscopic_Image/IMD065.bmp
166IMD080 Melanoma 2 2 AT A P P P 2 4 6 PH2Dataset/PH2 Dataset images/IMD080/IMD080_Dermoscopic_Image/IMD080.bmp
167IMD085 2 2 AT A A A P 5 6 PH2Dataset/PH2 Dataset images/IMD085/IMD085_Dermoscopic_Image/IMD085.bmp
168IMD088 Melanoma 2 2 AT AT P P P 1 4 5 6 PH2Dataset/PH2 Dataset images/IMD088/IMD088_Dermoscopic_Image/IMD088.bmp
NameHistological.DiagnosisClinical.DiagnosisAsymmetryPigment.NetworkDots.GlobulesStreaksRegression.AreasBlue.Whitish.VeilColorsFile
1IMD003 0 0 T A A A A 4 PH2Dataset/PH2 Dataset images/IMD003/IMD003_Dermoscopic_Image/IMD003.bmp
2IMD009 0 0 T A A A A 3 PH2Dataset/PH2 Dataset images/IMD009/IMD009_Dermoscopic_Image/IMD009.bmp
3IMD016 0 0 T T A A A 3 4 PH2Dataset/PH2 Dataset images/IMD016/IMD016_Dermoscopic_Image/IMD016.bmp
5IMD024 0 0 T A A A A 3 4 PH2Dataset/PH2 Dataset images/IMD024/IMD024_Dermoscopic_Image/IMD024.bmp
7IMD035 0 2 T A A A A 2 3 PH2Dataset/PH2 Dataset images/IMD035/IMD035_Dermoscopic_Image/IMD035.bmp
8IMD038 0 0 T T A A A 4 6 PH2Dataset/PH2 Dataset images/IMD038/IMD038_Dermoscopic_Image/IMD038.bmp
NameHistological.DiagnosisClinical.DiagnosisAsymmetryPigment.NetworkDots.GlobulesStreaksRegression.AreasBlue.Whitish.VeilColorsFile
4IMD022 0 0 T A A A A 3 PH2Dataset/PH2 Dataset images/IMD022/IMD022_Dermoscopic_Image/IMD022.bmp
6IMD025 0 0 T T A A A 3 PH2Dataset/PH2 Dataset images/IMD025/IMD025_Dermoscopic_Image/IMD025.bmp
9IMD042 0 0 T T A A A 3 4 PH2Dataset/PH2 Dataset images/IMD042/IMD042_Dermoscopic_Image/IMD042.bmp
12IMD050 0 0 T T A A A 3 PH2Dataset/PH2 Dataset images/IMD050/IMD050_Dermoscopic_Image/IMD050.bmp
14IMD101 0 0 T A A A A 3 PH2Dataset/PH2 Dataset images/IMD101/IMD101_Dermoscopic_Image/IMD101.bmp
30IMD162 0 0 T T A A A 3 4 PH2Dataset/PH2 Dataset images/IMD162/IMD162_Dermoscopic_Image/IMD162.bmp

Image format conversion

Everything looks fine... but there's nothing like making sure things are actually working so we will try to load and display an image from two random samples.

In [251]:
## Pick a random sample from the training set
negative_sample <- training_negative[sample.int(nrow(training_negative),1), ]
positive_sample <- training_positive[sample.int(nrow(training_positive),1), ]

## Use read.bmp to read the image and then create a pixmapRGB object that can be "plotted"
negative_image <- pixmapRGB(read.bmp(negative_sample$File))
positive_image <- pixmapRGB(read.bmp(positive_sample$File))

## We use plot to display the image, and par to display them in a single row, side by side
##par(mfrow=c(1,2))
plot(negative_image, sub = negative_sample$Name)
plot(positive_image, sub = positive_sample$Name)
Warning message in rep(cellres, length = 2):
“'x' is NULL so the result will be NULL”Warning message in rep(cellres, length = 2):
“'x' is NULL so the result will be NULL”