2  Data Wrangling

This section deals with the nitty gritty of data analysis. There’s no nice plots like the previous chapter. In fact, this data wrangling is the major aspect of data science.

2.1 Data

This section on atomic vector, arrays, matrix and list is often considered boring and ignored.

2.1.1 Vector, Arrays, Matrix

2.1.1.1 Vector and list

All elements of an atomic vector and arrays are the same. A vector is an array with one dimension . List can contain different types of data. A complex example of a structured list is the json format shown below. In base R, c is a function for creating a vector or list. The function list can also be used to create list. It is best to avoid using c when assigning a name to a dataframe or vector (Wickham 2019).

a<-c(1,2,3)
is.vector(a)
[1] TRUE
class(a) #numeric vector
[1] "numeric"
b<-c("apple","orange","banana")
is.vector(b) #character vector
[1] TRUE
class(b)
[1] "character"
d<-c(1,2,"banana")
is.list(d) #character vector
[1] FALSE
class(d) #FALSE
[1] "character"
e<-list(a,b,c(TRUE,FALSE,FALSE))
is.list(e) #TRUE
[1] TRUE

2.1.1.2 Matrix and Arrays

In R, an array is a vector organised with attributes such as dimensions. It is of a single data type. It contains a description of the number of dimension. Array can also be accessed by its indexing.

Later in this chapter, we will illustrate the importance of arrays for manipulating MRI scans. A volumetric mri scan, there are 3 dimenions [,,]. The first column is the sagittal sequence, second column is the coronal sequence and the third column is the axial sequence. In the example below, this knowledge of arrays can be used to reverse the ordering of information on MRI scan (flip on it axis between left and right).

#vector
vector1<-c(1,2,3)
vector2<-c(4,5,6,7,8,9)

# 2 dimensions
array1<-array(c(vector1,vector2),dim = c(3,3,2))
array1
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
# 3 dimensions
array2<-array(c(vector1,vector2),dim = c(2,2,3))
array2
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    5    7
[2,]    6    8

, , 3

     [,1] [,2]
[1,]    9    2
[2,]    1    3
#check if array or matrix
is.matrix(array1)
[1] FALSE
is.array(array1)
[1] TRUE

Arrays can be accessed by indexing its structure.

array1[,,1]
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

The second array of array1

array1[,,2]
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

First row of array1

array1[1,,]
     [,1] [,2]
[1,]    1    1
[2,]    4    4
[3,]    7    7

First column of array1

array1[,1,]
     [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    3    3

A matrix is a rectangular array of a single data type arranged in rows and columns or a 2 dimensional array. Data type can be number or character. Matrix can be considered a linear combination of vector into a linear map. In R, the function is.matrix returns true only if there is a dimension argument in the data. This definition also means that a dat frame is not a matrix.

M1<-matrix(c(1,2,3, 4,5,6),nrow=2, ncol=3, byrow = T,
           dimnames = list(c("row1", "row2"),
                        c("Col1", "Col2", "Col3")))
M1
     Col1 Col2 Col3
row1    1    2    3
row2    4    5    6

A tensor is multidimensional array. Using the analogy before about a matrix being a linear map, a tensor can be viewed as a multilinear map. More on tensor later.

2.1.1.3 array operations

Array operations are performed element wise.

arrayA<-array(c(1,2,3),dim=c(1,3)) #1 row 3 columns
arrayA
     [,1] [,2] [,3]
[1,]    1    2    3
arrayB<-array(c(4,5,6),dim=c(1,3))
arrayB
     [,1] [,2] [,3]
[1,]    4    5    6
#addition
arrayC=arrayA+arrayB
arrayC
     [,1] [,2] [,3]
[1,]    5    7    9

Now we illustrate multiplication of arrays

arrayD=arrayA*arrayB
arrayD
     [,1] [,2] [,3]
[1,]    4   10   18

Division of arrays

arrayE<-arrayA/arrayB
arrayE
     [,1] [,2] [,3]
[1,] 0.25  0.4  0.5

2.1.2 Operators

2.1.2.0.1 Relation operators

We test if variable A is same as B by using == operator while we test if A is not equal to B by != operator. A is greater than B by > operator and A is lesser than B by < operator.

2.1.2.1 Logical operators

The | operator signifies elementwise or relationship and || signifies or relationship. The & operator signifies elementwise and relationship and && signifies and relationship.

2.1.2.2 %in%

The %in% operator helps to evaluate if an element belong to a vector. The package Hmisc contains an operator %nin% which is the opposite of %in%.

2.1.2.3 %>%

The %>% operator or pipe function, from magrittr package, is used to pass information to the next function.

DF<-data.frame(Disability=c("Good","Moderate","Good","Poor"),NIHSS=c(2,10,1,20),Sex=c("M","F","M","M" ))

#create new variable outcome
DF$outcome=ifelse(DF$Disability %in% c("Good","Moderate"), 0,1)
DF
  Disability NIHSS Sex outcome
1       Good     2   M       0
2   Moderate    10   F       0
3       Good     1   M       0
4       Poor    20   M       1

2.1.3 Simple function

a function is written by creating name of function, calling on function (x) and brackets to define function.

# function sun
ArithSum<-function (x) {
  sum(x)
}

vector1<-c(1,2,3)
ArithSum(vector1)
[1] 6

2.1.4 for loop

vector2<-c(4,5,6)

for (i in vector2){
  print(i)
}
[1] 4
[1] 5
[1] 6

Perform math operation using for loop

for (i in vector2){
  m= sum(vector2)/length(vector2)
}

m
[1] 5

The next command can be used to modify the loop. This simple example removes even number

for(i in vector2){
  if(!i %% 2){
    next
  }
  print(i)
}
[1] 5

Perform a math operation along the columns.

A=c(1,2,3)
B=c(4,5,6)
df=data.frame(A,B)

output_col <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output_col[[i]] <- mean(df[[i]])      # 3. body
}
output_col
[1] 2 5

Perform a math operation along the rows.

output_row <- vector("double", nrow(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output_row[[i]] <- mean(df[[i]])      # 3. body
}
output_row
[1] 2 5 0

2.1.5 apply, lapply, sapply

2.1.5.1 apply

The apply function works on data in array format.

#sum data across rows ie c(1)
apply(array1,c(1),sum)
[1] 24 30 36
#sum data across columns(2)
apply(array1,c(2),sum)
[1] 12 30 48
#sum data across rows and columns c(1,2)
apply(array1,c(1,2),sum)
     [,1] [,2] [,3]
[1,]    2    8   14
[2,]    4   10   16
[3,]    6   12   18

2.1.5.2 lapply

The call lapply applies a function to a list. In the section below of medical images the lapply function will be used to creates a list of nifti files and which can be opened painlessly with additional call to readNIfTI. Here a more simple example is used.

a<-c(1,2,3,4,5,6,7,8)
lapply(a, function(x) x^3)
[[1]]
[1] 1

[[2]]
[1] 8

[[3]]
[1] 27

[[4]]
[1] 64

[[5]]
[1] 125

[[6]]
[1] 216

[[7]]
[1] 343

[[8]]
[1] 512

2.1.5.3 sapply

The call sapply applies a function to a vector, matrix or list. It returns the results in the form of a matrix.

a<-c(1,2,3,4)
sapply(a, function(x) x^2)
[1]  1  4  9 16

2.1.5.4 tapply

The tapply function applies a function to a subset of the data.

dat.array1<-as.data.frame.array(array1)
dat.array1
  V1 V2 V3 V4 V5 V6
1  1  4  7  1  4  7
2  2  5  8  2  5  8
3  3  6  9  3  6  9
dat.array1$V6<- dat.array1$V6 %% 2
dat.array1$V6
[1] 1 0 1
tapply(dat.array1$V1, dat.array1$V6, sum) 
0 1 
2 4 

The rapply function or recursive apply is applied to a list, contingent on the second argument. In this example, the first function is to multiple elements of the list by 2 contingent on the element being numeric.

a<-c(1,2,3.4,D,5, "NA")

rapply(a, function(x) x*2, class="numeric")
[1]  2.0  4.0  6.8 10.0

2.1.6 Functional

A functional is a function embedded in a function. Later, we will revisit functional to perform analysis on a list of nifti files.

Fou<-lapply(df, function(A) mean(A)/sd(A))

#returns a list
Fou
$A
[1] 2

$B
[1] 5

The unlist function returns a matrix.

Fou2<-unlist(lapply(df, function(A) mean(A)/sd(A)))

#returns a matrix
Fou2
A B 
2 5 

2.1.6.1 Mapply

Mapply applies the function over elements of the data.

MeanFou<-lapply(df, mean)

MeanFou[]<-Map('/', df, MeanFou)

The equivalent code with lapply is provided below.

MeanFou2<-lapply(df, function(M) M/mean(M))

MeanFou2
$A
[1] 0.5 1.0 1.5

$B
[1] 0.8 1.0 1.2

2.1.7 Iteration

This is an extension of the discussion on functional programming. Here we will be using purrr library and map function to apply function to multiple input. First, we use split function to partition data.

library(purrr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
ss<-read.csv("./Data-Use/ss150718.csv") 
ss%>% 
  select(duplicate, PubYear, TP,FP,FN,TN) |> 
  split(ss$duplicate) %>% head()
$no
   duplicate PubYear TP FP  FN  TN
1         no    2007 10  3   1  25
4         no    2009 49 22   6 290
5         no    2010  7 10   1  41
6         no    2010 10  9   6  85
7         no    2010 11  0   4  12
8         no    2011 23  7   9 100
9         no    2011 60 16  17 219
10        no    2011  5 12   8  64
13        no    2013  5  6   3  57
14        no    2013  6 13   2  44
15        no    2013 16 11  16  58
19        no    2014  7  8   4  55
20        no    2014 12  1   3  37
21        no    2014 15 25  29 174
22        no    2014 25 15  53 194
24        no    2014 26 21  31 238
25        no    2014 33 41 123 620
26        no    2014 35 26  12 114
30        no    2016 10  7   6 100
32        no    2016 20  5  12  35
35        no    2017 13 11  40  69

$yes
   duplicate PubYear TP FP FN  TN
2        yes    2007 13 45  1  45
3        yes    2009 14  7  4  36
11       yes    2012 33 41 38 279
12       yes    2012 37 24 36 131
16       yes    2013 16 15  9  91
17       yes    2013 17  7 11  77
18       yes    2013  7  3  3   8
23       yes    2014 25 19  2  37
27       yes    2014 37 24 36 131
28       yes    2014 44 43 45 188
29       yes    2016  6 10 12  52
31       yes    2017 16 12  9  78
33       yes    2016 32 37 20 103
34       yes    2017 55 66 67 521
36       yes    2017 19 13 11  86
37       yes    2017 10 14  4 112
38       yes    2016 75 95 81 531
39       yes    2017 NA NA NA  NA

Following on from the splitting of the data, we perform the regression on the split data with map.

ss %>% select(duplicate, PubYear, TP, FP, FN, TN) %>% mutate(Sensitivity=TP/(TP+FN)) |> 
  split(ss$duplicate) |>
  map(\(df) lm(Sensitivity ~ PubYear, data = df))|> 
  map(summary) 
$no

Call:
lm(formula = Sensitivity ~ PubYear, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.32962 -0.11367  0.02946  0.13633  0.25884 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 99.98638   31.22820   3.202  0.00470 **
PubYear     -0.04938    0.01552  -3.182  0.00491 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1762 on 19 degrees of freedom
Multiple R-squared:  0.3477,    Adjusted R-squared:  0.3134 
F-statistic: 10.13 on 1 and 19 DF,  p-value: 0.004905


$yes

Call:
lm(formula = Sensitivity ~ PubYear, data = df)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.225540 -0.104559  0.002324  0.100728  0.314517 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 53.51463   26.03472   2.056   0.0577 .
PubYear     -0.02627    0.01293  -2.032   0.0603 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1491 on 15 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.2158,    Adjusted R-squared:  0.1636 
F-statistic: 4.129 on 1 and 15 DF,  p-value: 0.06026

We can extend this function to each element of a vector to get the output for the linear regression on the partition data.

ss %>% select(duplicate, PubYear, TP, FP, FN, TN) %>% mutate(Sensitivity=TP/(TP+FN)) |> 
  split(ss$duplicate) |>
  map(\(df) lm(Sensitivity ~ PubYear, data = df))|> 
  map(summary) %>%
  map_dbl("r.squared")
       no       yes 
0.3476832 0.2158460 

2.1.7.1 map characters

The map function can be applied to characters. This can be illustrated using the spot sign data

map2_chr(ss$Authors, ss$PubYear, ~paste(.x, .y, sep=": ")) %>% head()
[1] "Wada: 2007"           "Goldstein: 2007"      "Delgado Almand: 2009"
[4] "Ederies: 2009"        "Hallevi: 2010"        "Park: 2010"          

The characters can be mapped to upper case.

library(stringr)
map2_chr(ss$Authors, ss$PubYear, ~paste(.x, .y, sep=" - ")) %>% 
  map_chr(~str_to_upper(.)) %>%
  head()
[1] "WADA - 2007"           "GOLDSTEIN - 2007"      "DELGADO ALMAND - 2009"
[4] "EDERIES - 2009"        "HALLEVI - 2010"        "PARK - 2010"          

The purrr library has function to pluck items from the list or data frame.

map2_chr(ss$Authors, ss$PubYear, ~paste(.x, .y, sep=" - ")) %>% 
  pluck(1) 
[1] "Wada - 2007"

2.2 Data storage

Often one assumes that opening Rstudio is sufficient to locate the file and run the analysis. One way of doing this at the console is to click on Session tab, then Set Working Directory to location of file. Another way of doing this seemlessly is to use the library here. It is easy to find the files in your directory using the list.files() call.

To list only some files use pattern matching.

We can increase the complexity of pattern matching.

#list files matching pattern
list.files(pattern=".Rmd|*.stan")
[1] "lm.stan"

2.2.1 Data frame

Data frame is a convenient way of formatting data in table format. It is worth checking the structure of data. Some libraries prefer to work with data in data frame while others prefer matrix or array structure.

a<-c(1,2,3)
b<-c("apple","orange","banana")
e<-data.frame(a,b)
rownames(e)
[1] "1" "2" "3"

2.2.2 Excel data

Excel data are stored as csv, xls and xlsx. Csv files can be open in base R using read.csv function or using readr library and read_csv function. I would urge you to get used to manipulating data in R as the codes serve as a way to keep track with the change in data. The original xcel data should not be touched outside of R. A problem with excel data is that its autocorrect function change provenance of genomic data eg SEPT1, MARCH1. SEPT1 is now relabeled as SEPTIN1.

A<-read.csv("File.csv")

B<-readr::read_csv ("File.csv")

The readxl library can be used to open files ending in xls and xlsx.

C<-readxl::read_xlsx("File.xlsx",skip=1) #skip first row 

D<-readxl::read_xlsx("File.xlsx",sheet=2) #read data from sheet 2

2.2.3 Date and time

Date and time can be handle in base R.

#character
DateofEvent=c("12/03/2005","12/04/2006")
class(DateofEvent)
[1] "character"

The as.Date function converts character to Date object

#convert to Date object
DateofEvent=as.Date(DateofEvent)
class(DateofEvent)
[1] "Date"

The library lubridate is useful for parsing date data. It is possible to get an overview of the functions of the library by typing help(package=lubridate). Errors with parsing can occur if there are characters in the column containing date data.

library(dplyr)

dfdate<-data.frame("DateofEvent"=c("12/03/2005","12/04/2006",NA),
                   "Result"=c(4,5,6))
class(dfdate$DateofEvent)
[1] "character"
dfdate$DateofEvent
[1] "12/03/2005" "12/04/2006" NA          

The date column appears to be listed as character because of NA. This is easily fixed by filtering NA.

dfdate$`Date.of.Event`<-dfdate$DateofEvent
#remove NA using filter
dfdate %>% filter(!is.na(DateofEvent))
  DateofEvent Result Date.of.Event
1  12/03/2005      4    12/03/2005
2  12/04/2006      5    12/04/2006
#re assigning date data type
dfdate$DateofEvent2<-as.POSIXct(dfdate$DateofEvent)
dfdate$DateofEvent3<-as.POSIXct(dfdate$`Date.of.Event`)
class(dfdate$DateofEvent2)
[1] "POSIXct" "POSIXt" 
dfdate$DateofEvent2
[1] "0012-03-20 LMT" "0012-04-20 LMT" NA              

Problem can occur when date and time are located in separate columns. The first issue is that time data is assigned a default date 1899-12-30. This error occurs as the base date in MS Office is 1899-12-30. This issue can be compounded when there are 2 time data eg a patient has stroke onset at 22:00 and is scanned at 02:00. The data become 1899-12-31 22:00 and 1899-12-31 02:00. In this case it implies that scanning occurs before stroke onset. This could have been solved at the data collection stage by having 2 separate date colums. There are several solutions inclusing ifelse but care must be taken with this argument. Note that ifelse argument convert date time data to numeric class. This can be resolved by embedding ifelse statement within as.Date argument. This argument requires origin argument. The logic argument may fail when the default date differ eg 1904-08 02:00. In this case 1904-02-08 02:00 is greater than 1899-12-31 22:00.

library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0     ✔ tibble  3.2.1
✔ ggplot2 3.5.1     ✔ tidyr   1.3.0
✔ readr   2.1.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df<-data.frame("Onset_date"=c("2014-03-06","2013-06-09"), "Onset_Time"=c("1899-12-31 08:03:00","1899-12-31 22:00:00"), "Scan_Time"=c("1904-02-08 10:00:00","1899-12-31 02:00:00")) %>% 
  mutate(
  #use update argument to reassign year, month and day
  Scantime=update(ymd_hms(Scan_Time), year=year(ymd(Onset_date)), month=month(ymd(Onset_date)), mday=day(ymd(Onset_date))),
 
  Onsettime=update(ymd_hms(Onset_Time), year=year(ymd(Onset_date)), month=month(ymd(Onset_date)), mday=day(ymd(Onset_date))),
  
  Scan_date=ifelse(Onsettime>Scantime,1,0),
  Scantime=Scantime+Scan_date*hms("24:00:00"),
  
 DiffHour=Scantime-Onsettime #minutes
)

df %>% select(-Scan_date) %>% gt::gt()
Onset_date Onset_Time Scan_Time Scantime Onsettime DiffHour
2014-03-06 1899-12-31 08:03:00 1904-02-08 10:00:00 2014-03-06 10:00:00 2014-03-06 08:03:00 1.95
2013-06-09 1899-12-31 22:00:00 1899-12-31 02:00:00 2013-06-10 02:00:00 2013-06-09 22:00:00 4

In some case, the origin is 1970-01-01. In this case 1 is 1970-01-02 and 2 is 1970-01-03 and so on. This example was provided in the creation of Gantt chart.

One way of storing data in R format is to save the file as .Rda. This format will ensure that no one can accidentally rewrite or delete a number. For very large data, it’s quicker to save as .Rda file than as csv file.

2.2.4 Foreign data

The foreign library is traditionally use to handle data from SPSS (.sav), Stata (.dta) and SAS (.sas). One should look at the ending in the file to determine the necessary library.

library(foreign)
#write and read Stata
write.dta(dfdate,file="./Data-Use/dfdate_temp.dta")
a<-read.dta("./Data-Use/dfdate_temp.dta")
a

The foreign library can handle older SAS files but not the current version. The current version of SAS file requires sas7bdat. The clue is the file ends in .sas7bdat.

2.2.5 json format

Json is short for JavaScript object Notification. These files have a hierarchical structured format. The json file is in text format amd can also be examined using Notepad. These files can be read using the RJSONIO or rjson libraries in R. Geojson is a json format with geographical data.

library(RJSONIO)
j<-fromJSON("./Data-Use/0411.geojson") #Christionso Island
j<-lapply(j, function(x) {
  x[sapply(x,is.null)]<-NA
  unlist(x)
})
k<-as.data.frame(do.call("cbind",j)) #list to data frame
head(k)
                                   type       crs
type                  FeatureCollection      name
geometry.type         FeatureCollection EPSG:4326
geometry.coordinates1 FeatureCollection      name
geometry.coordinates2 FeatureCollection EPSG:4326
properties.id         FeatureCollection      name
properties.status     FeatureCollection EPSG:4326
                                                  features
type                                               Feature
geometry.type                                        Point
geometry.coordinates1                          15.18657358
geometry.coordinates2                           55.3200168
properties.id         651a5746-735a-4312-b236-3a008e173de9
properties.status                                        1

The geojson file can be converted to sf using geojson_sf function from geojsonsf library.

library(geojsonsf)

#Christianso Island
geojson_sf("./Data-Use/0411.geojson")  %>% mapview::mapview()
The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
which was just loaded, will retire in October 2023.
Please refer to R-spatial evolution reports for details, especially
https://r-spatial.org/r/2023/05/15/evolution4.html.
It may be desirable to make the sf package available;
package maintainers should consider adding sf to Suggests:.
The sp package is now running under evolution status 2
     (status 2 uses the sf package in place of rgdal)

2.3 Tidy data

Attention to collection of data is important as it shows the way for performing analysis. In general each row represents on variable and each column represents an attribute of that variables. Sometimes there is a temptation to embed 2 types of attributes into a single column. However, when a column has more than 2 attributes it becomes harder to analyse.

This is an example of non-tidy data.

df2<-data.frame(Sex=c("Male","Female"), 
                Test=c("positive 5 negative 5",
                        " negative 0 negative 10"))
df2
     Sex                    Test
1   Male   positive 5 negative 5
2 Female  negative 0 negative 10

The above example should be entered this way. This change allows one to group variables by Test status: ‘positive’ or ‘negative’. One can easily perform a t-test here (not recommend in this case as the data contains only 2 rows).

df2<-data.frame(Sex=c("Male","Female"), 
                `Test Positive` =c(5,0), 
                `Test Negative`=c(5, 10))
df2
     Sex Test.Positive Test.Negative
1   Male             5             5
2 Female             0            10

The below example is illustrate how to collapse columns when using base R.

dfa<-data.frame(City=c("Melbourne","Sydney","Adelaide"),
                State=c("Victoria","NSW","South Australia"))
dfa
       City           State
1 Melbourne        Victoria
2    Sydney             NSW
3  Adelaide South Australia

The data can be collapse using _paste0_ function.

#collapsing City and State columns and generate new column address
dfa$addresses<-paste0(dfa$City,",", dfa$State) #separate by comma

dfa$addresses2<-paste0(dfa$City,",", dfa$State,", Australia")

dfa
       City           State                addresses
1 Melbourne        Victoria       Melbourne,Victoria
2    Sydney             NSW               Sydney,NSW
3  Adelaide South Australia Adelaide,South Australia
                           addresses2
1       Melbourne,Victoria, Australia
2               Sydney,NSW, Australia
3 Adelaide,South Australia, Australia

This example is same as above but uses verbs from tidyr. This is useful for collapsing addresses for geocoding.

library(tidyr)
dfa1<-dfa %>% unite ("new_address",City:State,sep = ",")
dfa1
               new_address                addresses
1       Melbourne,Victoria       Melbourne,Victoria
2               Sydney,NSW               Sydney,NSW
3 Adelaide,South Australia Adelaide,South Australia
                           addresses2
1       Melbourne,Victoria, Australia
2               Sydney,NSW, Australia
3 Adelaide,South Australia, Australia

Using the data above, let’s split the column address

library(tidyr)
dfa2<-dfa1 %>% separate(addresses, c("City2", "State2"))
Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
dfa2
               new_address     City2   State2
1       Melbourne,Victoria Melbourne Victoria
2               Sydney,NSW    Sydney      NSW
3 Adelaide,South Australia  Adelaide    South
                           addresses2
1       Melbourne,Victoria, Australia
2               Sydney,NSW, Australia
3 Adelaide,South Australia, Australia

2.3.1 Factors

There are several types of factors in R: ordered and not ordered. It is important to pay attention to how factors are coded. Sometimes, male is represented as 1 and female as 0. Sometimes, female is represented as 2 as integer encoding. Integer encoding assumes a rank ordering. This discussion may seems trivial but several papers have been retracted in high impact factor journal Jama because of miscoding of the trial assignment 1 and 2 rather than the assignment of 0 and 1. This error led to reversing the results with logistic regression when 2 is exchanged for 0 (Aboumatar and Wise 2019). This error led to report that an outpatient management program for chronic obstructive pulmonary disease resulted in fewer admissions. Below is an example which can occur when data is transformed into factor and back to number. Note that the coding goes from 0 and 1 to 2 and 1. It has been suggested that the move away from coding data as 1 and 0 was historical and due to the fear that coding with punch card would treat 0 as a missing number and hence the move to coding binary variable as 1 and 2.

In certain analyses, the libraries prefer to use the dependent or outcome variable as binary coding in numeric format such as 1 and 0 eg logistic regression and random forest. The library e1071 for performing support vector machine prefers the outcome variable as factor.

library(Stat2Data)
data("Leukemia") #treatment of leukemia

Leukemia %>% dplyr::glimpse()
Rows: 51
Columns: 9
$ Age    <int> 20, 25, 26, 26, 27, 27, 28, 28, 31, 33, 33, 33, 34, 36, 37, 40,…
$ Smear  <int> 78, 64, 61, 64, 95, 80, 88, 70, 72, 58, 92, 42, 26, 55, 71, 91,…
$ Infil  <int> 39, 61, 55, 64, 95, 64, 88, 70, 72, 58, 92, 38, 26, 55, 71, 91,…
$ Index  <int> 7, 16, 12, 16, 6, 8, 20, 14, 5, 7, 5, 12, 7, 14, 15, 9, 12, 4, …
$ Blasts <dbl> 0.6, 35.0, 7.5, 21.0, 7.5, 0.6, 4.8, 10.0, 2.3, 5.7, 2.6, 2.5, …
$ Temp   <int> 990, 1030, 982, 1000, 980, 1010, 986, 1010, 988, 986, 980, 984,…
$ Resp   <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, …
$ Time   <int> 18, 31, 31, 31, 36, 1, 9, 39, 20, 4, 45, 36, 12, 8, 1, 15, 24, …
$ Status <int> 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …

The variable Resp is now a factor with levels 0 and 1

Leukemia$Resp
 [1] 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0
[39] 0 1 0 0 0 1 0 0 1 0 1 1 0
Leukemia$Response.factor<-as.factor(Leukemia$Resp)
head(Leukemia$Response.factor)
[1] 1 1 1 1 1 0
Levels: 0 1

Note in the conversion back to numeric ‘dummy’ values, the data takes the form 1 and 2. This has changed the dummy values of 0 and 1. It is important to examine the data before running analysis.

Leukemia$Response.numeric<-as.numeric(Leukemia$Response.factor)
Leukemia$Response.numeric
 [1] 2 2 2 2 2 1 2 2 2 1 2 2 1 2 1 2 2 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 1 1 1 1 1 1
[39] 1 2 1 1 1 2 1 1 2 1 2 2 1

For variables which are characters but considered as factors, it is necessary to convert to class character before converting to dummy values.

data("BreastCancer", package="mlbench")
BreastCancer %>% glimpse()
Rows: 699
Columns: 11
$ Id              <chr> "1000025", "1002945", "1015425", "1016277", "1017023",…
$ Cl.thickness    <ord> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, 8, 7, 4, 4, …
$ Cell.size       <ord> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1, 7, 4, 1, 1,…
$ Cell.shape      <ord> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1, 5, 6, 1, 1,…
$ Marg.adhesion   <ord> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, 10, 4, 1, 1,…
$ Epith.c.size    <ord> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, 7, 6, 2, 2, …
$ Bare.nuclei     <fct> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, 3, 9, 1, 1, …
$ Bl.cromatin     <fct> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, 5, 4, 2, 3, …
$ Normal.nucleoli <fct> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, 5, 3, 1, 1, …
$ Mitoses         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 4, 1, 1, 1, …
$ Class           <fct> benign, benign, benign, benign, benign, malignant, ben…

The steps for conversion are illustrated below. Conversion of multiple columns of factors and ordered factors can be done in one step using lapply function. This will be described much further below.

BreastCancer$Class<-as.character(BreastCancer$Class)
BreastCancer$Class[BreastCancer$Class=="benign"]<-0
BreastCancer$Class[BreastCancer$Class=="malignant"]<-1
BreastCancer$Class<-as.numeric(BreastCancer$Class)
head(BreastCancer$Class)
[1] 0 0 0 0 0 1

This illustration describes conversion of a continuous variable into orderly factors.

library(Stat2Data)
data("Leukemia") #treatment of leukemia
#partition Age into 8 ordered factors
Leukemia$AgeCat<-ggplot2::cut_interval(Leukemia$Age, n=8, ordered_result=TRUE)
class(Leukemia$AgeCat)
[1] "ordered" "factor" 

2.3.2 Multiple files

Merging of files can be done using dplyr to perform inner_join, outer_join, left_join and right_join. Note that this can also be done in base R or using syntax of data.table. These files can be joined using %>% operator.

DF1<-data.frame(ID=c(1,2,3),Age=c(20,30,40))
DF2<-data.frame(ID=c(1,2,3),Sex=c(1,0,0))
DF3<-data.frame(ID=c(1,2,3), Diabetes=c(1,1,0))

DF <-DF1 %>% left_join(DF2, by="ID") %>% left_join(DF3, by="ID")

2.3.3 Tidy evaluation

Data in a dataframe can be summarised with the help of group_by function in dplyr. The summarise function returns 1 row of data per group. The number of observations in each group can be counted using n() function.

# spot sign dataset
ss<-read.csv("./Data-Use/ss150718.csv")

ss %>% group_by(Country) %>% summarise (Number=n())
# A tibble: 11 × 2
   Country  Number
   <chr>     <int>
 1 Brazil        1
 2 Canada        3
 3 China         6
 4 China/US      1
 5 Denmark       1
 6 Germany       1
 7 Japan         5
 8 Korea         4
 9 Multiple      5
10 Spain         1
11 US           11

We can add more arguments to the group_by function

ss %>% group_by(Country, Study.type) %>% summarise (Number=n())
`summarise()` has grouped output by 'Country'. You can override using the
`.groups` argument.
# A tibble: 16 × 3
# Groups:   Country [11]
   Country  Study.type  Number
   <chr>    <chr>        <int>
 1 Brazil   Prospective      1
 2 Canada   Prospective      1
 3 Canada   Retro            2
 4 China    Prospective      3
 5 China    Retro            3
 6 China/US Retro            1
 7 Denmark  Prospective      1
 8 Germany  Prospective      1
 9 Japan    Prospective      1
10 Japan    Retro            4
11 Korea    Prospective      1
12 Korea    Retro            3
13 Multiple Prospective      5
14 Spain    Prospective      1
15 US       Prospective      4
16 US       Retro            7

We can also apply a function across multiple columns with across. Here we use reframe argument to unlock restriction impose by group_by. Here we use tibble function to return several output columns.

DR<- function (w,x,y,z){
  tibble(
   Sensitivity=round(w/(w+y),2),
   Specificity=round(z/(z+x),2))
}

ss %>% group_by(Country, Study.type) %>% 
  #the output is limted to 6 rows by head argument
  reframe (across(TP:TN), DR(TP,FP,FN,TN)) %>% head() 
# A tibble: 6 × 8
  Country Study.type     TP    FP    FN    TN Sensitivity Specificity
  <chr>   <chr>       <int> <int> <int> <int>       <dbl>       <dbl>
1 Brazil  Prospective     5     6     3    57        0.62        0.9 
2 Canada  Prospective    10     3     1    25        0.91        0.89
3 Canada  Retro          49    22     6   290        0.89        0.93
4 Canada  Retro          11     0     4    12        0.73        1   
5 China   Prospective    60    16    17   219        0.78        0.93
6 China   Prospective    17     7    11    77        0.61        0.92

2.3.4 Pivot wide and long

A variety of different expressions are used to describe data format such as wide and long formats. In some case the distinction between such formats is not clear. The verbs for performing these operations are pivot_wide, pivot_long. Again data.table uses different verbs such as cast and melt. In general, most regression analyses are performed with data in wide format. In this case each row represents a unique ID. Longitudinal analyses are performed with data in long format. In this format, there are several rows with the same ID. In the next Chapter on Statistics, an example of data generated in wide format and coverted to long format using plyr. Here we will demonstrate the use of tidyr to pivot loner or wider.

The best way to think about how data should be presented is that data is analyzed according to columns not rows. The data below is extracted from CDC COVID website. Details are given below under Web scraping on how this task was performed.

library(dplyr)
library(tidyr)
library(stringr)
usa<-read.csv("./Data-Use/Covid_bystate_Table130420.csv")
# for demonstration we will select 3 columns of interest
usa_long <-usa %>% 
  select(Jurisdiction,NumberCases31.03.20,NumberCases07.04.20) %>% pivot_longer(-Jurisdiction,names_to = "Date",values_to = "Number.Cases")  
usa_long$Date <- str_replace(usa_long$Date,"NumberCases","")

#data in wide format
head(usa %>%select(Jurisdiction,NumberCases31.03.20,NumberCases07.04.20),6) 
  Jurisdiction NumberCases31.03.20 NumberCases07.04.20
1      Alabama                 999                2197
2       Alaska                 133                 213
3      Arizona                1289                2575
4     Arkansas                 560                 993
5   California                8131               15865
6     Colorado                2966                5429
#data in long format
head(usa_long,6) 
# A tibble: 6 × 3
  Jurisdiction Date     Number.Cases
  <chr>        <chr>           <int>
1 Alabama      31.03.20          999
2 Alabama      07.04.20         2197
3 Alaska       31.03.20          133
4 Alaska       07.04.20          213
5 Arizona      31.03.20         1289
6 Arizona      07.04.20         2575

2.4 Regular Expressions in Strings

Here is a short tutorial on handling of regular expressions found in strings. We will begin using base R. This section is based on experience trying to clean a data frame containing many words used to describe one disease or one drug.

2.4.1 Base R

#create example dataframe
df<-data.frame(
  drug=c("valium 1mg","verapamil sr","betaloc zoc","tramadol","valium (diazepam)"),
  infection=c("pneumonia","aspiration pneumonia","tracheobronchitis","respiratory tract infection","respiratory.tract.infection"))
df
               drug                   infection
1        valium 1mg                   pneumonia
2      verapamil sr        aspiration pneumonia
3       betaloc zoc           tracheobronchitis
4          tramadol respiratory tract infection
5 valium (diazepam) respiratory.tract.infection

Now that we have a data frame, we can use pattern matching to replace part of phrase. This step can be done simply using gsub command. First create a list so that the computer searches the phrases in the list.

#create list to remove phrase
redun=c("1mg", "zoc", "sr")
pat=paste0("\\b(",paste0(redun,collapse = "|"),")\\b")
df$drug1<-gsub(pat,"",df$drug)
df$drug1
[1] "valium "           "verapamil "        "betaloc "         
[4] "tramadol"          "valium (diazepam)"
#create list to remove phrase
redunc1=c("respiratory tract infection", "tracheobronchitis", "aspiration")
pat=paste0("\\b(",paste0(redunc1,collapse = "|"),")\\b")
df$infection1<-gsub(pat,"",df$infection)
df$infection1
[1] "pneumonia"                   " pneumonia"                 
[3] ""                            ""                           
[5] "respiratory.tract.infection"

2.4.2 Metacharacters

This section deals with meta-characterers. Examples of meta-characters include $ . + * ? ^ () {} []. These meta-characters requires the double back slashes \.

#create list to remove phrase
redun=c("1mg", "zoc", "sr")
pat=paste0("\\b(",paste0(redun, collapse = "|"),")\\b")   
df$drug2<-gsub(pat,"",df$drug)

#[a-z] indicates any letter
#[a-z]+ indicates any letter and those that follow the intial letter
df$drug2<-gsub("\\(|[a-z]+\\)","",df$drug2)
df$drug2
[1] "valium "    "verapamil " "betaloc "   "tramadol"   "valium "   

Back to our data frame df, we want to remove or change the different words accounting for pneumonia.

redunc=c("\\.")
redunc1=c("respiratory tract infection", "tracheobronchitis", "aspiration")
pat=paste0("\\b(",paste0(redunc,collapse = "|"),")\\b")
df$infection2<-gsub(pat," ",df$infection)
pat=paste0("\\b(",paste0(redunc1,collapse = "|"),")\\b")
df$infection2<-gsub(pat," ",df$infection2)
df$infection2
[1] "pneumonia"   "  pneumonia" " "           " "           " "          

2.4.3 stringr

The following examples are taken from excel after conversion from pdf. In the process of conversion errors were introduced in the conversion from pdf to excel. A full list of the options available can be found at https://stringr.tidyverse.org/articles/regular-expressions.html

library(stringr)
#error introduced by double space 
a<-c("8396 (7890 to 8920)","6 301 113(6 085 757 to 6 517 308)",
     "4 841 208 (4 533 619 to 5 141 654)",
     "1 407 701 (127 445 922 to 138 273 863)",
     "4 841 208\n(4 533 619 to\n5 141 654)")

b<-str_replace (a, "\\(c.*\\)","")

#this is a complex example to clean and requires several steps. Note that the original data in the list a is now assigned to b. 
b<-str_replace(a,"\n","") %>% 
  #remove (
  str_replace("\\(.*","") %>%
  str_replace("\n.*","") %>%
  #remove )
  str_replace("\\)","") %>%
  #remove empty space
  str_replace("\\s","") %>%
  str_replace("\\s","")%>% as.numeric()
b
[1]    8396 6301113 4841208 1407701 4841208

Another example. This time the 2 numbers in the column are separated by a slash sign. Supposed you want to keep the denominator. The first remove the number before the slash sign. The _*_ metacharacter denotes the action occurs at the end.

df.d<-data.frame(seizure.rate=c("59/90", "90/100", "3/23"))
df.d$seizure.number<-str_replace(df.d$seizure.rate,"[0-9]*","") 
df.d$seizure.number
[1] "/90"  "/100" "/23" 

Now combine with the next step to remove the slash sign.

#We used [0-9] to denote any number from 0 to 9. For text, one can use [A-Z].
df.d$seizure.number<-str_replace(df.d$seizure.rate,"^[0-9]*","")%>%
  str_replace("/","\\")
df.d$seizure.number
[1] "90"  "100" "23" 

Removing the denominator requires a different approach. First remove the last number then the slash sign.

df.d$case<-str_replace(df.d$seizure.rate,"/[0-9]*"," ")
df.d$case
[1] "59 " "90 " "3 " 

The example below has several words mixed in numeric vector columns. The words are a mixture of upper and lower cases. Note that “NA” is treated as a word character while NA is treated as Not Available by R. This recognition is important as removing them requires different actions. Character “NA” can be removed by str_replace while NA requires is.na operator.

A<-c(1,2,3,"NA",4,"no COW now")
B<-c(1,2,NA,4,"NA","check")
C<-c(1,"not now",2,3, NA ,5)

D<-data.frame(A,B,C) 

#str_replace on one column
D$A1<-str_replace(D$A,"[A-Z]+","") %>% str_replace("[a-z]+","")

#change to lower case
D$A2<-str_to_lower(D$A) %>% str_replace("[a-z]+","")

#remove space before replacement
D$A3<-str_to_lower(D$A) %>% str_replace("\\s+","") %>% str_replace("[a-z]+","")

#note that this action does not remove the third word
D$A4<-str_to_lower(D$A) %>% str_replace("\\s","") %>% str_replace("[a-z]+","")

#repeat removal of empty space
D$A5<-str_to_lower(D$A) %>% str_replace("\\s","") %>% 
  str_replace("\\s","") %>%  str_replace("[a-z]+","")

#apply str_replace_all rather than repeat
D$A6<-str_to_lower(D$A) %>% str_replace_all("\\s","") %>% 
    str_replace("[a-z]+","")

#now combine into vector. Note the use of c to combine the vector and replace 
#the comma with equal sign
D$A7<-str_to_lower(D$A) %>%
 str_replace_all(c("\\s"="","[a-z]+"=""))

D
           A     B       C    A1       A2   A3   A4 A5 A6 A7
1          1     1       1     1        1    1    1  1  1  1
2          2     2 not now     2        2    2    2  2  2  2
3          3  <NA>       2     3        3    3    3  3  3  3
4         NA     4       3                                  
5          4    NA    <NA>     4        4    4    4  4  4  4
6 no COW now check       5   now  cow now  now  now         

The lessons from above can be combine in when creating data frame. The mutate_if function enable multiple columns to be changed. One problem to handle adding multiple columns which contain NA is the use of rowSums and dplyr::select. These examples are illustrated below.

#use the mutate function 

E<-data.frame(A,B,C) %>%
  mutate (A=str_to_lower(A) %>% str_replace_all(c("\\s"="","[a-z]+"="")),
          B=str_to_lower(B) %>%str_replace_all(c("\\s"="","[a-z]+"="")),
          C=str_to_lower(C) %>%str_replace_all(c("\\s"="","[a-z]+"="")))%>%
  
  #change character columns to numeric
  mutate_if(is.character, as.numeric)%>%
  
  #add across columns and avoid NA
  mutate(ABC=rowSums(dplyr::select(.,A:C),na.rm = T))

Another example of NA creating issues with sum in mutate is provided below. Here, the function rowwise from dplyr can be used to emphasise the operation across rows.

df<-data.frame(Mon_SBP=c(160,NA,180),Tues_SBP=c(0,0,150),
               Wed_SBP=c(NA,130,125)) %>%
  #error from NA
  rowwise %>%
  mutate(SBP=ifelse(sum(Mon_SBP>140 & Tues_SBP>120, 
                        Tues_SBP>100 & Wed_SBP>120, 
                        Mon_SBP>100 & Wed_SBP>115, na.rm=T)>1.5,1,0))


df
# A tibble: 3 × 4
# Rowwise: 
  Mon_SBP Tues_SBP Wed_SBP   SBP
    <dbl>    <dbl>   <dbl> <dbl>
1     160        0      NA     0
2      NA        0     130     0
3     180      150     125     1

Sometimes, you may only want to keep the number and ignore the words in the column. This can be done using the str_extract function.

df.e<-data.frame(disability=c("1 - No significant disability despite symptoms; able to carry out all usual duties and activities","5 - Severe disability, bedridden, incontinent and requiring constant nursing care and attention","
1 - No significant disability despite symptoms; able to carry out all usual 
duties and activities"))
df.e$disability2<-str_extract(df.e$disability,"\\w") #extract number

2.5 PDF to xcel

Sometimes data from public government sites come in PDF form instead of excel. Conversion from pdf to excel or text can be difficult especially with special character eg Danish. There are several libraries for doing this: pdftables (require API key) and pdftools. The example below uses pdftools. available at https://docs.ropensci.org/pdftools/. The document is the 2018 Danish Stroke Registry report. The tabulizer package is excellent for converting table data. However, tabulizer package depends on rJava and requires deft handling. The PDE libray has user interface for performing data extraction from pdf.

library(pdftools)
Using poppler version 22.04.0
#Danish stroke registry
txt<-pdf_text("./Data-Use/4669_dap_aarsrapport-2018_24062019final.pdf")

#browse data page 13+4 filler pages
cat(txt[17]) 
3. Indikatorresultater på lands-, regions- og afdelingsniveau
Indikator 1a: Andel af patienter med akut apopleksi som indlægges inden for 3 timer
efter symptomdebut. Standard: ≥ 30%

Indikator 1b: Andel af patienter med akut apopleksi som indlægges inden for 4,5
timer efter symptomdebut. Standard: ≥ 40%

                                        Inden for 3 timer

                                                      Uoplyst    Aktuelle år            Tidligere år
                            Standard     Tæller/       antal         2018          2017            2016
                             opfyldt     nævner        (%)      %     95% CI     % (95% CI)     % (95% CI)

 Danmark                       ja      4730 / 11794    49 (0)   40   (39 - 41)   39 (38-40)     37 (36-38)
 Hovedstaden                   ja       1502 / 3439    49 (1)   44   (42 - 45)   40 (38-42)     40 (39-42)
 Sjælland                      ja        760 / 1917     0 (0)   40   (37 - 42)   39 (36-41)     40 (38-43)
 Syddanmark                    ja        942 / 2433     0 (0)   39   (37 - 41)   39 (37-41)     35 (33-37)
 Midtjylland                   ja        918 / 2590     0 (0)   35   (34 - 37)   36 (34-38)     35 (33-37)
 Nordjylland                   ja        577 / 1341     0 (0)   43   (40 - 46)   41 (39-44)     35 (32-37)
 Bopæl uden for Danmark        ja           31 / 74     0 (0)   42   (31 - 54)   51 (36-66)     39 (26-53)

 Hovedstaden                   ja       1502 / 3439    49 (1)   44   (42 - 45)   40 (38-42)     40 (39-42)
 Albertslund                   ja           17 / 52     0 (0)   33   (20 - 47)   30 (18-44)     43 (27-59)
 Allerød                       ja           22 / 53     0 (0)   42   (28 - 56)   32 (20-46)     35 (22-50)
 Ballerup                      ja          43 / 106     0 (0)   41   (31 - 51)   48 (38-58)     40 (31-51)
 Bornholms Regionskommune      ja           38 / 98     0 (0)   39   (29 - 49)   28 (19-38)     32 (22-43)
 Brøndby                       ja          45 / 113     0 (0)   40   (31 - 49)   31 (22-42)     34 (23-47)
 Dragør                        ja           18 / 43     0 (0)   42   (27 - 58)   47 (30-65)     33 (17-54)
 Egedal                        ja           45 / 95     0 (0)   47   (37 - 58)   40 (30-52)     48 (37-59)
 Fredensborg                   ja           36 / 99     0 (0)   36   (27 - 47)   37 (27-47)     43 (33-53)
 Frederiksberg                 ja          74 / 141    13 (8)   52   (44 - 61)   42 (35-50)     52 (44-61)
 Frederikssund                 ja          53 / 141     0 (0)   38   (30 - 46)   39 (31-48)     42 (33-51)
 Furesø                        ja          55 / 110     0 (0)   50   (40 - 60)   56 (44-67)     50 (38-62)
 Gentofte                      ja          65 / 123     1 (1)   53   (44 - 62)   46 (37-56)     38 (30-47)
 Gladsaxe                      ja          68 / 129     0 (0)   53   (44 - 62)   48 (38-57)     36 (27-44)
 Glostrup                      ja           38 / 72     0 (0)   53   (41 - 65)   40 (27-54)     48 (33-63)
 Gribskov                      ja          46 / 129     0 (0)   36   (27 - 45)   35 (26-44)     36 (28-46)
 Halsnæs                       ja           45 / 92     0 (0)   49   (38 - 60)   34 (25-45)     34 (25-44)
 Helsingør                     ja          50 / 129     0 (0)   39   (30 - 48)   32 (24-40)     38 (30-45)
 Herlev                        ja           25 / 50     0 (0)   50   (36 - 64)   52 (38-65)     33 (22-46)
 Hillerød                      ja          47 / 121     0 (0)   39   (30 - 48)   39 (30-49)     40 (30-50)
 Hvidovre                      ja          57 / 135     0 (0)   42   (34 - 51)   38 (29-48)     47 (37-57)

                                                                                                             13
screenshot13<-
  pdf_render_page("./Data-Use/4669_dap_aarsrapport-2018_24062019final.pdf", 
                  page =17)

png::writePNG(screenshot13, "./Data-Use/Danish-Stroke-page13.png")

knitr::include_graphics("./Data-Use/Danish-Stroke-page13.png")

2.5.1 Scanned text or picture

Importing data from scanned text will require use of Optical Character Recognition (OCR). The tesseract library provides an R interface for OCR. In the example below, a picture is taken from same CDC website containing mortality data (https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/04102020/ nchs-data.html). The screenshot of this website was then cleaned in paint. The data is available in the Data-Use folder.

library(tesseract)
eng <- tesseract("eng") #english
text <- tesseract::ocr("./Data-Use/Covid_PNG100420.png", engine = eng)
cat(text)
NCHS Mortality Surveillance Data
Data as of April 9, 2020
For the Week Ending April 4, 2020 (Week 14)

COVID-19 Deaths Pneumonia Deaths* Influenza Deaths
Year Week TotalDeaths Number %ofTotal Number %ofTotal Number %of Total
2019 40 52,452 0 0 2,703 5.15 16 0.03
2019 4l 52,860 0 0 2,770 5.24 16 0.03
2019 42 54,129 0 0 2,977 5.50 18 0.03
2019 43 53,914 0 0 2,985 5.54 30 0.06
2019 44 53,980 0 0 2,908 5.39 31 0.06
2019 4S 55,468 0 0 3,063 5.52 31 0.06
2019 46 55,684 0 0 3,096 5.56 39 0.07
2019 47 55,986 0 0 2,993 5.35 50 0.09
2019 48 55,238 0 0 2,976 5.38 65 0.12
2019 49 56,990 0 0 3,305 5.80 99 0.17
2019 50 57,276 0 0 3,448 6.02 111 0.19
2019 51 56,999 0 0 3,345 5.87 125 0.22
2019 52 57,956 0 0 3,478 5.99 198 0.34
2020 4 58,961 0 0 3,998 6.77 416 0.71
2020 2 58,962 0 0 3,995 6.76 450 0.76
2020 3 57,371 0 0 3,903 6.78 44) 0.77
2020 4 56,666 0 0 3,742 6.56 468 0.83
2020 5 56,381 0 0 3,617 6.42 452 0.80
2020 6 56,713 0 0 3,599 6.35 482 0.85
2020 7 55,237 0 0 3,577 6.48 487 0.88

2.6 Web scraping

The readers may ask why web scraping for healthcare. A pertinent example related to COVID-19 data is provided below. The library rvest is helpful at scraping data from an internet page. The rvest library assumes that web contents have xml document-tree representation. The different options available for web scraping with rvest are available at the website https://rvest.tidyverse.org/reference/. The user can use CSS selectors to scrape content. The library Rselenium is also useful for web scraping. For dynamic web page, the library CasperJS library does a better job especially if the data contain embedded java script.

The library cdccovidview provides access to the CDC website on COVID-19. In the example below, we will try to this manually. Data from CDC website on COVID-19 is downloaded, cleaned and saved in csv format. It is important to pay attention to the data. The first row contains header and is removed. There are several columns with commas. These commas can be removed using the exercises above. Further the data is updated on weekly basis. As such the data needs to be converted into a date time format using lubridate.

library(rvest)

Attaching package: 'rvest'
The following object is masked from 'package:readr':

    guess_encoding
library(tidyverse)
#assign handle to web page accessed 12/4/20
#cdc<-read_html("https://www.cdc.gov/coronavirus/2019-ncov/covid-data/
#               covidview/04102020/nchs-data.html")
# scrape all div tags
#html_tag <- cdc %>% html_nodes("div")
# scrape header h1 tags
#html_list<-html_tag %>% html_nodes("h1") %>% html_text()
#there is only one table on this web page
#Table1<- cdc %>% html_node("table") %>% html_table(fill = TRUE)
#Table1 has a header row
#Table1<-Table1[-1,]
#The data in the Total Deaths column has a comma 
#Table1$Total.Deaths<-as.numeric(gsub(",","",Table1$`Total Deaths`))
#now combine the year and week column to Date
#Table1$Date<-lubridate::parse_date_time(paste(Table1$Year, Table1$Week, 'Mon', sep="/"),'Y/W/a')
#there are still commas remaining in some columns. This is a useful exercise for the reader. A solution is provided in the next example.  
#write.csv(Table1,file="./Data-Use/Covid_Table100420.csv")

The next example is from the CDC COVID-19 website. It poses a different challenges as there are several columns with the same names. In this case we will rename the column by index. There are several columns containing commas. Rather than removing column by column we will write a function with lapply to do it over the table. the apply function returns a matrix whereas lapply returns a dataframe. There is one column containing percentage enclosed in a bracket. This can be removed using the example above on metacharacter ie using doule back slash in front of bracket and again at close of bracket.

library(rvest)
library(tidyverse)
cdc<-
read_html("https://www.cdc.gov/mmwr/volumes/69/wr/mm6915e4.htm?s_cid=mm6915e4_w")

# scrape all div tags
html_tag <- cdc %>% html_nodes("div")
# scrape header h1 tags
html_list<-html_tag %>% html_nodes("h1") %>% html_text()
#there is only one table on this web page
Table2<- cdc %>% html_node("table") %>% html_table(fill = TRUE)
#first row is header
names(Table2) <- as.matrix(Table2[1, ])
Table2<-Table2[-c(1:2,55),]#rows 1 and 2 are redundant
#rename the columns by index 
names(Table2)[2] <-"NumberCases31.03.20"
names(Table2)[3]<-"CumulativeIncidence31.03.20"
names(Table2)[4]<-"NumberCases07.04.20"
names(Table2)[5]<-"NumberDeath07.04.20"
names(Table2)[6]<-"CumulativeIncidence07.04.20"
#rather than removing column by column we will write a function with lapply to remove commas over the table. the apply function returns a matrix whereas lapply returns a dataframe.
Table2<-as.data.frame(lapply(Table2, function(y) gsub(",", "", y))) 
Table2<-as.data.frame(lapply(Table2, function(x)
  gsub("\\(|[0-9]+\\)","",x)))
#write.csv(Table2,file="./Data-Use/Covid_bystate_Table130420.csv")

3 Data simulation

Data simulation is an important aspects of data science. The example below is taken from our experience trying to simulate data from recent clot retrieval trials in stroke (Berkhemer et al. 2015) (Campbell et al. 2015). Simulation is performed using simstudy library.

library (simstudy) 
library(tidyverse)  

#T is Trial 
def <- defData(varname = "T", dist = "binary", formula = 0.5)  

#early neurological improvement (ENI) .37 in TPA and .8 in ECR #baseline NIHSS 13 in TPA and 17 in ECR  

def <- defData(def, varname = "ENI", dist = "normal", formula = .8-.53*T, variance = .1)  

#baseline NIHSS 13 in TPA and 17 in ECR 

def <- defData(def, varname = "Y1", dist = "normal", formula=13, variance = 1)  

def <- defData(def, varname = "Y2", dist = "normal", formula = "Y1*ENI - 5 * T >5",variance = 1)  

def <- defData(def, varname = "Y3", dist = "normal", formula = "Y2- 4*T>2",variance = 1)  

def <- defData(def, varname = "Y4", dist = "normal", formula = "Y3- 2*T>0", variance = 1)  

#male 
def <- defData(def,varname = "Male", dist = "binary", formula = 0.49*T)  #diabetes .23 in TPA and .06 in ECR 

def <- defData(def,varname = "Diabetes", dist = "binary", formula = .23-.17*T)  

#HT .66 TPA vs .6 ECR 
def <- defData(def,varname = "HT", dist = "binary", formula = .66-.06*T)  

#generate data frame 
dtTrial <- genData(500, def)  
#define parameter for mRS   #Add conditional column with field name "mRS"    

dtTime <- addPeriods(dtTrial, nPeriods = 4, idvars = "id",              timevars = c("Y1", "Y2", "Y3","Y4"), timevarName = "Y") 

dtTime  #check  that 2 groups are similar at start but not at finish 
       id period T        ENI Male Diabetes HT          Y timeID
   1:   1      0 0 -0.1019761    1        0  1 12.8545850      1
   2:   1      1 0 -0.1019761    1        0  1 -1.9119528      2
   3:   1      2 0 -0.1019761    1        0  1  0.2061292      3
   4:   1      3 0 -0.1019761    1        0  1  1.7379323      4
   5:   2      0 1  0.8182255    0        0  1 11.6832425      5
  ---                                                           
1996: 499      3 0 -0.3552098    1        0  1  1.5248974   1996
1997: 500      0 0  0.7536842    0        0  1 12.5616525   1997
1998: 500      1 0  0.7536842    0        0  1  1.0239354   1998
1999: 500      2 0  0.7536842    0        0  1  0.0359603   1999
2000: 500      3 0  0.7536842    0        0  1  1.4739410   2000
t.test(Y1~T,data=dtTrial) 

    Welch Two Sample t-test

data:  Y1 by T
t = -0.60193, df = 497.83, p-value = 0.5475
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.2202427  0.1169406
sample estimates:
mean in group 0 mean in group 1 
       12.98403        13.03568 
t.test(Y4~T,data=dtTrial) 

    Welch Two Sample t-test

data:  Y4 by T
t = 6.3746, df = 493.96, p-value = 4.214e-10
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 0.4156189 0.7859735
sample estimates:
mean in group 0 mean in group 1 
      0.6282369       0.0274407 
t.test(Male~T,data=dtTrial)  

    Welch Two Sample t-test

data:  Male by T
t = -0.44869, df = 497.98, p-value = 0.6539
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.10809561  0.06790297
sample estimates:
mean in group 0 mean in group 1 
      0.5019920       0.5220884 
#putting the 4 time periods together - long format 

dtTime <- addPeriods(dtTrial, nPeriods = 3, idvars = "id", timevars = c("Y1", "Y2", "Y3","Y4"), timevarName = "Y")  

#summarise data using group_by 

dtTime2<-dtTime %>%   group_by(period, T) %>%   
  summarise(meanY=mean(Y),             
            sdY=sd(Y),             
            upperY=meanY+sdY,             
            lowerY=meanY-sdY)    
`summarise()` has grouped output by 'period'. You can override using the
`.groups` argument.
#write.csv(dtTime, file="./Data-Use/dtTime_simulated.csv") 
#write.csv(dtTrial, file="./Data-Use/dtTrial_simulated.csv")}