

Thanh G Phan


December 8, 2024


This is a book describing the use of R in Healthcare. It was aimed at beginners of R language. A long the way, I have added some Python codes as some members of the Lab use Python and the RStudio IDE allows easy switching between R and Python languages. The majority of the examples are taken from works in Neurology. Where possible data from other disease such as heart disease, cancer and vaccine are used. The ideas and principles can be applied to other aspects of Healthcare.

The book is written in the hope that clinicians and junior doctors will take a new approach towards understanding medicine and data. One of the key aspect in the journey to becoming a doctor is leaning about medical diagnosis. This is a complex process and should appropriately take into account the history and corroborating history, examination and investigations (Centor, Geha, and Manesh 2019). However, early on in medical school, students are taught list of associations and frequency of signs and symptoms related to a disease, rather than recognition of patterns. Medical students later become junior doctors and would pride themselves on the ability to generate these lists. An analogy to these lists is performing univariable regression to explore their occurrences in the condition compare to an alternative diagnosis. However, the findings from univariable regression do not convey a meaning about relationship of the variables. Similarly, the teaching that vasculitis is a cause of stroke has led many junior doctors to search for rare association such vasculitis among young patients with stroke. In an audit over 10 years at Monash Health, there was only one case of stroke as presenting feature of vasculitis (Kempster, McLean, and Phan 2016). This ordering of the tempo of the case is not well taught as the focus has been on generating a list. Another example would be searching for temporal arteritis among elderly patients presenting with stroke. This action is performed without realising that few patients with arteritis have stroke and few patients with stroke have arteritis. Often patients developed temporal arteritis first and their associated symptoms including constitutional symptoms, myalgias etc. Stroke may present later in the course of arteritis and sometimes even after treatment for arteritis. By contrast, a senior clinician would discuss why certain diagnoses and not others were considered in the differential diagnoses given the occurrence of selected symptoms and signs. This issue comes to the fore when evaluating features of seizure and pseudoseizure. Signs such as eye closure, lip smacking, asynchronous limb movement are treated publications as isolated signs. However, experienced neurologists do not interpret isolated signs but as ensemble of signs or patterns. In statistical analysis, patterns can be generated using a variety of multivariate methods (Hastie, Tibshirani, and Friedman 2009). These methods are not taught except in advance Statistics or in machine learning courses. Consequently, there is a lack of appreciations of these methods and how they can be applied in Healthcare. In this book, these multivariate methods will be briefly demonstrated to give the clinicians a glimpse of the possibilities.

Statistics is taught at rudimentary level in school and during specialty training. Given this lack of emphasis of an important subject, it is not surprising that students and junior doctors do not embrace statistics. When it is done it is through the use of commercial statistical software such as SPSS which has a graphical user interface (GUI) and encourage the user to perform analysis by clicking. R is a statistical program which takes a completely opposite direction. It requires the user to be able to code and understand why a task should be done. This drawback means that many students, junior doctors and clinicians do not appreciate the advantage of R: free open source software, large online community who are willing to share their codes and direct access to statisticians and bioinformaticians who write softwares used statistical analyses. A lot of the new ideas in statistics have libraries available in R. By contrast, it would take several years for commercial software to catch-up. R can be used to scrape data from the internet or interface with data platform such as Google Maps application programming interface (API), Youtube, Twitter. Rstudio, the integrated development environment (IDE) of R, provides platform for Shiny app development, creation of web document and writing of book (such as this one).

This book takes on a non-traditional approach to teaching R. It emphasises learning R by examples. Often data science course spend time explaining how R treat data as vector and manipulate data symbolically. Data manipulation is the foundation of data science but can bore those new to R. That aspect is left to the next chapter on data wrangling. This chapter is an introduction to ggplot2. Another aspect of learning R is that the libraries come with many free datasets. Clinicians do not find the diamond or car or gapminder datasets useful as they are not related to medicine. On the other hand, the actual Titanic data, with passenger list and their fate, may be of interest. In this book we will try to use dataset which are directly related to medicine or topics of high interest such as the COVID-19 data. Some of the data provided here comes from publications by the Neurology Department, some are simulated and some have come from the internet (COVID-19) and some datasets are provided by R (eg fertility, cancer (lung, breast), leukemia, lymphoma, coronary artery disease, diabetes, hepatitis and microbiome). The datasets are available in R in the following packages datasets, Stat2Data and mlbench. Additional datasets are available from external website such as Kaggle and UCI Machine learning Databases. For example, the heart disease data are available from It is encouraged that the reader visit these websites to obtain data for learning R. Datasets from these websites are labelled in this book as coming from the ExtData folder. Unless indicated, the data use in this book can be found in the Data-Use folder. Researchers working on animal data may find the principles of analysis described here useful for animal research.

This book is live and some sections of the book are still under constructions.