1 Heart Disease Dataset

In this dataset has been published in the UCI Machine Learning Repository. It has 303 rows and 16 coulumns. I am going to check for outliers, missing values, and the trends and relationships among different features with R. For the original dataset, please click here

1.1 Features (metadata):

  1. age: age in years
  2. sex: sex (1 = male; 0 = female)
  3. cp: chest pain type -- Value 1: typical angina -- Value 2: atypical angina -- Value 3: non-anginal pain -- Value 4: asymptomatic
  4. rest_bp: resting blood pressure (in mm Hg on admission to the hospital)
  5. chol: serum cholestoral in mg/dl
  6. restecg: resting electrocardiographic results -- Value 0: normal -- Value 1: having ST-T wave abnormality (T wave inversions )
    1. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  7. thalach: maximum heart rate achieved
  8. exang: exercise induced angina (1 = yes; 0 = no)
  9. slope: the slope of the peak exercise ST segment -- Value 1: upsloping -- Value 2: flat -- Value 3: downsloping
  10. ca: number of major vessels (0-3) colored by flourosopy
  11. thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
  12. target: diagnosis of heart disease (angiographic disease status) -- Value 0: < 50% diameter narrowing -- Value 1: > 50% diameter narrowing
  13. Oldpeak: ST depression induced by exercise relative to rest.

1.2 R code

This is the code that used in data wrangling step before visualising the data

# Load required packages.
library(janitor)
library(tidyr)
library(stringr)
library(readr)
library(forcats)
library(dplyr)
library(tibble)
library(exploratory)

# Set working directory so that the script can read saved data file.
setwd("data path"); jsonlite::toJSON(TRUE)
# Steps to produce the output
exploratory::read_delim_file("https://raw.githubusercontent.com/PacktWorkshops/The-Data-Analysis-Workshop/master/Chapter07/Dataset/heart.csv", delim = NULL, quote = "\"" , col_names = TRUE , na = c('') , locale=readr::locale(encoding = "UTF-8", decimal_mark = ".", tz = "Africa/Cairo", grouping_mark = "," ), trim_ws = TRUE , progress = FALSE) %>%
  readr::type_convert() %>%
  exploratory::clean_data_frame() %>%
    #rename the dataframe for its orginal names
  rename(chest_pain = cp, rest_bp = trestbps, fast_bld_sugar = fbs, rest_ecg = restecg, st_deper = oldpeak, max_hr = thalach, ex_angina = exang, colored_vessels = ca, thalassemia = thal, cholestrol=chol)

2 First, using the box plot for detecting outliers

2.0.1 Checking the outliers for cholesterol

2.1 cholestrol BoxPlot

Loading...

In the previous chart, there are a few outliers beyond the 370.

2.1.1 Checking the outliers for st_depr

2.2 st_depr

Loading...

2.2.1 BoxPlot for colored_vessels

2.3 colored_vessels

Loading...

2.3.1 Checking outliers for thalassemia

2.4 thalassemia

Loading...

Note: from the preceding boxplots, there are some outliers. However, they will not be imputed as there are few data to start with. i just show them.


3 Showing the disturbance and realtionship between certain features

3.0.1 Age distributions

3.1 Age

Loading...

We can observe that the youngest patient was 29 years old, while the oldest was 77, and the majority of patients were between 50s and 60s years old. The most common age is 58 years old.

3.1.1 Getting the number of patients who have heart disease and those who have not

3.2 target count

Loading...
in the preceding table, 1 means has the disease, while 0 has not. ### Now, I am going to represent them visually with count plot

3.3 target count plot

Loading...

3.3.1 Then, count plot representing the Distribution of Presence of Heart Disease by Sex

3.4 sex summary count

Loading...

3.5 target gpby present absent

Loading...

From the previous graphs, we can conclude that 72 out of 96 female patients have been diagnosed with heart disease. This scenario is opposite for the male patients, most of them have not been diagnosed with heart disease.


4 The distributions and relationships among columns with respect to the target column.

4.1 chest pain count

Loading...

4.2 Presence of Heart Disease by Chest Pain Type

Loading...

This chart shows us that most of patients have typical angina, and the next is non-anginal pain followed by angina. Most of patients who had typical angina were not diagnose with heart disease. The largest group who had been diagnosed with heart disease had non-anginal pain.


5 Colored Vessel

5.1 colored vesssels summary

Loading...

5.2 colored vessels by target

Loading...

Most of patients who have 0 colored vessels have been diagnosed with heart disease, which implies a strong negative correlation between colored vessels and heart disease.


6 Slope

6.1 The total number of each type of slope

6.2 slope summary

Loading...
## Its distribtion

6.3 heart disease by slope

Loading...
From the above chart, most patients with a downwards slope of the ST segment post-peak exercise have been diagnosed with heart disease, so there might be a correlation between the two. In addition, most of the patients who had flat slope had not been diagnosed with heart disease.


7 The relationship between the presence of heart disease and maximum recorded heart rate.

7.1 Creating a scatter plot for detecting the relationship

7.2 Heart Disease based on Age and Maximum Heart Rate gpby target

Loading...
We can state that although there is no distict pattern as the dots for absent and present are mixed together, quite a few points with a higher heart rate have been diagnosed with heart disease regradless of age. Also, some patients with heart disease might be on the younger side but have a high heart rate.


8 What next!

8.0.1 The best course of action is to take a closer look at the relationship between the age and the presence and the absence of the heart disease

We can procced that by categrizing the age group into different bins then plot each group against the presence and the absence of the heart disease.

8.1 presence of heart disease per age group

Loading...

The previous observation has been confirmed. Some patients who are the youngest have been diagnosed with heart disease when compared to those who have not been diagnosed.


9 Investigating the relationship between the presence of heart disease and the colesterol column

A scatter plot is the horse for doing this step

9.1 Presence of Heart Disease based on Age and Cholesterol

Loading...

Again, categorizing the cholesterol against the presence and the abence of

9.2 The presence of heart disease per cholesterol group

Loading...
There is no much to tell; however, some patient who do have heart disease have high cholesterol. So, cholesterol does not show a strong relationship with heart disease or age.


10 Finally, getting correlations

10.1 Visualizing the correlation - Correlations - Heatmap

Loading...

10.2 Visualizing the correlation - Significance

Loading...

10.3 Visualizing the correlation - Scatter Matrix

Loading...

From the previous correlations, slope, maximum heart rate and chest pain have a positive correlation with the target column.

11 Conclusion

This dataset contains medical data of 303 patients. This report has checked for ouliers, distributions and some correlations against the presence and absence of heart disease.