Who are the most vulnerable in society?

By: Evan Owan

Summary of research questions and results

RQ1: How does one's health condition affect their chances of hospitalization or a mortality?

This analysis seeks to determine which health conditions are more likely to lead to a hospitalization or mortality. Knowing this information would allow us to see which demographics have the highest chances of hospitalization or mortality.

The analyses showed that heart failure and chronic obstructive pulmonary disease cause the most hospitalizations and that heart failure and acute myocardial infarction cause the most mortalities.

RQ2: How does age affect the chances of hospitalization and mortality from each health condition?

This analysis seeks to analyze the relationship between the various age ranges and their chances of hospitalization and mortality from each health condition. This heatmap can reveal to us at what age doctors should be most on the lookout for health conditions.

The analyses highlighted how hospitalizations are most common from 65-74 and mortalities from 75-84.

RQ3: How does race affect the chances of hospitalization from each health condition?

This analysis aims to identify which race suffers more hospitalizations from each health condition. Knowing the answer to this question could potentially shine light on inequalities in regards to access to health care.

The analyses showed that the rates of hospitalizations between whites and blacks were fairly similar for most health conditions aside from heart failure and acute myocardial infarction.

RQ4: How does sex affect the chances of hospitalization from health conditions?

This question seeks to address whether or not certain sexes have a greater risk of hospitalization from each health condition. If we knew the answer to this question, we could put a greater emphasis on testing for these chronic conditions in the respective gender so that an earlier diagnosis could be acquired and thus treatment could begin earlier.

The analyses exhibited an elevated chance for men to be hospitalized from heart failure and acute myocardial infection and women from chronic obstructive pulmonary disease.

RQ5: How does geographic area affect the likelihood of hospitalization from health conditions?

This analysis seeks to identify a relationship between urban or rural living and a heightened risk of hospitalization from each health condition. The results of this question could potentially reveal a widespread gap in health care quality in rural versus urban living.

The analyses showed that people living in an ubran area were more likely to be hospitalized and significantly more likely to have a mortality occur.

RQ6: How does one's state affect their chances of having each health condition?

This analysis seeks to see which states have a relationship with the most diagnoses of each health condition. Knowing the answer to this question could allow us to conduct further analysis to see if these states have any underlying issues which lead to the high rate of cases.

The analyses showed that people living in Texas, Georgia, Virginia, North Carolina, and Florida are most likely to be diagnosed with a chronic condition.

RQ7: Is there a relationship between older ages and certain states/regions?

This analysis seeks to identify the correlation between older ages and their population in certain states/regions. There is this idea of becoming more vulnerable from a health perspective as you grow older, so it would be helpful if we could see which states from the data set represent the most elderly people.

The analyses showed that the largest population of elderly people were found in Texas, Georgia, Virginia, North Carolina, and Mississippi.

RQ8: Does one's race make them more likely to live in a particular state or region?

This analysis seeks to identify the correlation between race and particular states. In question three, we identified the relationship between race and hospitalizations from each condition. If we know which states this race tends to live in, we could then direct more funds to them for health care procedures and treatment.

The analyses showed that both whites and blacks are most commonly found in the south, specifically in Georgia, Virginia, and Texas.

RQ9: Is there a regional relationship between the states with the most hospitalizations and mortalities?

This analysis seeks to examine the correlation between states with the most hospitalizations and mortalities to see if they are all in the same region of the U.S. Knowing the answer to this question would allow us to pose further questions potentially of whether or not regional lifestyles or climate affect the rate of hospitalizations and mortalities.

The analyses showed that Southern states produced the most hospitalizations and mortalities.

RQ10: Is there a relationship between high state population and representation in the data set?

This question seeks to see if states with a higher population are represented more in the data set than states with a low population. This is important to consider moving forward because we want to know if our data is indicative of the proportional population breakdown of the U.S. according to each state.

The analyses showed that the data does not follow state population, rather that most of the data comes from Southern and Midwestern states.

Motivation and background

These three data sets refer to rates of hospitalizations and mortalities in the U.S from chronic conditions in 2020. While ideally we could provide treatment and care to everyone with these chronic conditions, the reality is that we do not have the resources and sufficient workers to do this. In 2021, we have seen the effects of widespread supply and worker shortages complicate all aspects of our country's operations. With limited resources, it becomes even more crucial to know who the most vulnerable people are. That way health care workers and resources can be devoted to treating those who need assistance the most. The direction of my research questions seeks to utilize the data sets to identify the demographics most plagued by death from chronic conditions. By identifying these groups, we can properly allocate our health care resources in order to minimize the loss of life from the demogrpahics that suffer the most from chronic conditions.

Datasets (with an s!)

I used the CMS Excel spreadsheet which contained the three sheets: county geographies, hospitalizations, and mortalities. The datasets describe the relationship between U.S. geography and general demographic factors on hospitalizations and mortalities from chronic conditions.

In order to upload the dataset and its three sheets into Exploratory, I first downloaded the CMS Excel spreadsheet. Next I created a "New Project" and then clicked the plus button to the right of "Data Frames" and clicked on "File Data" and then "Excel File". I then clicked the box for the CMS Excel spreadsheet and hit "import" and then "save". I repeated the process an additional two times, changing the sheet name each time before hitting "save" because you can only upload one sheet at a time. Therefore, the dataset uploading process needs to be completed three times to upload the complete dataset.

Ethical considerations of data

One worry in regards to the data collection methods is that they only reflect the number of mortalities from those with chronic diseases who were previously hospitalized. Some people probably went undiagnosed and then passed away which would skew the data sets. Another potential issue with the data set is that it only draws from the White and Black racial demographics. The dataset fails to include the Hispanic, Asian, Native American, or Pacific Islander races. Also, there a significantly more White people than Black people in the data set.

Methodology (analysis)

The scope of my project was to identify which demographic factors and variables from this dataset led to the highest rates of hospitalizations and mortalities from chronic conditions. Knowing this information will then allow health care fields to redirect more funding and personnel to adminster to these vulnerable demographics. By knowing which conditions are most common for each demographic, we can begin screening for them earlier and thus give people earlier diagnoses. As a result, we can begin treatment for the conditions earlier which could prove to be potentially life saving for some.

In order to mimic my analyses, you need to think of demographic questions that can be tested through different types of charts. I tested the available demographics/variables of state, health condition, sex, race, age, geographic area, and hospitalization and mortality numbers. I typically would test the relationship of 2-3 variables in each chart. Once I made that chart, I would analyze for the most significant relationships and/or representations. Because of my methodology, I found that for chronic health conditions the most vulnerable states are Texas, Virginia, Georgia, and North Carolina. Men are more vulnerable along with those in urban areas and those aged 65-84. Lastly, the condition that makes people most vulnerable is heart failure by a wide margin. People with these characteristics should get the most medical attention.

Analytics 1 - Correlations

Loading...

Furthermore, those who analyze Medicare-level measures may find it of note that there is virtually no difference in the correlation between the analysis value and fips between those living in rural and urban areas.

Results

Q1

Loading...

The results in the bar chart showed that heart failure and chronic obstructive pulmonary disease cause the most hospitalizations and heart failure and acute myocardial infarction cause the most mortalities. Honestly, I had no prior knowledge in regards to severity of these conditions so I had no idea what to expect the results of this chart to be. It is interesting to see how there are significantly more hospitalizations and mortalities from heart failure. Now that we have identified the most severe conditions, we can examine which demographics are most likely to acquire them.

Q2

Loading...

The results in the heatmap showed how hospitalizations are most common from 65-74 and mortalities from 75-84. I feel that this makes sense because it probably means that these conditions do not cause immediate death rather that they drag on for multiple years. People are probably first diagnosed at 65-74. Those who are unable to overcome their conditions probably fight it for years and then die between 75-84.

Q3

Loading...

The results in the line chart showed that the rates of hospitalizations between Whites and Blacks were fairly similar for most health conditions aside from heart failure and acute myocardial infarction. This was surprising to me at first because I thought the rate of hospitalizations would be higher for Blacks due to income inequality increasing their exposure to poor health conditions more so than Whites. However, then I realized that there were 131,265 Whites and only 56,025 Blacks in the dataset. This means that the dataset is not equally representing Whites and Blacks, thus somewhat skewing the results. The fact that the number of hospitalizations for Alzheimer's Disease, and Cancer are so close in the chart given how there are significantly less Blacks in the dataset might that they are more vulnerable to acquiring those conditions.

Q4

Loading...

The results in the boxplot highlighted an elevated chance for men to be hospitalized from heart failure and acute myocardial infection and women from chronic obstructive pulmonary disease. It make sense to me that men and women suffer from these conditions the most because the results in Q1 showed this. Now we can emphasize which conditions to be most on the lookout for by sex. However, I think we should be somewhat alarmed that men are more likely to be diagnosed with the two conditions that cause the most mortalities.

Q5

Loading...

The results of the pivot table showed that people living in an ubran area were more likely to be hospitalized and significantly more likely to have a mortality occur. I thought that this was especially frightening given that there are approximately 20,000 less urban people in the dataset. I had always previously thought that there would be less mortalities in urban areas because they have better doctors and health care facilities. Moving forward, we may want to examine the effectiveness of health care in urban areas due to the results shown in the pivot table.

Q6

Loading...

The results of the line chart showed that people living in Texas, Georgia, Virginia, North Carolina, and Florida are most likely to be diagnosed with a chronic condition. Honestly, I had no prior knowledge to expect certain states to have the more diagnoses of chronic conditions. However, moving forward we should try to encourage more medical school graduates to work in these states in order to increase the number of health care resources available.

Q7

Loading...

The results of the heatmap showed that the largest population of elderly people were found in Texas, Georgia, Virginia, North Carolina, and Mississippi. This makes sense because the previous chart showed how the most diagnoses of chronic conditions occurred in all these states except for Mississippi. Since older people are more probably more prone to these conditions, it makes sense that the states they have the highest representation in would also have the highest number of cases. Moving forward, we should focus health care funding for the 65-84 age range in Texas, Georgia, Virginia, and North Carolina.

Q8

Loading...

The results of the bar chart showed that both whites and blacks are most commonly found in the south, specifically in Georgia, Virginia, Texas, and North Carolina. This does not surprise me as Whites and Blacks have historically been the main two races in the South. The influx of Asians and Hispanics was not until recently in the region. I would focus health care research on Blacks in those four states as Alzheimer's Disease, and Cancer might be more common for them than Whites.

Q9

Loading...

The results of the bar chart showed that southern states produced the most hospitalizations and mortalities. This is then begs the question of why specifically in the South are people suffering from more chronic conditions. Perhaps the health care system there is inherently worse than other regions in the U.S. or maybe regional factors encourage cultural behaviors that promote these chronic diseases. Either way the South's relationship with chronic conditions should be investigated further as it is the hot spot for chronic conditions.

Q10

Loading...

The results of the U.S. States' map showed that the data does not follow state population, rather that most of the data comes from Southern and Midwestern states. This is important to note because part of the reason why the South may have the most chronic conditions in the dataset is because they are the most represented in it. The dataset lacks a significant amount of numbers from high population states such as California and New York. It would be helpful to remake this dataset with proportional representation according to each state's actual population. Moving forward for now, I still feel that it would be helpful to redirect more health care funding and personnel to the South to assist in further research and treatment.

Reproducing your results

Importing data

See Datasets (with an s!) section

Cleaning the data

  1. Go the county_geographies tab and click on the Summary tab and the down button on the "State" section.
  2. Hover over "Replace Values" and then click on "With New Values". Make sure "Overwrite Existing Column" is checked.
  3. Write the correct spelling to the right of the states that are spelled and capitalized incorrectly and then hit "Run".
  4. Hover over "Replace Values" on the "State" section and then click on "Convert US State Name". Make sure "Overwrite Existing Column" is checked. Have the Convert Type= "US State" and the Output Type= "US State Names" and then hit run.
  5. Go the "mortalities" tab and click on the Summary tab and the down button on the "State" section.
  6. Hover over "Replace Values" and then click on "With New Values". Make sure "Overwrite Existing Column" is checked.
  7. Write the correct spelling to the right of the states that are spelled and capitalized incorrectly and then hit "Run".
  8. Repeat this renaming process for the "geography" and "primary_race" sections.
  9. Click the down button on the "primary_age" section.
  10. Hover over "Replace Values" and then click on "With New Values". Make sure "Overwrite Existing Column" is checked.
  11. Fix the duplicate age ranges and then hit "Run".
  12. In the steps section, click on the three lines option for step 2. Click on "Copy Step" and then paste the step into the "hospitalizations" tab. Repeat this copy and paste process for steps 3-5.
  13. Stay on the "hospitalizations" tab.
  14. Click the down button on the "condition" section.
  15. Hover over "Replace Values" and then click on "With New Values". Make sure "Overwrite Existing Column" is checked.
  16. Fix the fix the extra space in "Acute myocardial infarction" and hit run.
  17. Click the + button to add a step and then click on "Merge (Add Rows and Unions)". Make mortalities the data frame and hit run.
  18. Click the down button on the "county" section and then "Create Calculation (Mutate)". Make sure "Overwrite Existing Column" is checked.
  19. Enter "str_remove_all(county, " County")" into the Calculation Editor and hit run. Repeat 2 seperate times changing the word in quotations to " Parish" and " Borough".
  20. Lastly, complete a Left Join with county_geographies on state and county.

Running the Analysis

  1. Go to Chart in the hospitalizations tab.
  2. To create Q1, make the Type=Bar, X Axis=condition, Y Axis=analysis_value (make sure it says "SUM" to the right), Color (Group By)=ID, Sort By=Y1 Axis (DESC)
  3. To create Q2, make the Type=Heatmap, X Axis=condition, Y Axis=primary_age, Color By=(Number of Rows), Repeat By=ID
  4. To create Q3, make the Type=Line, X Axis=condition, Y Axis=analysis_value (make sure it says "SUM" to the right), Color (Group By)=primary_race
  5. To create Q4, make the Type=BoxPlot, X Axis=condition, Y Axis=analysis_value (make sure it says "NUM" to the right), Repeat By=primary_sex
  6. To create Q5, make the Type=Pivot Table, Row=Geo_Area and ID, Value=analysis_value (make sure it says "SUM" to the right)
  7. To create Q6, make the Type=Line, X Axis=condition, Y Axis=analysis_value (make sure it says "SUM" to the right), Color (Group By)=state (click the three lines to left and then limit and then the type "top", number of results "5" and then click apply)
  8. To create Q7, make the Type=Heatmap, X Axis=state, Y Axis=primary_age, Color By=(Number of Rows)
  9. To create Q8, make the Type=Bar, X Axis=state (click the three lines to left and then limit and then the type "top", number of results "10" and then click apply), Y Axis=analysis_value (make sure it says "SUM" to the right), Sort By=Y1 Axis (DESC), Repeat By="primary_race"
  10. To create Q9, make the Type=Bar, X Axis=state (click the three lines to left and then limit and then the type "top", number of results "10" and then click apply), Y Axis=analysis_value (make sure it says "SUM" to the right), Color (Group By)=ID, Sort By=Y1 Axis (DESC)
  11. To create Q10, make the Type=Map - Standard, Map=US States, Type=Circle, State=state, Color By=analysis_value (make sure it says "SUM" to the right, also click the three lines and then color settings and change the color palette to Red-Blue), Size=analysis_value (make sure it says "SUM" to the right)
  12. All of the axes of Q1-10 were renamed by clicking on the screw to the right of Type for each chart and then typing in the corresponding section the new name for the x and y axis each.

Interpreting the Results

In order to interpret my visuals, I simply looked for the variables with the highest representation.

Collaboration/bibliography

I asked Jae Jung a few questions regarding general help with the data wrangling procedure.

Reflection

This final project was the perfect way for me to put my data science skills to the test. One of the main things that this project highlighted to me was the importance of having a gameplan. You had to be very organized in order to complete this project. This means that you cannot just rush right in and start working. I had to be very deliberate and careful in the data wrangling process. Additionally, I took a significant amount of time to think of questions that would support my project topic of identifying the most vulnerable in the U.S. to chronic health conditions.

This assignment also taught me to focus on making a project and presentation interesting. I made sure to use a wide variety of charts in order to not bore those who analyze my research. I wish I had known prior to beginning my analysis the state composition of the dataset. I feel that the South's high representation in the dataset makes the results a little off from what they should be. While I still feel like more health care attention for chronic conditions should be directed towards the South, if I could redo this project I would do it with a dataset that proportionally represents the states by population. My advice to future students would be to take your time and embrace the process. It feels rewarding when your question gets answered in the end.