Data Set Observations:
In CDC 2018 diabetes data that we are working with a dataset containing information on three variables: %diabetes, %obesity, and %inactivity. Here’s a breakdown of the information you provided.
There are 354 rows of data that contain information on all three variables: %diabetes, %obesity, and %inactivity.
It mentioned that the intention to build a model to predict %diabetes using %inactivity and %obesity as factors. However, you only have 354 data points with which to work for this analysis.
We are exploring the relationship between %diabetes and %inactivity. That found that there are 1370 data points for %inactivity in your dataset. You also checked how many of these have data for %diabetes, and you found that all 1370 data points have information for both %diabetes and %inactivity.
We’ve observed that the %diabetes data is slightly skewed, with a kurtosis value higher than the expected value for a normal distribution (which is 3). Additionally, the quantile-quantile plot shows substantial deviation from normality, suggesting that the data may not follow a normal distribution.
You’ve also mentioned that you will generate descriptive statistics for the %inactivity data, but it appears that the actual calculations for %inactivity statistics are not provided in the text you’ve shared.
If you have any specific questions about the analysis or if you would like to perform any further statistical tests or visualizations, please let me know, and I’d be happy to assist you further.
Linear Regression:
Linear regression is a statistical method used to model and understand the relationship between a dependent variable (the one you want to predict) and one or more independent variables (the predictors). It does this by fitting a straight line (hence “linear”) to observed data points. This line is called the regression line and is characterized by two parameters: slope (how much the dependent variable changes with a change in the independent variable) and intercept (the value of the dependent variable when the independent variable is zero).
The primary goal of linear regression is to find the best-fitting line that minimizes the differences between the predicted values from the line and the actual observed data. This technique is foundational in data analysis, providing insights into relationships between variables and enabling predictions based on historical data. Linear regression can be extended to multiple linear regression when there are multiple independent variables involved. It relies on certain assumptions, such as linearity, normality of residuals, and constant variance, which should be assessed to ensure the validity of the model.