Information Gain and its role in decision trees

Information Gain in Decision Trees: 

1. Purpose of Decision Trees: Decision trees help make decisions by breaking down a problem into smaller, manageable steps, like a flowchart.

2. Entropy: Entropy is a measure of confusion or disorder. In decision trees, it gauges how mixed up our data is in terms of categories.

3. Information Gain: Information Gain is like a guide for decision trees. It helps decide which question (feature) to ask first to make our dataset less confusing.

4. How it Works: At each step, the tree looks at different questions (features) and picks the one that reduces confusion the most—this is high Information Gain.

5. Goal: The goal is to keep asking the best questions (features) to split our data until we reach clear and tidy groups.

statistical method called regression modeling

I used a statistical method called regression modeling to take a closer look at two important things: the stats about Logan International Airport and how full hotels were. These factors played the lead roles in our study, helping us understand how they influence economic indicators.

By digging into the numbers, I didn’t just learn about the direct impact of transportation and hotels on the economy. I also uncovered hidden connections that, when pieced together, paint a complete picture of how these factors shape the larger economic scenario. This deep dive into the data gave us valuable insights and helped me grasp the complex forces that drive economic trends. It was a reminder of how crucial it is to consider various factors when thoroughly analyzing data.



Introduction to Regression Modeling

Over the past few days,  I delved into the realm of advanced statistical analyses, with a primary focus on regression modeling. This sophisticated technique empowered us to systematically quantify relationships between crucial variables, injecting a quantitative dimension into our previously qualitative observations. This analytical step marked a pivotal moment as we sought to unravel the intricate web of connections within our data.

OLS Regression Results                            
Dep. Variable:      med_housing_price   R-squared:                       0.691
Model:                            OLS   Adj. R-squared:                  0.683
Method:                 Least Squares   F-statistic:                     90.56
Date:                Thu, 30 Nov 2023   Prob (F-statistic):           2.21e-21
Time:                        21:09:15   Log-Likelihood:                -1103.1
No. Observations:                  84   AIC:                             2212.
Df Residuals:                      81   BIC:                             2219.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
const       1.931e+05   5.58e+05      0.346      0.730   -9.18e+05     1.3e+06
unemp_rate   1.23e+07   2.19e+06      5.625      0.000    7.95e+06    1.66e+07
total_jobs    -1.4689      1.347     -1.091      0.279      -4.149       1.211
Omnibus:                        3.255   Durbin-Watson:                   0.296
Prob(Omnibus):                  0.196   Jarque-Bera (JB):                2.849
Skew:                           0.354   Prob(JB):                        0.241
Kurtosis:                       2.441   Cond. No.                     5.90e+07

Preparing for In-Depth Analysis

After learning a bunch from our first look at the data, I’m gearing up to do some fancier analyses in the coming weeks. We want to figure out even more about how different things are connected by using some advanced statistical methods like regression modeling and hypothesis testing.

This way, I can discover more detailed insights and get a better handle on how Boston’s economy really works. It’s like moving from the basics to the next level of understanding, using fancier tools to uncover more details in how key factors are linked.

Time Challenges in Predicting Future Trends Over Time:

Forecasting Future Trends Over Time Challenges in Predicting Future Trends Over Time:

Forecasting future trends over time comes with various challenges that can impact the accuracy and trustworthiness of predictions. One common hurdle is non-stationarity, where the statistical characteristics of the data change as time progresses. Dealing with dynamic environments, pinpointing outliers, and managing anomalies are critical challenges. Moreover, selecting suitable models that effectively capture intricate temporal patterns and accommodating irregularities in data distribution remains an ongoing issue. Successfully addressing these challenges emphasizes the need for robust techniques and meticulous preprocessing in applications involving the forecasting of future trends over time.

Applications of Predicting Future Trends Over Time:

The practice of predicting future trends over time has broad applications across diverse fields. In finance, it plays a crucial role in anticipating stock prices and currency exchange rates. Demand forecasting utilizes models that analyze time series data to estimate future product demand, facilitating efficient inventory management. Within the energy sector, forecasting is essential for predicting electricity consumption and optimizing the allocation of resources. Weather forecasting heavily relies on time series analysis to predict variables such as temperature and precipitation. These applications underscore the versatility of predicting future trends over time, providing valuable insights for decision-making across industries, from finance to logistics and beyond.

Auto Regressive Integrated Moving Average


which stands for Auto Regressive Integrated Moving Average, is like a smart tool for predicting future values in a timeline, such as stock prices or weather patterns.

Auto Regressive (AR) Component: Think of this as looking at how today’s situation is connected to what happened in the past. If we say it’s AR(3), it means we’re looking at the connection between today and the past three days.

Integrated (I) Component: This part is about making sure our data is easy to work with. We do this by checking the difference between consecutive days. If we do it twice, it’s like looking at the change in change.

Moving Average (MA) Component: Here, we consider how today’s situation relates to any mistakes we made in predicting the past. If it’s MA(2), it means we’re looking at the connection between today and the errors we made in predicting the two previous days. The ARIMA model is just a combination of these three, written as ARIMA(p, d, q), where: p is about looking back in time. d is about making the data easy to work with. q is about learning from our past mistakes.

Steps in Building an ARIMA Model:

Inspecting Data: Look at how things have been changing over time. Making Things Simple: If things are too complicated, we simplify them by looking at differences between days.

Choosing Settings: Figure out how much we need to look back ( p), how many times to simplify ( d), and how much to learn from past mistakes ( q). Putting it All Together: Use these settings to build a smart model that learns from the past. Checking How Well It Works: See how well our model predicts the future by comparing it to new data.

Making Predictions: Once it’s working well, we can use our smart model to make predictions about what might happen next. ARIMA models are great for making predictions when things follow a clear pattern. However, for more complicated situations, there are other tools that might work even better.

Time series forecasting

Time series forecasting is like predicting the future based on how things have changed in the past. Imagine you have a timeline of events, and you want to figure out what might happen next. This is used in many areas, like predicting stock prices, estimating how much of a product people will want to buy, forecasting energy usage, or even predicting the weather.

Stationarity: We often assume that the way things have been changing will keep happening the same way. This makes it easier to make predictions.

Components of Time Series: Imagine the data as having three parts – a long-term trend (like a steady increase or decrease), repeating patterns (like seasons changing), and random ups and downs that we can’t explain (we call this noise). C

Common Models: There are different ways to make predictions. Some look at the past data’s trends and patterns, like ARIMA. Others, like LSTM and GRU, are like smart computer programs that learn from the past and make predictions.

Evaluation Metrics: We use tools to check if our predictions are accurate. It’s like making sure our guesses about the future match up with what actually happens.

Challenges: Sometimes things change in unexpected ways, or there are unusual events that we didn’t predict. Adapting to these changes is one of the tricky parts.

Applications: This forecasting tool is handy in many fields. For example, it can help predict how much money a company might make, how many products they need to produce, or how much energy a city might use. In a nutshell, time series forecasting helps us plan for the future by learning from the past. It’s like looking at patterns in a timeline to make smart guesses about what comes next.


Today’s analysis involves looking at patterns—like trends or recurring behaviors—and figuring out how they unfold over different time periods. Imagine you’re tracking data points at different moments.

The first step is to organize and visually represent this data, often with timestamps showing when each piece was recorded. This helps us see the patterns more clearly. Next, we use various methods to break down the data, like splitting it into different parts or figuring out how one data point relates to another. We might also choose specific models (like mathematical formulas) to better understand what’s happening in the data.

To make sure our understanding is accurate, we use measures like Mean Squared Error or Mean Absolute Error to check how well our models predict or represent the changing patterns. Once we’ve built and confirmed our model, we can use it to make predictions about future values or to understand what might happen next in those patterns. It’s crucial to keep an eye on things over time, updating our analysis as we get more data. This way, we can be sure our understanding stays relevant and reflects any new developments. The specific tools we use for this depend on what we’re studying and what we want to find out. There are computer programs, like pandas and states models in Python, that make these analyses easier for us to do.


In today’s class,   providing us with insights into the practical significance of the observed difference in average ages between two distinct racial groups, namely the White and Black populations. Upon examination, I determined that the resulting Cohen’s d value was approximately 0.57. This numerical value holds particular significance as it enables us to gauge the effect size of the observed age difference. According to established guidelines, a Cohen’s d value of this magnitude falls into the category of a medium effect size.

What this essentially signifies is that the approximately 7-year difference in average ages between the White and Black racial groups carries meaningful weight. While it may not reach the magnitude of a large effect, it is nonetheless a noteworthy and discernible difference that merits our attention and consideration. In practical terms, this medium effect size implies that the disparity in average ages between these two racial groups is of moderate importance and relevance. It suggests that the age difference, while not overwhelmingly substantial, is statistically and practically significant, and should be taken into account when making relevant decisions or drawing conclusions in contexts where age plays a role.

Assessing Age Disparities Between White and Black Individuals

Statistical Analysis:   today’s analysis, I applied two distinct statistical approaches—a two-sample t-test and a comprehensive Monte Carlo simulation—to assess potential age disparities between two groups, one represented by “AgesWhite” and the other by “AgesBlack.” Two-Sample T-Test: The two-sample t-test, a widely recognized statistical method, was utilized to ascertain whether there exists a statistically significant difference in means between the two groups. The results of the t-test are as follows: T-statistic: 19.21 P-value: 2.28e-79 Negative Log (base 2) of p-value: 261.24 The t-statistic value of 19.21 signifies a substantial difference in means between the ‘Black’ and ‘White’ racial groups. The remarkably small p-value (2.28e-79) provides strong evidence against the null hypothesis of no difference. The negative log (base 2) of the p-value accentuates the significance, illustrating that the observed age difference is equivalent to the likelihood of obtaining more than 261 consecutive tails when flipping a fair coin.

Remarkably, none of the 2,000,000 random samples generated in the Monte Carlo simulation produced a difference in means greater than the observed 7.2-year disparity between White and Black individuals. This outcome aligns with the t-test results, providing strong evidence that such a substantial age difference is exceedingly unlikely to occur by random chance if the null hypothesis (no difference in means) were true. Combined Conclusion: Both the two-sample t-test and the Monte Carlo simulation converge on a consistent conclusion. The age difference of 7.2 years between White and Black individuals is highly statistically significant. The t-test presents strong evidence against the null hypothesis, and the Monte Carlo simulation reinforces this by demonstrating the extreme unlikelihood of observing such a significant age difference by random chance. This collective statistical analysis firmly underscores the presence of a genuine and substantial disparity in mean ages between these two demographic groups.

Two-Sample T-Test Results:
T-statistic: 7.637258554192298
P-value: 4.715270331488705e-07

Monte Carlo Simulation Results:
Observed Age Difference: 9.2
P-value (Monte Carlo): 5e-06



In today’s analysis, I delved into examining the age distribution from various angles, aiming to understand its deviation from a standard normal distribution, particularly in terms of extremeness and behavior.

The initial phase of the analysis involved calculating the mean age and standard deviation, providing valuable insights into the distribution’s central tendency and variability. These statistical measures set the stage for further exploration. A key aspect of the analysis was the determination of what proportion of the right tail of the age distribution extended beyond 2 standard deviations from the mean, for both the Black and White races.

To accomplish this, a threshold was established by adding 2 times the standard deviation to the mean. This threshold served as a demarcation line, identifying outliers in the right tail of the distribution. Subsequently, I quantified the percentage of data points within the dataset that exceeded this threshold. This calculation shed light on the rarity of values in the right tail and provided a deeper understanding of the distribution’s characteristics. Moreover, as a point of reference, I utilized the standard normal distribution, allowing for a comparative analysis between our data and the theoretical normal distribution.

This comparison, particularly in the tail region extending beyond 2 standard deviations, facilitated an assessment of how the age distribution deviated from the idealized normal distribution.

Exploring Anomalies and Crafting Informed Hypotheses

As we delved into our dataset, we kept an eye out for things that seemed a bit unusual or didn’t quite fit the usual patterns. This led us to tweak our dataset to make sure it was ready for a closer look.

At the same time, I started making some educated guesses—basically, forming ideas that would help me figure out what to look at next. One interesting thought we had was whether how full hotels are might have something to do with how many people are working. It’s like a guess that guides us to focus on specific parts of the economic indicators I was checking out.

Age Distribution by Race

In the course of today’s analysis, an examination of the age distribution between the White and Black racial groups. I’ve a combined kernel density plot below for the ‘age’ variable in both the ‘Black’ and ‘White’ races. The kernel density plot representing the ‘age’ variable for the ‘Black’ race, shown in red, exhibits a positive skewness and a moderate level of peakness. According to the statistical summary, the dataset comprises 1,725 individuals with an average age of approximately 32.93 years and a standard deviation of roughly 11.39.

The age values in this dataset span from 13 to 88 years, and the quartile values offer insights into the distribution’s overall spread. On the other hand, the kernel density plot for the ‘age’ variable within the ‘White’ race, displayed in blue, displays a slight negative skewness and a relatively flat distribution.

The statistical summary for this dataset reveals a total of 3,244 individuals with an average age of approximately 40.13 years and a standard deviation of around 13.16. The age values in this dataset range from 6 to 91 years, and the quartile values provide valuable information about the distribution’s variability.



October 30,2023

This dataset contains various variables, including age and race, which provide insights into the demographics of individuals involved in these incidents.

Age Data: The ‘age’ variable in the dataset represents the age of individuals who were fatally shot by the police. This variable is crucial for understanding the age distribution of the victims. You can analyze this data to determine the average age, age ranges, and other statistics related to the ages of the individuals involved in these incidents.

Race Data: The ‘race’ variable in the dataset categorizes the race or ethnicity of the individuals involved in these incidents. It provides information about the racial composition of those who were fatally shot. Analyzing race data can help identify patterns, disparities, and trends in police shootings with respect to different racial or ethnic groups.

To conduct a more in-depth analysis of age and race data within this dataset, you can perform various statistical and visualization techniques, such as creating histograms, bar charts, or other relevant visualizations to gain insights into these demographic aspects of police shootings.

October 27,2023

During our class, we discussed the potential instability of DBSCAN in comparison to K-Means clustering. Here, we’ll outline various scenarios that illustrate the instability of DBSCAN:

Sensitivity to Density Variations:

DBSCAN’s stability can be affected by variations in data point density. If the dataset exhibits significant differences in data density across various segments, it can lead to the formation of clusters with varying sizes and shapes. Consequently, selecting appropriate parameters (such as the maximum distance ε and the minimum point thresholds) to define clusters effectively becomes a challenging task.

In contrast, K-Means assumes spherical and uniformly sized clusters, potentially performing more effectively when clusters share similar densities and shapes.

Sensitivity to Parameter Choices:

DBSCAN requires the configuration of hyperparameters, including ε (representing the maximum distance defining a data point’s neighborhood) and the minimum number of data points needed to establish a dense region. These parameter choices have a significant impact on the resulting clusters.

K-Means, while also requiring a parameter (the number of clusters, K), is generally more straightforward to determine, as it directly reflects the desired number of clusters. In contrast, DBSCAN’s parameters are more abstract, introducing sensitivity to the selection of parameter values.

Boundary Points and Noise:

DBSCAN explicitly identifies noise points, which are data points that don’t belong to any cluster, and it handles outliers well. However, the classification of boundary points (those located on the periphery of a cluster) within DBSCAN can sometimes appear arbitrary.

In K-Means, data points on the boundaries of clusters may be assigned to one of the neighboring clusters, potentially leading to instability when a data point is close to the shared boundary of two clusters.

Varying Cluster Shapes:

DBSCAN excels in its ability to accommodate clusters with arbitrary shapes and detect clusters with irregular boundaries. This is in contrast to K-Means, which assumes roughly spherical clusters and therefore demonstrates greater stability when data adheres to this assumption.

The choice between DBSCAN and K-Means should consider the specific characteristics of the dataset, as well as the objectives of the analysis, as these algorithms have different strengths and limitations.

October 23,2023

In our recent class, our professor introduced several technical concepts, including the K-Medoids clustering technique, and delved into the concept of hierarchical clustering. Here’s a summary of what was covered:

K-Medoids is a partitioning clustering algorithm that distinguishes itself by its enhanced robustness, especially in handling outliers. In contrast to K-Means, which uses the mean (average) as the cluster center, K-Medoids selects the actual data point (medoid) within a cluster. The medoid is the data point that minimizes the sum of distances to all other points in the same cluster. This unique approach makes K-Medoids less sensitive to outliers and particularly suitable for clusters with non-Gaussian shapes.

Hierarchical Clustering:
Hierarchical clustering is a clustering method characterized by the construction of a tree-like structure of clusters, establishing a hierarchical relationship between data points. There are two primary approaches to hierarchical clustering: agglomerative (bottom-up) and divisive (top-down). Agglomerative clustering starts with each data point as its own cluster and iteratively merges the closest neighboring clusters, creating a dendrogram as a visual representation. In contrast, divisive clustering begins with all data points in a single cluster and then recursively divides them into smaller clusters. One notable advantage of hierarchical clustering is that it doesn’t require specifying the number of clusters in advance, and it provides a visual representation of the inherent grouping of the data.

A dendrogram is a graphical representation in the form of a tree-like diagram, employed to visualize the hierarchical structure of clusters within hierarchical clustering. This visual tool displays the sequence of merges or splits, along with the respective distances at which these actions occur. The height of the vertical lines within the dendrogram signifies the dissimilarity or distance between clusters. By choosing a specific height to cut the dendrogram, you can obtain a desired number of clusters, making dendrograms a valuable aid in cluster selection.

These concepts offer a comprehensive toolbox for exploring and understanding the underlying structures and relationships within datasets, catering to a wide range of data types and shapes.

October 20, 2023

In today’s analysis, I ventured into the realm of geospatial calculations, specifically calculating the geodesic distance between two geographic coordinates: Seattle, WA, and Miami, FL. To accomplish this, I harnessed the geodesic function from the geopy library, a reliable tool for accurately computing distances on the Earth’s curved surface. The outcome provided the distance between these two locations in both miles and kilometers, delivering valuable information for geospatial analysis.

For my next analytical step, I’m gearing up to execute clustering based on geographic locations within the state of California. In pursuit of this objective, I explored two clustering algorithms: K-Means and DBSCAN.

K-Means is a widely adopted clustering algorithm renowned for partitioning a dataset into ‘K’ distinct clusters. Its operation involves iteratively assigning data points to the nearest cluster center (centroid) and recalculating these centroids until the algorithm converges. K-Means is prized for its simplicity in implementation and computational efficiency, making it a go-to choice for a range of applications, including image segmentation and customer segmentation.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise:
DBSCAN takes a different approach, relying on density-based criteria to cluster data points. Unlike K-Means, it doesn’t require a predetermined number of clusters. DBSCAN identifies core points, which have a sufficient number of data points in their neighborhood, and border points, which are in proximity to core points but lack a critical mass of neighbors to qualify as core points. Noise points, on the other hand, don’t align with any cluster. DBSCAN’s resilience to noise and its ability to unveil clusters of various sizes and shapes make it particularly well-suited for datasets with intricate structures.

These clustering techniques have the potential to reveal valuable patterns and structures within the geographic data, advancing our understanding of the distribution of police shootings in the state of California.

October 18 ,2023

In the subsequent stage of my analysis, I harnessed geospatial data and related libraries to visually represent instances of police shootings across the United States. To achieve this, I extracted the ‘latitude’ and ‘longitude’ attributes from the dataset and meticulously filtered out any null values within these columns. The following steps involved the creation of an accurate geographical map of the United States, supplemented with the inclusion of distinctive red markers, ultimately forming a Geospatial Scatter Plot. The resulting visualization delivers a geographically precise depiction of where these incidents transpired, thus providing valuable insights into their distribution throughout the country. By plotting these incidents on the map, it becomes readily apparent where concentrations of police shootings occur, facilitating a deeper comprehension of regional trends and patterns.

The capability to visualize the scatter plot for individual states of the United States empowers policymakers, researchers, and the public to gain profound insights into the geographical dimensions of police shootings. This, in turn, has the potential to foster more informed discussions and drive actions aimed at addressing this significant issue. As a case in point, I’ve crafted a similar Geospatial Scatter Plot for the state of Massachusetts, a visual representation of which is included below.

Moving forward in my analysis, my agenda includes delving into the realm of GeoHistograms and exploring the application of clustering algorithms, as per our professor’s guidance in the previous class. Specifically, our professor introduced two distinct clustering techniques: K-Means and DBSCAN. It’s worth noting that K-Means entails the need to predefine the value of K, which can be a limiting factor. My goal is to implement both of these algorithms in Python and evaluate whether they yield meaningful clusters when applied to the geographic locations of the shooting data. This phase promises to uncover additional layers of insights and patterns within the dataset, contributing to a more comprehensive understanding of this critical issue.

october 16 ,2023

Today, my primary focus was on analyzing the distribution of the ‘age’ variable. As depicted in the density plot of the ‘age’ variable, it’s evident that this column, representing the ages of 7,499 individuals (non-null values), displays a positive skew. This skewness indicates that the majority of individuals in the dataset tend to be on the younger side, resulting in a right-tailed distribution.

The average age stands at approximately 37.21 years, with a moderate level of variability around this mean, as evidenced by a standard deviation of 12.98. The age range spans from 2 to 92 years, encompassing a diverse age demographic. The youngest individual in the dataset is 2 years old, while the oldest is 92 years old. With a kurtosis value of 0.234, the distribution appears to be somewhat less peaked than a normal distribution, signifying a dispersion of ages rather than a tight clustering around the mean. Additionally, the median age, which falls at 35 years, serves as the midpoint of the dataset.

Moving on to the box plot representation of the ‘age’ variable, it’s apparent that outliers beyond the upper whisker are present. In a box plot, the ‘whiskers’ typically indicate the range within which most of the data points fall. Any data point lying beyond these whiskers is considered an outlier, signifying that it deviates significantly from the typical range of values.

In this specific case, the upper whisker of the box plot extends to a threshold typically defined as 1.5 times the interquartile range (IQR) above the third quartile (Q3). Data points beyond this threshold are identified as outliers. The presence of outliers beyond the upper whisker in the ‘age’ variable suggests that there are individuals in the dataset whose ages significantly exceed the upper age range found within the ‘typical’ or ‘normal’ population.

October 13, 2023

Commencing my analysis of the ‘fatal-police-shootings-data’ dataset in Python, I’ve loaded the data to scrutinize its various variables and their respective distributions. Notably, among these variables, ‘age’ stands out as a numerical column, offering insights into the ages of individuals tragically shot by law enforcement. Additionally, the dataset contains latitude and longitude values, pinpointing the precise geographical locations of these incidents.

During this preliminary assessment, I’ve identified an ‘id’ column, which appears to hold limited significance for our analysis. Consequently, I’m considering its exclusion from our further examination. Delving deeper, I’ve scrutinized the dataset for missing values, revealing that several variables exhibit null or missing data, including ‘name,’ ‘armed,’ ‘age,’ ‘gender,’ ‘race,’ ‘flee,’ ‘longitude,’ and ‘latitude.’ Furthermore, I’ve undertaken an investigation into potential duplicate records. This examination has uncovered just a single duplicate entry within the entire dataset, notable for its absence of a ‘name’ value. For the subsequent phase of my analysis, I intend to shift our focus towards exploring the distribution of the ‘age’ variable, a critical step in unraveling insights from this dataset.

In today’s classroom session,  essential knowledge on computing geospatial distances using location information. This newfound expertise equips us to create GeoHistograms, a valuable tool for visualizing and analyzing geographical data. GeoHistograms serve as a powerful instrument for pinpointing spatial trends, identifying hotspots, and uncovering clusters within datasets associated with geographic locations. This, in turn, enhances our comprehension of the underlying phenomena embedded within the data.

October 11,2023

Today, we commenced our work with a recently acquired dataset from ‘The Washington Post’ website. This dataset sheds light on a troubling statistic: police shootings in the United States result in an average of over 1,000 deaths each year, as revealed by an ongoing investigation by ‘The Washington Post.’

A pivotal moment in this ongoing inquiry was the tragic 2014 killing of Michael Brown, an unarmed Black man, by police in Ferguson, Missouri. This incident exposed a significant issue – the data reported to the FBI regarding fatal police shootings was significantly underestimated, with more than half of such incidents going unreported. This problem has continued to worsen, with only a third of fatal shootings being accurately reflected in the FBI’s database by 2021. The primary reason for this underreporting is that local police departments are not required to report these incidents to the federal government. Additionally, complications arise from an updated FBI system for data reporting and confusion among local law enforcement agencies about their reporting responsibilities.

In response, ‘The Washington Post’ initiated its own comprehensive investigation in 2015, meticulously documenting every instance in which an on-duty police officer in the United States shot and killed someone. Over the years, their reporters have compiled a substantial dataset, which now comprises 8,770 records. This dataset includes various variables, such as date, name, age, gender, whether the person was armed, their race, the city and state where the incident occurred, whether they were attempting to flee, if body cameras were in use, signs of mental illness, and importantly, the involved police departments.

It’s worth noting that this dataset covers incidents from 2015 onwards, and ‘The Post’ recently updated it in 2022 to include the names of the police agencies connected to each shooting, providing a means to better assess accountability at the department level.

In our class today, we delved into some initial questions about this dataset. Notably, we discovered that there are multiple versions of the dataset available. The one accessible on GitHub provides information on police shootings by agencies, but the version directly obtained from ‘The Washington Post’s’ website includes a variable called ‘police_departments_involved.’ This means there’s no need for an external relationship to discover which police stations were involved in these shootings.

As the next step in my analysis, I plan to conduct a more detailed examination of the dataset and its variables to uncover further insights


we are currently in the process of creating a concise and impactful report that summarizes the results of our study on the CDC Diabetes dataset.

In our examination of the CDC Diabetes dataset, we utilized a wide range of statistical techniques. These techniques included exploratory data analysis, correlation analysis, both simple and multiple linear regression, as well as the implementation of the Breusch-Pagan test to assess constant variance. Additionally, we introduced interaction terms, explored higher-order relationships through polynomial regression, and made use of cross-validation to evaluate the performance and generalization of our models.

Our findings have uncovered some interesting insights into the predictive power of our models. Initially, when we introduced an interaction term into the Simple Linear model, it contributed 36.5% to the overall explanatory power. However, this contribution increased to 38.5% when we developed a Multi-Linear quadratic regression model for predicting diabetes, which incorporated both ‘% INACTIVE’ and ‘% OBESE.’ Interestingly, when we applied Support Vector Regression, the explanatory power decreased to 30.1%.

While it is evident that ‘% INACTIVE’ and ‘% OBESE’ play a significant role in diabetes prediction, they may not fully capture the complex dynamics involved. This highlights the need for a more comprehensive analysis that considers a broader array of influencing factors. Therefore, incorporating additional variables is essential for gaining a deeper and more holistic understanding of diabetes prediction.


Continuing our analysis, we implemented Support Vector Regression (SVR) on the CDC dataset, incorporating quadratic and interaction terms. SVR is a machine learning algorithm designed for regression tasks, especially useful for high-dimensional data and scenarios where traditional linear regression models struggle to capture complex relationships between variables.

The SVR model was initialized with specific parameters, including the ‘RBF (Radial Basis Function) kernel, the regularization parameter (C), and the tolerance for errors (epsilon). The RBF kernel is a mathematical function utilized in SVR to capture non-linear relationships, ‘C’ controls the trade-off between fitting the training data and preventing overfitting, and ‘epsilon’ specifies the margin within which errors are acceptable in SVR.

In this analysis, we used ‘INACTIVE’ and ‘OBESE’ as features, along with their squared values (‘INACTIVE_sq’ and ‘OBESE_sq’) and an interaction term (‘OBESE*INACTIVE’). We employed a K-Fold cross-validator with 5 folds to split the data into training and testing sets for cross-validation. A new SVR model was created, fitted to the training data, and used for predictions on the testing data. The ‘RBF’ kernel was applied to enable the model to learn the relationships between the input features and the target variable.

The performance of the SVR model was evaluated using the R-squared (R2) score, which returned a value of 0.30. This score is lower than the R-squared from our quadratic model, suggesting that the SVR model may not capture the data’s relationships as effectively.


Principal Component Analysis (PCA):

It is a dimensionality reduction technique widely used in data analysis, machine learning, and statistics. It helps to simplify complex data by transforming it into a lower-dimensional form while retaining the essential information. PCA achieves this by identifying and extracting the principal components from the original data.

Here’s how PCA works:

Standardization: The first step in PCA is often to standardize or normalize the data to have a mean of 0 and a standard deviation of 1. This is important because PCA is sensitive to the scale of the variables.

Covariance Matrix: PCA computes the covariance matrix of the standardized data. The covariance matrix summarizes the relationships between variables and provides information about how variables change together. It is crucial for finding the principal components.

Eigenvalue Decomposition: The next step involves finding the eigenvalues and eigenvectors of the covariance matrix. Each eigenvector represents a principal component, and its corresponding eigenvalue indicates the proportion of the total variance in the data that is explained by that component.

Selecting Principal Components: The eigenvectors are ranked by their corresponding eigenvalues in descending order. The eigenvector with the highest eigenvalue is the first principal component (PC1), the second highest eigenvalue corresponds to PC2, and so on. These principal components are orthogonal to each other, meaning they are uncorrelated.

Dimensionality Reduction: To reduce the dimensionality of the data, you can choose to keep only the top ‘k’ principal components that explain most of the variance in the data. By selecting fewer principal components, you represent the data in a lower-dimensional space while retaining as much information as possible.

Data Transformation: The original data can be projected onto the new basis formed by the selected principal components. This transformation results in a reduced-dimension representation of the data.

PCA has several practical applications, including:

Dimensionality Reduction: It’s useful for reducing the number of features or variables in a dataset, which can improve computational efficiency and help avoid overfitting in machine learning models.

Data Visualization: PCA is often used to visualize data in a lower-dimensional space (e.g., 2D or 3D) to explore the structure of the data and identify patterns.

Noise Reduction: By focusing on the principal components that explain most of the variance, PCA can help reduce noise in the data.

Feature Engineering: PCA can be used to create new features or variables that capture the most important information in the data, which can improve model performance.

It’s important to note that while PCA can be a powerful tool, it also has limitations. For example, it assumes that the data is linear and Gaussian-distributed, and it may not be suitable for data with nonlinear relationships. Additionally, interpreting the meaning of the principal components may not always be straightforward, especially in high-dimensional spaces.



Cross-validation is a crucial technique used in machine learning and statistics to assess the performance of a predictive model and to ensure that the model generalizes well to new, unseen data. It is a valuable tool for estimating a model’s accuracy and for preventing issues like overfitting, where a model performs well on the training data but poorly on new data.

The primary purpose of cross-validation is to divide the available data into multiple subsets or “folds.” The model is trained and evaluated multiple times, with each iteration using a different subset as the validation set and the remaining subsets as the training data. This helps obtain a more robust estimate of a model’s performance because it assesses how well the model performs on different portions of the data.

Here are the key steps involved in cross-validation:

Data Splitting: The dataset is divided into ‘k’ roughly equal-sized folds. The choice of ‘k’ is typically determined by the practitioner, with common values being 5 or 10. Each fold represents a distinct subset of the data.

Model Training and Evaluation: The model is trained ‘k’ times. In each iteration, one of the folds is used as the validation set, while the remaining ‘k-1’ folds are used for training the model. This process ensures that each fold gets a chance to be the validation set, and the model is trained on all other folds.

Performance Metric Calculation: The model’s performance metric, such as accuracy, mean squared error, or another appropriate measure, is computed for each iteration on the validation set. These metrics are often averaged to obtain an overall estimate of the model’s performance.

Final Model Training: After completing the ‘k’ iterations, a final model is often trained on the entire dataset, including all ‘k’ folds combined, for deployment.

Cross-validation helps in several ways:

It provides a more robust estimate of a model’s performance, as it evaluates the model on different subsets of the data.

It helps identify issues like overfitting. If a model performs significantly better on the training data compared to the validation data, it may be overfitting.

Common variations of k-fold cross-validation include stratified k-fold cross-validation (for imbalanced datasets), leave-one-out cross-validation (where ‘k’ equals the number of data points), and time series cross-validation (for time-ordered data).

In summary, cross-validation is a valuable tool in the machine learning and model evaluation process, providing a more accurate assessment of a model’s performance and its ability to generalize to new, unseen data.

September 25,2023

K-Fold Cross-Validation and Monte Carlo Cross-Validation are both techniques used for assessing and validating the performance of machine learning models, but they have different methodologies and use cases. Here’s an explanation of each:

K-Fold Cross-Validation:

K-Fold Cross-Validation is a common technique for model evaluation and hyperparameter tuning. It is particularly useful when you have a limited amount of data and you want to maximize the use of that data for both training and validation. The key idea is to split the data into ‘k’ roughly equal-sized folds or partitions, where ‘k’ is a positive integer (e.g., 5 or 10). The process involves the following steps:

The dataset is divided into ‘k’ subsets or folds.

The model is trained and evaluated ‘k’ times, each time using a different fold as the validation set and the remaining ‘k-1’ folds as the training set.

The performance metric (e.g., accuracy, mean squared error) is calculated for each of the ‘k’ iterations, and the results are typically averaged to obtain an overall estimate of the model’s performance.

Finally, the model can be trained on the entire dataset for deployment.

K-Fold Cross-Validation provides a robust estimate of a model’s performance and helps identify issues like overfitting. It’s widely used in the machine learning community.

Monte Carlo Cross-Validation:

Monte Carlo Cross-Validation is a more flexible and stochastic approach to model evaluation. Unlike K-Fold Cross-Validation, it doesn’t involve a fixed number of folds or partitions. Instead, it randomly splits the dataset into training and validation sets multiple times, and the random splitting can be performed with or without replacement. The key steps are as follows:

Calculate the performance metric for each iteration.

Average the performance metrics over all iterations to obtain an overall estimate of the model’s performance.

Monte Carlo Cross-Validation is useful when you want to assess a model’s stability and performance variance over different random data splits. It’s especially helpful when you suspect that certain data splits could lead to significantly different model performance. It’s also useful for situations where a strict division into ‘k’ folds may not be suitable.

In summary, while K-Fold Cross-Validation involves a fixed number of folds and is deterministic, Monte Carlo Cross-Validation is more random and flexible, making it well-suited for assessing the stability and performance variance of a model. The choice between these techniques depends on the specific goals and characteristics of your machine learning project.

September 22,2023

My  approach to comparing pre-molt and post-molt crab sizes using a Monte Carlo permutation test is a sound way to address the potential non-normality of your data. Here’s a step-by-step summary of proposed method:

Data Collection: You have collected data on pre-molt and post-molt crab sizes.

Kurtosis Assessment: You’ve noted that the kurtosis values for both groups are relatively high, which suggests that the data distributions have heavier tails than a normal distribution.

Hypothesis Testing Challenge: Given the non-normality of the data, using a traditional t-test may not be appropriate, as it assumes normality. Hence, you’re considering an alternative approach.

Monte Carlo Test:

Data Pooling: You combine the data from both pre-molt and post-molt groups into one dataset.
Random Sampling: You randomly split this combined dataset into two groups of equal size many times (10 million times in your case).
Calculate Mean Differences: For each split, you calculate the mean difference between the two groups.
Distribution of Mean Differences: After all iterations, you have a distribution of mean differences, which represents what you might expect under the null hypothesis (i.e., no real difference between pre-molt and post-molt crab sizes).
Compare Observed Difference: You compare the observed mean difference in your actual data to the distribution of permuted mean differences.
Calculate p-value: The p-value is the proportion of permuted mean differences that are as extreme as or more extreme than the observed mean difference. A low p-value suggests that the observed difference is unlikely to have occurred by chance, supporting the rejection of the null hypothesis.
This Monte Carlo permutation test approach allows you to assess the significance of the observed mean difference while accounting for the non-normality of your data. It’s a robust method for hypothesis testing when the assumptions of traditional parametric tests like the t-test are not met. If your calculated p-value is below your chosen significance level (e.g., 0.05), you can conclude that there is a significant difference between pre-molt and post-molt crab sizes.

September 20,2023

in today’s class i have learnt about the

Dataset: You have a dataset consisting of pairs of values, where “post-molt” represents the size of a crab’s shell after molting, and “pre-molt” represents the size of a crab’s shell before molting.

Linear Model Fitting: You have created a linear model to predict pre-molt size from post-molt size using the Linear Model Fit function. The linear model equation is:=−25.2137+1.07316y=−25.2137+1.07316x.

Pearson’s r-squared: The Pearson’s r-squared value for this linear model is 0.980833. This indicates a very high correlation between post-molt and pre-molt sizes.

Descriptive Statistics for Post-Molt Data:

Median: 147.4
Mean: 143.898
Standard Deviation: 14.6406
Variance: 214.347
Skewness: -2.3469
Kurtosis: 13.116

Descriptive Statistics for Pre-Molt Data:

Median: 132.8
Mean: 129.212
Standard Deviation: 15.8645
Variance: 251.683
Skewness: -2.00349
Kurtosis: 9.76632
Histograms and Quantile Plots: You have created histograms and quantile plots to visualize the distributions of post-molt and pre-molt data. Both distributions appear to be negatively skewed and have high kurtosis, indicating non-normality.

This analysis suggests a strong linear relationship between post-molt and pre-molt crab shell sizes. The descriptive statistics and visualizations also highlight the non-normality and skewness in the data distributions. If you have specific questions or tasks related to this analysis, feel free to ask.


A t-test is a statistical hypothesis test used to determine whether there is a significant difference between the means of two groups. It’s often used when you want to compare the means of two samples to determine if they are statistically different from each other.

Independent Samples T-Test:

This test is used when you have two independent groups or samples, and you want to determine if there’s a significant difference between the means of these two groups.

Paired Samples T-Test:

This test is used when you have one group of subjects and you measure them twice (before and after some intervention) to determine if there is a significant difference in the means of the paired measurements.

One-Sample T-Test:

This test is used when you have one sample group, and you want to determine if its mean differs significantly from a known or hypothesized value.

September 18,2023

In today’s  topic i have gone through

Linear regression between two predictor variables:

Linear regression between two predictor variables is known as simple linear regression. In simple linear regression, we seek to establish a linear relationship between two variables: a dependent variable (the one we want to predict) and an independent variable (the one we use to make predictions). The goal is to find a linear equation that best describes the relationship between these two variables.

The general form of a simple linear regression equation is:



Y is the dependent variable (the variable you want to predict).
X is the independent variable (the predictor variable).
a is the intercept (the value of Y when X is 0).
b is the slope (the change in Y for a one-unit change in X).
ε represents the error term (the part of Y that is not explained by the linear relationship with X).

The goal in simple linear regression is to estimate the values of a and b that best fit the data. This is typically done using a method called least squares regression, where the values of a and b are chosen to minimize the sum of the squared differences between the observed values of Y and the values predicted by the equation.

Once you have estimated the values of a and b, you can use the equation to make predictions for Y based on values of X that were not in  original dataset.

Linear model:

A linear model is a statistical technique that assumes a linear relationship between one or more independent variables and a dependent variable. It’s used for prediction and modeling. Examples include simple linear regression (one predictor) and multiple linear regression (multiple predictors). Other variations like logistic regression are used for binary outcomes, while Poisson regression is for count data. These models are widely used in various fields for data analysis and prediction.

Correlation with predictor variables:

Interactions between predictor variables mean that the effect of one predictor on the outcome depends on the value of another predictor. Correlation between predictor variables measures how they relate to each other. High correlations can complicate regression analysis. Handling interactions and addressing high correlations are essential for accurate modeling.

September 13,2023

Today’s class we gone through a P-Value and Hypothesis testing


The P-value is known as the probability value. It is defined as the probability of getting a result that is either the same or more extreme than the actual observations.

The P-value is known as the level of marginal significance within the hypothesis testing that represents the probability of occurrence of the given event.

The P-value is used as an alternative to the rejection point to provide the least significance at which the null hypothesis would be rejected. If the P-value is small, then there is stronger evidence in favour of the alternative hypothesis.

Hypothesis Testing :

Usually, we get Sample Datasets to work on and perform data analysis and visualization and find insights.

The P-value method is used in Hypothesis Testing to check the significance of the given Null Hypothesis. Then, deciding to reject or support it is based upon the specified significance level.

Based on that probability and a significance level, we reject or fail to reject the Null Hypothesis.

Generally, the lower the p-value, the higher the chances are for Rejecting the Null Hypothesis .

september 11,2023

Data Set Observations:

In CDC 2018 diabetes data that we are working with a dataset containing information on three variables: %diabetes, %obesity, and %inactivity. Here’s a breakdown of the information you provided.

There are 354 rows of data that contain information on all three variables: %diabetes, %obesity, and %inactivity.

It mentioned that the intention to build a model to predict %diabetes using %inactivity and %obesity as factors. However, you only have 354 data points with which to work for this analysis.

We are exploring the relationship between %diabetes and %inactivity. That found that there are 1370 data points for %inactivity in your dataset. You also checked how many of these have data for %diabetes, and you found that all 1370 data points have information for both %diabetes and %inactivity.

We’ve observed that the %diabetes data is slightly skewed, with a kurtosis value higher than the expected value for a normal distribution (which is 3). Additionally, the quantile-quantile plot shows substantial deviation from normality, suggesting that the data may not follow a normal distribution.

You’ve also mentioned that you will generate descriptive statistics for the %inactivity data, but it appears that the actual calculations for %inactivity statistics are not provided in the text you’ve shared.

If you have any specific questions about the analysis or if you would like to perform any further statistical tests or visualizations, please let me know, and I’d be happy to assist you further.

Linear Regression:

Linear regression is a statistical method used to model and understand the relationship between a dependent variable (the one you want to predict) and one or more independent variables (the predictors). It does this by fitting a straight line (hence “linear”) to observed data points. This line is called the regression line and is characterized by two parameters: slope (how much the dependent variable changes with a change in the independent variable) and intercept (the value of the dependent variable when the independent variable is zero).

The primary goal of linear regression is to find the best-fitting line that minimizes the differences between the predicted values from the line and the actual observed data. This technique is foundational in data analysis, providing insights into relationships between variables and enabling predictions based on historical data. Linear regression can be extended to multiple linear regression when there are multiple independent variables involved. It relies on certain assumptions, such as linearity, normality of residuals, and constant variance, which should be assessed to ensure the validity of the model.