October 30,2023

This dataset contains various variables, including age and race, which provide insights into the demographics of individuals involved in these incidents.

Age Data: The ‘age’ variable in the dataset represents the age of individuals who were fatally shot by the police. This variable is crucial for understanding the age distribution of the victims. You can analyze this data to determine the average age, age ranges, and other statistics related to the ages of the individuals involved in these incidents.

Race Data: The ‘race’ variable in the dataset categorizes the race or ethnicity of the individuals involved in these incidents. It provides information about the racial composition of those who were fatally shot. Analyzing race data can help identify patterns, disparities, and trends in police shootings with respect to different racial or ethnic groups.

To conduct a more in-depth analysis of age and race data within this dataset, you can perform various statistical and visualization techniques, such as creating histograms, bar charts, or other relevant visualizations to gain insights into these demographic aspects of police shootings.

October 27,2023

During our class, we discussed the potential instability of DBSCAN in comparison to K-Means clustering. Here, we’ll outline various scenarios that illustrate the instability of DBSCAN:

Sensitivity to Density Variations:

DBSCAN’s stability can be affected by variations in data point density. If the dataset exhibits significant differences in data density across various segments, it can lead to the formation of clusters with varying sizes and shapes. Consequently, selecting appropriate parameters (such as the maximum distance ε and the minimum point thresholds) to define clusters effectively becomes a challenging task.

In contrast, K-Means assumes spherical and uniformly sized clusters, potentially performing more effectively when clusters share similar densities and shapes.

Sensitivity to Parameter Choices:

DBSCAN requires the configuration of hyperparameters, including ε (representing the maximum distance defining a data point’s neighborhood) and the minimum number of data points needed to establish a dense region. These parameter choices have a significant impact on the resulting clusters.

K-Means, while also requiring a parameter (the number of clusters, K), is generally more straightforward to determine, as it directly reflects the desired number of clusters. In contrast, DBSCAN’s parameters are more abstract, introducing sensitivity to the selection of parameter values.

Boundary Points and Noise:

DBSCAN explicitly identifies noise points, which are data points that don’t belong to any cluster, and it handles outliers well. However, the classification of boundary points (those located on the periphery of a cluster) within DBSCAN can sometimes appear arbitrary.

In K-Means, data points on the boundaries of clusters may be assigned to one of the neighboring clusters, potentially leading to instability when a data point is close to the shared boundary of two clusters.

Varying Cluster Shapes:

DBSCAN excels in its ability to accommodate clusters with arbitrary shapes and detect clusters with irregular boundaries. This is in contrast to K-Means, which assumes roughly spherical clusters and therefore demonstrates greater stability when data adheres to this assumption.

The choice between DBSCAN and K-Means should consider the specific characteristics of the dataset, as well as the objectives of the analysis, as these algorithms have different strengths and limitations.

October 23,2023

In our recent class, our professor introduced several technical concepts, including the K-Medoids clustering technique, and delved into the concept of hierarchical clustering. Here’s a summary of what was covered:

K-Medoids:
K-Medoids is a partitioning clustering algorithm that distinguishes itself by its enhanced robustness, especially in handling outliers. In contrast to K-Means, which uses the mean (average) as the cluster center, K-Medoids selects the actual data point (medoid) within a cluster. The medoid is the data point that minimizes the sum of distances to all other points in the same cluster. This unique approach makes K-Medoids less sensitive to outliers and particularly suitable for clusters with non-Gaussian shapes.

Hierarchical Clustering:
Hierarchical clustering is a clustering method characterized by the construction of a tree-like structure of clusters, establishing a hierarchical relationship between data points. There are two primary approaches to hierarchical clustering: agglomerative (bottom-up) and divisive (top-down). Agglomerative clustering starts with each data point as its own cluster and iteratively merges the closest neighboring clusters, creating a dendrogram as a visual representation. In contrast, divisive clustering begins with all data points in a single cluster and then recursively divides them into smaller clusters. One notable advantage of hierarchical clustering is that it doesn’t require specifying the number of clusters in advance, and it provides a visual representation of the inherent grouping of the data.

Dendrograms:
A dendrogram is a graphical representation in the form of a tree-like diagram, employed to visualize the hierarchical structure of clusters within hierarchical clustering. This visual tool displays the sequence of merges or splits, along with the respective distances at which these actions occur. The height of the vertical lines within the dendrogram signifies the dissimilarity or distance between clusters. By choosing a specific height to cut the dendrogram, you can obtain a desired number of clusters, making dendrograms a valuable aid in cluster selection.

These concepts offer a comprehensive toolbox for exploring and understanding the underlying structures and relationships within datasets, catering to a wide range of data types and shapes.

October 20, 2023

In today’s analysis, I ventured into the realm of geospatial calculations, specifically calculating the geodesic distance between two geographic coordinates: Seattle, WA, and Miami, FL. To accomplish this, I harnessed the geodesic function from the geopy library, a reliable tool for accurately computing distances on the Earth’s curved surface. The outcome provided the distance between these two locations in both miles and kilometers, delivering valuable information for geospatial analysis.

For my next analytical step, I’m gearing up to execute clustering based on geographic locations within the state of California. In pursuit of this objective, I explored two clustering algorithms: K-Means and DBSCAN.

K-Means:
K-Means is a widely adopted clustering algorithm renowned for partitioning a dataset into ‘K’ distinct clusters. Its operation involves iteratively assigning data points to the nearest cluster center (centroid) and recalculating these centroids until the algorithm converges. K-Means is prized for its simplicity in implementation and computational efficiency, making it a go-to choice for a range of applications, including image segmentation and customer segmentation.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise:
DBSCAN takes a different approach, relying on density-based criteria to cluster data points. Unlike K-Means, it doesn’t require a predetermined number of clusters. DBSCAN identifies core points, which have a sufficient number of data points in their neighborhood, and border points, which are in proximity to core points but lack a critical mass of neighbors to qualify as core points. Noise points, on the other hand, don’t align with any cluster. DBSCAN’s resilience to noise and its ability to unveil clusters of various sizes and shapes make it particularly well-suited for datasets with intricate structures.

These clustering techniques have the potential to reveal valuable patterns and structures within the geographic data, advancing our understanding of the distribution of police shootings in the state of California.

October 18 ,2023

In the subsequent stage of my analysis, I harnessed geospatial data and related libraries to visually represent instances of police shootings across the United States. To achieve this, I extracted the ‘latitude’ and ‘longitude’ attributes from the dataset and meticulously filtered out any null values within these columns. The following steps involved the creation of an accurate geographical map of the United States, supplemented with the inclusion of distinctive red markers, ultimately forming a Geospatial Scatter Plot. The resulting visualization delivers a geographically precise depiction of where these incidents transpired, thus providing valuable insights into their distribution throughout the country. By plotting these incidents on the map, it becomes readily apparent where concentrations of police shootings occur, facilitating a deeper comprehension of regional trends and patterns.

The capability to visualize the scatter plot for individual states of the United States empowers policymakers, researchers, and the public to gain profound insights into the geographical dimensions of police shootings. This, in turn, has the potential to foster more informed discussions and drive actions aimed at addressing this significant issue. As a case in point, I’ve crafted a similar Geospatial Scatter Plot for the state of Massachusetts, a visual representation of which is included below.

Moving forward in my analysis, my agenda includes delving into the realm of GeoHistograms and exploring the application of clustering algorithms, as per our professor’s guidance in the previous class. Specifically, our professor introduced two distinct clustering techniques: K-Means and DBSCAN. It’s worth noting that K-Means entails the need to predefine the value of K, which can be a limiting factor. My goal is to implement both of these algorithms in Python and evaluate whether they yield meaningful clusters when applied to the geographic locations of the shooting data. This phase promises to uncover additional layers of insights and patterns within the dataset, contributing to a more comprehensive understanding of this critical issue.

october 16 ,2023

Today, my primary focus was on analyzing the distribution of the ‘age’ variable. As depicted in the density plot of the ‘age’ variable, it’s evident that this column, representing the ages of 7,499 individuals (non-null values), displays a positive skew. This skewness indicates that the majority of individuals in the dataset tend to be on the younger side, resulting in a right-tailed distribution.

The average age stands at approximately 37.21 years, with a moderate level of variability around this mean, as evidenced by a standard deviation of 12.98. The age range spans from 2 to 92 years, encompassing a diverse age demographic. The youngest individual in the dataset is 2 years old, while the oldest is 92 years old. With a kurtosis value of 0.234, the distribution appears to be somewhat less peaked than a normal distribution, signifying a dispersion of ages rather than a tight clustering around the mean. Additionally, the median age, which falls at 35 years, serves as the midpoint of the dataset.

Moving on to the box plot representation of the ‘age’ variable, it’s apparent that outliers beyond the upper whisker are present. In a box plot, the ‘whiskers’ typically indicate the range within which most of the data points fall. Any data point lying beyond these whiskers is considered an outlier, signifying that it deviates significantly from the typical range of values.

In this specific case, the upper whisker of the box plot extends to a threshold typically defined as 1.5 times the interquartile range (IQR) above the third quartile (Q3). Data points beyond this threshold are identified as outliers. The presence of outliers beyond the upper whisker in the ‘age’ variable suggests that there are individuals in the dataset whose ages significantly exceed the upper age range found within the ‘typical’ or ‘normal’ population.

October 13, 2023

Commencing my analysis of the ‘fatal-police-shootings-data’ dataset in Python, I’ve loaded the data to scrutinize its various variables and their respective distributions. Notably, among these variables, ‘age’ stands out as a numerical column, offering insights into the ages of individuals tragically shot by law enforcement. Additionally, the dataset contains latitude and longitude values, pinpointing the precise geographical locations of these incidents.

During this preliminary assessment, I’ve identified an ‘id’ column, which appears to hold limited significance for our analysis. Consequently, I’m considering its exclusion from our further examination. Delving deeper, I’ve scrutinized the dataset for missing values, revealing that several variables exhibit null or missing data, including ‘name,’ ‘armed,’ ‘age,’ ‘gender,’ ‘race,’ ‘flee,’ ‘longitude,’ and ‘latitude.’ Furthermore, I’ve undertaken an investigation into potential duplicate records. This examination has uncovered just a single duplicate entry within the entire dataset, notable for its absence of a ‘name’ value. For the subsequent phase of my analysis, I intend to shift our focus towards exploring the distribution of the ‘age’ variable, a critical step in unraveling insights from this dataset.

In today’s classroom session,  essential knowledge on computing geospatial distances using location information. This newfound expertise equips us to create GeoHistograms, a valuable tool for visualizing and analyzing geographical data. GeoHistograms serve as a powerful instrument for pinpointing spatial trends, identifying hotspots, and uncovering clusters within datasets associated with geographic locations. This, in turn, enhances our comprehension of the underlying phenomena embedded within the data.

October 11,2023

Today, we commenced our work with a recently acquired dataset from ‘The Washington Post’ website. This dataset sheds light on a troubling statistic: police shootings in the United States result in an average of over 1,000 deaths each year, as revealed by an ongoing investigation by ‘The Washington Post.’

A pivotal moment in this ongoing inquiry was the tragic 2014 killing of Michael Brown, an unarmed Black man, by police in Ferguson, Missouri. This incident exposed a significant issue – the data reported to the FBI regarding fatal police shootings was significantly underestimated, with more than half of such incidents going unreported. This problem has continued to worsen, with only a third of fatal shootings being accurately reflected in the FBI’s database by 2021. The primary reason for this underreporting is that local police departments are not required to report these incidents to the federal government. Additionally, complications arise from an updated FBI system for data reporting and confusion among local law enforcement agencies about their reporting responsibilities.

In response, ‘The Washington Post’ initiated its own comprehensive investigation in 2015, meticulously documenting every instance in which an on-duty police officer in the United States shot and killed someone. Over the years, their reporters have compiled a substantial dataset, which now comprises 8,770 records. This dataset includes various variables, such as date, name, age, gender, whether the person was armed, their race, the city and state where the incident occurred, whether they were attempting to flee, if body cameras were in use, signs of mental illness, and importantly, the involved police departments.

It’s worth noting that this dataset covers incidents from 2015 onwards, and ‘The Post’ recently updated it in 2022 to include the names of the police agencies connected to each shooting, providing a means to better assess accountability at the department level.

In our class today, we delved into some initial questions about this dataset. Notably, we discovered that there are multiple versions of the dataset available. The one accessible on GitHub provides information on police shootings by agencies, but the version directly obtained from ‘The Washington Post’s’ website includes a variable called ‘police_departments_involved.’ This means there’s no need for an external relationship to discover which police stations were involved in these shootings.

As the next step in my analysis, I plan to conduct a more detailed examination of the dataset and its variables to uncover further insights

10-04-23

we are currently in the process of creating a concise and impactful report that summarizes the results of our study on the CDC Diabetes dataset.

In our examination of the CDC Diabetes dataset, we utilized a wide range of statistical techniques. These techniques included exploratory data analysis, correlation analysis, both simple and multiple linear regression, as well as the implementation of the Breusch-Pagan test to assess constant variance. Additionally, we introduced interaction terms, explored higher-order relationships through polynomial regression, and made use of cross-validation to evaluate the performance and generalization of our models.

Our findings have uncovered some interesting insights into the predictive power of our models. Initially, when we introduced an interaction term into the Simple Linear model, it contributed 36.5% to the overall explanatory power. However, this contribution increased to 38.5% when we developed a Multi-Linear quadratic regression model for predicting diabetes, which incorporated both ‘% INACTIVE’ and ‘% OBESE.’ Interestingly, when we applied Support Vector Regression, the explanatory power decreased to 30.1%.

While it is evident that ‘% INACTIVE’ and ‘% OBESE’ play a significant role in diabetes prediction, they may not fully capture the complex dynamics involved. This highlights the need for a more comprehensive analysis that considers a broader array of influencing factors. Therefore, incorporating additional variables is essential for gaining a deeper and more holistic understanding of diabetes prediction.

10-02-23

Continuing our analysis, we implemented Support Vector Regression (SVR) on the CDC dataset, incorporating quadratic and interaction terms. SVR is a machine learning algorithm designed for regression tasks, especially useful for high-dimensional data and scenarios where traditional linear regression models struggle to capture complex relationships between variables.

The SVR model was initialized with specific parameters, including the ‘RBF (Radial Basis Function) kernel, the regularization parameter (C), and the tolerance for errors (epsilon). The RBF kernel is a mathematical function utilized in SVR to capture non-linear relationships, ‘C’ controls the trade-off between fitting the training data and preventing overfitting, and ‘epsilon’ specifies the margin within which errors are acceptable in SVR.

In this analysis, we used ‘INACTIVE’ and ‘OBESE’ as features, along with their squared values (‘INACTIVE_sq’ and ‘OBESE_sq’) and an interaction term (‘OBESE*INACTIVE’). We employed a K-Fold cross-validator with 5 folds to split the data into training and testing sets for cross-validation. A new SVR model was created, fitted to the training data, and used for predictions on the testing data. The ‘RBF’ kernel was applied to enable the model to learn the relationships between the input features and the target variable.

The performance of the SVR model was evaluated using the R-squared (R2) score, which returned a value of 0.30. This score is lower than the R-squared from our quadratic model, suggesting that the SVR model may not capture the data’s relationships as effectively.