MTH 522 – 10/30/2023

Fundamentally distinct clustering techniques, K-means and DBSCAN, each have their own advantages and disadvantages. Because K-means depends on centroid-based partitioning and needs the user to predefine the number of clusters, it works best in scenarios where the number of clusters is known and the shapes are generally spherical. It is less efficient, nevertheless, and sensitive to initializations when handling clusters with unusual shapes. However, DBSCAN, a density-based technique, works well at finding clusters of any shape and doesn’t require a predetermined number of clusters. It’s a good option for more intricate and varied data clustering tasks because of its great ability to handle noise and outliers and classify them as unassigned data points. The particulars of the data and the intended clustering objectives should be taken into consideration while selecting between K-means and DBSCAN.

MTH 522 – 10/23/2023

Logistic regression is a statistical method employed to analyze datasets with one or more independent variables that influence a binary or dichotomous outcome. It is particularly well-suited for situations where the result falls into one of two categories, such as ‘Yes’ or ‘No.’ The key elements of logistic regression include modeling the relationship between independent variables and the log-odds of the binary outcome, utilizing a sigmoid function to transform these log-odds into probabilities between 0 and 1, and estimating coefficients for independent variables to determine the strength and direction of their impact. Additionally, logistic regression calculates odds ratios to quantify how changes in independent variables affect the odds of the binary outcome. This method finds applications in diverse fields, from medical research to marketing and credit scoring, providing valuable insights into the likelihood of specific events occurring based on a set of relevant factors.
Logistic regression serves as a powerful analytical tool for understanding and modeling binary outcomes across a wide range of domains. It enables researchers and analysts to uncover the intricate relationships between independent variables and the probability of specific events, offering practical applications in medical prognosis, customer behavior prediction, credit risk assessment, and more. Whether it’s predicting the likelihood of a patient developing a medical condition or forecasting customer purchase decisions, logistic regression proves invaluable in making informed decisions and understanding the dynamics of binary outcomes.

MTH 522 – 10/27/2023

There are three primary types of logistic regression: multinomial, ordinal, and binary. For scenarios where there are only two possible outcomes, like loan approval, cancer risk assessment, or sports match predictions, Binary Logistic Regression is perfect. In situations when the dependent variable has ordered categories with unequal intervals, such as student selections or pet food types, ordinal logistic regression is utilized. However, when the dependent variable is nominal, meaning it has more than two levels and no particular order—such as test scores, survey replies, or shirt sizes—multinomial logistic regression is appropriate.
Key practices for the effective application of logistic regression include understanding the technical requirements of the model, carefully choosing dependent variables to maintain consistency in the model, accurate estimation, interpreting results meaningfully, and comprehensive validation to guarantee the accuracy and reliability of the model.

MTH 522 – 10/20/2023

Mixed effects models and generalized linear models (GLMs) are combined in generalized linear mixed models (GLMMs). When dealing with data that is not regularly distributed and has correlations and hierarchical structures, they are helpful. GLMMs work well with hierarchical data because they take into account both fixed and random effects. They employ link functions to connect variables and maximum likelihood estimation of parameters.
GLMMs can detect regional clusters, temporal trends, and demographic discrepancies in the context of police fatal shootings. They aid in determining risk variables and evaluating how policy changes may affect the number of fatal shootings.

MTH 522 – 10/18/2023

I’ve started my examination of the ‘fatal-police-shootings-data’ dataset in Python. I’ve initiated the process of loading the data to examine its different variables and their respective distributions. Notably, the ‘age’ variable, which is a numerical column, stands out as it provides insights into the ages of individuals who tragically lost their lives in police shootings. Additionally, the dataset includes latitude and longitude values, allowing us to precisely determine the geographical locations of these incidents.
During this initial assessment, I’ve come across an ‘id’ column, which appears to have limited relevance for our analysis. Consequently, I’m considering excluding it from our further investigation. Going deeper, I’ve scanned the dataset for missing values, revealing that several variables contain null or missing data, including ‘name,’ ‘armed,’ ‘age,’ ‘gender,’ ‘race,’ ‘flee,’ ‘longitude,’ and ‘latitude.’ Furthermore, I’ve checked the dataset for potential duplicate records, and I found only a single duplicate entry, notable for its absence of a ‘name’ value. As we move on to the next phase of this analysis, our focus will shift to exploring the distribution of the ‘age’ variable, a crucial step in gaining insights from this dataset.
In our recent classroom session, we acquired essential knowledge about computing geospatial distances using location information. This newfound expertise enables us to create GeoHistograms, a valuable tool for visualizing and analyzing geographic data. GeoHistograms serve as a powerful instrument for identifying spatial patterns, pinpointing hotspots, and discovering clusters within datasets related to geographic locations. As a result, our understanding of the underlying patterns in the data is significantly enhanced.

MTH 522 – 10/16/2023

In machine learning and data analysis, cluster analysis is a potent technique that groups related objects or data points together based on their shared properties. Finding underlying patterns in complicated datasets is the basic goal of cluster analysis, which also helps to make decision-making easier and more informed. This method, which does not require labeled data for training, is extensively used in many different fields, such as image segmentation, anomaly detection, and customer segmentation. This procedure makes use of well-known clustering algorithms including K-Means, Hierarchical Clustering, and DBSCAN. By repeatedly allocating data points to the closest cluster centroids, K-Means, for example, divides a dataset into K separate clusters. The method’s final goal is to minimize the sum of squared distances between each data point and its cluster centroids. K-Means is an effective technique, although it does require the number of clusters to be specified in advance, which is an important analysis parameter.
A crucial tool for improving data interpretation and decision-making is cluster analysis, which identifies underlying structures in complicated datasets. This method organizes data points according to common criteria and is applied in areas such as picture segmentation, anomaly detection, and consumer segmentation. In this process, clustering techniques like K-Means, Hierarchical Clustering, and DBSCAN are frequently used. To minimize the total squared distances between data points and their various centroids, K-Means, for instance, iteratively assigns data points to the closest cluster centroids in order to partition data into discrete clusters. K-Means is renowned for its effectiveness, but a crucial aspect of the analysis is that it requires predetermining the number of clusters.

MTH 522 – 10/13/2023

In my initial steps of working with the two CSV files, ‘fatal-police-shootings-data’ and ‘fatal-police-shootings-agencies,’ my process began with loading them into Jupyter Notebook. Here’s an overview of the actions I took and the obstacles I encountered:
1. Data Loading: I initiated the process by importing both CSV files into Jupyter Notebook. The ‘fatal-police-shootings-data’ dataset consists of 8,770 entries with 19 attributes, whereas the ‘fatal-police-shootings-agencies’ dataset comprises 3,322 entries with 5 attributes.
2. Column Correspondence: Upon reviewing the column descriptions available on GitHub, I realized that the ‘ids’ column in the ‘fatal-police-shootings-agencies’ dataset corresponds to the ‘agency_ids’ in the ‘fatal-police-shootings-data’ dataset. To streamline the merging process, I renamed the column in the ‘fatal-police-shootings-agencies’ dataset from ‘ids’ to ‘agency_ids.’
3. Data Type Inconsistency: When I attempted to merge the two datasets using the ‘agency_ids,’ I encountered an error indicating an inability to merge on a column with mismatched data types. Upon inspecting the data types using the ‘.info()’ function, I found that one dataset had the ‘agency_ids’ column as an object type, while the other had it as an int64 type. To address this, I utilized the ‘pd.to_numeric()’ function to ensure both columns were of type ‘int64.’
4. Data Fragmentation: A new challenge surfaced in the ‘fatal-police-shootings-data’ dataset: the ‘agency_ids’ column contained multiple IDs within a single cell. To overcome this, I am currently in the process of splitting these cells into multiple rows.
Once I successfully split the cells in the ‘fatal-police-shootings-data’ dataset into multiple rows, my next steps will involve a deeper dive into data exploration and commencing data preprocessing. This will encompass tasks like data cleaning, managing missing data, and preparing the data for analysis or modeling. Your journey into data analysis and preprocessing appears to be off to a promising start, and navigating through these challenges will help you uncover valuable insights from the data.

MTH 522 – 10/11/2023

The “Fatal Force Database,” a comprehensive project that painstakingly tracks and documents instances in which American police officers shoot and mortally hurt citizens while carrying out their duty, was first reported by The Washington Post. Important information is included in this database, such as the victim’s race, the circumstances of the shootings, if the victim was carrying a weapon, and whether the victim was going through a mental health crisis. Data is gathered from a number of sources, including as social media, law enforcement websites, independent databases like Fatal Encounters, and local news articles.

Importantly, the database was updated in 2022 to improve accountability and transparency inside departments by standardizing and making publicly available the identities of member police organizations. This dataset has regularly documented more than twice as many fatal police shootings since 2015 as official sources like the FBI and CDC, highlighting a large data gap and highlighting the necessity for thorough tracking. Constantly updated, it continues to be a priceless tool for scholars, decision-makers, and the general public. It provides information about shootings in which police are involved, encourages openness, and adds to the continuing conversations about police accountability and reform.

MTH 522 – 10/02/2023

I have fully understood the fundamental ideas and guidelines presented in the offered instructions for authoring data science reports. It has been fascinating to see the need of developing informative titles, summarizing the key points in a straightforward and non-technical way, fully presenting findings, and explaining their ramifications. Additionally, the methodical method used to arrange appendices, describe results, and guarantee correct reference has given me important new perspectives on how to write reports.
Along with understanding these fundamental concepts, I also took the initiative to quickly scan the website’s “Punchline reports” PDF. This additional resource should improve my comprehension of data science report writing and its particular complexities.

I am excited to put this information to use in my upcoming projects and keep honing my ability to effectively deliver data-driven insights.