Based on the attachments provided. these attachments are the word documents created within building this project model. you will look over each document attached and through this, you will need to provide another Word document summarizing the following:
- Any refinements made
- Metrics of those refinements
Firas Yacoub
Css 300
08/19/23
Module 2
The following key steps were undertaken during the pre-processing phase:
1. Data Cleaning and Validation:
o
Checked for missing values, outliers, and anomalies in each column of the
dataset.
o
Utilized appropriate techniques to handle missing values, including imputation,
removal, or substitution based on the specific context of each attribute.
o
Identified and resolved any discrepancies or inconsistencies in data entries by
cross-referencing against reliable sources.
2. Categorical Variable Handling:
o
Converted categorical variables like departments, positions, and employment
status into numerical format using techniques such as one-hot encoding.
o
Ensured that categorical labels were consistent across all instances to prevent
ambiguity in the analysis.
3. Outlier Detection and Treatment:
o
Conducted thorough outlier analysis to identify potential extreme values that
could skew analysis results.
o
Employed statistical methods to determine whether outliers were due to data
entry errors or legitimate observations, addressing them accordingly.
4. Feature Scaling:
o
Scaled numerical features as needed to ensure all attributes were on similar
scales, enhancing the stability and performance of subsequent modeling
techniques.
5. Data Imputation:
o
Employed appropriate imputation strategies to fill missing values in a manner
that minimized bias and retained the integrity of the dataset.
6. Data Transformation:
o
Transformed certain attributes, such as annual salaries, into more suitable forms
(e.g., logarithmic transformation) to meet assumptions of linearity in the
modeling process.
7. Quality Assurance:
o
Conducted thorough checks at each step to ensure that data modifications were
consistent with the context of the dataset.
o
Documented all pre-processing steps, reasoning, and any assumptions made
during the process.
1
Exploratory Data Analysis
Descriptive Statistics
The dataset contains the employees’ names, job title, department and salary captured in
recent periods. A total of 31, 064 employees were considered in the data displayed. Out of the
number, police officers constituted 29%, firefighter department consisted of $% while others
occupied the remaining 67%. Most of the employees are paid salaries which is covered in the
“Annual Salary” column. However, some of the employees are paid at hourly rates and is
covered by a combination of “typical hours” and “hourly rates.” The data gives an insight into
how many people are employed in various city agencies and departments.
Histograms
After filtering the data to only include employees with “salary pay” frequency, the
histogram of the annual salary rates can be represented as below.
Fig 1: Histogram for Annual Salary Rates
After filtering the data to only include employees with an “Hourly” pay frequency, the
histogram of the hourly rates was obtained as below.
2
Fig 2: Histogram of Hourly Salary Rates
Bar Plot
The bar plot below represents the number of employees by “Full” or “Part” time status.
The number of full-time employees greatly exceeds those who are employed as part-timers.
Fig 3: Employees’ Status of Employment
Pie Chart
The relationships between the salary and hourly distribution can be depicted as below.
The salary category (78.5%) has many employees as opposed to the hourly category (21.5%).
3
Fig 4: Salary and Hourly Distribution
Bar Chart
The bar chart below represents the top 10 annual salaries for the employees with “Salary”
category only. At the top is the Commissioner of Aviation with over $250,000.
Fig 5: Top 10 annual salaries
Box Plot
The box plot below represents the relationship between annual salary against the
departments for various categories.
4
Fig 6: Box Plot