For more information, please refer to the attachment
Dataset
https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022?select=Base.csv
Q1: Which factor affect the fraud the most? (Logistic Regression Model)
The box plot above indicates that income, credit limit, age, and name_email_similarity exhibit significant data dispersion. This suggests that these factors could be significant determinants of fraud. While the box plot provides a visual representation of the data dispersion, it is merely an initial observation. To further explore the correlation between these factors and fraud, we will be using Logistic Regression.
#2.Split the data into a train and test set.
The percentage of data to be used for training is set to 70%. They are the train and test datasets, respectively.
#3.Find the fitted model.
We use glm() function to generate the first model, we use all variables from fraud_bool in m1, data is Analy_train, family is binomial, and link is logistic regression. Its AIC value is 4090.7. However, we don’t believe this is the best model, so we try different combinations again and again finally, we got the best model as m3.
Here is the summary of m3:
In m3, the AIC become 4077.1 and, at the same time, we noticed that some payment type, employment status, housing status, and device has eliminated from the model. We suppose that those are not affecting the fraudulence. However, the m3 model could not directly tell us which affect the fraudulence the most, so we need to look at the p-value of each coefficient. For all p-values here, we can find that device with windows has the smallest p-value. It should that whether the device is windows will affect the fraudulence the most.
Q2. What is the relationship between fraud and income? (Linear Regression Model)
According to the screenshot, we can find that fraud and income do not have a strong relationship in linear regression model, because the R-squared is too small which means that the relationship is too small. As a result, we only could use the coefficient in the logistic model:
Fraud = -9.03 + 1.354 * Income
However, in the Logistic Regression Model, this isn’t the direct relationship. When it is bigger than 0.5, we predict it will have fraudulence. When it is smaller than 0.5, we predict it won’t have fraudulence.
Q3. What is the relationship between fraud and age? (Linear Regression Model)
In the screenshot, it shows the similar situation. It only has R-squared of 0.0006527. It is too small, and we only can give the equation in Logistic Regression:
Fraud = -9.03 + 0.02081 * Age
Q4. For the people who are not fraudulent, is any relationship between the age and the income? (Linear Regression Model)
As shown above, we can see a positive correlation between age and income for those who have not been defrauded. This is because, in general, income tends to increase with age as people gain more experience and advance in their careers. And higher-income people have more knowledge about financial matters, so it leads them to be better able to identify fraud.
The Linear Regression Line is:
Income = 0.48793084 + 0.00288834 * Age
Q5. Is the Velocity 6h, Velocity 24h, and Velocity 4w have the same mean? (ANOVA)
Step 1:
Null Hypothesis: Velocity 6h, Velocity 24h, and Velocity 4w have the same mean.
Mean (Velocity 6h) = Mean (Velocity 24h,) = Mean (Velocity 4w)
Alternative Hypothesis: they are not the same.
Step 2:
Step 3:
According to the data, the p-value is smaller than 0.05, so we can reject the Null Hypothesis and accept the Alternative Hypothesis.
Step 4:
As a result, we can say that the mean of Velocity 6h, the mean of Velocity 24h and the mean of Velocity 4w are not the same.