Code
# Import any relevant packages or libraries
import pandas as pd
from scipy import stats
This notebook demostrates how to conduct a two-sample hypothesis test.
We will be applying descriptive statistics and hypothesis testing.
We will do the following:
Research question:
Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?
Question:
Data professionals use descriptive statistics for exploratory data analysis (EDA). How can computing descriptive statistics help you learn more about your data in this stage of your analysis?
Answer:
In general, descriptive statistics are useful because they let you quickly explore and understand large amounts of data. In this case, computing descriptive statistics helps you quickly compare the average amount of drives by device type.
Use descriptive statistics to conduct exploratory eata analysis (EDA).
Note: In the dataset, device
is a categorical variable with the labels iPhone
and Android
.
In order to perform this analysis, you must turn each label into an integer. The following code assigns a 1
for an iPhone
user and a 2
for Android
. It assigns this label back to the variable device_type
.
Note: Creating a new variable is ideal so that you don’t overwrite original data.
0 2
1 1
2 2
3 1
4 2
Name: device_type, dtype: int64
You are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Calculate these averages.
device_type
1 67.859078
2 66.231838
Name: drives, dtype: float64
device
Android 66.231838
iPhone 67.859078
Name: drives, dtype: float64
Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, you can conduct a hypothesis test.
Your goal is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:
Note: This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).
Recall the difference between the null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_A\)).
Question: What are your hypotheses for this data project?
Hypotheses:
\(H_0\): There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.
\(H_A\): There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.
Next, choose 5% as the significance level and proceed with a two-sample t-test.
You can use the stats.ttest_ind()
function to perform the test.
Technical note: The default for the argument equal_var
in stats.ttest_ind()
is True
, which assumes population variances are equal. This equal variance assumption might not hold in practice (that is, there is no strong reason to assume that the two groups have the same variance); you can relax this assumption by setting equal_var
to False
, and stats.ttest_ind()
will perform the unequal variances \(t\)-test (known as Welch’s t
-test). Refer to the scipy t-test documentation for more information.
drives
column for iPhone users.drives
column for Android users.Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)
Ttest_indResult(statistic=-1.4635232068852353, pvalue=0.1433519726802059)
Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)
Question: Based on the p-value you got above, do you reject or fail to reject the null hypothesis?
Since the p-value is larger than the chosen significance level (5%), you fail to reject the null hypothesis. You conclude that there is not a statistically significant difference in the average number of drives between drivers who use iPhones and drivers who use Androids.
What business insight(s) can you draw from the result of your hypothesis test?
The key business insight is that drivers who use iPhone devices on average have a similar number of drives as those who use Androids.
One potential next step is to explore what other factors influence the variation in the number of drives, and run additonal hypothesis tests to learn more about user behavior. Further, temporary changes in marketing or user interface for the Waze app may provide more data to investigate churn.
Google Advanced Data Analytics (Coursera)