Data exploration and hypothesis testing

This notebook demostrates how to conduct a two-sample hypothesis test.

We will be applying descriptive statistics and hypothesis testing.

We will do the following:

  1. Conduct hypothesis testing
  • How did computing descriptive statistics help you analyze your data?
  • How did you formulate your null hypothesis and alternative hypothesis?
  1. Communicate insights with stakeholders
  • What key business insight(s) emerged from your hypothesis test?
  • What business recommendations do you propose based on your results?

Research question:

Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?

Code
# Import any relevant packages or libraries
import pandas as pd
from scipy import stats
Code
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

Question:

Data professionals use descriptive statistics for exploratory data analysis (EDA). How can computing descriptive statistics help you learn more about your data in this stage of your analysis?

Answer:

In general, descriptive statistics are useful because they let you quickly explore and understand large amounts of data. In this case, computing descriptive statistics helps you quickly compare the average amount of drives by device type.

1. Data exploration

Use descriptive statistics to conduct exploratory eata analysis (EDA).

Note: In the dataset, device is a categorical variable with the labels iPhone and Android.

In order to perform this analysis, you must turn each label into an integer. The following code assigns a 1 for an iPhone user and a 2 for Android. It assigns this label back to the variable device_type.

Note: Creating a new variable is ideal so that you don’t overwrite original data.

Code
# 1. Create `map_dictionary`
map_dictionary = {'Android': 2, 'iPhone': 1}

# 2. Create new `device_type` column
df['device_type'] = df['device']

# 3. Map the new column to the dictionary
df['device_type'] = df['device_type'].map(map_dictionary)

df['device_type'].head()
0    2
1    1
2    2
3    1
4    2
Name: device_type, dtype: int64

You are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Calculate these averages.

Code
df.groupby('device_type')['drives'].mean()
device_type
1    67.859078
2    66.231838
Name: drives, dtype: float64
Code
df.groupby('device')['drives'].mean()
device
Android    66.231838
iPhone     67.859078
Name: drives, dtype: float64

Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, you can conduct a hypothesis test.

2. Hypothesis testing

Your goal is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:

  1. State the null hypothesis and the alternative hypothesis
  2. Choose a signficance level
  3. Find the p-value
  4. Reject or fail to reject the null hypothesis

Note: This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).

Recall the difference between the null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_A\)).

Question: What are your hypotheses for this data project?

Hypotheses:

\(H_0\): There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

\(H_A\): There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

Next, choose 5% as the significance level and proceed with a two-sample t-test.

You can use the stats.ttest_ind() function to perform the test.

Technical note: The default for the argument equal_var in stats.ttest_ind() is True, which assumes population variances are equal. This equal variance assumption might not hold in practice (that is, there is no strong reason to assume that the two groups have the same variance); you can relax this assumption by setting equal_var to False, and stats.ttest_ind() will perform the unequal variances \(t\)-test (known as Welch’s t-test). Refer to the scipy t-test documentation for more information.

  1. Isolate the drives column for iPhone users.
  2. Isolate the drives column for Android users.
  3. Perform the t-test
Code
# 1. Isolate the `drives` column for iPhone users.
iPhone = df[df['device_type'] == 1]['drives']

# 2. Isolate the `drives` column for Android users.
Android = df[df['device_type'] == 2]['drives']

# 3. Perform the t-test
stats.ttest_ind(a=iPhone, b=Android, equal_var=False)
Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)
Code
# 1. Isolate the `drives` column for iPhone users.
iPhone = df[df['device_type'] == 1]['drives']

# 2. Isolate the `drives` column for Android users.
Android = df[df['device_type'] == 2]['drives']

# 3. Perform the t-test
stats.ttest_ind(b=iPhone, a=Android, equal_var=False)
Ttest_indResult(statistic=-1.4635232068852353, pvalue=0.1433519726802059)
Code
iPhone.head()
1    107
3     40
5    103
6      2
7     35
Name: drives, dtype: int64
Code
# 1. Isolate the `drives` column for iPhone users.
iPhone = df[df['device'] == 'iPhone']['drives']

# 2. Isolate the `drives` column for Android users.
Android = df[df['device'] == 'Android']['drives']

# 3. Perform the t-test
stats.ttest_ind(a=iPhone, b=Android, equal_var=False)
Ttest_indResult(statistic=1.4635232068852353, pvalue=0.1433519726802059)
Code
iPhone.head()
1    107
3     40
5    103
6      2
7     35
Name: drives, dtype: int64

Question: Based on the p-value you got above, do you reject or fail to reject the null hypothesis?

Since the p-value is larger than the chosen significance level (5%), you fail to reject the null hypothesis. You conclude that there is not a statistically significant difference in the average number of drives between drivers who use iPhones and drivers who use Androids.

Conclusion

  • What business insight(s) can you draw from the result of your hypothesis test?

    The key business insight is that drivers who use iPhone devices on average have a similar number of drives as those who use Androids.

    One potential next step is to explore what other factors influence the variation in the number of drives, and run additonal hypothesis tests to learn more about user behavior. Further, temporary changes in marketing or user interface for the Waze app may provide more data to investigate churn.

References

Google Advanced Data Analytics (Coursera)