Difference between revisions of "Statistics"

Jump to navigation Jump to search
13,520 bytes added ,  05:30, 25 January 2021
no edit summary
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Template:Computer engineering}}
{{Template:Shortcut}}
<div style="float: right; margin-left: 12px">__TOC__</div>
<div style="float: right; margin-left: 12px">__TOC__</div>


Line 8: Line 8:
<!--PLEASE DO NOT EDIT THE OPENING SENTENCE WITHOUT FIRST PROPOSING YOUR CHANGE AT THE TALK PAGE.-->  
<!--PLEASE DO NOT EDIT THE OPENING SENTENCE WITHOUT FIRST PROPOSING YOUR CHANGE AT THE TALK PAGE.-->  
'''Statistics''' is the discipline that concerns the collection, organization, analysis, interpretation and presentation of data.<ref name=ox>{{cite web | title=Oxford Reference|url=https://www.oxfordreference.com/view/10.1093/acref/9780199541454.001.0001/acref-9780199541454-e-1566?rskey=nxhBLl&result=1979}}</ref><ref>{{cite encyclopedia |first=Jan-Willem |last=Romijn |year=2014 |title=Philosophy of statistics |encyclopedia=Stanford Encyclopedia of Philosophy |url=http://plato.stanford.edu/entries/statistics/}}</ref><ref>{{cite web | title=Cambridge Dictionary | url=https://dictionary.cambridge.org/dictionary/english/statistics}}</ref> In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a [[statistical population]] or a [[statistical model]] to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of [[statistical survey|surveys]] and [[experimental design|experiments]].<ref name=Dodge>Dodge, Y. (2006) ''The Oxford Dictionary of Statistical Terms'', Oxford University Press. {{isbn|0-19-920613-9}}</ref> See [[glossary of probability and statistics]].
'''Statistics''' is the discipline that concerns the collection, organization, analysis, interpretation and presentation of data.<ref name=ox>{{cite web | title=Oxford Reference|url=https://www.oxfordreference.com/view/10.1093/acref/9780199541454.001.0001/acref-9780199541454-e-1566?rskey=nxhBLl&result=1979}}</ref><ref>{{cite encyclopedia |first=Jan-Willem |last=Romijn |year=2014 |title=Philosophy of statistics |encyclopedia=Stanford Encyclopedia of Philosophy |url=http://plato.stanford.edu/entries/statistics/}}</ref><ref>{{cite web | title=Cambridge Dictionary | url=https://dictionary.cambridge.org/dictionary/english/statistics}}</ref> In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a [[statistical population]] or a [[statistical model]] to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of [[statistical survey|surveys]] and [[experimental design|experiments]].<ref name=Dodge>Dodge, Y. (2006) ''The Oxford Dictionary of Statistical Terms'', Oxford University Press. {{isbn|0-19-920613-9}}</ref> See [[glossary of probability and statistics]].
You can use [[LibreOffice Calc]], [[Microsoft Excel]], [[SPSS]], [[Python]], and [[R]] to calculate statistics data.
* Run R code online
https://rdrr.io/snippets/
* compile R online
https://rextester.com/l/r_online_compiler


== Python ==
== Python ==
Line 24: Line 35:




=== z-test ===
=== t-test ===
=== Chi square ===
=== Linear regression ===




=== One-way (one factor) ANOVA ===
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
 
Execute the above [[source code]] and confirm whether or not [[scipy]], [[numpy]], [[matplotlib]], and [[pandas]] are installed.
 
If they are not installed, you can install [[SciPy]], [[NumPy]], [[Matplotlib]], [[pandas]] with the below [[command]]s.
 
pip3 install pandas
pip3 install scipy
pip3 install matplotlib
 
"numpy" will be installed altogether when "pandas" is installed.
 
 
=== Parametric statistics ===
[[Parametric statistics]]
 
==== Z-test ====
[[Z-test]]
 
 
Z-test is a test for the proportions. In other words this is a statistical test that helps us evaluate our beliefs about certain proportions in the population based on the sample at hand.
 
 
This can help us answer the questions like:
* is the proportion of female students at SKEMA equal to 0.5.
* is the proportion of smokers in France equal to 0.15.
 
 
For conducting Z-test you do not need much calculations on your sample data. The only thing you need to know is the proportion of observations that qualify to belong to the sub-sample you are interested in (e.g. a “female SKEMA student”, or a “French smoker” in examples above).
 
 
We will use the dataset on cars in the US for learning purposes. This contains a list of 32 cars and their characteristics.
 
 
In the simplest example involving the data at hand, we can ask the question whether the share of cars with variable “am” being equal to 0 is equal to 50%.
 
 
Function used for z-testing is "scipy.stats.binom_test". It requires three arguments x - number of qualified observations in our data (19 in our case) n - number of total observations (32 in our case) p - the null hypothesis on the share of qualified data (0.5 in our case)
 
 
Output of the test gives rich information about the test:
* It specifies the alternative hypothesis (by default it is set to conduct a two-sided test, so the alternative hypothesis is that the share is not equal to the proportion specified in the null hypotheses. However, we will see how to adjust this in next chapter)
* It specifies the confidence level and interval
* However, by default, it only returns the most important piece of information - the p-value of the test
 
 
This value can be understood as the probability that we are making a mistake if we reject our null hypothesis in favor of the alternative one. In this case this probability is 37.7% which is very high (anything above 10% is high), which would prompt us to conclude that we do not have enough statistical evidence to claim that the share of cars with am=0 was not 50% in the population.
 
 
 
* Use "value_counts()" function to display th frequency distribution of the desired variable
* Test whether the share of cars with the number of cylinders less than 6 is equal to 0.6
* What is your conclusion about the test on number of cylinders?
 
 
 
# Load the libraries
import pandas as pd
from scipy.stats import binom_test
# Load the dataset
df = pd.read_csv("https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv")
# In order to conduct a test we need our three inputs.
# Nul hypotheses is whether the share of cars with am=0 is equal to 50%. This (0.5 - a share!) is out first input (p).
# The two other inputs are the number of total observations in our dataset (n) and the number of observations satisfying the condition we are testing for [i.e. am=0] (x)
# We can get both of this information by diplaying (not plotting!) a simple frequency distribution of our am variable
# We can do this by using .value_counts() function
print(df.am.value_counts())
print('\n')
# This shows that x=19.
# By adding all frequencies, we can get the total number of observations in the dataset (n=32)
# Now we are ready to run the test (i.e. calculating the p-value of our test):
print(binom_test(x = 19, n = 32, p = 0.5))
print("\n")
# [DIY] Calculate how many cars in the dataset have less than 6 cylinders
print(df.cyl.value_counts())
print('\n')
# [DIY] Test (calculate the p-value) whether we have enough statistical evidence to claim that the share of cars with less than 6 cilinders is not 60%
print(binom_test(x = 11, n = 32, p = 0.6))
 
 
 
Result:
 
0    19
1    13
Name: am, dtype: int64
0.37708558747544885
8    14
4    11
6    7
Name: cyl, dtype: int64
0.003708023361985829
 
==== t-test ====
[[Student's t-test]] ([[t-test]])
 
 
 
Unpaired and paired two-sample t-tests
 
Independent (unpaired) samples
 
Paired samples
 
 
 
There are multiple statistical hypothesis tests out there. Each test aims to find if there is a difference in one of many statistical properties. Statistical properties include the standard deviation, '''average''' or variance for example. The T-Test is used to determine if the mean (average) of two groups are truly different.
 
It is also called the Student’s T-Test. <ins>Not</ins> because it’s used in college! But rather, because its inventor, William Sealy Gosset, used the pseudonym ''Student''.
 
=====When to use the T-Test?=====
You use the T-Test when you will be comparing the '''means''' of two samples. If you have more than 2 samples, you will have to run a pairwise T-Test on all samples or use another statistical hypothesis method called Anova.
 
When you don’t know the population’s mean and standard deviation. In the T-Test, you are comparing 2 samples of an unknown population. A sample is a randomly chosen set of data points from a population. If you do know the population’s mean and standard deviation, you would run a Z-Test instead.
 
When you have a small number of samples. The T-Test is commonly used when you have less than 30 samples in each of the groups you are running the T-Test on.
 
If you have less than 30 samples in each of the groups, you run the T-Test if you can assume the population follows a normal distribution. As mentioned previously, the T-Test is commonly used on smaller sample sizes. You use the T-Test if the samples follow a normal distribution. Why is this allowed? You can thank the [[Central Limit Theorem]] for this.
 
=====Types of T-Test=====
There are three types of T-Tests that you can run.
 
 
A first is the '''Independent Sample T-Test'''. In this type of test, you are comparing the average of two independent unrelated groups. Meaning, you are comparing samples from two different populations and are testing whether or not they have a different average.
 
 
You can also run a '''Paired Sample T-Test'''. In this test, you compare the average of two samples taken from the same population but at different points in time. A simple example would be when you would like to test the means of before and after observations taken from the same target.
 
 
Lastly, you could also run a '''One-Sample T-Test''', where we test if the average of a single group is different from a known average or hypothesized average.
 
=====Python Independent Sample T-Test=====
To run an Independent Sample T-Test using python, let us first generate 2 samples of 50 observations each. Sample '''A''' is taken from a population of mean 55 and a standard deviation of 20. Sample '''B''' is taken from a population of mean 50 and a standard deviation of 15.
 
 
Using the [[seaborn]] python library to generate a histogram of our 2 samples outputs the following.
 
Install [[seaborn]].
 
pip3 install seaborn
 
 
We are ready to test statistically whether these two samples have a different mean using the T-Test. To do so first, we have to define our '''Null and Alternate Hypothesis'''.
 
* Null Hypothesis: µ<sub>a</sub> = µ<sub>b</sub> (the means of both populations are equal)
* Alternate Hypothesis: µ<sub>a</sub> ≠ µ<sub>b</sub> (the means of both populations are not equal)
 
 
Python has a popular statistical package called [[scipy]] which has implemented the T-Test in its statistics module. To run a Python Independent Sample T-Test we do so as below.
 
Important to note, we are specifying that the population does not have equal variance passing along False for the equal_var parameter. We know this because both samples were taken from populations with different standard deviations. Normally you wouldn’t know this is true and would have to run a Levene Test for Equal Variances.
 
 
import random
random.seed(20) #for results to be recreated
N = 50 #number of samples to take from each population
a = [random.gauss(55,20) for x in range(N)] #take N samples from population A
b = [random.gauss(50,15) for x in range(N)] #take N samples from population B
from scipy import stats
tStat, pValue = stats.ttest_ind(a, b, equal_var = False) #run independent sample T-Test
print("P-Value:{0} T-Statistic:{1}".format(pValue,tStat)) #print the P-Value and the T-Statistic
import seaborn as sns
import matplotlib.pyplot as plt
sns.kdeplot(a, shade=True)
sns.kdeplot(b, shade=True)
plt.title("Independent Sample T-Test")
plt.show()
 
 
Result:
P-Value:0.017485741540118758 T-Statistic:2.421942924642376
 
 
The stats '''ttest_ind''' function runs the independent sample T-Test and outputs a P-Value and the Test-Statistic. In this example, there is enough evidence to <ins>reject</ins> the Null Hypothesis as the P-Value is low (typically ≤ 0.05).
 
=====Python Paired Sample T-Test=====
In a Paired Sample T-Test, we will test whether the averages of 2 samples taken from the <ins>same</ins> population are different or not.
 
 
Taking two sets of observations from the same population generates a pair of samples, reason why it is called the Paired Sample Test.
 
 
For instance, let’s imagine you have a target process you are experimenting with. At time t, you take 30 samples. Next, you implement a process change hoping to increase the average. The code below generates the data needed for this experiment.
 
 
With the data on hand, we will now run the Paired Sample T-Test. T
 
* Null Hypothesis: µ<sub>d</sub> = 0 (the mean difference (d) between both samples is equal to zero)
* Alternate Hypothesis: µ<sub>d</sub> ≠ 0 (the mean difference (d) between both samples is not equal to zero )
 
 
The python package [[Scipy]] has implemented the Paired Sample T-Test in its '''ttest_rel''' function. Let’s run this below.
 
 
import random
random.seed(20) #for results to be recreated
N = 30 #number of samples to take from each population
a = [random.gauss(50,15) for x in range(N)] #take N samples from population A at time T
b = [random.gauss(60,15) for x in range(N)] #take N samples from population A at time T+x
from scipy import stats
tStat, pValue =  stats.ttest_rel(a, b)
print("P-Value:{0} T-Statistic:{1}".format(pValue,tStat)) #print the P-Value and the T-Statistic
 
 
Result:
 
P-Value:0.007834002687720412 T-Statistic:-2.856841146891359
 
 
As expected, since we generated the data, we can reject the null hypothesis and accept the alternative hypothesis that the mean difference between both samples is not equal to zero.
 
 
=====Python One-Sample T-Test=====
In the One-Sample T-Test, we test the hypothesis of whether the population average is equal to a specified average. The null and alternative hypotheses are stated below.
 
* Null Hypothesis: µ<sub>a</sub>  = X (the population mean is equal to a mean of X)
* Alternate Hypothesis: µ<sub>a</sub> ≠ X (he population mean is not equal to a mean of X )
 
 
As has been usual in this tutorial, let us generate some data to utilize. We take 30 samples from a population with mean 50 and standard deviation 15. We will test whether the population mean is equal to 50.5. The variable ''popmean'' holds this value.
 
 
The Python [[Scipy]] packages makes conducting a One-Sample T-Test a breeze through their '''ttest_1samp''' function of the stats package.
 
 
 
import random
random.seed(20) #for results to be recreated
N = 30 #number of samples to take from each population
a = [random.gauss(50,15) for x in range(N)] #take N samples from population A
popmean = 50.5  #hypothesized population mean
from scipy import stats
tStat, pValue =  stats.ttest_1samp(a, popmean, axis=0)
print("P-Value:{0} T-Statistic:{1}".format(pValue,tStat)) #print the P-Value and the T-Statistic
 
 
Result:
 
P-Value:0.5340949682112062 T-Statistic:0.6292755379958038
 
 
What do you think? Do we reject or fail to reject the null hypothesis?
 
Since the P-Value is not low, 0.5 in this case, we fail to reject the Null Hypothesis. Statistically speaking, there is not enough evidence to conclude that the population average (mean) is not equal to 50.5.
 
==== Chi-squared test ====
[[Chi-squared test]]
 
==== Regression analysis ====
[[Regression analysis]]
 
[[Linear regression]]
 
==== Analysis of variance ====
[[Analysis of variance]] ([[ANOVA]])
 
One-way (one factor) ANOVA


==== Pearson correlation coefficient ====
[[Pearson correlation coefficient]]


Analysis of variance (ANOVA)
=== Nonparametric statistics ===
[[Nonparametric statistics]]






[[Rank test]]


==== Wilcoxon signed-rank test ====
[[Wilcoxon signed-rank test]]


=== Mann-Whitney U test ===
==== Kruskal–Wallis one-way analysis of variance ====
[[Kruskal–Wallis one-way analysis of variance]]


==== Mann-Whitney U test ====
[[Mann–Whitney U test]]


[[Mann-Whitney rank test]]


Mann-Whitney rank test
==== Spearman's rank correlation coefficient ====
[[Spearman's rank correlation coefficient]]


== Introduction ==
== Introduction ==
17

edits

Navigation menu