Introduction to Statistical Analysis and Machine Learning with Python

Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of data. Fundamentally, it provides methods for making informed decisions in the presence of uncertainty. Statistics is divided into two main areas: descriptive statistics, which seeks to summarize and describe the features of a data set, and inferential statistics, which focuses on making predictions or inferences about a population based on a sample.

In today’s world, the importance of statistics transcends disciplines and sectors. In Administration, it aids in strategic decision-making and resource planning. In Finance, it is crucial for risk analysis and market trend forecasting. The Industry benefits from statistics in quality control and process optimization. In Data Science, statistics is foundational for understanding data, extracting insights, and validating hypotheses. However, it is in Machine Learning where statistics reveals its most transformative potential, enabling machines to learn from data and improve their performance on specific tasks without being explicitly programmed to do so.

Python has established itself as the de facto programming language in statistical analysis and machine learning, thanks to its simplicity and the rich ecosystem of specialized libraries. Among these, Numpy is one of the most essential, offering efficient data structures and mathematical operations for handling and processing large sets of numerical data. Below are basic examples of statistical functions in Numpy that are pillars in data analysis and machine learning:

  • Mean (Average): The central measure that provides a reference point for the data set.
import numpy as np
data = np.array([1, 2, 3, 4, 5])
mean = np.mean(data)
print("Mean:", mean)

Standard Deviation: Indicates how much the values tend to deviate from the mean, i.e., the dispersion of the data.

standard_deviation = np.std(data)
print("Standard Deviation:", standard_deviation)

standard_deviation = np.std(data)
print("Standard Deviation:", standard_deviation)

Median: The middle value in an ordered set of data, useful when the data contains outliers.

median = np.median(data)
print("Median:", median)

Mode: The value or values that appear most frequently in the data set, calculable with the scipy library.

from scipy import stats
mode = stats.mode(data)
print("Mode:", mode)

Through these and many other functions, Python and its libraries like Numpy and Scipy become powerful tools for statistical analysis, unlocking deep insights into data and enabling the development of more accurate and effective machine learning models. This ecosystem of tools not only democratizes access to advanced statistical techniques but also catalyzes innovation across diverse fields such as artificial intelligence, bioinformatics, engineering, and beyond.

Code repository:

Statistical Analysis with Python: An Updated Guide on Functions and Applications

Here we present the reference to functionalities introduced starting from Python 3.4 for performing basic statistical analysis with the statistics library. This module offers functions to calculate central statistical measures, such as the mean and median, as well as to deal with data that may include non-numeric or undefined (NaN) values, which are common in real data sets.

The example illustrates how the median function of the statistics module correctly handles data, including NaN values, to calculate the median more intuitively than simply sorting the data and picking the middle value directly. It also shows how to clean data of NaN values using filterfalse from itertools and isnan from math before performing statistical operations, which is a common practice in data analysis to ensure meaningful results.

The functions are median(), median_low(), median_high(), median_grouped(), mode(), multimode(), and quantiles().

from statistics import median
from math import isnan
from itertools import filterfalse

data = [20.7, float('NaN'),19.2, 18.3, float('NaN'), 14.4]
sorted(data)  # This has surprising behavior
[20.7, nan, 14.4, 18.3, 19.2, nan]
median(data)  # This result is unexpected

sum(map(isnan, data))    # Number of missing values
clean = list(filterfalse(isnan, data))  # Strip NaN values
[20.7, 19.2, 18.3, 14.4]
sorted(clean)  # Sorting now works as expected
[14.4, 18.3, 19.2, 20.7]
median(clean)       # This result is now well defined

Averages and measures of central location

These functions calculate an average or typical value from a population or sample.

mean()Arithmetic mean (“average”) of data.
fmean()Fast, floating point arithmetic mean, with optional weighting.
geometric_mean()Geometric mean of data.
harmonic_mean()Harmonic mean of data.
median()Median (middle value) of data.
median_low()Low median of data.
median_high()High median of data.
median_grouped()Median, or 50th percentile, of grouped data.
mode()Single mode (most common value) of discrete or nominal data.
multimode()List of modes (most common values) of discrete or nominal data.
quantiles()Divide data into intervals with equal probability.
Measures of spread

These functions calculate a measure of how much the population or sample tends to deviate from the typical or average values.

pstdev()Population standard deviation of data.
pvariance()Population variance of data.
stdev()Sample standard deviation of data.
variance()Sample variance of data.
Statistics for relations between two inputs

These functions calculate statistics regarding relations between two inputs.

covariance()Sample covariance for two variables.
correlation()Pearson and Spearman’s correlation coefficients.
linear_regression()Slope and intercept for simple linear regression.
Function details

To know the details of these functions, we recommend analyzing the documentation statistics – Mathematical statistics functions

Recommended: American Statistical Association blog page

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *