Data Analysis using Python

Sep 1, 20214 min read

Why python for Data Analysis?

For many people, the Python programming language has strong appeal. Since its first appearance in 1991, Python has become one of the most popular interpreted programming languages, along with Perl, Ruby, and others. Python and Ruby have become especially popular since 2005 or so for building websites using their numerous web frameworks, like Rails (Ruby) and Django (Python). Such languages are often called scripting languages, as they can be used to quickly write small programs, or scripts to automate other tasks.

Python has developed a large and active scientific computing and data analysis community. In the last 10 years, Python has gone from a bleeding-edge or “at your own risk” scientific computing language to one of the most important languages for data science, machine learning, and general software development in academia and industry.

For data analysis and interactive computing and data visualization, Python will inevitably draw comparisons with other open source and commercial programming languages and tools in wide use, such as R, MATLAB, SAS, Stata, and others. In recent years, Python’s improved support for libraries (such as pandas and scikit-learn) has made it a popular choice for data analysis tasks. Combined with Python’s overall strength for general-purpose software engineering, it is an excellent option as a primary language for building data applications.

Essential Python Libraries:

Numpy
Pandas
Matpoltlib
Ipython and Jupyter
SciPy
Scikit-learn
Statsmodels

Installation and Setup:

I would like to recommend Anaconda with python 3.6 version.

Windows:

To get started on Windows, download the Anaconda installer. I recommend following the installation instructions for Windows available on the Anaconda download page.

Let’s verify that things are configured correctly. To open the Command Prompt application (also known as cmd.exe), right-click the Start menu and select Command Prompt. Try starting the Python interpreter by typing python. You should see a message that matches the version of Anaconda you installed:

C:\Users\wesm>python Python 3.5.2 |Anaconda 4.1.1 (64-bit)| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)] on win32 >>>

To exit the shell, press Ctrl-D (on Linux or macOS), Ctrl-Z (on Windows), or type the command exit() and press Enter.

Installing or Updating Python Packages:

At some point while reading, you may wish to install additional Python packages that are not included in the Anaconda distribution. In general, these can be installed with the following command:

conda install package_name

If this does not work, you may also be able to install the package using the pip package management tool:

pip install package_name

You can update packages by using the conda update command:

conda update package_name

pip also supports upgrades using the — upgrade flag:

pip install --upgrade package_name

Ok. Let’s talk about python essential libraries for data analysis:

Numpy:

Numpy short for Numerical Python, has long been a cornerstone of numerical computing in Python. It provides the data structures, algorithms, and library glue needed for most scientific applications involving numerical data in Python. NumPy contains, among other things:

A fast and efficient multidimensional array object ndarray
Functions for performing element-wise computations with arrays or mathematical operations between arrays
Tools for reading and writing array-based datasets to disk
Linear algebra operations, Fourier transform, and random number generation
A mature C API to enable Python extensions and native C or C++ code to access NumPy’s data structures and computational facilities

Beyond the fast array-processing capabilities that NumPy adds to Python, one of its primary uses in data analysis is as a container for data to be passed between algorithms and libraries. For numerical data, NumPy arrays are more efficient for storing and manipulating data than the other built-in Python data structures.

Example for numpy :

Before you start to writing a code check whether the numpy library are installed or simply install using pip install numpy.

In [1]: import numpy as np
In [2]: data = np.random.randn(2, 3)

In [3]: data

Output:

Out[3]: array([[-0.2047, 0.4789, -0.5194], 
                [-0.5557, 1.9658, 1.3934]])

Then write mathematical operations with data:

In [4]: data * 10

Out[4]: array([[ -2.0471, 4.7894, -5.1944],
               [ -5.5573, 19.6578, 13.9341]])

Pandas:

It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops.

Importing pandas library:

import pandas as pd

To get started with pandas, you will need to get comfortable with its two data structures: Series and DataFrame.

Series:

A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

In [1]: obj = pd.Series([4, 7, -5, 3])
In [2]: obj

Out[2]: 0 4
         1 7
         2 -5
         3 3
         dtype: int64

DataFrame:

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index it can be thought of as a dictionary of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.

# creating dataframe
In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
                'year': [2000, 2001, 2002, 2001, 2002, 2003], 
                'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} 
In [4]: frame = pd.DataFrame(data)

In [5]: frame
Out[45]:  
  pop state year
0 1.5 Ohio 2000 
1 1.7 Ohio 2001
2 3.6 Ohio 2002 
3 2.4 Nevada 2001 
4 2.9 Nevada 2002 
5 3.2 Nevada 2003

Matpoltlib:

Data visualization is the most important part of any analysis. Matplotlib is an amazing python library which can be used to plot pandas dataframe. There are various ways in which a plot can be generated depending upon the requirement.

# importing pandas library
import pandas as pd
# importing matplotlib library
import matplotlib.pyplot as plt

# creating dataframe
df = pd.DataFrame({
 'Name': ['John', 'Sammy', 'Joe'],
 'Age': [45, 38, 90]
})

# plotting a bar graph
df.plot(x="Name", y="Age", kind="bar")

Output: