NumPy Basics
NumPy. This library has become fundamental, it is hard to imagine a world of research and data science without it, or before its birth. NumPy has been around since 2005, and if you ever worked with data in Python, you must have used it, one way or the other.
What is NumPy?
So what is NumPy? According to the official website, NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Features
- It is a combination of C and python
- Multidimensional homogeneous arrays. Ndarray which are a ndimensional array
- Various functions for arrays.
- Reshaping of arrays Python can be used as an alternative to MATLAB.
One trade-off of using Python is its computing speed. On the other hand, C is known for its high speed. Hence, the developers came to the conclusion of writing a package of numerical functions which is written in C, but which you can run from Python. So, without having to learn C, you can use its power in Python.
The biggest advantage of NumPy is its ability to handle numerical arrays. For example, if you have a list of values and you want to square each of them, the code in base Python will look like:
a = [1, 2, 3, 4, 5]
b = []
for i in a:
b.append(a**2)
and you will get [1, 4, 9, 16, 25]
for b
. Now, if you want to do the same with a 2-dimensional array, the base Python to do this is:
a = [[1, 2], [3, 4]]
b = [[],[]]
for i in range(len(a)):
for j in range(len(a[i])):
b[i].append(a[i][j]**2)
This would give you b
equal to [[1, 4], [9, 16]]
. To do the same with a 3D array you would need 3 nested loops and to do it in 4D would require 4 nested loops. However, with NumPy you can take the square of an array of any dimensions using the same line of code and no loops:
import numpy as npb = np.array(a)**2
Using numpy is much faster than the base python version! It is faster to run, saving you on computing time, and faster to write, saving you time writing your code. All of this allows you to write and run code much faster, and therefore do more science in less time. Not only that, if your friend has a look at your code, they will read the code and understand you want a squared value of the array in an instant, without having to decipher what the for loop is trying to do.
NumPy serves as the basis of most scientific packages in Python, including pandas, matplotlib, scipy, etc. Hence, it would be a good idea to explore the basics of data handling in Python with NumPy.
Installation requirements
Let’s take a look at the various requirements we need to set up before we proceed.
The code is based on the Python 3.4/2.7- compatible version and NumPy version 1.9. The easiest way to install these requirements (and more) is to install a complete Python distribution, such as Enthought Canopy, EPD, Anaconda, or Python (x,y). Once you have installed any one of these, you can safely skip the remainder of this section and should be ready to begin.
Using Python package managers
You can also use Python package managers, such enpkg, Conda, pip or easy_install, to install the requirements using one of the following commands; replace numpy
with any other package name you'd like to install, for example, ipython
, matplotlib
and so on:
$ pip install numpy
$ easy_install numpy
$ enpkg numpy # for Canopy users
$ conda install numpy # for Anaconda users
Using native package managers
If the Python interpreter you want to use comes with the OS and is not a third-party installation, you may prefer using OS-specific package managers such as aptitude, yum, or Homebrew. The following table illustrates the package managers and the respective commands used to install NumPy:
Package managers and Commands:
Aptitude
$ sudo apt-get install python-numpy
Yum
$ yum install python-numpy
Homebrew
$ brew install numpy
Note that, when installing NumPy (or any other Python modules) on OS X systems with Homebrew, Python should have been originally installed with Homebrew.
Detailed installation instructions are available on the respective websites of NumPy, IPython, and matplotlib. As a precaution, to check whether NumPy was installed properly, open an IPython terminal and type the following commands:
In [1]: import numpy as np
If the first statement looks like it does nothing, this is a good sign. If it executes without any output, this means that NumPy was installed and has been imported properly into your Python session.
Congratulations! We are now ready to begin.
Why should we Use?
We use python numpy array instead of a list because of the below three reasons:
- Less Memory usage
- Fast performance
- Convenient to Work
The very first reason to prefer python numpy arrays is that it takes less memory as compared to the python list. Then, it is fast in terms of execution and at the same time, it is convenient and easy to work with it.
What can we do with Numpy?
Built-in support for Arrays is not available in python, but we can use python lists as arrays.
arrayA = ['Hello', 'world'] print(arrayA)
But it’s still a python list, not an array.
So here comes Numpy which we can use to create 2D,3D that is multidimensional arrays. Also, we can do computations on arrays.
import numpy as num
arr = num.array([1,2,3,4,5,6])
print(arr)
Creates array arr.
Then, for 2D and 3D arrays,
import numpy as num
arr = num.array([(1,2,3,4,5),(6,7,8,9,10,11)])
print(arr)
–If you want to know the dimensions of your array, you can simply use the following function.
print(arr.ndim)
–If you want to find out the size of an array, you can simply use the following function,
print(arr.size)
–To find out the shape of an array, you can use shape function.
print(arr.shape)
It will tell you the number of (col, rows)
You can also use slicing, reshaping and many more methods with numpy arrays.
Why do we Need?
To make a logical and mathematical computation on array and matrices numpy is needed. It performs these operations way too efficient and faster than python lists.
NumPy Ndarray
Ndarray is one of the most important classes in the NumPy python library. It is basically a multidimensional or n-dimensional array of fixed size with homogeneous elements( i.e. data type of all the elements in the array is the same). A multidimensional array looks something like this:
In Numpy, the number of dimensions of the array is given by Rank. In the above example, the ranks of the array of 1D, 2D, and 3D arrays are 1, 2 and 3 respectively.
Syntax:
np.ndarray(shape, dtype= int, buffer=None, offset=0, strides=None, order=None)
Here, the size and the number of elements present in the array is given by the shape attribute. The data type of the array(elements in particular) is given by the dtype attribute. Buffer attribute is an object exposing the buffer interface. An offset is the offset of the array data in the buffer. Stride attribute specifies the number of locations in the memory between the starting of successive array elements.
It should always be greater or equal to the size of the data type of the elements. Finally, the order attribute is to specify if we want a row-major or column-major order. Among all the above-mentioned attributes, shape and dtype are the compulsory ones. All other attributes are optional and can be specified on the requirement basis.
Working with Ndarray
An array can be created using the following functions :
- np.ndarray(shape, type): Creates an array of the given shape with random numbers.
- np.array(array_object): Creates an array of the given shape from the list or tuple.
- np.zeros(shape): Creates an array of the given shape with all zeros.
- np.ones(shape): Creates an array of the given shape with all ones.
- np.full(shape,array_object, dtype): Creates an array of the given shape with complex numbers.
- np.arange(range): Creates an array with the specified range.
Examples of Ndarray
Given below are the examples of Ndarray:
Example #1: Attributes of a multidimensional array(ndarray)
import numpy as np
#creating an array to understand its attributes
A = np.array([[1,2,3],[1,2,3],[1,2,3]])
print("Array A is:\n",A)
#type of array
print("Type:", type(A))
#Shape of array
print("Shape:", A.shape)
#no. of dimensions
print("Rank:", A.ndim)
#size of array
print("Size:", A.size)
#type of each element in the array
print("Element type:", A.dtype)
Output:
Indexing & Slicing
Contents of ndarray object can be accessed and modified by indexing or slicing, just like Python’s in-built container objects.
As mentioned earlier, items in ndarray object follows zero-based index. Three types of indexing methods are available − field access, basic slicing and advanced indexing.
Basic slicing is an extension of Python’s basic concept of slicing to n dimensions. A Python slice object is constructed by giving start, stop, and step parameters to the built-in slice function. This slice object is passed to the array to extract a part of array.
Example #1
import numpy as np
a = np.arange(10)
s = slice(2,7,2)
print a[s]
Its output is as follows −
[2 4 6]
In the above example, an ndarray object is prepared by arange() function. Then a slice object is defined with start, stop, and step values 2, 7, and 2 respectively. When this slice object is passed to the ndarray, a part of it starting with index 2 up to 7 with a step of 2 is sliced.
The same result can also be obtained by giving the slicing parameters separated by a colon : (start:stop:step) directly to the ndarray object.
Example #2
import numpy as np
a = np.arange(10)
b = a[2:7:2]
print b
Here, we will get the same output −
[2 4 6]
If only one parameter is put, a single item corresponding to the index will be returned. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index) with default step one are sliced.
Example #3
# slice single item
import numpy as np a = np.arange(10)
b = a[5]
print b
Its output is as follows −
5
Example #4
# slice items starting from index
import numpy as np
a = np.arange(10)
print a[2:]
Now, the output would be −
[2 3 4 5 6 7 8 9]
Example #5
# slice items between indexes
import numpy as np
a = np.arange(10)
print a[2:5]
Here, the output would be −
[2 3 4]
The above description applies to multi-dimensional ndarray too.
Example #6
import numpy as np
a = np.array([[1,2,3],[3,4,5],[4,5,6]])
print a # slice items starting from index
print 'Now we will slice the array from the index a[1:]'
print a[1:]
The output is as follows −
[[1 2 3]
[3 4 5]
[4 5 6]]Now we will slice the array from the index a[1:]
[[3 4 5]
[4 5 6]]
Slicing can also include ellipsis (…) to make a selection tuple of the same length as the dimension of an array. If ellipsis is used at the row position, it will return an ndarray comprising of items in rows.
Example #7
# array to begin with
import numpy as np
a = np.array([[1,2,3],[3,4,5],[4,5,6]]) print 'Our array is:'
print a
print '\n' # this returns array of items in the second column
print 'The items in the second column are:'
print a[...,1]
print '\n' # Now we will slice all items from the second row
print 'The items in the second row are:'
print a[1,...]
print '\n' # Now we will slice all items from column 1 onwards
print 'The items column 1 onwards are:'
print a[...,1:]
The output of this program is as follows −
Our array is:
[[1 2 3]
[3 4 5]
[4 5 6]]
The items in the second column are:
[2 4 5] The items in the second row are:
[3 4 5]The items column 1 onwards are:
[[2 3]
[4 5]
[5 6]]
Copies & Views
While executing the functions, some of them return a copy of the input array, while some return the view. When the contents are physically stored in another location, it is called Copy. If on the other hand, a different view of the same memory content is provided, we call it as View.
No Copy
Simple assignments do not make the copy of array object. Instead, it uses the same id() of the original array to access it. The id() returns a universal identifier of Python object, similar to the pointer in C.
Furthermore, any changes in either gets reflected in the other. For example, the changing shape of one will change the shape of the other too.
Example
import numpy as np
a = np.arange(6) print 'Our array is:'
print a print 'Applying id() function:'
print id(a) print 'a is assigned to b:'
b = a
print b print 'b has same id():'
print id(b) print 'Change shape of b:'
b.shape = 3,2
print b print 'Shape of a also gets changed:'
print a
It will produce the following output −
Our array is:
[0 1 2 3 4 5]Applying id() function:
139747815479536a is assigned to b:
[0 1 2 3 4 5]
b has same id():
139747815479536Change shape of b:
[[0 1]
[2 3]
[4 5]]Shape of a also gets changed:
[[0 1]
[2 3]
[4 5]]
View or Shallow Copy
NumPy has ndarray.view() method which is a new array object that looks at the same data of the original array. Unlike the earlier case, change in dimensions of the new array doesn’t change dimensions of the original.
Example
import numpy as np
# To begin with, a is 3X2 array
a = np.arange(6).reshape(3,2) print 'Array a:'
print a print 'Create view of a:'
b = a.view()
print b print 'id() for both the arrays are different:'
print 'id() of a:'
print id(a)
print 'id() of b:'
print id(b) # Change the shape of b. It does not change the shape of a
b.shape = 2,3 print 'Shape of b:'
print b print 'Shape of a:'
print a
It will produce the following output −
Array a:
[[0 1]
[2 3]
[4 5]]Create view of a:
[[0 1]
[2 3]
[4 5]]id() for both the arrays are different:
id() of a:
140424307227264
id() of b:
140424151696288Shape of b:
[[0 1 2]
[3 4 5]]Shape of a:
[[0 1]
[2 3]
[4 5]]
Slice of an array creates a view.
Example
import numpy as np
a = np.array([[10,10], [2,3], [4,5]]) print 'Our array is:'
print a print 'Create a slice:'
s = a[:, :2]
print s
It will produce the following output −
Our array is:
[[10 10]
[ 2 3]
[ 4 5]]Create a slice:
[[10 10]
[ 2 3]
[ 4 5]]
Deep Copy
The ndarray.copy() function creates a deep copy. It is a complete copy of the array and its data, and doesn’t share with the original array.
Example
import numpy as np
a = np.array([[10,10], [2,3], [4,5]]) print 'Array a is:'
print a print 'Create a deep copy of a:'
b = a.copy()
print 'Array b is:'
print b #b does not share any memory of a
print 'Can we write b is a'
print b is a print 'Change the contents of b:'
b[0,0] = 100 print 'Modified array b:'
print b print 'a remains unchanged:'
print a
It will produce the following output −
Array a is:
[[10 10]
[ 2 3]
[ 4 5]]Create a deep copy of a:
Array b is:
[[10 10]
[ 2 3]
[ 4 5]]
Can we write b is a
FalseChange the contents of b:
Modified array b:
[[100 10]
[ 2 3]
[ 4 5]]a remains unchanged:
[[10 10]
[ 2 3]
[ 4 5]]
Universal Functions: Fast Element-wise Array Functions
A universal function, or ufunc, is a function that performs elementwise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.
Many ufuncs are simple elementwise transformations, like sqrt
or exp
:
In [120]: arr = np.arange(10)In [121]: np.sqrt(arr)
Out[121]:
array([ 0. , 1. , 1.4142, 1.7321, 2. , 2.2361, 2.4495,
2.6458, 2.8284, 3. ])In [122]: np.exp(arr)
Out[122]:
array([ 1. , 2.7183, 7.3891, 20.0855, 54.5982,
148.4132, 403.4288, 1096.6332, 2980.958 , 8103.0839])
These are referred to as unary ufuncs. Others, such as add
or maximum
, take 2 arrays (thus, binary ufuncs) and return a single array as the result:
In [123]: x = np.random.randn(8)In [124]: y = np.random.randn(8)In [125]: x
Out[125]:
array([ 0.0749, 0.0974, 0.2002, -0.2551, 0.4655, 0.9222, 0.446 ,
-0.9337])In [126]: y
Out[126]:
array([ 0.267 , -1.1131, -0.3361, 0.6117, -1.2323, 0.4788, 0.4315,
-0.7147])In [127]: np.maximum(x, y) # element-wise maximum
Out[127]:
array([ 0.267 , 0.0974, 0.2002, 0.6117, 0.4655, 0.9222, 0.446 ,
-0.7147])
While not common, a ufunc can return multiple arrays. modf
is one example, a vectorized version of the built-in Python divmod
: it returns the fractional and integral parts of a floating point array:
In [128]: arr = randn(7) * 5In [129]: np.modf(arr)
Out[129]:
(array([-0.6808, 0.0636, -0.386 , 0.1393, -0.8806, 0.9363, -0.883 ]),
array([-2., 4., -3., 5., -3., 3., -6.]))
Advantages of NumPy
Below are the points explain the advantages of NumPy:
- The core of Numpy is its arrays. One of the main advantages of using Numpy arrays is that they take less memory space and provide better runtime speed when compared with similar data structures in python(lists and tuples).
- Numpy support some specific scientific functions such as linear algebra. They help us in solving linear equations.
- Numpy support vectorized operations, like elementwise addition and multiplication, computing Kronecker product, etc. Python lists fail to support these features.
- It is a very good substitute for MATLAB, OCTAVE, etc as it provides similar functionalities and supports with faster development and less mental overhead(as python is easy to write and comprehend)
- NumPy is very good for data analysis.
Disadvantages of NumPy
Below are the points explain the disadvantages of NumPy:
- Using “nan” in Numpy: “Nan” stands for “not a number”. It was designed to address the problem of missing values. NumPy itself supports “nan” but lack of cross-platform support within Python makes it difficult for the user. That’s why we may face problems when comparing values within the Python interpreter.
- Require a contiguous allocation of memory: Insertion and deletion operations become costly as data is stored in contiguous memory locations as shifting it requires shifting.
Linear Algebra with NumPy
The numpy ndarray
class is used to represent both matrices and vectors. To construct a matrix in numpy we list the rows of the matrix in a list and pass that list to the numpy array constructor.
For example, to construct a numpy array that corresponds to the matrix
we would do
A = np.array([[1,-1,2],[3,2,0]])
Vectors are just arrays with a single column. For example, to construct a vector
we would do
v = np.array([[2],[1],[3]])
A more convenient approach is to transpose the corresponding row vector. For example, to make the vector above we could instead transpose the row vector
The code for this is
v = np.transpose(np.array([[2,1,3]]))
numpy overloads the array index and slicing notations to access parts of a matrix. For example, to print the bottom right entry in the matrix A we would do
print(A[1,2])
To slice out the second column in the A matrix we would do
col = A[:,1:2]
The first slice selects all rows in A, while the second slice selects just the middle entry in each row.
To do a matrix multiplication or a matrix-vector multiplication we use the np.dot()
method.
w = np.dot(A,v)
Solving systems of equations with numpy
One of the more common problems in linear algebra is solving a matrix-vector equation. Here is an example. We seek the vector x that solves the equation
A x = b
where
We start by constructing the arrays for A and b.
A = np.array([[2,1,-2],[3,0,1],[1,1,-1]])
b = np.transpose(np.array([[-3,5,-2]])
To solve the system we do
x = np.linalg.solve(A,b)
Application: multiple linear regression
In a multiple regression problem we seek a function that can map input data points to outcome values. Each data point is a feature vector (x1 , x2 , …, xm) composed of two or more data values that capture various features of the input. To represent all of the input data along with the vector of output values we set up a input matrix X and an output vector y:
In a simple least-squares linear regression model we seek a vector β such that the product Xβ most closely approximates the outcome vector y.
Once we have constructed the β vector we can use it to map input data to a predicted outcomes. Given an input vector in the form
we can compute a predicted outcome value
The formula to compute the β vector is
β = (XT X)-1 XT y
In our next example program I will use numpy to construct the appropriate matrices and vectors and solve for the β vector. Once we have solved for β we will use it to make predictions for some test data points that we initially left out of our input data set.
Assuming we have constructed the input matrix X and the outcomes vector y in numpy, the following code will compute the β vector:
Xt = np.transpose(X)
XtX = np.dot(Xt,X)
Xty = np.dot(Xt,y)
beta = np.linalg.solve(XtX,Xty)
The last line uses np.linalg.solve
to compute β, since the equation
β = (XT X)-1 XT y
is mathematically equivalent to the system of equations
(XT X) β = XT y
The data set I will use for this example is the Windsor house price data set, which contains information about home sales in the Windsor, Ontario area. The input variables cover a range of factors that may potentially have an impact on house prices, such as lot size, number of bedrooms, and the presence of various amenities. A CSV file with the full data set is available here. I downloaded the data set from this site, which offers a large number of data sets covering a large range of topics.
Here now is the source code for the example program.
import csv
import numpy as npdef readData():
X = []
y = []
with open('Housing.csv') as f:
rdr = csv.reader(f)
# Skip the header row
next(rdr)
# Read X and y
for line in rdr:
xline = [1.0]
for s in line[:-1]:
xline.append(float(s))
X.append(xline)
y.append(float(line[-1]))
return (X,y)X0,y0 = readData()
# Convert all but the last 10 rows of the raw data to numpy arrays
d = len(X0)-10
X = np.array(X0[:d])
y = np.transpose(np.array([y0[:d]]))# Compute beta
Xt = np.transpose(X)
XtX = np.dot(Xt,X)
Xty = np.dot(Xt,y)
beta = np.linalg.solve(XtX,Xty)
print(beta)# Make predictions for the last 10 rows in the data set
for data,actual in zip(X0[d:],y0[d:]):
x = np.array([data])
prediction = np.dot(x,beta)
print('prediction = '+str(prediction[0,0])+' actual = '+str(actual))
The original data set consists of over 500 entries. To test the accuracy of the predictions made by the linear regression model we use all but the last 10 data entries to build the regression model and compute β. Once we have constructed the β vector we use it to make predictions for the last 10 input values and then compare the predicted home prices against the actual home prices from the data set.
Here are the outputs produced by the program:
[[ -4.14106096e+03]
[ 3.55197583e+00]
[ 1.66328263e+03]
[ 1.45465644e+04]
[ 6.77755381e+03]
[ 6.58750520e+03]
[ 4.44683380e+03]
[ 5.60834856e+03]
[ 1.27979572e+04]
[ 1.24091640e+04]
[ 4.19931185e+03]
[ 9.42215457e+03]]
prediction = 97360.6550969 actual = 82500.0
prediction = 71774.1659014 actual = 83000.0
prediction = 92359.0891976 actual = 84000.0
prediction = 77748.2742379 actual = 85000.0
prediction = 91015.5903066 actual = 85000.0
prediction = 97545.1179047 actual = 91500.0
prediction = 97360.6550969 actual = 94000.0
prediction = 106006.800756 actual = 103000.0
prediction = 92451.6931269 actual = 105000.0
prediction = 73458.2949381 actual = 105000.0
Refrences :
https://www.educba.com/numpy-ndarray/
https://towardsdatascience.com/a-hitchhiker-guide-to-python-numpy-arrays-9358de570121
https://www.tutorialspoint.com/numpy/numpy_indexing_and_slicing.htm
That’s all for this particular post. Will come up with another set of interesting Data Science topics in another post
Thanks for Reading, keep learning !!!