The Journey Begins: From Technician to Data Scientist
January 18, 2024Comprehensive Data Analysis Strategy for Algorithmic Stocks Trading
January 25, 2024When I began the journey into data science and analysis, Python and R programming were the choice programming language I was presented with. As a beginner sometimes it is challenging to know what to choose. After consulting with experts in the industry and a few friends, I quickly learned that while R is versatile and very effective , Python is widely used by several companies and has emerged as a linchpin, primarily due to its simplicity and the powerful libraries it offers.
Working with Python libraries has been eye opening and I discovered that two of the most pivotal libraries that have become essential tools for data manipulation, exploration, analysis, and machine learning are Pandas and NumPy. These libraries significantly simplify data manipulation, making it more accessible to beginners and invaluable to seasoned professionals. My aim in this post is to introduce you to the basics of data manipulation using Pandas and NumPy, providing a foundation from which you can get a glimpse of the vastness of data analysis using Python.
Overview of Pandas
Pandas is an open-source data analysis and manipulation library for Python develpoed by Wes McKinney, in 2008 originally called Python data analytics toolkit and later turned it over to its open source community in 2013. Pandas provide a fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.
The core of Pandas is its DataFrame – a two-dimensional table, size-mutable, and potentially heterogeneous tabular data structure (numerical, ordinal, categorical) with labeled axes rows and columns (axis = 0 refers to horizontal axis or rows and axis = 1 refers to vertical axis or columns.). Think of it as a spreadsheet or SQL table in Python. Pandas provide an arsenal of functions to perform operations such as reading, writing, cleaning, slicing, aggregating, merging, and much more on a DataFrame.
What is NumPy?
NumPy, short for Numerical Python, is a foundational package for numerical computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. A NumPy array is a grid of values, all of the same type, indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.
NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.
How You Get Started with Pandas and NumPy
To begin working with Pandas and NumPy, you first need to install them using pip:
pip install pandas numpy
Once installed, you can import these libraries in your Python script as follows:
import numpy as np
import pandas as pd
Basic Operations with Pandas
Let’s create a simple DataFrame to demonstrate some basic Pandas operations:
# Creating a DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']}
df = pd.DataFrame(data)
# Display the DataFrame
print(df)
With this DataFrame, you can perform a variety of operations. For example, to select a single column, you would use:
# Selecting a single column
print(df['Name'])
To add a new column:
# Adding a new column
df['Salary'] = [70000, 80000, 75000, 65000]
print(df)
Basic Operations with NumPy
NumPy’s main object is the multidimensional array. To create a NumPy array, you can use the following syntax:
# Creating a NumPy array
arr = np.array([1, 2, 3, 4, 5])
print(arr)
You can perform a wide array of mathematical operations on arrays:
# Basic operations
print(arr + 1)
print(arr * 2)
print(arr ** 2)
NumPy arrays also support conditional operations:
# Conditional operations
print(arr > 2)
Combining Pandas and NumPy
Pandas and NumPy are not mutually exclusive and can be used together to perform more complex data manipulation tasks. For instance, you can apply NumPy functions directly on Pandas Series or DataFrame objects:
# Applying a NumPy function on a Pandas DataFrame
df['Age_Squared'] = np.square(df['Age'])
print(df)
Conclusion
Pandas and NumPy are powerful tools for anyone looking to dive into data analysis with Python. They offer an extensive range of functionalities that can handle most of your data manipulation needs. The best way to learn is by doing, so I encourage you to experiment with these libraries and explore their vast potential.
Whether you’re a novice looking to understand the basics of data manipulation or an experienced data scientist seeking to refine your skills, Pandas and NumPy provide the foundation you need. I hope this introduction sparks your interest in data analysis with Python and leads you to further explore what you can accomplish with these libraries.
I welcome feedback, questions, or discussions on this topic. Share your thoughts in the comments below or on social media. Let’s dive into