Data Transformation
Last updated on 2024-12-09 | Edit this page
Estimated time: 120 minutes
Overview
Questions
- How can I process tabular data files in Python?
Objectives
- Explain what a library is and what libraries are used for.
- Import a Python library and use the functions it contains.
- Read tabular data from a file into a program.
- Clean and prepare data.
- Merge and reshape data.
- Handle missing values.
- Aggregate and summarize data.
Introduction
What is a Package/Library in Python?
In Python, a library (or package) is a collection of pre-written code that you can use to perform common tasks without needing to write the code from scratch. It’s a toolbox that provides tools (functions, classes, and modules) to help you solve specific problems.
Python packages save time and effort by providing solutions to common
programming challenges. Instead of reinventing the wheel, you can
import
these tools into your scripts and use them to
complete tasks more easily.
A Python package for data manipulations: pandas
In this lesson, we will focus on using the pandas library in Python to perform common data wrangling tasks.
pandas
is an open-source Python library for data
manipulation and analysis. It provides data structures like
DataFrame and Series that make it easy
to handle and analyze data.
Series
A Series is a one-dimensional labeled array, similar to a list. It can hold any data type, such as integers, strings, or even more complex data types.
Key Features:
- It’s one-dimensional, so it holds data in a single column.
- It has an index (labels) that you can use to access specific elements.
- Each element in the Series has a label (index) and a value.
Example of a Series:
PYTHON
import pandas as pd
# Create a Series from a list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
# Print the Series
print(series)
Output:
OUTPUT
0 10
1 20
2 30
3 40
4 50
dtype: int64
The index is 0, 1, 2, 3, 4, and the values are 10, 20, 30, 40, 50. pandas automatically creates an index for you (starting from 0), but you can also specify a custom index.
DataFrame
A DataFrame is a two-dimensional, table-like structure (similar to a spreadsheet or SQL table) that can hold multiple Series. It is the most commonly used pandas object.
A DataFrame consists of:
- Rows (with an index, just like a Series),
- Columns (which are each Series).
You can think of a DataFrame as a collection of Series that share the same index.
Key Features:
- It’s two-dimensional (i.e., it has rows and columns).
- Each column is a Series.
- It has both row and column labels (indexes and column names).
- It can hold multiple data types (integers, strings, floats, etc.).
Example of a DataFrame:
PYTHON
import pandas as pd
# Create a DataFrame using a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Paris', 'Seoul']
}
df = pd.DataFrame(data)
# Print the DataFrame
print(df)
Output:
OUTPUT
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
Here:
- The rows are indexed from 0, 1, 2 (default index).
- The columns are
Name
,Age
, andCity
. - Each column is a Series, so the
Name
column is a Series, theAge
column is another Series, etc.
Import methods
In Python, libraries (or modules) can be imported into your code using the import statement. This allows you to access the functions, classes, and methods defined in that library. There are several ways to do it:
- Full import:
import pandas
- Use with
pandas.DataFrame()
,pandas.Series()
, etc.
- Import with alias:
import pandas as pd
- Use with
pd.DataFrame()
,pd.Series()
, etc.
- Import specific functions or classes:
from pandas import DataFrame
- Use directly as
DataFrame()
.
- Import multiple specific elements:
from pandas import DataFrame, Series
In general, we use the option 2 for
pandas
.
Loading data
Loading CSV data
You can load a CSV file into a pandas DataFrame using the
read_csv()
function:
-
read_csv()
reads the CSV file located at ‘path/to/file.csv’ and loads it into a pandas DataFrame (df
). - By default, it assumes that the file has a header row (i.e., column names) in the first row.
If the file does not have a header, you can use the
header=None
parameter to let pandas generate default column
names.
You can pass arguments like sep
if the file uses a
different delimiter (e.g., tab-separated .
Data exploration
Once you’ve loaded your data, it’s important to understand its structure and contents.
Viewing the first few rows
If you want to see more (or fewer) rows, you can pass a number to
head()
, such as df.head(10)
to view the first
10 rows.
Similarly, you can use the tail()
method to view the
last few rows of the DataFrame.
Unique values in columns
To get a sense of the distinct values in a column, the
unique()
and value_counts()
methods are
useful.
The unique()
method shows all the unique values in a
column.
The value_counts()
method returns the count of unique
values in the column, sorted in descending order. This is particularly
useful for categorical data.
Checking for Missing Values
The isnull()
method returns a DataFrame of the same
shape as df
, where each element is a boolean
(True
for missing values and False
for
non-missing values).
To get the total number of missing values in each column, you can
chain sum()
to isnull()
.
This gives you a count of how many missing values are present in each column.
Summary statistics
To get a quick overview of the numerical data, you can use:
The describe()
method provides summary statistics for
all numeric columns, including:
-
count
: the number of non-null entries -
mean
: the average value -
std
: the standard deviation -
min
/max
: the minimum and maximum values -
25%
,50%
,75%
: the percentiles
Checking the data types
To understand the types of data in each column:
Exploring a SDMX dataset
Using the education.csv
dataset in the materials for
this episode, write the lines of code to:
- Import pandas
- Load the dataset into a pandas DataFrame
- Print the list of columns in this dataset
- Print the unique values of the
REF_AREA
column
Cleaning data
Renaming columns
You may want to rename columns for clarity:
Callout
The inplace=True
parameter means that we are modifying
the original DataFrame df
directly.
By default, inplace=False
, which means the following
line won´t rename the old_name
column:
An alternative is to create a new DataFrame with the renamed columns
and assign it back to df
:
The inplace
parameter is present in many other pandas
methods.
Dropping columns
If you no longer need a column, you can drop it:
You can also select specific columns from a DataFrame by passing a list of column names to the DataFrame inside double square brackets.
Removing duplicates
To remove duplicate rows, use the drop_duplicates()
method, which removes all duplicate rows by default.
You can also specify which columns to check for duplicates by passing a subset to the subset parameter:
This will remove duplicates based only on column ‘A’.
Handling missing data
Missing data is common in real-world datasets. You can handle it in various ways:
- Dropping rows with missing values:
- Filling missing values with a default value:
First cleaning steps
Using the education.csv
dataset in the materials for
this episode (continuing on the script of the previous exercise), write
the lines of code to:
- Keep only the following columns:
REF_AREA
,AGE
,SUBJ_TYPE
,OBS_VALUE
,REF_PERIOD
- Rename them with simpler names:
iso3
,age
,subject
,value
,year
- Drop rows with missing data
PYTHON
# Keeping only the necessary columns
df = df[["REF_AREA", "AGE", "SUBJ_TYPE", "OBS_VALUE", "REF_PERIOD"]]
# Rename them
df.rename(columns={
"REF_AREA": "iso3",
"AGE": "age",
"SUBJ_TYPE": "subject",
"OBS_VALUE": "value",
"REF_PERIOD": "year"},
inplace=True)
# Drop rows with missing data
df.dropna(inplace=True)
Transforming data
Filtering rows
You can filter rows based on certain conditions. For example, to filter for rows where the column age is greater than 30:
Another way to do this is to use loc
(Label Based
Indexing):
Replacing values based on condition
loc
can also be used to replace values in a DataFrame
based on conditions. Let’s assume we have the following DataFrame, and
we want to update certain values based on specific conditions.
PYTHON
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
OUTPUT
Original DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 San Francisco
Suppose we want to replace the City for anyone over the age of 30 with Seattle.
PYTHON
# Replace 'City' with 'Seattle' where 'Age' is greater than 30
df.loc[df['Age'] > 30, 'City'] = 'Seattle'
print("\nUpdated DataFrame:")
print(df)
OUTPUT
Updated DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Seattle
3 David 40 Seattle
-
df['Age'] > 30
: This is the condition used to filter rows where the Age is greater than 30. -
df.loc[df['Age'] > 30, 'City']
: This selects the City column for those rows where the condition is true. -
= 'Seattle'
: This replaces the value in the City column with ‘Seattle’ for those rows.
In the above example, the cities for Charlie and David were changed to Seattle because their ages are greater than 30.
Replace values using map
The map()
function in pandas allows you to apply a
mapping or a function to each element in the Series. You can use
map()
with a dictionary to replace values in a Series
according to the mapping provided.
PYTHON
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'City': ['NY', 'LA', 'CHI', 'SF']
}
df = pd.DataFrame(data)
# Create a dictionary for mapping
city_map = {
'NY': 'New York',
'LA': 'Los Angeles',
'CHI': 'Chicago',
'SF': 'San Francisco'
}
# Apply the map function to replace city abbreviations
df['City'] = df['City'].map(city_map)
print(df)
OUTPUT
Name City
0 Alice New York
1 Bob Los Angeles
2 Charlie Chicago
3 David San Francisco
-
city_map
is a dictionary where the keys are the city abbreviations and the values are the full city names. -
df['City'].map(city_map)
: This replaces the city abbreviations with the corresponding full city names from the city_map dictionary.
Transforming the dataset for easier analysis
We are still using the education.csv
dataset in the
materials for this episode (continuing on the script of the previous
exercise).
- Now, we would like to focus on a subset of education subjects,
instead of using the full list (17 subjects). Write the lines of code to
select only the 8 subjects listed below. > You can use the
isin()
method in pandas is used to filter rows . - You may have noticed that the column for subjects labels in the raw
data was filled with missing values. For a better readability, we will
transform subject codes into labels, using the following mapping:
-
READ
: “Reading, writing and literature” -
MATH
: “Mathematics” -
NSCI
: “Natural sciences” -
SSCI
: “Social sciences” -
SLAN
: “Second language” -
OLAN
: “Other languages” -
PHED
: “Physical education and health” -
ARTS
: “Arts”
-
PYTHON
# Selecting only the 8 main subjects and assign to a new dataframe
df_subset = df.loc[df['subject'].isin(["READ", "MATH", "NSCI", "SSCI", "SLAN", "OLAN", "PHED", "ARTS"])]
# Adding labels
df_subset["subject_label"] = df_subset["subject"].map({
"READ": "Reading, writing and literature",
"MATH": "Mathematics",
"NSCI": "Natural sciences",
"SSCI": "Social sciences",
"SLAN": "Second language",
"OLAN": "Other languages",
"PHED": "Physical education and health",
"ARTS": "Arts"})
Pivoting data
Pivoting and melting are two important operations for reshaping data in pandas. They are used to transform a DataFrame from “long” format to “wide” format, and vice versa. These operations allow you to better organize your data depending on your analysis needs.
Pivot
The pivot()
method reshapes the data by turning unique
values from one column into new columns. It’s useful when you want to
convert a “long” format DataFrame (where each row represents a single
observation) into a “wide” format (where each unique value becomes a
column).
-
index
: Column to use as the new index of the DataFrame. -
columns
: Column to use as the new columns. -
values
: Column to use for populating the new DataFrame. If not specified, all remaining columns are used.
PYTHON
data = {
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
'Temperature': [30, 75, 32, 77],
}
df = pd.DataFrame(data)
# Pivoting the data
pivot_df = df.pivot(index='Date', columns='City', values='Temperature')
print(pivot_df)
OUTPUT
City Los Angeles New York
Date
2021-01-01 75 30
2021-01-02 77 32
- The
Date
column is used as the index. - The
City
column values are turned into new columns. - The
Temperature
column is used to populate the new DataFrame.
Melt
The melt()
function is the opposite of
pivot()
. It transforms a DataFrame from wide format to long
format.
-
id_vars
: Columns that should remain in the “long” format (i.e., columns that will not be unpivoted). -
value_vars
: Columns to unpivot. If not specified, all columns not in id_vars will be unpivoted. -
var_name
: Name for the new column that will hold the variable names (the original column names). -
value_name
: Name for the new column that will hold the values.
PYTHON
data = {
'Date': ['2021-01-01', '2021-01-02'],
'New York': [30, 32],
'Los Angeles': [75, 77],
}
df = pd.DataFrame(data)
# Melting the data
melted_df = df.melt(id_vars=['Date'], var_name='City', value_name='Temperature')
print(melted_df)
OUTPUT
Date City Temperature
0 2021-01-01 New York 30
1 2021-01-02 New York 32
2 2021-01-01 Los Angeles 75
3 2021-01-02 Los Angeles 77
- The
Date
column remains fixed (asid_vars
). - The
New York
andLos Angeles
columns are melted into a singleCity
column, with corresponding values in theTemperature
column.
Pivot method
You are given a dataset containing sales information for different products over a few months.
PYTHON
import pandas as pd
# Create the DataFrame
data = {
'Product': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Month': ['January', 'February', 'March', 'January', 'February', 'March', 'January', 'February', 'March'],
'Sales': [100, 150, 200, 80, 120, 160, 130, 170, 220]
}
df = pd.DataFrame(data)
Use the pivot()
method to rearrange this DataFrame so
that the months become columns, and each product’s sales data for each
month appears under its respective column.
Merging and joining data
Merging dataFrames
The merge()
function in pandas is used to combine two
DataFrames based on one or more common columns. It’s similar to SQL
joins and can be used for various purposes such as combining datasets or
performing lookups.
The basic syntax for merging two DataFrames is:
PYTHON
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False)
-
left
: The first DataFrame. -
right
: The second DataFrame. -
how
: The type of merge to perform. Options include:-
left
: Use only keys from the left DataFrame (like a left join in SQL). -
right
: Use only keys from the right DataFrame (like a right join in SQL). -
outer
: Use keys from both DataFrames, filling in missing values withNaN
(like a full outer join in SQL). -
inner
: Use only the common keys (like an inner join in SQL, default option).
-
-
on
: The column or index level names to join on. If not specified, it will join on columns with the same name in both DataFrames. -
left_on
andright_on
: Specify columns from left and right DataFrames to merge on if the column names are different. -
left_index
andright_index
: IfTrue
, it uses the index of the DataFrames for merging instead of columns.
In the following example, we merge DataFrames on multiple columns by passing a list to the on parameter.
PYTHON
df1 = pd.DataFrame({
'Name': ['John', 'Anna', 'Peter'],
'City': ['NY', 'LA', 'SF'],
'Age': [22, 25, 28]
})
df2 = pd.DataFrame({
'Name': ['John', 'Anna', 'Peter'],
'City': ['NY', 'LA', 'DC'],
'Salary': [50000, 60000, 70000]
})
# Merge on multiple columns
merged_df = pd.merge(df1, df2, how='inner', on=['Name', 'City'])
print(merged_df)
OUTPUT
Name City Age Salary
0 John NY 22 50000
1 Anna LA 25 60000
Concatenating dataFrames
In addition to merging DataFrames, pandas provides the
concat()
function, which is useful for combining DataFrames
along a particular axis (either rows or columns). While
merge()
is typically used for combining DataFrames based on
a shared key or index, concat()
is more straightforward and
is generally used when you want to append or stack DataFrames
together.
The basic syntax for concat() is:
[df1, df2]
: A list of DataFrames to concatenate.-
axis
: The axis along which to concatenate:-
axis=0
: Concatenate along rows (default behavior). This stacks DataFrames on top of each other. -
axis=1
: Concatenate along columns, aligning DataFrames side-by-side.
-
ignore_index
: IfTrue
, the index will be reset (i.e., it will generate a new index). IfFalse
, the original indices of the DataFrames are preserved.-
join
: Determines how to handle indices (or columns when axis=1):-
outer
: Takes the union of the indices (or columns) from both DataFrames (default). -
inner
: Takes the intersection of the indices (or columns), excluding any non-overlapping indices (or columns).
When concatenating along rows (which is the default behavior), the DataFrames are stacked on top of each other, and the rows are added to the end of the previous DataFrame. This is commonly used to combine datasets with the same structure but with different data.
-
Here is an example for concatenating DataFrames with the same columns:
PYTHON
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['John', 'Anna', 'Peter']
})
df2 = pd.DataFrame({
'ID': [4, 5],
'Name': ['Linda', 'James']
})
# Concatenate along rows (stack vertically)
concatenated_df = pd.concat([df1, df2], axis=0, ignore_index=True)
print(concatenated_df)
OUTPUT
ID Name
0 1 John
1 2 Anna
2 3 Peter
3 4 Linda
4 5 James
In this case:
- The two DataFrames
df1
anddf2
are stacked vertically. - The
ignore_index=True
parameter ensures that the index is reset to a default integer index (0 to 4). - If you didn’t set
ignore_index=True
, the original indices fromdf1
anddf2
would be preserved.
Merge method
We are given two datasets: one containing employee details and the
other containing their department information. We want to merge these
two datasets on the common column Employee_ID
to create a
single DataFrame that contains employee details along with their
department names, while making sure we won´t drop any observation.
PYTHON
import pandas as pd
# Create the Employee DataFrame
employee_data = {
'Employee_ID': [101, 102, 103, 104, 105],
'Employee_Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45]
}
employee_df = pd.DataFrame(employee_data)
# Create the Department DataFrame
department_data = {
'Employee_ID': [101, 102, 103, 106],
'Department': ['HR', 'Finance', 'IT', 'Marketing']
}
department_df = pd.DataFrame(department_data)
# Display both DataFrames
print("Employee DataFrame:")
print(employee_df)
print("\nDepartment DataFrame:")
print(department_df)
Aggregating data
Aggregation is often used to summarize data by applying functions like sum, mean, etc., to groups of rows.
Grouping data
The groupby()
method in pandas is used to group data by
one or more columns. Once the data is grouped, you can apply an
aggregation function to each group.
Basic Grouping
Let’s assume we have a dataset of sales data that includes the following columns: store, product, and sales.
PYTHON
import pandas as pd
data = {
'store': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'B'],
'product': ['apple', 'banana', 'apple', 'banana', 'apple', 'banana', 'banana', 'apple'],
'sales': [10, 20, 30, 40, 50, 60, 70, 80]
}
df = pd.DataFrame(data)
# Group by 'store' and calculate the total sales for each store
grouped = df.groupby('store')['sales'].sum()
print(grouped)
Output:
OUTPUT
store
A 100
B 150
C 110
Name: sales, dtype: int64
In this example, we grouped the data by store
and
calculated the total sales (sum
) for each store.
Grouping by Multiple Columns
You can also group by multiple columns. Let’s say we want to calculate the total sales per store and per product:
Output:
OUTPUT
store product
A apple 10
banana 90
B apple 30
banana 40
C apple 50
banana 60
Name: sales, dtype: int64
This shows the total sales
for each combination of
store
and product
.
Aggregation Functions
Once you’ve grouped the data, you can apply different aggregation
functions. The most common ones include sum()
,
mean()
, min()
, max()
, and
count()
. These can be used to summarize the data in various
ways.
Calculating the Mean
To calculate the average sales per store, you can use the
mean()
function:
Output:
OUTPUT
store
A 33.333333
B 50.000000
C 55.000000
Name: sales, dtype: float64
Calculating the Count
You can also count how many rows there are in each group. This is useful when you want to know how many entries exist for each group:
Output:
OUTPUT
store
A 3
B 3
C 3
Name: sales, dtype: int64
Using custom aggregations
You can also apply custom aggregation functions to your grouped data. For example, let’s say you want to compute the range (difference between the maximum and minimum) of sales for each store:
PYTHON
range_sales = df.groupby('store')['sales'].agg(lambda x: x.max() - x.min())
print(range_sales)
Output:
OUTPUT
store
A 60
B 50
C 10
Name: sales, dtype: int64
In this example, we used a lambda function to compute the range of sales for each store.
Aggregating data using
groupby()
Let’s now go back to our script for transforming the education dataset.
The df_subset
dataframe provides for each country and
age, the share of instruction time spent on each of the 8 selected
subjets.
Now, we would like to compute the average share of instruction time of each selected subjects by country.
Handling missing values during aggregation
When aggregating data, missing values (NaN) are typically ignored by
default. However, if you need to change this behavior, you can control
how pandas handles them using the skipna
argument.
For example, if you want to include missing values in your aggregation, you can do the following:
PYTHON
data = {
'store': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'B'],
'product': ['apple', 'banana', 'apple', 'banana', 'apple', 'banana', 'banana', 'apple'],
'sales': [10, 20, 30, 40, 50, None, 70, 80]
}
df = pd.DataFrame(data)
# Group by store and calculate the sum, including missing values
agg_sales_with_na = df.groupby('store')['sales'].sum(skipna=False)
print(agg_sales_with_na)
Output:
OUTPUT
store
A 100.0
B 150.0
C NaN
Name: sales, dtype: float64
Notice that the sum for store C is NaN because the df
dataframe contains a missing value.
Aggregating while preserving the data structure
The transform()
function in pandas allows you to perform
transformations on a group of data while preserving the original
structure. Unlike aggregation (which reduces data),
transform()
returns a DataFrame
or
Series
with the same index as the original.
If you want to rank the sales data within each store, you can use the
rank()
function inside transform():
PYTHON
# Rank the sales within each store
df['sales_rank'] = df.groupby('store')['sales'].transform('rank')
print(df)
Output:
OUTPUT
store product sales sales_rank
0 A apple 10 1.0
1 A banana 20 2.0
2 B apple 30 1.0
3 B banana 40 2.0
4 C apple 50 1.0
5 C banana 60 2.0
6 A banana 70 3.0
7 B apple 80 3.0
Exporting data
Once you’ve wrangled your data, you may want to export it to a file.
Exporting to CSV
index=False
prevents the row index from being saved in the file.
You can also specify other options like separator (sep), columns to export, or if you want to handle missing values with the na_rep parameter.
Exporting to Excel
You can also specify which sheet name to use with the
sheet_name
parameter:
If you’re dealing with multiple DataFrames and need to save them in different sheets of the same Excel file, you can use an ExcelWriter:
Other supported export formats
Format | Method | Example Usage |
---|---|---|
CSV | DataFrame.to_csv() |
df.to_csv('output_data.csv') |
Excel | DataFrame.to_excel() |
df.to_excel('output_data.xlsx') |
JSON | DataFrame.to_json() |
df.to_json('output_data.json') |
SQL | DataFrame.to_sql() |
df.to_sql('my_table', conn) |
HDF5 | DataFrame.to_hdf() |
df.to_hdf('output_data.h5', key='df') |
Parquet | DataFrame.to_parquet() |
df.to_parquet('output_data.parquet') |
Feather | DataFrame.to_feather() |
df.to_feather('output_data.feather') |
Pickle | DataFrame.to_pickle() |
df.to_pickle('output') |
Each of these export functions has additional parameters for customizing how the data is saved (e.g., file paths, indexes, column selections). You can refer to the pandas documentation for more advanced options for each method.
Exporting data
Let’s now export our transformed data, that we will use in the next episode focusing on data visualisations.
We want to export the following dataframes from our script on education data:
-
df
to be saved in aeducation_clean.csv
file -
df_subset
to be saved in aeducation_subset.csv
file -
df_average
to be saved in aeducation_average.csv
file