Data Transformation
Last updated on 2024-12-11 | Edit this page
Estimated time: 120 minutes
Overview
Questions
- How can I process tabular data files in Python?
Objectives
- Explain what a library is and what libraries are used for.
- Import a Python library and use the functions it contains.
- Read tabular data from a file into a program.
- Clean and prepare data.
- Merge and reshape data.
- Handle missing values.
- Aggregate and summarize data.
Introduction
What is a Package/Library in Python?
In Python, a library (or package) is a collection of pre-written code that you can use to perform common tasks without needing to write the code from scratch. It’s a toolbox that provides tools (functions, classes, and modules) to help you solve specific problems.
Python packages save time and effort by providing solutions to common
programming challenges. Instead of reinventing the wheel, you can
import
these tools into your scripts and use them to
complete tasks more easily.
A Python package for data manipulations: pandas
In this lesson, we will focus on using the
pandas
library in Python to perform common
data wrangling tasks.
pandas
is an open-source Python library for data
manipulation and analysis. It provides data structures like
DataFrame
and
Series
that make it easy to handle and
analyze data.
Series
A Series
is a one-dimensional labeled array, similar to
a list. It can hold any data type, such as integers, strings, or even
more complex data types.
Key Features:
- It’s one-dimensional, so it holds data in a single column.
- It has an index (labels) that you can use to access specific elements.
- Each element in the Series has a label (index) and a value.
Example of a Series
:
PYTHON
import pandas as pd
# Create a Series from a list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
# Print the Series
print(series)
Output:
OUTPUT
0 10
1 20
2 30
3 40
4 50
dtype: int64
The index is 0, 1, 2, 3, 4, and the values are 10, 20, 30, 40, 50.
pandas
automatically creates an index for you (starting
from 0), but you can also specify a custom index.
DataFrame
A DataFrame
is a two-dimensional, table-like structure
(similar to a spreadsheet or SQL table) that can hold multiple
Series
. It is the most commonly used pandas
object.
A DataFrame
consists of:
- Rows (with an index, just like a
Series
), - Columns (which are each
Series
).
You can think of a DataFrame
as a collection of
Series
that share the same index.
Key Features:
- It’s two-dimensional.
- Each column is a
Series
. - It has both row and column labels (indexes and column names).
- It can hold multiple data types (integers, strings, floats, etc.).
Example of a DataFrame
:
PYTHON
import pandas as pd
# Create a DataFrame using a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Paris', 'Seoul']
}
df = pd.DataFrame(data)
# Print the DataFrame
print(df)
Output:
OUTPUT
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
- The rows are indexed from 0, 1, 2 (default index).
- The columns are
Name
,Age
, andCity
. - Each column is a Series, so the
Name
column is aSeries
, theAge
column is anotherSeries
, etc.
Import methods
In Python, libraries (or modules) can be imported into your code
using the import
statement. This allows you to access the
functions, classes, and methods defined in that library. There are
several ways to do it:
- Full import:
import pandas
- Use with
pandas.DataFrame()
,pandas.Series()
, etc.
- Import with alias:
import pandas as pd
- Use with
pd.DataFrame()
,pd.Series()
, etc.
- Import specific functions or classes:
from pandas import DataFrame
- Use directly as
DataFrame()
.
- Import multiple specific elements:
from pandas import DataFrame, Series
In general, we use the option 2 for
pandas
.
Loading data
Loading CSV data
You can load a CSV file into a pandas
DataFrame
using the read_csv()
function:
-
read_csv()
reads the CSV file located atpath\to\file.csv
and loads it into apandas
DataFrame
(df
). - By default, it assumes that the file has a header row (i.e., column names) in the first row.
If the file does not have a header, you can use the
header=None
parameter to let pandas generate default column
names.
You can pass arguments like sep
if the file uses a
different delimiter (e.g., tab-separated \t
).
Raw string literal
In Python, the r
prefix before a string is used to
create a raw string literal. This tells Python to treat the string
exactly as it is, without interpreting backslashes (\
) as
escape characters.
In regular strings, backslashes are used as escape characters. For
example, \n
represents a new line.
Data exploration
Viewing the first few rows
If you want to see more (or fewer) rows, you can pass a number to
head()
, such as df.head(10)
to view the first
10 rows.
Similarly, you can use the tail()
method to view the
last few rows of the DataFrame
.
Unique values in columns
To get a sense of the distinct values in a column, the
unique()
and value_counts()
methods are
useful.
The unique()
method shows all the unique values in a
column.
The value_counts()
method returns the count of unique
values in the column, sorted in descending order. This is particularly
useful for categorical data.
Checking for missing values
The isnull()
method returns a DataFrame
of
the same shape as df
, where each element is a boolean
(True
for missing values and False
for
non-missing values).
To get the total number of missing values in each column, you can
chain sum()
to isnull()
.
This gives you a count of how many missing values are present in each column.
Summary statistics
To get a quick overview of the numerical data, you can use:
The describe()
method provides summary statistics for
all numeric columns, including:
-
count
: the number of non-null entries -
mean
: the average value -
std
: the standard deviation -
min
/max
: the minimum and maximum values -
25%
,50%
,75%
: the percentiles
Checking the data types
To understand the types of data in each column:
Exploring a SDMX dataset
Using the education.csv
dataset in the materials for
this episode, write the lines of code to:
- Import
pandas
- Load the dataset into a
pandas
DataFrame
- Print the list of columns in this dataset
- Print the unique values of the
REF_AREA
column
PYTHON
import pandas as pd
df = pd.read_csv(r"education.csv")
print(df.columns)
print(df["REF_AREA"].unique())
OUTPUT
Index(['STRUCTURE', 'STRUCTURE_ID', 'STRUCTURE_NAME', 'ACTION', 'REF_AREA',
'Reference area', 'MEASURE', 'Measure', 'UNIT_MEASURE',
'Unit of measure', 'INST_TYPE_EDU', 'Type of educational institution',
'EDUCATION_LEV', 'Education level', 'AGE', 'Age', 'SUBJ_TYPE',
'Subject', 'OBS_VALUE', 'Observation value', 'OBS_STATUS',
'Observation status', 'UNIT_MULT', 'Unit multiplier',
'STATISTICAL_OPERATION', 'Statistical operation', 'REF_PERIOD',
'Reference period', 'DECIMALS', 'Decimals'],
dtype='object')
['BFR' 'CHN' 'ESP' 'ISL' 'MEX' 'CHE' 'DEU' 'LTU' 'SAU' 'AUS' 'CAN' 'NLD'
'CHL' 'FRA' 'KOR' 'GRC' 'LUX' 'ROU' 'FIN' 'IND' 'IRL' 'ISR' 'CZE' 'SVK'
'TUR' 'USA' 'SVN' 'BGR' 'SWE' 'ZAF' 'UKM' 'NZL' 'OECD' 'BFL' 'IDN' 'NOR'
'DNK' 'HRV' 'HUN' 'ARG' 'CRI' 'EST' 'COL' 'PER' 'POL' 'PRT' 'UKB' 'ITA'
'BRA' 'JPN' 'LVA' 'EU25' 'G20' 'AUT']
Cleaning data
Renaming columns
You may want to rename columns for clarity:
Callout
The inplace=True
parameter means that we are modifying
the original DataFrame
df
directly.
By default, inplace=False
, which means the following
line won’t rename the old_name
column:
An alternative is to create a new DataFrame
with the
renamed columns and assign it back to df
:
The inplace
parameter is present in many other
pandas
methods.
Dropping columns
If you no longer need a column, you can drop it:
You can also select specific columns from a DataFrame
by
passing a list of column names to the DataFrame
inside
double brackets.
Removing duplicates
To remove duplicate rows, use the drop_duplicates()
method, which removes all duplicate rows by default.
You can also specify which columns to check for duplicates by passing a subset to the subset parameter:
This will remove duplicates based only on column A
.
Handling missing data
You can handle missing data it in various ways:
- Dropping rows with missing values:
- Filling missing values with a default value:
First cleaning steps
Using the education.csv
dataset in the materials for
this episode (continuing on the script of the previous exercise), write
the lines of code to:
- Keep only the following columns:
REF_AREA
,AGE
,SUBJ_TYPE
,OBS_VALUE
,REF_PERIOD
- Rename them with simpler names:
iso3
,age
,subject
,value
,year
- Drop rows with missing data
PYTHON
# Keeping only the necessary columns
df = df[["REF_AREA", "AGE", "SUBJ_TYPE", "OBS_VALUE", "REF_PERIOD"]]
# Rename them
df.rename(columns={
"REF_AREA": "iso3",
"AGE": "age",
"SUBJ_TYPE": "subject",
"OBS_VALUE": "value",
"REF_PERIOD": "year"},
inplace=True)
# Drop rows with missing data
df.dropna(inplace=True)
Transforming data
Filtering rows
You can filter rows based on certain conditions. For example, to
filter for rows where the column age
is greater than
30:
Another way to do this is to use loc
(Label Based
Indexing):
Replacing values based on condition
loc
can also be used to replace values in a
DataFrame
based on conditions. Let’s assume we have the
following DataFrame
, and we want to update certain values
based on specific conditions.
PYTHON
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
OUTPUT
Original DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 San Francisco
Suppose we want to replace the city for anyone over the age of 30 with Seattle.
PYTHON
# Replace 'City' with 'Seattle' where 'Age' is greater than 30
df.loc[df['Age'] > 30, 'City'] = 'Seattle'
print("\nUpdated DataFrame:")
print(df)
OUTPUT
Updated DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Seattle
3 David 40 Seattle
-
df['Age'] > 30
: This is the condition used to filter rows where theAge
is greater than 30. -
df.loc[df['Age'] > 30, 'City']
: This selects theCity
column for those rows where the condition is true. -
= 'Seattle'
: This replaces the value in theCity
column with ‘Seattle’ for those rows.
Replace values using map
The map()
method in pandas
allows you to
apply a mapping or a function to each element in the
Series
. You can use map()
with a dictionary to
replace values in a Series
according to the mapping
provided.
PYTHON
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'City': ['NY', 'LA', 'CHI', 'SF']
}
df = pd.DataFrame(data)
# Create a dictionary for mapping
city_map = {
'NY': 'New York',
'LA': 'Los Angeles',
'CHI': 'Chicago',
'SF': 'San Francisco'
}
# Apply the map function to replace city abbreviations
df['City'] = df['City'].map(city_map)
print(df)
OUTPUT
Name City
0 Alice New York
1 Bob Los Angeles
2 Charlie Chicago
3 David San Francisco
-
city_map
is a dictionary where the keys are the city abbreviations and the values are the full city names. -
df['City'].map(city_map)
: This replaces the city abbreviations with the corresponding full city names from thecity_map
dictionary.
Selection and mapping
We are still using the education.csv
dataset in the
materials for this episode (continuing on the script of the previous
exercise).
- Now, we would like to focus on a subset of education subjects, instead of using the full list (17 subjects). Write the lines of code to select only the 8 subjects listed below.
You can use the
isin()
method in pandas is used to filter rows .
- You may have noticed that the column for subjects labels in the raw
data was filled with missing values. For a better readability, we will
transform subject codes into labels, using the following mapping:
-
READ
: “Reading, writing and literature” -
MATH
: “Mathematics” -
NSCI
: “Natural sciences” -
SSCI
: “Social sciences” -
SLAN
: “Second language” -
OLAN
: “Other languages” -
PHED
: “Physical education and health” -
ARTS
: “Arts”
-
PYTHON
# Selecting only the 8 main subjects and assign to a new dataframe
df_subset = df.loc[df['subject'].isin(["READ", "MATH", "NSCI", "SSCI", "SLAN", "OLAN", "PHED", "ARTS"])]
# Adding labels
df_subset["subject_label"] = df_subset["subject"].map({
"READ": "Reading, writing and literature",
"MATH": "Mathematics",
"NSCI": "Natural sciences",
"SSCI": "Social sciences",
"SLAN": "Second language",
"OLAN": "Other languages",
"PHED": "Physical education and health",
"ARTS": "Arts"})
Pivoting data
Pivoting and melting are two important operations for reshaping data
in pandas
. They are used to transform a
DataFrame
from “long” format to “wide” format, and vice
versa.
Pivot
The pivot()
method reshapes the data by turning unique
values from one column into new columns. It’s useful when you want to
convert a “long” format DataFrame
(where each row
represents a single observation) into a “wide” format (where each unique
value becomes a column).
PYTHON
data = {
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
'Temperature': [30, 75, 32, 77],
}
df = pd.DataFrame(data)
# Pivoting the data
pivot_df = df.pivot(index='Date', columns='City', values='Temperature')
print(pivot_df)
OUTPUT
City Los Angeles New York
Date
2021-01-01 75 30
2021-01-02 77 32
- The
Date
column is used as the index. - The
City
column values are turned into new columns. - The
Temperature
column is used to populate the new DataFrame.
Melt
The melt()
function is the opposite of
pivot()
. It transforms a DataFrame from wide format to long
format.
PYTHON
data = {
'Date': ['2021-01-01', '2021-01-02'],
'New York': [30, 32],
'Los Angeles': [75, 77],
}
df = pd.DataFrame(data)
# Melting the data
melted_df = df.melt(id_vars=['Date'], var_name='City', value_name='Temperature')
print(melted_df)
OUTPUT
Date City Temperature
0 2021-01-01 New York 30
1 2021-01-02 New York 32
2 2021-01-01 Los Angeles 75
3 2021-01-02 Los Angeles 77
- The
Date
column remains fixed (asid_vars
). - The
New York
andLos Angeles
columns are melted into a singleCity
column, with corresponding values in theTemperature
column.
Pivot method
You are given a dataset containing sales information for different products over a few months.
PYTHON
import pandas as pd
# Create the DataFrame
data = {
'Product': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'Month': ['January', 'February', 'March', 'January', 'February', 'March', 'January', 'February', 'March'],
'Sales': [100, 150, 200, 80, 120, 160, 130, 170, 220]
}
df = pd.DataFrame(data)
Use the pivot()
method to rearrange this DataFrame so
that the months become columns, and each product’s sales data for each
month appears under its respective column.
Merging and joining data
Merging DataFrames
The merge()
function in pandas
is used to
combine two DataFrames
based on one or more common columns.
It’s similar to SQL joins.
The basic syntax for merging two DataFrames
is:
-
left
: The firstDataFrame
. -
right
: The secondDataFrame
. -
how
: The type of merge to perform. Options include:-
left
: Use only keys from the leftDataFrame
(like a left join in SQL). -
right
: Use only keys from the rightDataFrame
(like a right join in SQL). -
outer
: Use keys from bothDataFrames
, filling in missing values withNaN
(like a full outer join in SQL). -
inner
: Use only the common keys (like an inner join in SQL, default option).
-
-
on
: The column or index level names to join on. If not specified, it will join on columns with the same name in bothDataFrames
. -
left_on
andright_on
: Specify columns from left and rightDataFrames
to merge on if the column names are different.
In the following example, we merge DataFrames
on
multiple columns by passing a list to the on
parameter.
PYTHON
df1 = pd.DataFrame({
'Name': ['John', 'Anna', 'Peter'],
'City': ['NY', 'LA', 'SF'],
'Age': [22, 25, 28]
})
df2 = pd.DataFrame({
'Name': ['John', 'Anna', 'Peter'],
'City': ['NY', 'LA', 'DC'],
'Salary': [50000, 60000, 70000]
})
# Merge on multiple columns
merged_df = pd.merge(df1, df2, how='inner', on=['Name', 'City'])
print(merged_df)
OUTPUT
Name City Age Salary
0 John NY 22 50000
1 Anna LA 25 60000
Concatenating DataFrames
In addition to merging DataFrames
, pandas
provides the concat()
function, which is useful for
combining DataFrames
along a particular axis (either rows
or columns). While merge()
is typically used for combining
DataFrames
based on a shared key or index,
concat()
is more straightforward and is generally used when
you want to append or stack DataFrames
together.
The basic syntax for concat()
is:
[df1, df2]
: A list ofDataFrames
to concatenate.-
axis
: The axis along which to concatenate:-
axis=0
: Concatenate along rows (default behavior). This stacksDataFrames
on top of each other. -
axis=1
: Concatenate along columns, aligningDataFrames
side-by-side.
-
ignore_index
: IfTrue
, the index will be reset (i.e., it will generate a new index). IfFalse
, the original indices of theDataFrames
are preserved.-
join
: Determines how to handle indices (or columns whenaxis=1
):-
outer
: Takes the union of the indices (or columns) from bothDataFrames
(default). -
inner
: Takes the intersection of the indices (or columns), excluding any non-overlapping indices (or columns).
-
When concatenating along rows (which is the default behavior), the
DataFrames
are stacked on top of each other, and the rows
are added to the end of the previous DataFrame
. This is
commonly used to combine datasets with the same structure but with
different data.
Here is an example for concatenating DataFrames
with the
same columns:
PYTHON
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['John', 'Anna', 'Peter']
})
df2 = pd.DataFrame({
'ID': [4, 5],
'Name': ['Linda', 'James']
})
# Concatenate along rows (stack vertically)
concatenated_df = pd.concat([df1, df2], axis=0, ignore_index=True)
print(concatenated_df)
OUTPUT
ID Name
0 1 John
1 2 Anna
2 3 Peter
3 4 Linda
4 5 James
In this case:
- The two DataFrames
df1
anddf2
are stacked vertically. - The
ignore_index=True
parameter ensures that the index is reset to a default integer index (0 to 4). - If you didn’t set
ignore_index=True
, the original indices fromdf1
anddf2
would be preserved.
Merging datasets
We are given two datasets: one containing employee details and the
other containing their department information. We want to merge these
two datasets on the common column Employee_ID
to create a
single DataFrame
that contains employee details with their
department names, while making sure we won’t drop any observation.
PYTHON
import pandas as pd
# Create the Employee DataFrame
employee_data = {
'Employee_ID': [101, 102, 103, 104, 105],
'Employee_Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45]
}
employee_df = pd.DataFrame(employee_data)
# Create the Department DataFrame
department_data = {
'Employee_ID': [101, 102, 103, 106],
'Department': ['HR', 'Finance', 'IT', 'Marketing']
}
department_df = pd.DataFrame(department_data)
# Display both DataFrames
print("Employee DataFrame:")
print(employee_df)
print("\nDepartment DataFrame:")
print(department_df)
Aggregating data
Aggregation is often used to summarize data by applying functions like sum, mean, etc., to groups of rows.
Grouping data
The groupby()
method in pandas is used to group data by
one or more columns. Once the data is grouped, you can apply an
aggregation function to each group.
Basic Grouping
Let’s assume we have a dataset of sales data that includes the
following columns: store
, product
, and
sales
.
PYTHON
import pandas as pd
data = {
'store': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'B'],
'product': ['apple', 'banana', 'apple', 'banana', 'apple', 'banana', 'banana', 'apple'],
'sales': [10, 20, 30, 40, 50, 60, 70, 80]
}
df = pd.DataFrame(data)
# Group by 'store' and calculate the total sales for each store
grouped = df.groupby('store')['sales'].sum()
print(grouped)
Output:
OUTPUT
store
A 100
B 150
C 110
Name: sales, dtype: int64
In this example, we grouped the data by store
and
calculated the total sales (sum
) for each store.
Grouping by Multiple Columns
You can also group by multiple columns. Let’s say we want to calculate the total sales per store and per product:
Output:
OUTPUT
store product
A apple 10
banana 90
B apple 30
banana 40
C apple 50
banana 60
Name: sales, dtype: int64
This shows the total sales
for each combination of
store
and product
.
Aggregation Functions
Once you’ve grouped the data, you can apply different aggregation
functions. The most common ones include sum()
,
mean()
, min()
, max()
, and
count()
. These can be used to summarize the data in various
ways.
Calculating the mean
To calculate the average sales per store, you can use the
mean()
function:
Output:
OUTPUT
store
A 33.333333
B 50.000000
C 55.000000
Name: sales, dtype: float64
Calculating the count
You can also count how many rows there are in each group. This is useful when you want to know how many entries exist for each group:
Output:
OUTPUT
store
A 3
B 3
C 3
Name: sales, dtype: int64
Using custom aggregations
You can also apply custom aggregation functions to your grouped data. For example, let’s say you want to compute the range (difference between the maximum and minimum) of sales for each store:
PYTHON
range_sales = df.groupby('store')['sales'].agg(lambda x: x.max() - x.min())
print(range_sales)
Output:
OUTPUT
store
A 60
B 50
C 10
Name: sales, dtype: int64
Aggregating data using
groupby()
Let’s now go back to our script for transforming the education dataset.
The df_subset
DataFrame
provides for each
country and age, the share of instruction time spent on each of the 8
selected subjets.
Now, we would like to compute the average share of instruction time of each selected subject and country.
Handling missing values during aggregation
When aggregating data, missing values (NaN
) are
typically ignored by default. However, if you need to change this
behavior, you can control how pandas
handles them using the
skipna
argument.
For example, if you want to include missing values in your aggregation, you can do the following:
PYTHON
data = {
'store': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'B'],
'product': ['apple', 'banana', 'apple', 'banana', 'apple', 'banana', 'banana', 'apple'],
'sales': [10, 20, 30, 40, 50, None, 70, 80]
}
df = pd.DataFrame(data)
# Group by store and calculate the sum, including missing values
agg_sales_with_na = df.groupby('store')['sales'].sum(skipna=False)
print(agg_sales_with_na)
Output:
OUTPUT
store
A 100.0
B 150.0
C NaN
Name: sales, dtype: float64
Notice that the sum for store C is NaN
because the
df
DataFrame
contains a missing value.
Aggregating while preserving the data structure
The transform()
function in pandas
allows
you to perform transformations on a group of data while preserving the
original structure. Unlike aggregation (which reduces data),
transform()
returns a DataFrame
or
Series
with the same index as the original.
If you want to rank the sales data within each store, you can use the
rank()
function inside transform()
:
PYTHON
# Rank the sales within each store
df['sales_rank'] = df.groupby('store')['sales'].transform('rank')
print(df)
Output:
OUTPUT
store product sales sales_rank
0 A apple 10 1.0
1 A banana 20 2.0
2 B apple 30 1.0
3 B banana 40 2.0
4 C apple 50 1.0
5 C banana 60 2.0
6 A banana 70 3.0
7 B apple 80 3.0
Exporting data
Once you’ve wrangled your data, you may want to export it to a file.
Exporting to CSV
index=False
prevents the row index from being saved in the file.
You can also specify other options like separator
(sep
):
Exporting to Excel
You can also specify which sheet name to use with the
sheet_name
parameter:
If you’re dealing with multiple DataFrames
and need to
save them in different sheets of the same Excel file, you can use
ExcelWriter
:
Other supported export formats
Other supported export formats include:
Format | Method | Example Usage |
---|---|---|
CSV | DataFrame.to_csv() |
df.to_csv('output_data.csv') |
Excel | DataFrame.to_excel() |
df.to_excel('output_data.xlsx') |
JSON | DataFrame.to_json() |
df.to_json('output_data.json') |
SQL | DataFrame.to_sql() |
df.to_sql('my_table', conn) |
HDF5 | DataFrame.to_hdf() |
df.to_hdf('output_data.h5', key='df') |
Parquet | DataFrame.to_parquet() |
df.to_parquet('output_data.parquet') |
Feather | DataFrame.to_feather() |
df.to_feather('output_data.feather') |
Pickle | DataFrame.to_pickle() |
df.to_pickle('output') |
Each of these export functions has additional parameters for customizing how the data is saved (e.g., file paths, indexes, column selections). You can refer to the pandas documentation for more advanced options for each method.