pandas: Python data analysis and manipulation library
Labeled data structures for tabular data analysis.
Learn more about pandas
pandas is a Python library that implements labeled data structures, primarily the DataFrame and Series objects, for organizing and manipulating tabular and time series data. It is built on top of NumPy and integrates with the broader Python scientific computing ecosystem. The library handles data alignment automatically through index-based operations, supports multiple data types within columns, and includes functionality for reading and writing data across various formats including CSV, Excel, HDF5, and SQL databases. Common applications include exploratory data analysis, data cleaning, time series analysis, and preparing datasets for statistical modeling or machine learning workflows.
Index-based alignment
Data structures use labeled axes (indices and columns) that enable automatic alignment during operations, reducing the need for explicit position-based indexing. This allows operations on datasets with different orderings or missing labels to align correctly without manual intervention.
Flexible missing data handling
Supports multiple representations of missing values (NaN, NA, NaT) across both floating-point and non-floating-point data types. Operations automatically propagate or skip missing values depending on context, with configurable behavior for aggregations and transformations.
Integrated I/O and reshaping
Provides native readers and writers for multiple data formats (CSV, Excel, HDF5, SQL) and includes built-in operations for reshaping, pivoting, merging, and grouping data. This reduces the need for external tools or multiple library dependencies when working with diverse data sources.
import pandas as pd
import numpy as np
# Read data from CSV file
df = pd.read_csv('sales_data.csv')
# Display basic information about the dataset
print(df.head())
print(df.info())
print(df.describe())
# Check for missing values
print(df.isnull().sum())
# Filter data based on conditions
high_sales = df[df['sales'] > 1000]
print(f"Records with sales > 1000: {len(high_sales)}")
# Group by category and calculate mean sales
category_stats = df.groupby('category')['sales'].agg(['mean', 'sum', 'count'])
print(category_stats)See how people are using pandas
Top in Data Engineering
Related Repositories
Discover similar tools and frameworks used by developers
Fiona
Python library for reading and writing geographic data files like GeoPackage and Shapefile.
Patroni
Automates PostgreSQL failover using distributed consensus systems.
PostHog
Event tracking, analytics, and experimentation platform.
pdfplumber
Python library for extracting PDF text and tables.
n8n
Node-based automation platform with JavaScript and Python scripting.