Pandas - 语法基础 Pt.3

发表于 2022-07-31 更新于 2024-03-30 分类于 pandas

这篇随笔主要介绍 Pandas 的两种数据类型：Series 和 DataFrame 相关的内容
Pt.1 部分主要介绍对 Pandas 两种数据类型的基本操作，包括创建、索引和修改
Pt.2 部分详细介绍了 Pandas 的索引操作
Pt.3 部分主要介绍了 Pandas 的计算和函数
Pt.4 部分详细介绍了 Pandas 的统计计算相关的函数

1
2
3

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

算术运算与数据对齐

在将对象相加时如果索引对不同, 则结果的索引就是索引对的并集

自动的数据对齐操作在不重叠的索引处引入了 NaN 值

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=['c', 'd', 'e', 'f', 'g'])
s1, s2

(a    7.3
 b   -2.5
 c    3.4
 d    1.5
 dtype: float64,
 c   -2.1
 d    3.6
 e   -1.5
 f    4.0
 g    3.1
 dtype: float64)

s1+s2

a    NaN
b    NaN
c    1.3
d    5.1
e    NaN
f    NaN
g    NaN
dtype: float64

对于 DataFrame , 对齐操作会同时发生在行和列上

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('cde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1, df2

(            b    c    d
 Ohio      0.0  1.0  2.0
 Texas     3.0  4.0  5.0
 Colorado  6.0  7.0  8.0,
           c     d     e
 Utah    0.0   1.0   2.0
 Ohio    3.0   4.0   5.0
 Texas   6.0   7.0   8.0
 Oregon  9.0  10.0  11.0)

df1+df2

	b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	NaN	4.0	6.0	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	NaN	10.0	12.0	NaN
Utah	NaN	NaN	NaN	NaN

df1.add( df2, fill_value ) 等算术运算函数

算术运算中的缺失值 : fill_value : 当一个对象中某个位置在另一个对象中找不到时填充一个指定值

1	df1.add(df2, fill_value=0)

	b	c	d	e
Colorado	6.0	7.0	8.0	NaN
Ohio	0.0	4.0	6.0	5.0
Oregon	NaN	9.0	10.0	11.0
Texas	3.0	10.0	12.0	8.0
Utah	NaN	0.0	1.0	2.0

算数算法

类型	说明
add, radd	用于加法（+）运算
sub, rsub	用于减法（-）运算
div, rdiv	用于除法（/）运算
floordiv, rfloordiv	用于整除（//）运算
mod, rmod	用于取余（%）运算
divmod, rdivmod	用于取整和取余（// 和 %）运算
mul, rmul	用于乘法（*）运算
pow, rpow	用于指数（**）运算

div 与 rdiv 的区别

df1.div(df2) : df1/df2 ; df1.rdiv(df2) : df2/df1

1
2
3

s1 = pd.Series(list(range(1,6)), index=list('abcde'))
s2 = pd.Series(list(range(4,9)), index=list('cdefg'))
s1, s2

(a    1
 b    2
 c    3
 d    4
 e    5
 dtype: int64,
 c    4
 d    5
 e    6
 f    7
 g    8
 dtype: int64)

1	s1.mod(s2), s1.rmod(s2) # s1/s2, s2/s1

(a    NaN
 b    NaN
 c    3.0
 d    4.0
 e    5.0
 f    NaN
 g    NaN
 dtype: float64,
 a    NaN
 b    NaN
 c    1.0
 d    1.0
 e    1.0
 f    NaN
 g    NaN
 dtype: float64)

DataFrame 与 Series 之间的运算

广播

匹配列索引 columns , 在行上广播

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame, series

(          b     d     e
 Utah    0.0   1.0   2.0
 Ohio    3.0   4.0   5.0
 Texas   6.0   7.0   8.0
 Oregon  9.0  10.0  11.0,
 b    0.0
 d    1.0
 e    2.0
 Name: Utah, dtype: float64)

1	frame-series

	b	d	e
Utah	0.0	0.0	0.0
Ohio	3.0	3.0	3.0
Texas	6.0	6.0	6.0
Oregon	9.0	9.0	9.0

参与运算的两个对象的索引不重合时, 形成并集

1 2	series2 = pd.Series(range(3), index=['b', 'e', 'f']) frame + series2

	b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

匹配行索引 index , 在列上广播

frame.sub(series, axis) : 匹配轴 axis 进行广播, 其他算数运算函数同理

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series3 = frame['d']
frame, series3

(          b     d     e
 Utah    0.0   1.0   2.0
 Ohio    3.0   4.0   5.0
 Texas   6.0   7.0   8.0
 Oregon  9.0  10.0  11.0,
 Utah       1.0
 Ohio       4.0
 Texas      7.0
 Oregon    10.0
 Name: d, dtype: float64)

1	frame.sub(series3, axis='index')

	b	e
Utah	-1.0	1.0
Ohio	-1.0	1.0
Texas	-1.0	1.0
Oregon	-1.0	1.0

函数应用与映射

Numpy 函数的应用

1
2
3

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

	b	d	e
Utah	1.014427	-0.171110	-0.209578
Ohio	0.384411	0.098477	0.425979
Texas	0.618812	-0.087757	0.498332
Oregon	0.329589	-1.004679	0.205972

1	np.abs(frame)

	b	d	e
Utah	1.014427	0.171110	0.209578
Ohio	0.384411	0.098477	0.425979
Texas	0.618812	0.087757	0.498332
Oregon	0.329589	1.004679	0.205972

frame.apply( func, axis ) : 将函数应用到各列或各行所形成的一维数组上

默认对 index 进行操作 , axis='columns' 时对 columns 进行操作

1
2
3

frame =  pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

	b	d	e
Utah	-0.504649	-0.773222	3.765686
Ohio	-1.138202	0.029167	0.841352
Texas	-0.649582	0.596585	-0.854243
Oregon	-0.353114	1.173491	-1.507888

1 2	f = lambda x: x.max() - x.min() # func_name = lambda vars_in: vars_out frame.apply(f), frame.apply(f, axis='columns') # f(frame)

(b    0.785088
 d    1.946713
 e    5.273573
 dtype: float64,
 Utah      4.538908
 Ohio      1.979554
 Texas     1.450828
 Oregon    2.681378
 dtype: float64)

1
2
3

def f(x):
    return x.max()-x.min()
frame.apply(f)

b    0.785088
d    1.946713
e    5.273573
dtype: float64

计算各列/各行的最大值和最小值
> 对各行/列操作的返回的结果可以是由多个值组成的 Series

1
2
3

def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

	b	d	e
min	-1.138202	-0.773222	-1.507888
max	-0.353114	1.173491	3.765686

frame.applymap( func ) : 将函数映射到对象中的各个元素进行操作

frame

	b	d	e
Utah	-0.504649	-0.773222	3.765686
Ohio	-1.138202	0.029167	0.841352
Texas	-0.649582	0.596585	-0.854243
Oregon	-0.353114	1.173491	-1.507888

1 2	fmap = lambda x: '%.2f' % (x*10) # 只保留两位小数 frame.applymap(fmap)

	b	d	e
Utah	-5.05	-7.73	37.66
Ohio	-11.38	0.29	8.41
Texas	-6.50	5.97	-8.54
Oregon	-3.53	11.73	-15.08

排序

obj.sort_index( axis, ascending ) : 按索引排序

ascending : 是否按升序排序

1 2	obj = pd.Series(range(4), index=['d', 'a', 'b', 'c']) obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame

	d	a	b	c
three	0	1	2	3
one	4	5	6	7

1	frame.sort_index()

	d	a	b	c
one	4	5	6	7
three	0	1	2	3

1	frame.sort_index(axis=1)

	a	b	c	d
three	1	2	3	0
one	5	6	7	4

降序

1	frame.sort_index(axis=1, ascending=False)

	d	c	b	a
three	0	3	2	1
one	4	7	6	5

obj.sort_values( by ) : 按值排序

1 2	obj = pd.Series([4, 7, -3, 2]) obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

在排序时, 任何缺失值默认都会被放到末尾

1 2	obj = pd.Series([4, np.nan, 7, np.nan, -3, 2]) obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

排序DataFrame时, 将一个或多个列的名字传递给 by 选项可以根据一个或多个列的值排序

1 2	frame = pd.DataFrame({'b': [4, 7, 2, 2], 'a': [0, 1, 0, 1], 'c': [4, 3, 2, 1]}) frame

	b	a	c
0	4	0	4
1	7	1	3
2	2	0	2
3	2	1	1

1	frame.sort_values(by='b')

	b	a	c
2	2	0	2
3	2	1	1
0	4	0	4
1	7	1	3

1	frame.sort_values(by=['b', 'c'])

	b	a	c
3	2	1	1
2	2	0	2
0	4	0	4
1	7	1	3

obj.rank( axis, ascending, method ) : 返回各组的平均排名

1 2	obj = pd.Series([7, -5, 7, 4, 2, 0, 4]) obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

method='first' : 根据值在原数据中出现的顺序给出排名

1	obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

按降序排序

1	obj.rank(ascending=False, method='min') # 并列第一、并列第三...

0    1.0
1    7.0
2    1.0
3    3.0
4    5.0
5    6.0
6    3.0
dtype: float64

1
2
3

frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame

	b	a	c
0	4.3	0	-2.0
1	7.0	1	5.0
2	-3.0	0	8.0
3	2.0	1	-2.5

1	frame.rank(axis='columns')

	b	a	c
0	3.0	2.0	1.0
1	3.0	1.0	2.0
2	1.0	2.0	3.0
3	3.0	2.0	1.0

method 选项

类型	说明
average	给相等的分组分配平均排名（默认）
min	给相等的分组分配最小排名
max	给相等的分组分配最大排名
first	对相等的分组按出现顺序排名
dense	类似于 min , 但是组间排名总是相差1