数值计算和统计基础
常用数学、统计方法
基本参数:axis、skipna
import numpy
as np
import pandas
as pd
df
= pd
.DataFrame
({'key1':[4,5,3,np
.nan
,2],
'key2':[1,2,np
.nan
,4,5],
'key3':[1,2,3,'j','k']},
index
= ['a','b','c','d','e'])
print(df
)
print(df
['key1'].dtype
,df
['key2'].dtype
,df
['key3'].dtype
)
print('-----')
==============================================
key1 key2 key3
a
4.0 1.0 1
b
5.0 2.0 2
c
3.0 NaN
3
d NaN
4.0 j
e
2.0 5.0 k
float64 float64
object
m1
= df
.mean
()
print(m1
,type(m1
))
print('单独统计一列:',df
['key2'].mean
())
print('-----')
=================================
key1
3.5
key2
3.0
dtype
: float64
<class 'pandas.core.series.Series'>
单独统计一列
: 3.0
-----
m2
= df
.mean
(axis
=1)
print(m2
)
print('-----')
m3
= df
.mean
(skipna
=False)
print(m3
)
print('-----')
主要数学计算方法,可用于Series和DataFrame(1)
df
= pd
.DataFrame
({'key1':np
.arange
(10),
'key2':np
.random
.rand
(10)*10})
print(df
)
print('-----')
==================================
key1 key2
0 0 4.667989
1 1 4.336625
2 2 0.746852
3 3 9.670919
4 4 8.732045
5 5 0.013751
6 6 8.963752
7 7 0.279303
8 8 8.586821
9 9 8.899657
-----
print(df
.count
(),'→ count统计非Na值的数量\n')
print(df
.min(),'→ min统计最小值\n',df
['key2'].max(),'→ max统计最大值\n')
print(df
.quantile
(q
=0.75),'→ quantile统计分位数,参数q确定位置\n')
print(df
.sum(),'→ sum求和\n')
print(df
.mean
(),'→ mean求平均值\n')
print(df
.median
(),'→ median求算数中位数,50%分位数\n')
print(df
.std
(),'\n',df
.var
(),'→ std,var分别求标准差,方差\n')
print(df
.skew
(),'→ skew样本的偏度\n')
print(df
.kurt
(),'→ kurt样本的峰度\n')
===============================
key1
10
key2
10
dtype
: int64 → count统计非Na值的数量
key1
0.000000
key2
0.013751
dtype
: float64 →
min统计最小值
9.67091932107 →
max统计最大值
key1
6.750000
key2
8.857754
dtype
: float64 → quantile统计分位数,参数q确定位置
key1
45.000000
key2
54.897714
dtype
: float64 →
sum求和
key1
4.500000
key2
5.489771
dtype
: float64 → mean求平均值
key1
4.500000
key2
6.627405
dtype
: float64 → median求算数中位数,
50%分位数
key1
3.027650
key2
3.984945
dtype
: float64
key1
9.166667
key2
15.879783
dtype
: float64 → std
,var分别求标准差,方差
key1
0.000000
key2
-0.430166
dtype
: float64 → skew样本的偏度
key1
-1.200000
key2
-1.800296
dtype
: float64 → kurt样本的峰度
主要数学计算方法,可用于Series和DataFrame(2)
df
['key1_s'] = df
['key1'].cumsum
()
df
['key2_s'] = df
['key2'].cumsum
()
print(df
,'→ cumsum样本的累计和\n')
df
['key1_p'] = df
['key1'].cumprod
()
df
['key2_p'] = df
['key2'].cumprod
()
print(df
,'→ cumprod样本的累计积\n')
print(df
.cummax
(),'\n',df
.cummin
(),'→ cummax,cummin分别求累计最大值,累计最小值\n')
========================================
key1 key2 key1_s key2_s
0 0 4.667989 0 4.667989
1 1 4.336625 1 9.004614
2 2 0.746852 3 9.751466
3 3 9.670919 6 19.422386
4 4 8.732045 10 28.154431
5 5 0.013751 15 28.168182
6 6 8.963752 21 37.131934
7 7 0.279303 28 37.411236
8 8 8.586821 36 45.998057
9 9 8.899657 45 54.897714 → cumsum样本的累计和
key1 key2 key1_s key2_s key1_p key2_p
0 0 4.667989 0 4.667989 0 4.667989
1 1 4.336625 1 9.004614 0 20.243318
2 2 0.746852 3 9.751466 0 15.118767
3 3 9.670919 6 19.422386 0 146.212377
4 4 8.732045 10 28.154431 0 1276.733069
5 5 0.013751 15 28.168182 0 17.556729
6 6 8.963752 21 37.131934 0 157.374157
7 7 0.279303 28 37.411236 0 43.955024
8 8 8.586821 36 45.998057 0 377.433921
9 9 8.899657 45 54.897714 0 3359.032396 → cumprod样本的累计积
key1 key2 key1_s key2_s key1_p key2_p
0 0.0 4.667989 0.0 4.667989 0.0 4.667989
1 1.0 4.667989 1.0 9.004614 0.0 20.243318
2 2.0 4.667989 3.0 9.751466 0.0 20.243318
3 3.0 9.670919 6.0 19.422386 0.0 146.212377
4 4.0 9.670919 10.0 28.154431 0.0 1276.733069
5 5.0 9.670919 15.0 28.168182 0.0 1276.733069
6 6.0 9.670919 21.0 37.131934 0.0 1276.733069
7 7.0 9.670919 28.0 37.411236 0.0 1276.733069
8 8.0 9.670919 36.0 45.998057 0.0 1276.733069
9 9.0 9.670919 45.0 54.897714 0.0 3359.032396
key1 key2 key1_s key2_s key1_p key2_p
0 0.0 4.667989 0.0 4.667989 0.0 4.667989
1 0.0 4.336625 0.0 4.667989 0.0 4.667989
2 0.0 0.746852 0.0 4.667989 0.0 4.667989
3 0.0 0.746852 0.0 4.667989 0.0 4.667989
4 0.0 0.746852 0.0 4.667989 0.0 4.667989
5 0.0 0.013751 0.0 4.667989 0.0 4.667989
6 0.0 0.013751 0.0 4.667989 0.0 4.667989
7 0.0 0.013751 0.0 4.667989 0.0 4.667989
8 0.0 0.013751 0.0 4.667989 0.0 4.667989
9 0.0 0.013751 0.0 4.667989 0.0 4.667989 → cummax
,cummin分别求累计最大值,累计最小值
唯一值:.unique()
s
= pd
.Series
(list('asdvasdcfgg'))
sq
= s
.unique
()
print(s
)
print(sq
,type(sq
))
print(pd
.Series
(sq
))
sq
.sort
()
print(sq
)
======================================
0 a
1 s
2 d
3 v
4 a
5 s
6 d
7 c
8 f
9 g
10 g
dtype
: object
['a' 's' 'd' 'v' 'c' 'f' 'g'] <class 'numpy.ndarray'>
0 a
1 s
2 d
3 v
4 c
5 f
6 g
dtype
: object
['a' 'c' 'd' 'f' 'g' 's' 'v']
值计数:.value_counts()
sc
= s
.value_counts
(sort
= False)
print(sc
)
===============================================
s
2
d
2
v
1
c
1
a
2
g
2
f
1
dtype
: int64
成员资格:.isin()
# 成员资格:.isin()
s = pd.Series(np.arange(10,15))
df = pd.DataFrame({'key1':list('asdcbvasd'),
'key2':np.arange(4,13)})
print(s)
print(df)
print('-----')
print(s.isin([5,14]))
print(df.isin(['a','bc','10',8]))
# 用[]表示
# 得到一个布尔值的Series或者Dataframe
================================
0 10
1 11
2 12
3 13
4 14
dtype: int32
key1 key2
0 a 4
1 s 5
2 d 6
3 c 7
4 b 8
5 v 9
6 a 10
7 s 11
8 d 12
-----
0 False
1 False
2 False
3 False
4 True
dtype: bool
key1 key2
0 True False
1 False False
2 False False
3 False False
4 False True
5 False False
6 True False
7 False False
8 False False
文本数据
Pandas针对字符串配备的一套方法,使其易于对数组的每个元素进行操作
通过str访问,且自动排除丢失/ NA值
s
= pd
.Series
(['A','b','C','bbhello','123',np
.nan
,'hj'])
df
= pd
.DataFrame
({'key1':list('abcdef'),
'key2':['hee','fv','w','hija','123',np
.nan
]})
print(s
)
print(df
)
print('-----')
print(s
.str.count
('b'))
print(df
['key2'].str.upper
())
print('-----')
df
.columns
= df
.columns
.str.upper
()
print(df
)
===================================================
0 A
1 b
2 C
3 bbhello
4 123
5 NaN
6 hj
dtype
: object
key1 key2
0 a hee
1 b fv
2 c w
3 d hija
4 e
123
5 f NaN
-----
0 0.0
1 1.0
2 0.0
3 2.0
4 0.0
5 NaN
6 0.0
dtype
: float64
0 HEE
1 FV
2 W
3 HIJA
4 123
5 NaN
Name
: key2
, dtype
: object
-----
KEY1 KEY2
0 a hee
1 b fv
2 c w
3 d hija
4 e
123
5 f NaN
字符串常用方法(1) - lower,upper,len,startswith,endswith
s
= pd
.Series
(['A','b','bbhello','123',np
.nan
])
print(s
.str.lower
(),'→ lower小写\n')
print(s
.str.upper
(),'→ upper大写\n')
print(s
.str.len(),'→ len字符长度\n')
print(s
.str.startswith
('b'),'→ 判断起始是否为a\n')
print(s
.str.endswith
('3'),'→ 判断结束是否为3\n')
====================================
0 a
1 b
2 bbhello
3 123
4 NaN
dtype
: object → lower小写
0 A
1 B
2 BBHELLO
3 123
4 NaN
dtype
: object → upper大写
0 1.0
1 1.0
2 7.0
3 3.0
4 NaN
dtype
: float64 →
len字符长度
0 False
1 True
2 True
3 False
4 NaN
dtype
: object → 判断起始是否为a
0 False
1 False
2 False
3 True
4 NaN
dtype
: object → 判断结束是否为
3
字符串常用方法(2) - strip
s
= pd
.Series
([' jack', 'jill ', ' jesse ', 'frank'])
df
= pd
.DataFrame
(np
.random
.randn
(3, 2), columns
=[' Column A ', ' Column B '],
index
=range(3))
print(s
)
print(df
)
print('-----')
print(s
.str.strip
())
print(s
.str.lstrip
())
print(s
.str.rstrip
())
df
.columns
= df
.columns
.str.strip
()
print(df
)
================================================
0 jack
1 jill
2 jesse
3 frank
dtype
: object
Column A Column B
0 0.647766 0.094747
1 0.342940 -0.660643
2 1.183315 -0.143729
-----
0 jack
1 jill
2 jesse
3 frank
dtype
: object
0 jack
1 jill
2 jesse
3 frank
dtype
: object
0 jack
1 jill
2 jesse
3 frank
dtype
: object
Column A Column B
0 0.647766 0.094747
1 0.342940 -0.660643
2 1.183315 -0.143729
字符串常用方法(3) - replace
df
= pd
.DataFrame
(np
.random
.randn
(3, 2), columns
=[' Column A ', ' Column B '],
index
=range(3))
df
.columns
= df
.columns
.str.replace
(' ','-')
print(df
)
df
.columns
= df
.columns
.str.replace
('-','hehe',n
=1)
print(df
)
=========================================
-Column
-A
- -Column
-B
-
0 1.855227 -0.519479
1 -0.400376 -0.421383
2 -0.293797 -0.432481
heheColumn
-A
- heheColumn
-B
-
0 1.855227 -0.519479
1 -0.400376 -0.421383
2 -0.293797 -0.432481
字符串常用方法(4) - split、rsplit
s
= pd
.Series
(['a,b,c','1,2,3',['a,,,c'],np
.nan
])
print(s
.str.split
(','))
print('-----')
print(s
.str.split
(',')[0])
print('-----')
print(s
.str.split
(',').str[0])
print(s
.str.split
(',').str.get
(1))
print('-----')
print(s
.str.split
(',', expand
=True))
print(s
.str.split
(',', expand
=True, n
= 1))
print(s
.str.rsplit
(',', expand
=True, n
= 1))
print('-----')
df
= pd
.DataFrame
({'key1':['a,b,c','1,2,3',[':,., ']],
'key2':['a-b-c','1-2-3',[':-.- ']]})
print(df
['key2'].str.split
('-'))
================================
0 [a
, b
, c
]
1 [1, 2, 3]
2 NaN
3 NaN
dtype
: object
-----
['a', 'b', 'c']
-----
0 a
1 1
2 NaN
3 NaN
dtype
: object
0 b
1 2
2 NaN
3 NaN
dtype
: object
-----
0 1 2
0 a b c
1 1 2 3
2 NaN
None None
3 NaN
None None
0 1
0 a b
,c
1 1 2,3
2 NaN
None
3 NaN
None
0 1
0 a
,b c
1 1,2 3
2 NaN
None
3 NaN
None
-----
0 [a
, b
, c
]
1 [1, 2, 3]
2 NaN
Name
: key2
, dtype
: object
字符串索引
s
= pd
.Series
(['A','b','C','bbhello','123',np
.nan
,'hj'])
df
= pd
.DataFrame
({'key1':list('abcdef'),
'key2':['hee','fv','w','hija','123',np
.nan
]})
print(s
.str[0])
print(s
.str[:2])
print(df
['key2'].str[0])
======================================
0 A
1 b
2 C
3 b
4 1
5 NaN
6 h
dtype
: object
0 A
1 b
2 C
3 bb
4 12
5 NaN
6 hj
dtype
: object
0 h
1 f
2 w
3 h
4 1
5 NaN
Name
: key2
, dtype
: object
合并 merge、join
Pandas具有全功能的,高性能内存中连接操作,与SQL等关系数据库非常相似
pd.merge(left, right, how=‘inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=(’_x’, ‘_y’), copy=True, indicator=False)
merge合并 → 类似excel的vlookup
df1
= pd
.DataFrame
({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2
= pd
.DataFrame
({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
df3
= pd
.DataFrame
({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df4
= pd
.DataFrame
({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
print(pd
.merge
(df1
, df2
, on
='key'))
print('------')
print(pd
.merge
(df3
, df4
, on
=['key1','key2']))
======================================================
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3
------
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
参数how → 合并方式
print(pd
.merge
(df3
, df4
,on
=['key1','key2'], how
= 'inner'))
print('------')
print(pd
.merge
(df3
, df4
, on
=['key1','key2'], how
= 'outer'))
print('------')
print(pd
.merge
(df3
, df4
, on
=['key1','key2'], how
= 'left'))
print('------')
print(pd
.merge
(df3
, df4
, on
=['key1','key2'], how
= 'right'))
===============================================
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
------
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
5 NaN NaN K2 K0 C3 D3
------
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
------
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
3 NaN NaN K2 K0 C3 D3
参数 left_on, right_on, left_index, right_index → 当键不为一个列时,可以单独设置左键与右键
df1
= pd
.DataFrame
({'lkey':list('bbacaab'),
'data1':range(7)})
df2
= pd
.DataFrame
({'rkey':list('abd'),
'date2':range(3)})
print(pd
.merge
(df1
, df2
, left_on
='lkey', right_on
='rkey'))
print('------')
df1
= pd
.DataFrame
({'key':list('abcdfeg'),
'data1':range(7)})
df2
= pd
.DataFrame
({'date2':range(100,105)},
index
= list('abcde'))
print(pd
.merge
(df1
, df2
, left_on
='key', right_index
=True))
=====================================================
data1 lkey date2 rkey
0 0 b
1 b
1 1 b
1 b
2 6 b
1 b
3 2 a
0 a
4 4 a
0 a
5 5 a
0 a
------
data1 key date2
0 0 a
100
1 1 b
101
2 2 c
102
3 3 d
103
5 5 e
104
参数 sort
df1
= pd
.DataFrame
({'key':list('bbacaab'),
'data1':[1,3,2,4,5,9,7]})
df2
= pd
.DataFrame
({'key':list('abd'),
'date2':[11,2,33]})
x1
= pd
.merge
(df1
,df2
, on
= 'key', how
= 'outer')
x2
= pd
.merge
(df1
,df2
, on
= 'key', sort
=True, how
= 'outer')
print(x1
)
print(x2
)
print('------')
print(x2
.sort_values
('data1'))
================================================================
data1 key date2
0 1.0 b
2.0
1 3.0 b
2.0
2 7.0 b
2.0
3 2.0 a
11.0
4 5.0 a
11.0
5 9.0 a
11.0
6 4.0 c NaN
7 NaN d
33.0
data1 key date2
0 2.0 a
11.0
1 5.0 a
11.0
2 9.0 a
11.0
3 1.0 b
2.0
4 3.0 b
2.0
5 7.0 b
2.0
6 4.0 c NaN
7 NaN d
33.0
------
data1 key date2
3 1.0 b
2.0
0 2.0 a
11.0
4 3.0 b
2.0
6 4.0 c NaN
1 5.0 a
11.0
5 7.0 b
2.0
2 9.0 a
11.0
7 NaN d
33.0
pd.join() → 直接通过索引链接
left
= pd
.DataFrame
({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index
=['K0', 'K1', 'K2'])
right
= pd
.DataFrame
({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index
=['K0', 'K2', 'K3'])
print(left
)
print(right
)
print(left
.join
(right
))
print(left
.join
(right
, how
='outer'))
print('-----')
df1
= pd
.DataFrame
({'key':list('bbacaab'),
'data1':[1,3,2,4,5,9,7]})
df2
= pd
.DataFrame
({'key':list('abd'),
'date2':[11,2,33]})
print(df1
)
print(df2
)
print(pd
.merge
(df1
, df2
, left_index
=True, right_index
=True, suffixes
=('_1', '_2')))
print(df1
.join
(df2
['date2']))
print('-----')
left
= pd
.DataFrame
({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key': ['K0', 'K1', 'K0', 'K1']})
right
= pd
.DataFrame
({'C': ['C0', 'C1'],
'D': ['D0', 'D1']},
index
=['K0', 'K1'])
print(left
)
print(right
)
print(left
.join
(right
, on
= 'key'))
连接与修补 concat、combine_first
连接 - 沿轴执行连接操作
pd.concat(objs, axis=0, join=‘outer’, join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)
连接:concat
s1
= pd
.Series
([1,2,3])
s2
= pd
.Series
([2,3,4])
s3
= pd
.Series
([1,2,3],index
= ['a','c','h'])
s4
= pd
.Series
([2,3,4],index
= ['b','e','d'])
print(pd
.concat
([s1
,s2
]))
print(pd
.concat
([s3
,s4
]).sort_index
())
print('-----')
print(pd
.concat
([s3
,s4
], axis
=1))
print('-----')
==================================================
0 1
1 2
2 3
0 2
1 3
2 4
dtype
: int64
a
1
b
2
c
2
d
4
e
3
h
3
dtype
: int64
-----
0 1
a
1.0 NaN
b NaN
2.0
c
2.0 NaN
d NaN
4.0
e NaN
3.0
h
3.0 NaN
-----
连接方式:join,join_axes
s5
= pd
.Series
([1,2,3],index
= ['a','b','c'])
s6
= pd
.Series
([2,3,4],index
= ['b','c','d'])
print(pd
.concat
([s5
,s6
], axis
= 1))
print(pd
.concat
([s5
,s6
], axis
= 1, join
='inner'))
print(pd
.concat
([s5
,s6
], axis
= 1, join_axes
=[['a','b','d']]))
==================================================
0 1
a
1.0 NaN
b
2.0 2.0
c
3.0 3.0
d NaN
4.0
0 1
b
2 2
c
3 3
0 1
a
1.0 NaN
b
2.0 2.0
d NaN
4.0
覆盖列名
sre
= pd
.concat
([s5
,s6
], keys
= ['one','two'])
print(sre
,type(sre
))
print(sre
.index
)
print('-----')
sre
= pd
.concat
([s5
,s6
], axis
=1, keys
= ['one','two'])
print(sre
,type(sre
))
================================================
one a
1
b
2
c
3
two b
2
c
3
d
4
dtype
: int64
<class 'pandas.core.series.Series'>
MultiIndex
(levels
=[['one', 'two'], ['a', 'b', 'c', 'd']],
labels
=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 1, 2, 3]])
-----
one two
a
1.0 NaN
b
2.0 2.0
c
3.0 3.0
d NaN
4.0 <class 'pandas.core.frame.DataFrame'>
修补 pd.combine_first()
df1
= pd
.DataFrame
([[np
.nan
, 3., 5.], [-4.6, np
.nan
, np
.nan
],[np
.nan
, 7., np
.nan
]])
df2
= pd
.DataFrame
([[-42.6, np
.nan
, -8.2], [-5., 1.6, 4]],index
=[1, 2])
print(df1
)
print(df2
)
print(df1
.combine_first
(df2
))
print('-----')
df1
.update
(df2
)
print(df1
)
===========================================
0 1 2
0 NaN
3.0 5.0
1 -4.6 NaN NaN
2 NaN
7.0 NaN
0 1 2
1 -42.6 NaN
-8.2
2 -5.0 1.6 4.0
0 1 2
0 NaN
3.0 5.0
1 -4.6 NaN
-8.2
2 -5.0 7.0 4.0
-----
0 1 2
0 NaN
3.0 5.0
1 -42.6 NaN
-8.2
2 -5.0 1.6 4.0
去重及替换
.duplicated / .replace
去重 .duplicated
s
= pd
.Series
([1,1,1,1,2,2,2,3,4,5,5,5,5])
print(s
.duplicated
())
print(s
[s
.duplicated
() == False])
print('-----')
s_re
= s
.drop_duplicates
()
print(s_re
)
print('-----')
df
= pd
.DataFrame
({'key1':['a','a',3,4,5],
'key2':['a','a','b','b','c']})
print(df
.duplicated
())
print(df
['key2'].duplicated
())
=========================================
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 False
8 False
9 False
10 True
11 True
12 True
dtype
: bool
0 1
4 2
7 3
8 4
9 5
dtype
: int64
-----
0 1
4 2
7 3
8 4
9 5
dtype
: int64
-----
0 False
1 True
2 False
3 False
4 False
dtype
: bool
0 False
1 True
2 False
3 True
4 False
Name
: key2
, dtype
: bool
替换 .replace
s
= pd
.Series
(list('ascaazsd'))
print(s
.replace
('a', np
.nan
))
print(s
.replace
(['a','s'] ,np
.nan
))
print(s
.replace
({'a':'hello world!','s':123}))
=============================
0 NaN
1 s
2 c
3 NaN
4 NaN
5 z
6 s
7 d
dtype
: object
0 NaN
1 NaN
2 c
3 NaN
4 NaN
5 z
6 NaN
7 d
dtype
: object
0 hello world!
1 123
2 c
3 hello world!
4 hello world!
5 z
6 123
7 d
dtype
: object
数据分组
分组统计 - groupby功能
① 根据某些条件将数据拆分成组 ② 对每个组独立应用函数 ③ 将结果合并到一个数据结构中
Dataframe在行(axis=0)或列(axis=1)上进行分组,将一个函数应用到各个分组并产生一个新值,然后函数执行结果被合并到最终的结果对象中。
df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
分组
df
= pd
.DataFrame
({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np
.random
.randn
(8),
'D' : np
.random
.randn
(8)})
print(df
)
print('------')
print(df
.groupby
('A'), type(df
.groupby
('A')))
print('------')
a
= df
.groupby
('A').mean
()
b
= df
.groupby
(['A','B']).mean
()
c
= df
.groupby
(['A'])['D'].mean
()
print(a
,type(a
),'\n',a
.columns
)
print(b
,type(b
),'\n',b
.columns
)
print(c
,type(c
))
分组 - 可迭代对象
df
= pd
.DataFrame
({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
print(df
)
print(df
.groupby
('X'), type(df
.groupby
('X')))
print('-----')
print(list(df
.groupby
('X')), '→ 可迭代对象,直接生成list\n')
print(list(df
.groupby
('X'))[0], '→ 以元祖形式显示\n')
for n
,g
in df
.groupby
('X'):
print(n
)
print(g
)
print('###')
print('-----')
print(df
.groupby
(['X']).get_group
('A'),'\n')
print(df
.groupby
(['X']).get_group
('B'),'\n')
print('-----')
grouped
= df
.groupby
(['X'])
print(grouped
.groups
)
print(grouped
.groups
['A'])
print('-----')
sz
= grouped
.size
()
print(sz
,type(sz
))
print('-----')
df
= pd
.DataFrame
({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np
.random
.randn
(8),
'D' : np
.random
.randn
(8)})
grouped
= df
.groupby
(['A','B']).groups
print(df
)
print(grouped
)
print(grouped
[('foo', 'three')])
==================================================
其他轴上的分组
df
= pd
.DataFrame
({'data1':np
.random
.rand
(2),
'data2':np
.random
.rand
(2),
'key1':['a','b'],
'key2':['one','two']})
print(df
)
print(df
.dtypes
)
print('-----')
for n
,p
in df
.groupby
(df
.dtypes
, axis
=1):
print(n
)
print(p
)
print('##')
============================================
data1 data2 key1 key2
0 0.454580 0.692637 a one
1 0.496928 0.214309 b two
data1 float64
data2 float64
key1
object
key2
object
dtype
: object
-----
float64
data1 data2
0 0.454580 0.692637
1 0.496928 0.214309
object
key1 key2
0 a one
1 b two
通过字典或者Series分组
df
= pd
.DataFrame
(np
.arange
(16).reshape
(4,4),
columns
= ['a','b','c','d'])
print(df
)
print('-----')
mapping
= {'a':'one','b':'one','c':'two','d':'two','e':'three'}
by_column
= df
.groupby
(mapping
, axis
= 1)
print(by_column
.sum())
print('-----')
s
= pd
.Series
(mapping
)
print(s
,'\n')
print(s
.groupby
(s
).count
())
通过函数分组
df
= pd
.DataFrame
(np
.arange
(16).reshape
(4,4),
columns
= ['a','b','c','d'],
index
= ['abc','bcd','aa','b'])
print(df
,'\n')
print(df
.groupby
(len).sum())
==================================
a b c d
abc
0 1 2 3
bcd
4 5 6 7
aa
8 9 10 11
b
12 13 14 15
a b c d
1 12 13 14 15
2 8 9 10 11
3 4 6 8 10
分组计算函数方法
s
= pd
.Series
([1, 2, 3, 10, 20, 30], index
= [1, 2, 3, 1, 2, 3])
grouped
= s
.groupby
(level
=0)
print(grouped
)
print(grouped
.first
(),'→ first:非NaN的第一个值\n')
print(grouped
.last
(),'→ last:非NaN的最后一个值\n')
print(grouped
.sum(),'→ sum:非NaN的和\n')
print(grouped
.mean
(),'→ mean:非NaN的平均值\n')
print(grouped
.median
(),'→ median:非NaN的算术中位数\n')
print(grouped
.count
(),'→ count:非NaN的值\n')
print(grouped
.min(),'→ min、max:非NaN的最小值、最大值\n')
print(grouped
.std
(),'→ std,var:非NaN的标准差和方差\n')
print(grouped
.prod
(),'→ prod:非NaN的积\n')
====================================
多函数计算:agg()
df
= pd
.DataFrame
({'a':[1,1,2,2],
'b':np
.random
.rand
(4),
'c':np
.random
.rand
(4),
'd':np
.random
.rand
(4),})
print(df
)
print(df
.groupby
('a').agg
(['mean',np
.sum]))
print(df
.groupby
('a')['b'].agg
({'result1':np
.mean
,
'result2':np
.sum}))
=========================================================
a b c d
0 1 0.357911 0.318324 0.627797
1 1 0.964829 0.500017 0.570063
2 2 0.116608 0.194164 0.049509
3 2 0.933123 0.542615 0.718640
b c d
mean
sum mean
sum mean
sum
a
1 0.661370 1.322739 0.409171 0.818341 0.598930 1.19786
2 0.524865 1.049730 0.368390 0.736780 0.384075 0.76815
result2 result1
a
1 1.322739 0.661370
2 1.049730 0.524865
分组转换及一般性“拆分-应用-合并”
transform / apply
数据分组转换,transform
df
= pd
.DataFrame
({'data1':np
.random
.rand
(5),
'data2':np
.random
.rand
(5),
'key1':list('aabba'),
'key2':['one','two','one','two','one']})
k_mean
= df
.groupby
('key1').mean
()
print(df
)
print(k_mean
)
print(pd
.merge
(df
,k_mean
,left_on
='key1',right_index
=True).add_prefix
('mean_'))
print('-----')
print(df
.groupby
('key2').mean
())
print(df
.groupby
('key2').transform
(np
.mean
))
=============================================
data1 data2 key1 key2
0 0.003727 0.390301 a one
1 0.744777 0.130300 a two
2 0.887207 0.679309 b one
3 0.448585 0.169208 b two
4 0.448045 0.993775 a one
data1 data2
key1
a
0.398850 0.504792
b
0.667896 0.424258
mean_data1_x mean_data2_x mean_key1 mean_key2 mean_data1_y mean_data2_y
0 0.003727 0.390301 a one
0.398850 0.504792
1 0.744777 0.130300 a two
0.398850 0.504792
4 0.448045 0.993775 a one
0.398850 0.504792
2 0.887207 0.679309 b one
0.667896 0.424258
3 0.448585 0.169208 b two
0.667896 0.424258
-----
data1 data2
key2
one
0.446326 0.687795
two
0.596681 0.149754
data1 data2
0 0.446326 0.687795
1 0.596681 0.149754
2 0.446326 0.687795
3 0.596681 0.149754
4 0.446326 0.687795
一般化Groupby方法:apply
df
= pd
.DataFrame
({'data1':np
.random
.rand
(5),
'data2':np
.random
.rand
(5),
'key1':list('aabba'),
'key2':['one','two','one','two','one']})
print(df
.groupby
('key1').apply(lambda x
: x
.describe
()))
def f_df1(d
,n
):
return(d
.sort_index
()[:n
])
def f_df2(d
,k1
):
return(d
[k1
])
print(df
.groupby
('key1').apply(f_df1
,2),'\n')
print(df
.groupby
('key1').apply(f_df2
,'data2'))
print(type(df
.groupby
('key1').apply(f_df2
,'data2')))
=============================================
data1 data2
key1
a count
3.000000 3.000000
mean
0.561754 0.233470
std
0.313439 0.337209
min 0.325604 0.026906
25% 0.383953 0.038906
50% 0.442303 0.050906
75% 0.679829 0.336753
max 0.917355 0.622599
b count
2.000000 2.000000
mean
0.881906 0.547206
std
0.079357 0.254051
min 0.825791 0.367564
25% 0.853849 0.457385
50% 0.881906 0.547206
75% 0.909963 0.637026
max 0.938020 0.726847
data1 data2 key1 key2
key1
a
0 0.325604 0.050906 a one
1 0.917355 0.622599 a two
b
2 0.825791 0.726847 b one
3 0.938020 0.367564 b two
key1
a
0 0.050906
1 0.622599
4 0.026906
b
2 0.726847
3 0.367564
Name
: data2
, dtype
: float64
<class 'pandas.core.series.Series'>
透视表及交叉表
类似excel数据透视 - pivot table / crosstab
透视表:pivot_table
date
= ['2017-5-1','2017-5-2','2017-5-3']*3
rng
= pd
.to_datetime
(date
)
df
= pd
.DataFrame
({'date':rng
,
'key':list('abcdabcda'),
'values':np
.random
.rand
(9)*10})
print(df
)
print('-----')
print(pd
.pivot_table
(df
, values
= 'values', index
= 'date', columns
= 'key', aggfunc
=np
.sum))
print('-----')
print(pd
.pivot_table
(df
, values
= 'values', index
= ['date','key'], aggfunc
=len))
print('-----')
=====================================
date key values
0 2017-05-01 a
5.886424
1 2017-05-02 b
9.906472
2 2017-05-03 c
8.617297
3 2017-05-01 d
8.972318
4 2017-05-02 a
7.990905
5 2017-05-03 b
8.131856
6 2017-05-01 c
2.823731
7 2017-05-02 d
2.394605
8 2017-05-03 a
0.667917
-----
key a b c d
date
2017-05-01 5.886424 NaN
2.823731 8.972318
2017-05-02 7.990905 9.906472 NaN
2.394605
2017-05-03 0.667917 8.131856 8.617297 NaN
-----
date key
2017-05-01 a
1.0
c
1.0
d
1.0
2017-05-02 a
1.0
b
1.0
d
1.0
2017-05-03 a
1.0
b
1.0
c
1.0
Name
: values
, dtype
: float64
-----
交叉表:crosstab
df
= pd
.DataFrame
({'A': [1, 2, 2, 2, 2],
'B': [3, 3, 4, 4, 4],
'C': [1, 1, np
.nan
, 1, 1]})
print(df
)
print('-----')
print(pd
.crosstab
(df
['A'],df
['B']))
print('-----')
print(pd
.crosstab
(df
['A'],df
['B'],normalize
=True))
print('-----')
print(pd
.crosstab
(df
['A'],df
['B'],values
=df
['C'],aggfunc
=np
.sum))
print('-----')
print(pd
.crosstab
(df
['A'],df
['B'],values
=df
['C'],aggfunc
=np
.sum, margins
=True))
print('-----')
================================
A B C
0 1 3 1.0
1 2 3 1.0
2 2 4 NaN
3 2 4 1.0
4 2 4 1.0
-----
B
3 4
A
1 1 0
2 1 3
-----
B
3 4
A
1 0.2 0.0
2 0.2 0.6
-----
B
3 4
A
1 1.0 NaN
2 1.0 2.0
-----
B
3 4 All
A
1 1.0 NaN
1.0
2 1.0 2.0 3.0
All
2.0 2.0 4.0
-----
数据读取
核心:read_table, read_csv, read_excel
读取普通分隔数据:read_table
import os
os
.chdir
('C:/Users/Hjx/Desktop/')
data1
= pd
.read_table
('data1.txt', delimiter
=',',header
= 0, index_col
=1)
print(data1
)
=================================================
va1 va3 va4
va2
2 1 3 4
3 2 4 5
4 3 5 6
5 4 6 7
读取csv数据:read_csv
data2
= pd
.read_csv
('data2.csv',engine
= 'python')
print(data2
.head
())
================================
省级政区代码 省级政区名称 地市级政区代码 地市级政区名称 年份 党委书记姓名 出生年份 出生月份 籍贯省份代码 籍贯省份名称 \
0 130000 河北省
130100 石家庄市
2000 陈来立 NaN NaN NaN NaN
1 130000 河北省
130100 石家庄市
2001 吴振华 NaN NaN NaN NaN
2 130000 河北省
130100 石家庄市
2002 吴振华 NaN NaN NaN NaN
3 130000 河北省
130100 石家庄市
2003 吴振华 NaN NaN NaN NaN
4 130000 河北省
130100 石家庄市
2004 吴振华 NaN NaN NaN NaN
... 民族 教育 是否是党校教育(是
=1,否
=0) 专业:人文 专业:社科 专业:理工 专业:农科 专业:医科 入党年份 工作年份
0 ... NaN 硕士
1.0 NaN NaN NaN NaN NaN NaN NaN
1 ... NaN 本科
0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
2 ... NaN 本科
0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
3 ... NaN 本科
0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
4 ... NaN 本科
0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
[5 rows x
23 columns
]
读取excel数据:read_excel
data3
= pd
.read_excel
('地市级党委书记数据库(2000-10).xlsx',sheetname
='中国人民共和国地市级党委书记数据库(2000-10)',header
=0)
print(data3
)