此文主要记录的是实验楼的百题闯关100题,特此记录,并加入输出和部分内容的补充。
Pandas 百题大冲关分为基础篇和进阶篇,每部分各有 50 道练习题。基础部分的练习题在于熟悉 Pandas 常用方法的使用,而进阶部分则侧重于 Pandas 方法的组合应用。
Pandas 的数据结构:Pandas 主要有Series(一维数组),DataFrame(二维数组),Panel(三维数组),Panel4D(四维数组),PanelND(更多维数组)等数据结构。其中 Series 和 DataFrame 应用的最为广泛。
Series 是一维带标签的数组,它可以包含任何数据类型。包括整数,字符串,浮点数,Python 对象等。Series 可以通过标签来定位。DataFrame 是二维的带标签的数据结构。我们可以通过标签来定位数据。这是 NumPy 所没有的。Pandas 中,Series 可以被看作由 1 列数据组成的数据集。
创建 Series 语法:s = pd.Series(data, index=index),可以通过多种方式进行创建,以下介绍了 3 个常用方法。
out:
0 0 1 1 2 2 3 3 4 4 dtype: int64提示:前面的 0, 1, 2, 3, 4 为当前 Series 的索引,后面的 0, 1, 2, 3, 4 为 Series 的值。
out:
a -0.419618 b -0.709257 c 0.288306 d -0.203162 e 1.754528 dtype: float64out:
a 1 b 2 c 3 d 4 e 5 dtype: int64out:
a 1 b 2 c 3 d 4 e 5 A 0 B 1 C 2 D 3 E 4 dtype: int64out:
a 1 b 2 c 3 d 4 e 5 A 0 B 1 C 2 D 3 E 4 dtype: int64 a 1 b 2 c 3 d 4 A 0 B 1 C 2 D 3 E 4 dtype: int64out:
a 1 b 2 c 3 d 4 A 6 B 1 C 2 D 3 E 4 dtype: int64out:
1例如对s4的前3个数据访问
s4[:3]out:
a 1 b 2 c 3 dtype: int64Series 的加法运算是按照索引计算,如果索引不同则填充为 NaN(空值)。
s4.add(s3)out:
A NaN B NaN C NaN D NaN E NaN a 2.0 b 4.0 c 6.0 d 8.0 e NaN dtype: float64Series的减法运算是按照索引对应计算,如果不同则填充为 NaN(空值)。
s4.sub(s3)out:
A NaN B NaN C NaN D NaN E NaN a 0.0 b 0.0 c 0.0 d 0.0 e NaN dtype: float64Series 的乘法运算是按照索引对应计算,如果索引不同则填充为 NaN(空值)。
s4.mul(s3)out:
A NaN B NaN C NaN D NaN E NaN a 1.0 b 4.0 c 9.0 d 16.0 e NaN dtype: float64Series 的除法运算是按照索引对应计算,如果索引不同则填充为 NaN(空值)。
s4.div(s3)out:
A NaN B NaN C NaN D NaN E NaN a 1.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64out:
3.0out:
26out:
6out:
1与 Sereis 不同,DataFrame 可以存在多列数据。一般情况下,DataFrame 也更加常用。
out:
2020-06-27 08:37:36.648563 0.136455 1.372062 -1.896225 1.005024 2020-06-28 08:37:36.648563 -1.208118 -0.806961 1.382154 1.417238 2020-06-29 08:37:36.648563 -1.800235 -0.272469 -0.966839 -0.188984 2020-06-30 08:37:36.648563 0.081191 -0.468042 0.551959 -0.441269 2020-07-01 08:37:36.648563 0.301758 -0.147157 0.632281 -0.622362 2020-07-02 08:37:36.648563 -0.762515 -1.773605 -0.990699 0.493300out:
animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes c snake 0.5 2 no d dog NaN 3 yes e dog 5.0 2 no f cat 2.0 3 no g snake 4.5 1 no h cat NaN 1 yes i dog 7.0 2 no j dog 3.0 1 no当然如果希望每个键的值为一行,将DataFrame转置即可:
df2.Tout:
a b c d e f g h i j animal cat cat snake dog dog cat snake cat dog dog age 2.5 3 0.5 NaN 5 2 4.5 NaN 7 3 visits 1 3 2 3 2 3 1 1 2 1 priority yes yes no yes no no no yes no no此方法对快速了解陌生数据集结构十分有用。
df2.head() # 默认为显示 5 行,可根据需要在括号中填入希望预览的行数out:
animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes c snake 0.5 2 no d dog NaN 3 yes e dog 5.0 2 noout:
animal age visits priority h cat NaN 1 yes i dog 7.0 2 no j dog 3.0 1 noout:
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')out:
Index(['animal', 'age', 'visits', 'priority'], dtype='object')out:
array([['cat', 2.5, 1, 'yes'], ['cat', 3.0, 3, 'yes'], ['snake', 0.5, 2, 'no'], ['dog', nan, 3, 'yes'], ['dog', 5.0, 2, 'no'], ['cat', 2.0, 3, 'no'], ['snake', 4.5, 1, 'no'], ['cat', nan, 1, 'yes'], ['dog', 7.0, 2, 'no'], ['dog', 3.0, 1, 'no']], dtype=object)out:
age visits count 8.000000 10.000000 mean 3.437500 1.900000 std 2.007797 0.875595 min 0.500000 1.000000 25% 2.375000 1.000000 50% 3.000000 2.000000 75% 4.625000 2.750000 max 7.000000 3.000000out:
a b c d e f g h i j animal cat cat snake dog dog cat snake cat dog dog age 2.5 3 0.5 NaN 5 2 4.5 NaN 7 3 visits 1 3 2 3 2 3 1 1 2 1 priority yes yes no yes no no no yes no noout:
animal age visits priority i dog 7.0 2 no e dog 5.0 2 no g snake 4.5 1 no b cat 3.0 3 yes j dog 3.0 1 no a cat 2.5 1 yes f cat 2.0 3 no c snake 0.5 2 no d dog NaN 3 yes h cat NaN 1 yesout:
animal age visits priority b cat 3.0 3 yes c snake 0.5 2 noout:
a 2.5 b 3.0 c 0.5 d NaN e 5.0 f 2.0 g 4.5 h NaN i 7.0 j 3.0 Name: age, dtype: float64等价于 df2.age
df2.age # 等价于 df2['age']out:
a 2.5 b 3.0 c 0.5 d NaN e 5.0 f 2.0 g 4.5 h NaN i 7.0 j 3.0 Name: age, dtype: float64out:
age animal a 2.5 cat b 3.0 cat c 0.5 snake d NaN dog e 5.0 dog f 2.0 cat g 4.5 snake h NaN cat i 7.0 dog j 3.0 dog.iloc[] 里面的第一参数为行,第二个参数为列
方法一:
df2.iloc[1:3] # 查询 2,3 行out:
animal age visits priority b cat 3.0 3 yes c snake 0.5 2 no方法二:
df2.iloc[1:3, 1] # 查询 2,3 行的第1列out:
b 3.0 c 0.5 Name: age, dtype: float64方法三:
df2.iloc[[0,2,3], [0,2]] # 查询 0,2,3 行的第0,1列out:
animal visits a cat 1 c snake 2 d dog 3out:
animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes c snake 0.5 2 no d dog NaN 3 yes e dog 5.0 2 no f cat 2.0 3 no g snake 4.5 1 no h cat NaN 1 yes i dog 7.0 2 no j dog 3.0 1 noout:
animal age visits priority a False False False False b False False False False c False False False False d False True False False e False False False False f False False False False g False False False False h False True False False i False False False False j False False False False判断指定列是否为空:
df3['age'].isnull() # 如果age为空则返回为 Trueout:
a False b False c False d True e False f False g False h True i False j False Name: age, dtype: bool1)添加一个列 series数据:
num = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], index=df3.index) df3['No.'] = num # 添加以 'No.' 为列名的新数据列 df3out:
animal age visits priority No. a cat 2.5 1 yes 0 b cat 3.0 3 yes 1 c snake 0.5 2 no 2 d dog NaN 3 yes 3 e dog 5.0 2 no 4 f cat 2.0 3 no 5 g snake 4.5 1 no 6 h cat NaN 1 yes 7 i dog 7.0 2 no 8 j dog 3.0 1 no 92)直接添加一个list数据:
num = list(range(1, 11)) # [1, 2, 3,....9, 10] df3['No.'] = num # 添加以 'No.' 为列名的新数据列 df3out:
animal age visits priority No. a cat 2.5 1 yes 1 b cat 3.0 3 yes 2 c snake 0.5 2 no 3 d dog NaN 3 yes 4 e dog 5.0 2 no 5 f cat 2.0 3 no 6 g snake 4.5 1 no 7 h cat NaN 1 yes 8 i dog 7.0 2 no 9 j dog 3.0 1 no 101)通过 iat修改:
# 修改第 2 行与第 2 列对应的值 3.0 → 2.0 df3.iat[1, 1] = 2 # 索引序号从 0 开始,这里为 1, 1 df3out:
animal age visits priority No. a cat 2.5 1 yes 1 b cat 2.0 3 yes 2 c snake 0.5 2 no 3 d dog NaN 3 yes 4 e dog 5.0 2 no 5 f cat 2.0 3 no 6 g snake 4.5 1 no 7 h cat NaN 1 yes 8 i dog 7.0 2 no 9 j dog 3.0 1 no 102)通过iloc修改:
# 修改第 2 行与第 2 列对应的值 2.0 → 3.0 df3.iloc[1, 1] = 3.0 # 索引序号从 0 开始,这里为 1, 1 df3out:
animal age visits priority No. a cat 2.5 1 yes 1 b cat 3.0 3 yes 2 c snake 0.5 2 no 3 d dog NaN 3 yes 4 e dog 5.0 2 no 5 f cat 2.0 3 no 6 g snake 4.5 1 no 7 h cat NaN 1 yes 8 i dog 7.0 2 no 9 j dog 3.0 1 no 10out:
animal age visits priority No. a cat 2.5 1 yes 1 b cat 3.0 3 yes 2 c snake 0.5 2 no 3 d dog NaN 3 yes 4 e dog 5.0 2 no 5 f cat 1.5 3 no 6 g snake 4.5 1 no 7 h cat NaN 1 yes 8 i dog 7.0 2 no 9 j dog 3.0 1 no 10out:
age 3.375 visits 1.900 No. 5.500 dtype: float641)对指定列求和:
df3['visits'].sum()out:
192)默认对所有列求和:
df3.sum()out:
animal catcatsnakedogdogcatsnakecatdogdog age 27 visits 19 priority yesyesnoyesnononoyesnono No. 55 dtype: objectout:
0 A 1 B 2 C 3 Aaba 4 Baca 5 NaN 6 CABA 7 dog 8 cat dtype: object 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: objectout:
0 A 1 B 2 C 3 AABA 4 BACA 5 NaN 6 CABA 7 DOG 8 CAT dtype: objectout:
animal age visits priority No. a cat 2.5 1 yes 1 b cat 3.0 3 yes 2 c snake 0.5 2 no 3 d dog NaN 3 yes 4 e dog 5.0 2 no 5 f cat 1.5 3 no 6 g snake 4.5 1 no 7 h cat NaN 1 yes 8 i dog 7.0 2 no 9 j dog 3.0 1 no 10 animal age visits priority No. a cat 2.5 1 yes 1 b cat 3.0 3 yes 2 c snake 0.5 2 no 3 d dog 3.0 3 yes 4 e dog 5.0 2 no 5 f cat 1.5 3 no 6 g snake 4.5 1 no 7 h cat 3.0 1 yes 8 i dog 7.0 2 no 9 j dog 3.0 1 no 10out:
animal age visits priority No. a cat 2.5 1 yes 1 b cat 3.0 3 yes 2 c snake 0.5 2 no 3 d dog NaN 3 yes 4 e dog 5.0 2 no 5 f cat 1.5 3 no 6 g snake 4.5 1 no 7 h cat NaN 1 yes 8 i dog 7.0 2 no 9 j dog 3.0 1 no 10 animal age visits priority No. a cat 2.5 1 yes 1 b cat 3.0 3 yes 2 c snake 0.5 2 no 3 e dog 5.0 2 no 5 f cat 1.5 3 no 6 g snake 4.5 1 no 7 i dog 7.0 2 no 9 j dog 3.0 1 no 10out:
key one 0 foo1 1 1 foo2 2 key two 0 foo2 4 1 foo3 5 key one two 0 foo2 2 4out:
写入成功.out:
cat 3.0 3 yes 2 0 snake 0.5 2 no 3 1 dog NaN 3 yes 4 2 dog 5.0 2 no 5 3 cat 1.5 3 no 6 4 snake 4.5 1 no 7 5 cat NaN 1 yes 8 6 dog 7.0 2 no 9 7 dog 3.0 1 no 10常用参数解析
-excel_writer : string or ExcelWriter object File path or existing ExcelWriter目标路径- sheet_name : string, default ‘Sheet1’ Name of sheet which will contain DataFrame,填充excel的第几页- na_rep : string, default ”,Missing data representation 缺失值填充- float_format : string, default None Format string for floating point numbers- columns : sequence, optional,Columns to write 选择输出的的列。- header : boolean or list of string, default True Write out column names. If a list of string is given it is assumed to be aliases for the column names- index : boolean, default True,Write row names (index)- index_label : string or sequence, default None, Column label for index column(s) if desired. If None is given, andheader and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.- startrow :upper left cell row to dump data frame- startcol :upper left cell column to dump data frame- engine : string, default None ,write engine to use - you can also set this via the options,io.excel.xlsx.writer, io.excel.xls.writer, andio.excel.xlsm.writer.- merge_cells : boolean, default True Write MultiIndex and Hierarchical Rows as merged cells.- encoding: string, default None encoding of the resulting excel file. Only necessary for xlwt,other writers support unicode natively.- inf_rep : string, default ‘inf’ Representation for infinity (there is no native representation for infinity in Excel)- freeze_panes : tuple of integer (length 2), default None Specifies the one-based bottommost row and rightmost column that is to be frozen df3.to_excel('animal.xlsx', sheet_name='Sheet1' ) print("写入成功.")out:
写入成功.常用参数解析:
io : string, path object ; excel 路径。
sheetname : string, int, mixed list of strings/ints, or None, default 0 返回多表使用sheetname=[0,1],若sheetname=None是返回全表 注意:int/string 返回的是dataframe,而none和list返回的是dict of dataframe
header : int, list of ints, default 0 指定列名行,默认0,即取第一行,数据为列名行以下的数据 若数据不含列名,则设定 header = None
skiprows : list-like,Rows to skip at the beginning,省略指定行数的数据
skip_footer : int,default 0, 省略从尾部数的int行数据
index_col: int, list of ints, default None指定列为索引列,也可以使用u”strings”
names : array-like, default None, 指定列的名字。
pd.read_excel('animal.xlsx', 'Sheet1', index_col=None, na_values=['NA'])out:
Unnamed: 0 animal age visits priority No. 0 a cat 2.5 1 yes 1 1 b cat 3.0 3 yes 2 2 c snake 0.5 2 no 3 3 d dog NaN 3 yes 4 4 e dog 5.0 2 no 5 5 f cat 1.5 3 no 6 6 g snake 4.5 1 no 7 7 h cat NaN 1 yes 8 8 i dog 7.0 2 no 9 9 j dog 3.0 1 no 10out:
2018-01-01 0.747769 2018-01-02 0.936370 2018-01-03 0.698550 2018-01-04 0.169079 2018-01-05 0.408279 ... 2018-12-27 0.854535 2018-12-28 0.343373 2018-12-29 0.367045 2018-12-30 0.129674 2018-12-31 0.020034 Freq: D, Length: 365, dtype: float64out:
27.895681550217724提取其中日期为周三的数据:
s[s.index.weekday == 2] # 周一为0,周三为2out:
2018-01-03 0.698550 2018-01-10 0.385514 2018-01-17 0.379610 2018-01-24 0.206736 2018-01-31 0.877809 2018-02-07 0.379703 2018-02-14 0.947930 2018-02-21 0.223449 2018-02-28 0.008998 2018-03-07 0.326007 2018-03-14 0.018386 2018-03-21 0.830026 2018-03-28 0.078421 2018-04-04 0.754530 2018-04-11 0.856349 2018-04-18 0.697045 2018-04-25 0.849338 2018-05-02 0.767433 2018-05-09 0.774195 2018-05-16 0.566739 2018-05-23 0.691705 2018-05-30 0.463924 2018-06-06 0.962409 2018-06-13 0.392350 2018-06-20 0.569404 2018-06-27 0.333270 2018-07-04 0.655454 2018-07-11 0.333805 2018-07-18 0.588172 2018-07-25 0.496672 2018-08-01 0.438350 2018-08-08 0.065597 2018-08-15 0.640373 2018-08-22 0.639175 2018-08-29 0.233980 2018-09-05 0.747509 2018-09-12 0.765390 2018-09-19 0.317519 2018-09-26 0.467703 2018-10-03 0.329899 2018-10-10 0.963069 2018-10-17 0.566592 2018-10-24 0.148371 2018-10-31 0.406007 2018-11-07 0.876346 2018-11-14 0.869522 2018-11-21 0.553947 2018-11-28 0.126612 2018-12-05 0.878116 2018-12-12 0.856469 2018-12-19 0.138991 2018-12-26 0.752213 dtype: float64out:
2018-01-31 0.483828 2018-02-28 0.552954 2018-03-31 0.419301 2018-04-30 0.523942 2018-05-31 0.547059 2018-06-30 0.460998 2018-07-31 0.482454 2018-08-31 0.462668 2018-09-30 0.557952 2018-10-31 0.461560 2018-11-30 0.499936 2018-12-31 0.492086 Freq: M, dtype: float64out:
2020-06-30 07:52:00 6130 2020-06-30 07:53:00 16115 2020-06-30 07:54:00 3366 Freq: T, dtype: int64out:
2020-06-30 08:01:12.838843+00:00 -0.361407 Freq: D, dtype: float64out:
2020-06-30 16:01:12.838843+08:00 -0.361407 Freq: D, dtype: float64看一看你当前的时间,是不是一致?与UTC差8个小时。
out:
2018-01-31 -0.293837 2018-02-28 -0.692926 2018-03-31 1.204096 2018-04-30 2.485244 2018-05-31 0.019893 Freq: M, dtype: float64 2018-01 -0.293837 2018-02 -0.692926 2018-03 1.204096 2018-04 2.485244 2018-05 0.019893 Freq: M, dtype: float64 2018-01-01 -0.293837 2018-02-01 -0.692926 2018-03-01 1.204096 2018-04-01 2.485244 2018-05-01 0.019893 Freq: MS, dtype: float64构建一个 letters = ['A', 'B', 'C'] 和 numbers = list(range(10))为索引,值为随机数的多重索引 Series。
letters = ['A', 'B', 'C'] numbers = list(range(10)) mi = pd.MultiIndex.from_product([letters, numbers]) # 设置多重索引 s = pd.Series(np.random.rand(30), index=mi) # 随机数 sout:
A 0 0.744729 1 0.351805 2 0.529587 3 0.043310 4 0.292182 5 0.740887 6 0.428499 7 0.653610 8 0.107801 9 0.590899 B 0 0.542375 1 0.231597 2 0.410738 3 0.634838 4 0.072990 5 0.188618 6 0.821767 7 0.624321 8 0.514436 9 0.695192 C 0 0.160558 1 0.631878 2 0.663879 3 0.667969 4 0.139756 5 0.878765 6 0.129184 7 0.449790 8 0.835275 9 0.965602 dtype: float64取值:
s['A'][0] # 取一级索引A里面的二级索引对应的值out:
0.7447294320037277查询索引为 1,3,6 的值:
# 查询索引为 1,3,6 的值 s.loc[:, [1, 3, 6]] # 可以理解为查询所有二级索引为 1,3,6的数据out:
A 1 0.351805 3 0.043310 6 0.428499 B 1 0.231597 3 0.634838 6 0.821767 C 1 0.631878 3 0.667969 6 0.129184 dtype: float64out:
A 5 0.740887 6 0.428499 7 0.653610 8 0.107801 9 0.590899 B 5 0.188618 6 0.821767 7 0.624321 8 0.514436 9 0.695192 dtype: float64创建一个以 letters = ['A', 'B'] 和 numbers = list(range(6))为索引,值为随机数据的多重索引 DataFrame。
frame = pd.DataFrame(np.arange(12).reshape(6, 2), index=[list('AAABBB'), list('123123')], columns=['hello', 'shiyanlou']) frameout:
hello shiyanlou A 1 0 1 2 2 3 3 4 5 B 1 6 7 2 8 9 3 10 11out:
hello shiyanlou first second A 1 0 1 2 2 3 3 4 5 B 1 6 7 2 8 9 3 10 11out:
hello shiyanlou first A 6 9 B 24 27out:
hello shiyanlou first second A 1 0 1 2 2 3 3 4 5 B 1 6 7 2 8 9 3 10 11 first second A 1 hello 0 shiyanlou 1 2 hello 2 shiyanlou 3 3 hello 4 shiyanlou 5 B 1 hello 6 shiyanlou 7 2 hello 8 shiyanlou 9 3 hello 10 shiyanlou 11 dtype: int64out:
hello shiyanlou first second A 1 0 1 2 2 3 3 4 5 B 1 6 7 2 8 9 3 10 11 hello shiyanlou second 1 2 3 1 2 3 first A 0 2 4 1 3 5 B 6 8 10 7 9 11可见,它将二级索引变成了列索引。
out:
animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes c snake 0.5 2 no d dog NaN 3 yes e dog 5.0 2 no f cat 2.0 3 no g snake 4.5 1 no h cat NaN 1 yes i dog 7.0 2 no j dog 3.0 1 no查找age大于3的全部信息
df[df['age'] > 3]out:
animal age visits priority e dog 5.0 2 no g snake 4.5 1 no i dog 7.0 2 noout:
age visits c 0.5 2 d NaN 32)行和列坐标:
df.iloc[1, 2]out:
3查找 age<3 且为 cat 的全部数据。
df = pd.DataFrame(data, index=labels) df[(df['animal'] == 'cat') & (df['age'] < 3)]out:
animal age visits priority a cat 2.5 1 yes f cat 2.0 3 no查找animal包含cat和dog的数据:
df3[df3['animal'].isin(['cat', 'dog'])]out:
animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes d dog NaN 3 yes e dog 5.0 2 no f cat 2.0 3 no h cat NaN 1 yes i dog 7.0 2 no j dog 3.0 1 no提取第3、4、8行的animal、age这两列数据:
df.loc[df.index[[3, 4, 8]], ['animal', 'age']]out:
animal age d dog NaN e dog 5.0 i dog 7.0按照 age 降序,visits 升序排列:
df.sort_values(by=['age', 'visits'], ascending=[False, True])out:
animal age visits priority i dog 7.0 2 no e dog 5.0 2 no g snake 4.5 1 no j dog 3.0 1 no b cat 3.0 3 yes a cat 2.5 1 yes f cat 2.0 3 no c snake 0.5 2 no h cat NaN 1 yes d dog NaN 3 yes1) 替换值
将 priority 列的 yes 值替换为 True,no 值替换为 False:
df['priority'].map({'yes': True, 'no': False})out:
a True b True c False d True e False f False g False h True i False j False Name: priority, dtype: bool2)保留小数
将年龄保留两位小数:
#方法一: df['age'].map(lambda x: '%.2f' % x) #方法二: df['age'].map(lambda x: format(x, '.2f')) #方法三:(不一定有效) df['age'].round(decimals=2)out:
a 2.50 b 3.00 c 0.50 d nan e 5.00 f 2.00 g 4.50 h nan i 7.00 j 3.00 Name: age, dtype: object3) 转为百分数
#方法一: df['age'].map(lambda x: '%.2f%%' % (x*100)) #方法二: df['age'].map(lambda x: format(x, '.2%')) # 不需要乘1001)groupby 分组
df.groupby('animal').size()out:
animal cat 4 dog 4 snake 2 dtype: int64注意上面得到的结果就是一个Series类型,所以需要进一步取值按照Series的方法取即可,例如:
print(df.groupby('animal').size().values) print(df.groupby('animal').size().index) print(type(df.groupby('animal').size()))out:
[4 4 2] Index(['cat', 'dog', 'snake'], dtype='object', name='animal') <class 'pandas.core.series.Series'>2) groupby 分组求和
df.groupby('animal').sum()out:
age visits animal cat 7.5 8 dog 15.0 8 snake 5.0 31) 纵向拼接
temp_df1 = pd.DataFrame(np.random.randn(3, 2)) # 生成由随机数组成的 DataFrame 1 temp_df2 = pd.DataFrame(np.random.randn(3, 2)) # 生成由随机数组成的 DataFrame 2 temp_df3 = pd.DataFrame(np.random.randn(3, 2)) # 生成由随机数组成的 DataFrame 3 print(temp_df1) print(temp_df2) print(temp_df3) pieces = [temp_df1, temp_df2, temp_df3] pd.concat(pieces, axis=0) # axis默认为0, 即默认纵向拼接out:
0 1 0 0.437607 -0.648355 1 -0.416831 -0.405202 2 1.681175 -0.031025 0 1 0 -0.730415 -0.806742 1 -0.914077 0.809963 2 -0.488658 -0.620225 0 1 0 -1.210932 0.606868 1 -1.539275 1.830870 2 -0.906066 0.440358 0 1 0 0.437607 -0.648355 1 -0.416831 -0.405202 2 1.681175 -0.031025 0 -0.730415 -0.806742 1 -0.914077 0.809963 2 -0.488658 -0.620225 0 -1.210932 0.606868 1 -1.539275 1.830870 2 -0.906066 0.4403582)横向拼接
temp_df1 = pd.DataFrame(np.random.randn(3, 2)) # 生成由随机数组成的 DataFrame 1 temp_df2 = pd.DataFrame(np.random.randn(3, 2)) # 生成由随机数组成的 DataFrame 2 temp_df3 = pd.DataFrame(np.random.randn(3, 2)) # 生成由随机数组成的 DataFrame 3 print(temp_df1) print(temp_df2) print(temp_df3) pieces = [temp_df1, temp_df2, temp_df3] pd.concat(pieces, axis=1)out:
0 1 0 1.116153 0.250272 1 -0.941279 -0.159497 2 -0.537866 1.675018 0 1 0 -0.103160 1.228339 1 -0.149218 -0.551139 2 -0.229225 -0.156848 0 1 0 0.971171 -0.715241 1 0.077248 0.941577 2 1.535163 0.333749 0 1 0 1 0 1 0 1.116153 0.250272 -0.103160 1.228339 0.971171 -0.715241 1 -0.941279 -0.159497 -0.149218 -0.551139 0.077248 0.941577 2 -0.537866 1.675018 -0.229225 -0.156848 1.535163 0.333749out:
a b c d e f g \ 0 0.265336 0.281261 0.626660 0.455936 0.469568 0.160094 0.254143 1 0.293103 0.429918 0.861056 0.704762 0.534546 0.997590 0.651032 2 0.653752 0.239481 0.956774 0.983244 0.835387 0.739893 0.446470 3 0.335220 0.832347 0.925990 0.083933 0.092788 0.144650 0.284757 4 0.923494 0.926540 0.227792 0.872578 0.471281 0.786390 0.731639 h i j 0 0.168989 0.034899 0.797001 1 0.942421 0.926441 0.218743 2 0.776017 0.662287 0.806842 3 0.247964 0.102461 0.051523 4 0.665478 0.116302 0.256650out:
0 1 2 0 0.865347 0.060866 0.750320 1 0.905023 0.779393 0.969498 2 0.800366 0.334823 0.346131 3 0.930328 0.295275 0.761584 4 0.922344 0.904810 0.062543 means: 0 0.558844 1 0.884638 2 0.493774 3 0.662396 4 0.629899 dtype: float64 0 1 2 0 0.306502 -0.497978 0.191476 1 0.020385 -0.105246 0.084860 2 0.306593 -0.158950 -0.147642 3 0.267932 -0.367120 0.099188 4 0.292445 0.274911 -0.567356out:
A B 0 a 12 1 a 345 2 a 3 3 b 1 4 b 45 5 c 14 6 a 4 7 a 52 8 b 54 9 c 23 10 c 235 11 c 21 12 b 57 13 b 3 14 c 87 A a 409 b 156 c 345 Name: B, dtype: int64当分析庞大的数据时,为了更好的发掘数据特征之间的关系,且不破坏原数据,就可以利用透视表 pivot_table 进行操作。
新建表将 A, B, C 列作为索引进行聚合:
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3, 'B': ['A', 'B', 'C'] * 4, 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, 'D': np.random.randn(12), 'E': np.random.randn(12)}) print(df) pd.pivot_table(df, index=['A', 'B'])out:
A B C D E 0 one A foo -1.039258 -0.443270 1 one B foo 0.388518 0.524361 2 two C foo -0.330776 -0.878025 3 three A bar 0.000832 1.133901 4 one B bar 0.418298 1.626217 5 one C bar 0.459358 -1.203031 6 two A foo -0.658593 -1.116155 7 three B foo -0.331466 1.130495 8 one C foo -0.197646 0.726132 9 one A bar -0.895106 -0.461336 10 two B bar 0.584046 0.674531 11 three C bar -0.401441 -0.017452 D E A B one A -0.967182 -0.452303 B 0.403408 1.075289 C 0.130856 -0.238450 three A 0.000832 1.133901 B -0.331466 1.130495 C -0.401441 -0.017452 two A -0.658593 -1.116155 B 0.584046 0.674531 C -0.330776 -0.878025将该 DataFrame 的 D 列聚合,按照 A,B 列为索引进行聚合,聚合的方式为默认求均值。
pd.pivot_table(df, values=['D'], index=['A', 'B'])out:
D A B one A -0.967182 B 0.403408 C 0.130856 three A 0.000832 B -0.331466 C -0.401441 two A -0.658593 B 0.584046 C -0.330776上一题中 D 列聚合时,采用默认求均值的方法,若想使用更多的方式可以在 aggfunc 中实现。
sum len D D A B one A -1.934364 2.0 B 0.806816 2.0 C 0.261711 2.0 three A 0.000832 1.0 B -0.331466 1.0 C -0.401441 1.0 two A -0.658593 1.0 B 0.584046 1.0 C -0.330776 1.0D 列按照 A,B 列进行聚合时,若关心 C 列对 D 列的影响,可以加入 columns 值进行分析。
pd.pivot_table(df, values=['D'], index=['A', 'B'], columns=['C'], aggfunc=np.sum)out:
D C bar foo A B one A -0.895106 -1.039258 B 0.418298 0.388518 C 0.459358 -0.197646 three A 0.000832 NaN B NaN -0.331466 C -0.401441 NaN two A NaN -0.658593 B 0.584046 NaN C NaN -0.33077在透视表中由于不同的聚合方式,相应缺少的组合将为缺省值,可以加入 fill_value 对缺省值处理:
pd.pivot_table(df, values=['D'], index=['A', 'B'], columns=['C'], aggfunc=np.sum, fill_value=0)out:
D C bar foo A B one A -0.895106 -1.039258 B 0.418298 0.388518 C 0.459358 -0.197646 three A 0.000832 0.000000 B 0.000000 -0.331466 C -0.401441 0.000000 two A 0.000000 -0.658593 B 0.584046 0.000000 C 0.000000 -0.33077在数据的形式上主要包括数量型和性质型,数量型表示着数据可数范围可变,而性质型表示范围已经确定不可改变,绝对型数据就是性质型数据的一种。
out:
id raw_grade grade 0 1 a a 1 2 b b 2 3 b b 3 4 a a 4 5 a a 5 6 e eout:
id raw_grade grade 0 1 a very good 1 2 b good 2 3 b good 3 4 a very good 4 5 a very good 5 6 e very badout:
id raw_grade grade 0 1 a very good 1 2 b good 2 3 b good 3 4 a very good 4 5 a very good 5 6 e very badout:
id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very gooout:
grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64常常我们得到的数据是不符合我们最终处理的数据要求,包括许多缺省值以及坏的数据,需要我们对数据进行清洗。
在FilghtNumber中有数值缺失,其中数值为按 10 增长,补充相应的缺省值使得数据完整,并让数据为 int 类型。
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 'Budapest_PaRis', 'Brussels_londOn'], 'FlightNumber': [10045, np.nan, 10065, np.nan, 10085], 'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]], 'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', '12. Air France', '"Swiss Air"']}) df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int) dfout:
From_To FlightNumber RecentDelays Airline 0 LoNDon_paris 10045 [23, 47] KLM(!) 1 MAdrid_miLAN 10055 [] <Air France> (12) 2 londON_StockhOlm 10065 [24, 43, 87] (British Airways. ) 3 Budapest_PaRis 10075 [13] 12. Air France 4 Brussels_londOn 10085 [67, 32] "Swiss Air"其中From_to应该为两独立的两列From和To,将From_to依照_拆分为独立两列建立为一个新表。
temp = df.From_To.str.split('_', expand=True) temp.columns = ['From', 'To'] tempout:
From To 0 LoNDon paris 1 MAdrid miLAN 2 londON StockhOlm 3 Budapest PaRis 4 Brussels londOn其中注意到地点的名字都不规范(如:londON应该为London)需要对数据进行标准化处理。
temp['From'] = temp['From'].str.capitalize() #capitalize() 返回一个首字母大写的字符串 temp['To'] = temp['To'].str.capitalize() print(temp['From']) print(temp['To'])out:
0 London 1 Madrid 2 London 3 Budapest 4 Brussels Name: From, dtype: object 0 Paris 1 Milan 2 Stockholm 3 Paris 4 London Name: To, dtype: object将最开始的 From_to 列删除,加入整理好的 From 和 to 列。
df = df.drop('From_To', axis=1) df = df.join(temp) print(df)out:
FlightNumber RecentDelays Airline From To 0 10045 [23, 47] KLM(!) London Paris 1 10055 [] <Air France> (12) Madrid Milan 2 10065 [24, 43, 87] (British Airways. ) London Stockholm 3 10075 [13] 12. Air France Budapest Paris 4 10085 [67, 32] "Swiss Air" Brussels London如同 airline 列中许多数据有许多其他字符,会对后期的数据分析有较大影响,需要对这类数据进行修正。
print(df['Airline']) # extract()按正则表达式进行数据提取 df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip() dfout:
FlightNumber RecentDelays Airline From To 0 10045 [23, 47] KLM London Paris 1 10055 [] Air France Madrid Milan 2 10065 [24, 43, 87] British Airways London Stockholm 3 10075 [13] Air France Budapest Paris 4 10085 [67, 32] Swiss Air Brussels London在 RecentDelays 中记录的方式为列表类型,由于其长度不一,这会为后期数据分析造成很大麻烦。这里将 RecentDelays 的列表拆开,取出列表中的相同位置元素作为一列,若为空值即用 NaN 代替。
delays = df['RecentDelays'].apply(pd.Series) delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)] df = df.drop('RecentDelays', axis=1).join(delays) dfout:
FlightNumber Airline From To delay_1 delay_2 delay_3 0 10045 KLM London Paris 23.0 47.0 NaN 1 10055 Air France Madrid Milan NaN NaN NaN 2 10065 British Airways London Stockholm 24.0 43.0 87.0 3 10075 Air France Budapest Paris 13.0 NaN NaN 4 10085 Swiss Air Brussels London 67.0 32.0 NaN班级一部分同学的数学成绩表,如下图所示:
df=pd.DataFrame({'name':['Alice','Bob','Candy','Dany','Ella','Frank','Grace','Jenny'],'grades':[58,83,79,65,93,45,61,88]})out:
name grades 0 Alice 58 1 Bob 83 2 Candy 79 3 Dany 65 4 Ella 93 5 Frank 45 6 Grace 61 7 Jenny 88但我们更加关心的是该同学是否及格,将该数学成绩按照是否>60来进行划分。
df = pd.DataFrame({'name': ['Alice', 'Bob', 'Candy', 'Dany', 'Ella', 'Frank', 'Grace', 'Jenny'], 'grades': [58, 83, 79, 65, 93, 45, 61, 88]}) def choice(x): if x > 60: return 1 else: return 0 df.grades = pd.Series(map(lambda x: choice(x), df.grades)) dfout:
name grades 0 Alice 0 1 Bob 1 2 Candy 1 3 Dany 1 4 Ella 1 5 Frank 0 6 Grace 1 7 Jenny 1一个列为A的 DataFrame 数据,如下图所示:
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})尝试将 A 列中连续重复的数据清除.
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]}) df.loc[df['A'].shift() != df['A']]out:
A 0 1 1 2 3 3 4 4 5 5 8 6 9 7有时候,DataFrame 中不同列之间的数据差距太大,需要对其进行归一化处理。
其中,Max-Min归一化是简单而常见的一种方式,公式如下: Y = X − X m i n X m a x − X m i n Y=\frac{X-X_{min}}{X_{max}-X_{min}} Y=Xmax−XminX−Xmin
def normalization(df): numerator = df.sub(df.min()) denominator = (df.max()).sub(df.min()) Y = numerator.div(denominator) return Y df = pd.DataFrame(np.random.random(size=(5, 3))) print(df) normalization(df)out:
0 1 2 0 0.055276 0.666860 0.206399 1 0.873721 0.924465 0.105095 2 0.161571 0.979359 0.678480 3 0.698888 0.091796 0.692453 4 0.694759 0.888997 0.528819 0 1 2 0 0.000000 0.647914 0.172474 1 1.000000 0.938152 0.000000 2 0.129874 1.000000 0.976209 3 0.786385 0.000000 1.000000 4 0.781340 0.898191 0.721407为了更好的了解数据包含的信息,最直观的方法就是将其绘制成图。
out:
out:
out:
out: