【pandas-汇总3】DataFrame常用属性、函数以及索引方式

技术2022-07-10 331

1.DataFrame常用属性、函数以及索引方式

1.1DataFrame简介

　　DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同一个索引）。DataFrame可以通过类似字典的方式或者.columnname的方式将列获取为一个Series。行也可以通过位置或名称的方式进行获取。

　　　　为不存在的列赋值会创建新列。

　　　　>>> del frame['xxx']　　# 删除列

1.2DataFrame常用属性

属性说明valuesDataFrame的值index行索引index.name行索引的名字columns列索引columns.name列索引的名字ix返回行的DataFrameix[[x,y,...], [x,y,...]]对行重新索引，然后对列重新索引Tframe行列转置

1.3DataFrame常用函数

1.3.1函数

说明

DataFrame(dict, columns=dict.index, index=[dict.columnnum])

DataFrame(二维ndarray)

DataFrame(由数组、列表或元组组成的字典)

DataFrame(NumPy的结构化/记录数组)

DataFrame(由Series组成的字典)

DataFrame(由字典组成的字典)

DataFrame(字典或Series的列表)

DataFrame(由列表或元组组成的列表)

DataFrame(DataFrame)

DataFrame(NumPy的MaskedArray)

构建DataFrame

数据矩阵，还可以传入行标和列标

每个序列会变成DataFrame的一列。所有序列的长度必须相同

类似于“由数组组成的字典”

每个Series会成为一列。如果没有显式制定索引，则各Series的索引会被合并成结果的行索引

各内层字典会成为一列。键会被合并成结果的行索引。

各项将会成为DataFrame的一行。索引的并集会成为DataFrame的列标。

类似于二维ndarray

沿用DataFrame

类似于二维ndarray，但掩码结果会变成NA/缺失值

df.reindex([x,y,...], fill_value=NaN, limit)

df.reindex([x,y,...], method=NaN)

df.reindex([x,y,...], columns=[x,y,...],copy=True)

返回一个适应新索引的新对象，将缺失值填充为fill_value，最大填充量为limit

返回适应新索引的新对象，填充方式为method

同时对行和列进行重新索引，默认复制新对象。

df.drop(index, axis=0)丢弃指定轴上的指定项。

1.3.2排序函数

说明

df.sort_index(axis=0, ascending=True)

df.sort_index(by=[a,b,...])

根据索引排序

1.3.3汇总统计函数

说明df.count()非NaN的数量df.describe()一次性产生多个汇总统计

df.min()

最小值

最大值

df.idxmax(axis=0, skipna=True)

df.idxmin(axis=0, skipna=True)

返回含有最大值的index的Series

返回含有最小值的index的Series

df.quantile(axis=0)计算样本的分位数

df.sum(axis=0, skipna=True, level=NaN)

df.mean(axis=0, skipna=True, level=NaN)

df.median(axis=0, skipna=True, level=NaN)

df.mad(axis=0, skipna=True, level=NaN)

df.var(axis=0, skipna=True, level=NaN)

df.std(axis=0, skipna=True, level=NaN)

df.skew(axis=0, skipna=True, level=NaN)

df.kurt(axis=0, skipna=True, level=NaN)

df.cumsum(axis=0, skipna=True, level=NaN)

df.cummin(axis=0, skipna=True, level=NaN)

df.cummax(axis=0, skipna=True, level=NaN)

df.cumprod(axis=0, skipna=True, level=NaN)

df.diff(axis=0)

df.pct_change(axis=0)

返回一个含有求和小计的Series

返回一个含有平均值的Series

返回一个含有算术中位数的Series

返回一个根据平均值计算平均绝对离差的Series

返回一个方差的Series

返回一个标准差的Series

返回样本值的偏度（三阶距）

返回样本值的峰度（四阶距）

返回样本的累计和

返回样本的累计最大值

返回样本的累计最小值

返回样本的累计积

返回样本的一阶差分

返回样本的百分比数变化

1.3.4计算函数

说明

df.add(df2, fill_value=NaN, axist=1)

df.sub(df2, fill_value=NaN, axist=1)

df.div(df2, fill_value=NaN, axist=1)

df.mul(df2, fill_value=NaN, axist=1)

元素级相加，对齐时找不到元素默认用fill_value

元素级相减，对齐时找不到元素默认用fill_value

元素级相除，对齐时找不到元素默认用fill_value

元素级相乘，对齐时找不到元素默认用fill_value

df.apply(f, axis=0)将f函数应用到由各行各列所形成的一维数组上df.applymap(f)将f函数应用到各个元素上df.cumsum(axis=0, skipna=True)累加，返回累加后的dataframe

1.4DataFrame索引方式

索引方式说明df[val]选取DataFrame的单个列或一组列df.ix[val]选取Dataframe的单个行或一组行df.ix[:,val]选取单个列或列子集df.ix[val1,val2]将一个或多个轴匹配到新索引reindex方法将一个或多个轴匹配到新索引xs方法根据标签选取单行或者单列，返回一个Seriesicol、irow方法根据整数位置选取单列或单行，并返回一个Seriesget_value、set_value根据行标签和列标签选取单个值

运算：默认情况下，Dataframe和Series之间的算术运算会将Series的索引匹配到的Dataframe的列，沿着列一直向下传播。若索引找不到，则会重新索引产生并集。

2.DataFrame常用属性例程

# -*- coding: utf-8 -*- """ @author: 蔚蓝的天空Tom DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。 DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同一个索引）。 DataFrame可以通过类似字典的方式或者.columnname的方式将列获取为一个Series。行也可以通过位置或名称的方式进行获取。 DataFrame常用属性属性说明 values DataFrame的值 index 行索引 index.name 行索引的名字 columns 列索引 columns.name 列索引的名字 ix 返回行的DataFrame ix[[x,y,...], [x,y,...]] 对行重新索引，然后对列重新索引 T frame行列转置 """ import pandas as pd from pandas import DataFrame if __name__== '__main__': data = { 'Name':[ 'Tom', 'Kim', 'Andy'], 'Age':[ 18, 16, 19], 'Height':[ 1.6, 1.5, 1.7]} ind = [ 'No.1', 'No.2', 'No.3'] df = pd.DataFrame(data, index=ind) # Age Height Name #No.1 18 1.6 Tom #No.2 16 1.5 Kim #No.3 19 1.7 Andy #DataFram的值 v = df.values #<class 'numpy.ndarray'> #[[18 1.6 'Tom'] # [16 1.5 'Kim'] # [19 1.7 'Andy']] #行索引，用户没有自定义行索引index时，返回行索引魔人数值 ind = df.index #<class 'pandas.indexes.base.Index'> #Index(['No.1', 'No.2', 'No.3'], dtype='object') #行索引的名字，未设置时获取到None iname = df.index.name #None #行索引的名字，先设置再获取 df.index.name = 'StudentID' iname = df.index.name #StudentID #列索引 col = df.columns #<class 'pandas.indexes.base.Index'> #Index(['Age', 'Height', 'Name'], dtype='object') #列索引的名字, 未设置时为None cname = df.columns.name #None #列索引的名字，先设置再获取 df.columns.name = 'StudentInfo' cname = df.columns.name #StudentInfo #ix, 返回行的DataFrame ret = df.ix[ 0] #返回第一行数据 #Age 18 #Height 1.6 #Name Tom #Name: No.1, dtype: object #ix, 返回行的DataFrame ret = df.ix[ 1] #返回第二行数据, <class 'pandas.core.series.Series'> #Age 16 #Height 1.5 #Name Kim #Name: No.2, dtype: object ret = df.ix[ -1] #返回最后一行数据 #Age 19 #Height 1.7 #Name Andy #Name: No.3, dtype: object #ix[[rowx, rowy,...]] 对行重新索引，相等于DataFrame切片 ret = df.ix[[ 0, 2]] #StudentInfo Age Height Name #StudentID #No.1 18 1.6 Tom #No.2 16 1.5 Kim #ix[[rowx, rowy,...], [colx, coly, ...]] ret = df.ix[[ 0, 2], [ 0, 1]] #StudentInfo Age Height #StudentID #No.1 18 1.6 #No.3 19 1.7 #T frame行列转置 print( '转置前:\n', df) #转置前: #StudentInfo Age Height Name #StudentID #No.1 18 1.6 Tom #No.2 16 1.5 Kim #No.3 19 1.7 Andy print( '转置前values:\n', df.values) #转置前values: # [[18 1.6 'Tom'] # [16 1.5 'Kim'] # [19 1.7 'Andy']] dfT = df.T print( '转置后:\n', dfT) #转置后: #StudentID No.1 No.2 No.3 #StudentInfo #Age 18 16 19 #Height 1.6 1.5 1.7 #Name Tom Kim Andy print( '转置后values:\n', dfT.values) #转置后values: # [[18 16 19] # [1.6 1.5 1.7] # ['Tom' 'Kim' 'Andy']] print( '转置前index.name:\n', df.index.name) #StudentID print( '转置后index.name:\n', dfT.index.name) #StudentInfo print( '转置前columns.name:\n', df.columns.name) #StudentInfo print( '转置后columns.name:\n', dfT.columns.name) #StudentID

3.DataFrame常用函数DataFrame()/reindex()/drop()

def DataFrame_manual(): ''' DataFrame类型类似于数据库表结构的数据结构，含有行索引和列索引可以将DataFrame看成由相同索引的Series组成的Dict类型。在其底层是通过二维以及一维的数据块实现 ''' import pandas as pd from pandas import DataFrame #1. DataFrame对象的创建 #1.1用包含等长的列表或者是NumPy数组的字典创建DataFrame对象 #建立等长列表的字典类型 data = { 'Name':[ 'Tom', 'Kim', 'Andy'], 'Age':[ 18, 16, 19], 'Height':[ 1.6, 1.5, 1.7]} #建立DataFrame对象 #使用默认索引[0,1,2,....] df = pd.DataFrame(data) #默认索引，默认列的顺序 # Age Height Name # 0 18 1.6 Tom # 1 16 1.5 Kim # 2 19 1.7 Andy #指定列的顺序 df = pd.DataFrame(data, columns=[ 'Name', 'Age', 'Height']) # Name Age Height # 0 Tom 18 1.6 # 1 Kim 16 1.5 # 2 Andy 19 1.7 #指定DataFrame的索引 df = pd.DataFrame(data, index=[ '1st', '2nd', '3th']) # Age Height Name # 1st 18 1.6 Tom # 2nd 16 1.5 Kim # 3th 19 1.7 Andy #1.2 用嵌套dict生成DataFrame对象 #用嵌套dict生成DataFrame，外部的dict索引会成为列名，内部的dict索引会成为行名 #生成的DataFrame会根据行索引排序 data = { 'Name': { '1st': 'Tom', '2nd': 'Kim', '3th': 'Andy'}, 'Age': { '1st': 18, '2nd': 16, '3th': 19}, 'Height':{ '1st': 1.6, '2nd': 1.5, '3th': 1.7}} df = pd.DataFrame(data) #使用嵌套dict指定的行序列，使用默认的列序列(列名字典排序) # Age Height Name # 1st 18 1.6 Tom # 2nd 16 1.5 Kim # 3th 19 1.7 Andy df = pd.DataFrame(data, [ '3th', '2nd', '1st']) #指定行的序列 # Age Height Name # 3th 19 1.7 Andy # 2nd 16 1.5 Kim # 1st 18 1.6 Tom #2访问DataFrame #从DataFrame中获取一列的结果为一个Series，有两种方法 #2.1字典索引方式获取 data = { 'Name':[ 'Tom', 'Kim', 'Andy'], 'Age':[ 18, 16, 19], 'Height':[ 1.6, 1.5, 1.7]} df = pd.DataFrame(data, columns=[ 'Name', 'Age', 'Height'], index=[ '1st', '2nd', '3th']) # Name Age Height # 1st Tom 18 1.6 # 2nd Kim 16 1.5 # 3th Andy 19 1.7 s = df[ 'Name'] # 1st Tom # 2nd Kim # 3th Andy # Name: Name, dtype: object #2.2通过ix获取一行数据 data = { 'Name':[ 'Tom', 'Kim', 'Andy'], 'Age':[ 18, 16, 19], 'Height':[ 1.6, 1.5, 1.7]} df = pd.DataFrame(data, columns=[ 'Name', 'Age', 'Height'], index=[ '1st', '2nd', '3th']) s = df.ix[ '1st'] #获取单行，参数为行索引值 # Name Tom # Age 18 # Height 1.6 # Name: 1st, dtype: object s = df.ix[ 0] #获取单行，参数默认数字行索引 # Name Tom # Age 18 # Height 1.6 # Name: 1st, dtype: object s = df.ix[[ '3th', '2nd']] #获取多行 # Name Age Height # 3th Andy 19 1.7 # 2nd Kim 16 1.5 s = df.ix[range( 3)] #通过默认数字行索引获取数据 # Name Age Height # 1st Tom 18 1.6 # 2nd Kim 16 1.5 # 3th Andy 19 1.7 #2.3获取指定行，指定列的交汇值 ret = df[ 'Name'][ '1st'] #Tom ret = df[ 'Name'][ 0] #Tom ret = df[ 'Age'][ '1st'] #18 ret = df[ 'Age'][ 0] #18 ret = df[ 'Height'][ '1st'] #1.6 ret = df[ 'Height'][ 0] #1.6 #2.4获取指定列，指定行的交汇值 ret = df.ix[ '1st'][ 'Name'] #Tom ret = df.ix[ 0][ 'Name'] #Tom ret = df.ix[ '1st'][ 'Age'] #18 ret = df.ix[ 0][ 'Age'] #18 ret = df.ix[ '1st'][ 'Height'] #1.6 ret = df.ix[ 0][ 'Height'] #1.6 #3.修改DataFame对象 #3.1增加列 data = { 'Name':[ 'Tom', 'Kim', 'Andy'], 'Age':[ 18, 16, 19], 'Height':[ 1.6, 1.5, 1.7]} df = pd.DataFrame(data, columns=[ 'Name', 'Age', 'Height'], index=[ '1st', '2nd', '3th']) df[ 'Grade'] = 9 #增加一列，年级'Grade'，为同一值9年级 # Name Age Height Grade # 1st Tom 18 1.6 9 # 2nd Kim 16 1.5 9 # 3th Andy 19 1.7 9 #3.2修改一列的值 df[ 'Grade'] = [ 6, 7, 7] # Name Age Height Grade # 1st Tom 18 1.6 6 # 2nd Kim 16 1.5 7 # 3th Andy 19 1.7 7 #3.3判断Grade是否为7年级 s = pd.Series([ False, True, True], index=[ '1st', '2nd', '3th']) df[ 'HighGrade'] = s #新增一列'HighGrade'，用Series赋值 # Name Age Height Grade HighGrade # 1st Tom 18 1.6 6 False # 2nd Kim 16 1.5 7 True # 3th Andy 19 1.7 7 True #4.命令DataFrame的行、列 data = { 'Name':[ 'Tom', 'Kim', 'Andy'], 'Age':[ 18, 16, 19], 'Height':[ 1.6, 1.5, 1.7]} df = pd.DataFrame(data, columns=[ 'Name', 'Age', 'Height'], index=[ '1st', '2nd', '3th']) df.columns.name = 'Students' df.index.name = 'ID' # Students Name Age Height # ID # 1st Tom 18 1.6 # 2nd Kim 16 1.5 # 3th Andy 19 1.7

4.DataFrame排序函数

def DataFrame_Sort(): data = { 'Name': { 'No.1': 'Tom', 'No.2': 'Kim', 'No.3': 'Andy'}, 'Age': { 'No.1': 18, 'No.2': 16, 'No.3': 19}, 'Height':{ 'No.1': 1.6, 'No.2': 1.5, 'No.3': 1.7}} df = pd.DataFrame(data) df.index.name = 'ID' df.columns.name = 'StudentInfo' #StudentInfo Age Height Name #ID #No.1 18 1.6 Tom #No.2 16 1.5 Kim #No.3 19 1.7 Andy #行索引排序，升序 ret = df.sort_index(ascending= True) #StudentInfo Age Height Name #ID #No.1 18 1.6 Tom #No.2 16 1.5 Kim #No.3 19 1.7 Andy #行索引排序，降序 ret = df.sort_index(ascending= False) #StudentInfo Age Height Name #ID #No.3 19 1.7 Andy #No.2 16 1.5 Kim #No.1 18 1.6 Tom #数据排序，按照指定列排序，降序 ret = df.sort_values(by= 'Age', ascending= True) #按照Age列降序排序 #StudentInfo Age Height Name #ID #No.2 16 1.5 Kim #No.1 18 1.6 Tom #No.3 19 1.7 Andy #数据排序，按照指定列排序，升序 ret = df.sort_values(by= 'Age', ascending= False) #StudentInfo Age Height Name #ID #No.3 19 1.7 Andy #No.1 18 1.6 Tom #No.2 16 1.5 Kim

5.DataFrame汇总统计函数

# -*- coding: utf-8 -*- """ @author: 蔚蓝的天空Tom Aim:DataFrame的汇总统计功能函数 df.count() 非NaN的数量 df.describe() 一次性产生多个汇总统计 df.min() 最小值 df.min() 最大值 df.idxmax(axis=0, skipna=True) 返回含有最大值的index的Series df.idxmin(axis=0, skipna=True) 返回含有最小值的index的Series df.quantile(axis=0) 计算样本的分位数 df.sum(axis=0, skipna=True, level=NaN) 返回一个含有求和小计的Series df.mean(axis=0, skipna=True, level=NaN) 返回一个含有平均值的Series df.median(axis=0, skipna=True, level=NaN) 返回一个含有算术中位数的Series df.mad(axis=0, skipna=True, level=NaN) 返回一个根据平均值计算平均绝对离差的Series df.var(axis=0, skipna=True, level=NaN) 返回一个方差的Series df.std(axis=0, skipna=True, level=NaN) 返回一个标准差的Series df.skew(axis=0, skipna=True, level=NaN) 返回样本值的偏度（三阶距） df.kurt(axis=0, skipna=True, level=NaN) 返回样本值的峰度（四阶距） df.cumsum(axis=0, skipna=True, level=NaN) 返回样本的累计和 df.cummin(axis=0, skipna=True, level=NaN) 返回样本的累计最大值 df.cummax(axis=0, skipna=True, level=NaN) 返回样本的累计最小值 df.cumprod(axis=0, skipna=True, level=NaN) 返回样本的累计积 df.diff(axis=0) 返回样本的一阶差分 df.pct_change(axis=0) 返回样本的百分比数变化 """ import pandas as pd from pandas import DataFrame if __name__== '__main__': data = { 'Name':[ 'Tom', 'Kim', 'Andy'], 'Age':[ 18, 16, 19], 'Height':[ 1.6, 1.5, 1.7]} ind = [ 'No.1', 'No.2', 'No.3'] df = pd.DataFrame(data, index=ind) df.index.name = 'ID' df.columns.name = 'StudentInfo' #StudentInfo Age Height Name #ID #No.1 18 1.6 Tom #No.2 16 1.5 Kim #No.3 19 1.7 Andy #df.count() 非NaN的数量 cnt = df.count() #StudentInfo #Age 3 #Height 3 #Name 3 #dtype: int64 #df.describe()一次性产生多个汇总统计(包括count, mean, std, min, max等) ret = df.describe() #<class 'pandas.core.frame.DataFrame'> #StudentInfo Age Height #count 3.000000 3.00 #mean 17.666667 1.60 #std 1.527525 0.10 #min 16.000000 1.50 #25% 17.000000 1.55 #50% 18.000000 1.60 #75% 18.500000 1.65 #max 19.000000 1.70 #df.min() 最小值，每列的最小数值 ret = df.min() #StudentInfo #Age 16 #Height 1.5 #Name Andy #dtype: object #df.min() 最大值，每列的最大数值 ret = df.max() #StudentInfo #Age 19 #Height 1.7 #Name Tom #dtype: object #df.idxmax(axis=0, skipna=True) 返回含有最大值的index的Series data = { 'Age':[ 18, 16, 19], 'Height':[ 1.6, 1.5, 1.7], 'Math':[ 60, 70, 100], 'English':[ 98, 68, 69], 'Chinese':[ 50, 99, 70]} ind = [ 'No.1', 'No.2', 'No.3'] df = pd.DataFrame(data, index=ind) df.index.name = 'ID' df.columns.name = 'Student' #Student Age Chinese English Height Math #ID #No.1 18 50 98 1.6 60 #No.2 16 99 68 1.5 70 #No.3 19 70 69 1.7 100 #df.idxmin(axis=0, skipna=True) 返回含有最小值的index的Series ret = df.idxmax(axis = 0) #<class 'pandas.core.series.Series'> #Student #Age No.3 #Chinese No.2 #English No.1 #Height No.3 #Math No.3 #dtype: object #每行最大数据所在列名 ret = df.idxmax(axis = 1) #<class 'pandas.core.series.Series'> #ID #No.1 English #No.2 Chinese #No.3 Math #dtype: object #df.quantile(axis=0) 计算样本的分位数（有二分位数，四分位数等） ret = df.quantile(axis = 0) #每列样本的中位数 #Student #Age 18.0 #Chinese 70.0 #English 69.0 #Height 1.6 #Math 70.0 #dtype: float64 #df.sum(axis=0, skipna=True, level=NaN) 返回一个含有求和小计的Series ret = df.sum(axis= 0) #每列样本的总和 #Student #Age 53.0 #Chinese 219.0 #English 235.0 #Height 4.8 #Math 230.0 #dtype: float64 ret = df.sum(axis= 1) #每行数据的总和，从此样本看没有任何意义 #ID #No.1 227.6 #No.2 254.5 #No.3 259.7 #dtype: float64 #df.mean(axis=0, skipna=True, level=NaN) 返回一个含有平均值的Series ret = df.mean(axis= 0) #每列样本的平均值 #Student #Age 17.666667 #Chinese 73.000000 #English 78.333333 #Height 1.600000 #Math 76.666667 #dtype: float64 ret = df.mean(axis= 1) #每行数据的平均值，以此样本看没有任何意义 #ID #No.1 45.52 #No.2 50.90 #No.3 51.94 #dtype: float64 #df.median(axis=0, skipna=True, level=NaN) 返回一个含有算术中位数的Series ret = df.median(axis= 0) #每列样本的中位数 #Student #Age 18.0 #Chinese 70.0 #English 69.0 #Height 1.6 #Math 70.0 #dtype: float64 ret = df.median(axis= 1) #每行数据的中位数 #ID #No.1 50.0 #No.2 68.0 #No.3 69.0 #dtype: float64 #df.mad(axis=0, skipna=True, level=NaN) 返回一个根据平均值计算平均绝对离差的Series #绝对离差=单项数值与平均值之差的绝对值 #Student Age Chinese English Height Math #ID #No.1 18 50 98 1.6 60 #No.2 16 99 68 1.5 70 #No.3 19 70 69 1.7 100 ret = df.mad(axis= 0) #逐列求值 #Student #Age 1.111111 #Chinese 17.333333 #English 13.111111 #Height 0.066667 #Math 15.555556 #dtype: float64 ret = df.mad(axis= 1) #逐行求值 #ID #No.1 28.576 #No.2 33.720 #No.3 33.272 #dtype: float64 #df.var(axis=0, skipna=True, level=NaN) 返回一个方差的Series ret = df.var(axis= 0) #逐列操作求方差 #Student #Age 2.333333 #Chinese 607.000000 #English 290.333333 #Height 0.010000 #Math 433.333333 #dtype: float64 ret = df.var(axis= 1) #逐行操作求方差 #ID #No.1 1417.552 #No.2 1657.300 #No.3 1634.018 #dtype: float64 #df.std(axis=0, skipna=True, level=NaN) 返回一个标准差的Series ret = df.std(axis= 0) #逐列求标准差 #Student #Age 1.527525 #Chinese 24.637370 #English 17.039171 #Height 0.100000 #Math 20.816660 #dtype: float64 ret = df.std(axis= 1) #逐行求标准差 #ID #No.1 37.650392 #No.2 40.709950 #No.3 40.422989 #dtype: float64 #df.skew(axis=0, skipna=True, level=NaN) 返回样本值的偏度（三阶距） ret = df.skew(axis= 0) #逐列求样本值的偏度（三阶矩） #Student #Age -0.935220 #Chinese 0.539824 #English 1.725342 #Height 0.000000 #Math 1.293343 #dtype: float64 ret = df.skew(axis= 1) #逐行求样本值的偏度（三阶矩） #ID #No.1 0.328682 #No.2 -0.245853 #No.3 -0.256661 #dtype: float64 #df.kurt(axis=0, skipna=True, level=NaN) 返回样本值的峰度（四阶距） ret = df.kurt(axis= 0) #逐列求样本值的峰度（四阶距） #Student #Age NaN #Chinese NaN #English NaN #Height NaN #Math NaN #dtype: float64 ret = df.kurt(axis= 1) #逐行求样本值的峰度（四阶距） #ID #No.1 -0.582437 #No.2 -2.079006 #No.3 -1.879115 #dtype: float64 #df.cumsum(axis=0, skipna=True, level=NaN) 返回样本的累计和 ret = df.cumsum(axis= 0) #逐列求累积和 #Student Age Chinese English Height Math #ID #No.1 18.0 50.0 98.0 1.6 60.0 #No.2 34.0 149.0 166.0 3.1 130.0 #No.3 53.0 219.0 235.0 4.8 230.0 ret = df.cumsum(axis= 1) #逐行求累积和 #Student Age Chinese English Height Math #ID #No.1 18.0 68.0 166.0 167.6 227.6 #No.2 16.0 115.0 183.0 184.5 254.5 #No.3 19.0 89.0 158.0 159.7 259.7 #df.cummin(axis=0, skipna=True, level=NaN) 返回样本的累计最小值 ret = df.cummin(axis= 0) #逐列求累计最小值 #Student Age Chinese English Height Math #ID #No.1 18.0 50.0 98.0 1.6 60.0 #No.2 16.0 50.0 68.0 1.5 60.0 #No.3 16.0 50.0 68.0 1.5 60.0 ret = df.cummin(axis= 1) #逐行求累计最小值 #Student Age Chinese English Height Math #ID #No.1 18.0 18.0 18.0 1.6 1.6 #No.2 16.0 16.0 16.0 1.5 1.5 #No.3 19.0 19.0 19.0 1.7 1.7 #df.cummax(axis=0, skipna=True, level=NaN) 返回样本的累计最大值 ret = df.cummax(axis= 0) #逐列求累计最大值 #Student Age Chinese English Height Math #ID #No.1 18.0 50.0 98.0 1.6 60.0 #No.2 18.0 99.0 98.0 1.6 70.0 #No.3 19.0 99.0 98.0 1.7 100.0 ret = df.cummax(axis= 1) #逐行求累计最大值 #Student Age Chinese English Height Math #ID #No.1 18.0 50.0 98.0 98.0 98.0 #No.2 16.0 99.0 99.0 99.0 99.0 #No.3 19.0 70.0 70.0 70.0 100.0 #df.cumprod(axis=0, skipna=True, level=NaN) 返回样本的累计积 ret = df.cumprod(axis= 0) #逐列求累计积 #Student Age Chinese English Height Math #ID #No.1 18.0 50.0 98.0 1.60 60.0 #No.2 288.0 4950.0 6664.0 2.40 4200.0 #No.3 5472.0 346500.0 459816.0 4.08 420000.0 ret = df.cumprod(axis= 1) #逐行求累计积 #Student Age Chinese English Height Math #ID #No.1 18.0 900.0 88200.0 141120.0 8467200.0 #No.2 16.0 1584.0 107712.0 161568.0 11309760.0 #No.3 19.0 1330.0 91770.0 156009.0 15600900.0 #df.diff(axis=0) 返回样本的一阶差分 ret = df.diff(axis= 0) #逐列求一阶差分 #Student Age Chinese English Height Math #ID #No.1 NaN NaN NaN NaN NaN #No.2 -2.0 49.0 -30.0 -0.1 10.0 #No.3 3.0 -29.0 1.0 0.2 30.0 ret = df.diff(axis= 1) #逐行求一阶差分 #<class 'pandas.core.frame.DataFrame'> #Student Age Chinese English Height Math #ID #No.1 NaN 32.0 48.0 NaN -38.0 #No.2 NaN 83.0 -31.0 NaN 2.0 #No.3 NaN 51.0 -1.0 NaN 31.0 #df.pct_change(axis=0) 返回样本的百分比数变化 ret =df.pct_change(axis= 0) #逐列求百分比数变化 #Student Age Chinese English Height Math #ID #No.1 NaN NaN NaN NaN NaN #No.2 -0.111111 0.980000 -0.306122 -0.062500 0.166667 #No.3 0.187500 -0.292929 0.014706 0.133333 0.428571 ret = df.pct_change(axis= 1) #逐行求百分比数变化 #Student Age Chinese English Height Math #ID #No.1 NaN 1.777778 0.960000 -0.983673 36.500000 #No.2 NaN 5.187500 -0.313131 -0.977941 45.666667 #No.3 NaN 2.684211 -0.014286 -0.975362 57.823529

6.DataFrame计算函数

# -*- coding: utf-8 -*- """ @author: 蔚蓝的天空Tom Aim:实现DataFrame的计算函数的示例 df.add(df2, fill_value=NaN, axist=1) 元素级相加，对齐时找不到元素默认用fill_value df.sub(df2, fill_value=NaN, axist=1) 元素级相减，对齐时找不到元素默认用fill_value df.div(df2, fill_value=NaN, axist=1) 元素级相除，对齐时找不到元素默认用fill_value df.mul(df2, fill_value=NaN, axist=1) 元素级相乘，对齐时找不到元素默认用fill_value df.apply(f, axis=0) 将f函数应用到由各行各列所形成的一维数组上 df.applymap(f) 将f函数应用到各个元素上 df.cumsum(axis=0, skipna=True) 累加，返回累加后的dataframe """ import pandas as pd from pandas import DataFrame if __name__== '__main__': data = { 'Math':[ 2, 4, 6], 'English':[ 4, 8, 12]} ind = [ 'No.1', 'No.2', 'No.3'] df1 = pd.DataFrame(data, index=ind) df1.index.name = 'ID' df1.columns.name = 'Student' #Student English Math #ID #No.1 4 2 #No.2 8 4 #No.3 12 6 data = { 'Math':[ 1, 2, 3], 'English':[ 2, 4, 6]} ind = [ 'No.1', 'No.2', 'No.3'] df2 = pd.DataFrame(data, index=ind) df2.index.name = 'ID' df2.columns.name = 'Student' #Student English Math #ID #No.1 2 1 #No.2 4 2 #No.3 6 3 #df.add(df2, fill_value=NaN, axist=1) 元素级相加，对齐时找不到元素默认用fill_value ret = df1.add(df2) #对应元素相加 #Student English Math #ID #No.1 6 3 #No.2 12 6 #No.3 18 9 #df.sub(df2, fill_value=NaN, axist=1) 元素级相减，对齐时找不到元素默认用fill_value ret = df1.sub(df2) #对应元素相减 #Student English Math #ID #No.1 2 1 #No.2 4 2 #No.3 6 3 #df.div(df2, fill_value=NaN, axist=1) 元素级相除，对齐时找不到元素默认用fill_value ret = df1.div(df2) #对应元素相除 #Student English Math #ID #No.1 2.0 2.0 #No.2 2.0 2.0 #No.3 2.0 2.0 #df.mul(df2, fill_value=NaN, axist=1) 元素级相乘，对齐时找不到元素默认用fill_value ret = df1.mul(df2) #对应元素相乘 #Student English Math #ID #No.1 8 2 #No.2 32 8 #No.3 72 18 #df.apply(f, axis=0) 将f函数应用到由各行各列所形成的一维数组上 #Student English Math #ID #No.1 4 2 #No.2 8 4 #No.3 12 6 import numpy as np ret = df1.apply(np.square) #对每个元素进行开平方np.squre #Student English Math #ID #No.1 16 4 #No.2 64 16 #No.3 144 36 #df.applymap(f) 将f函数应用到各个元素上 ret = df1.applymap(np.square) #Student English Math #ID #No.1 16 4 #No.2 64 16 #No.3 144 36 #df.cumsum(axis=0, skipna=True) 累加，返回累加后的dataframe #Student English Math #ID #No.1 4 2 #No.2 8 4 #No.3 12 6 ret = df1.cumsum(axis= 0) #对每列内的元素，进行累加 #Student English Math #ID #No.1 4 2 #No.2 12 6 #No.3 24 12 ret = df1.cumsum(axis= 1) #对每行内的元素，进行累加 #Student English Math #ID #No.1 4 6 #No.2 8 12 #No.3 12 18

7.DataFrame常用索引方式例程

# -*- coding: utf-8 -*- """ @author: 蔚蓝的天空Tom Aim:完成DataFrame的索引方式的示例----df[], df.ix[], df.reindex(), df.xs(), df.icol()等索引方式说明 df[val] 选取DataFrame的单个列或一组列 df.ix[val] 选取Dataframe的单个行或一组行 df.ix[:,val] 选取单个列或列子集 df.ix[val1,val2] 将一个或多个轴匹配到新索引 reindex方法将一个或多个轴匹配到新索引 xs方法根据标签选取单行或者单列，返回一个Series icol、irow方法根据整数位置选取单列或单行，并返回一个Series get_value、set_value 根据行标签和列标签选取单个值 """ import pandas as pd from pandas import DataFrame if __name__== '__main__': data = { 'Name':[ 'Tom', 'Kim', 'Andy'], 'Age':[ 18, 16, 19], 'Math':[ 95, 98, 96]} ind = [ 'No.1', 'No.2', 'No.3'] df = pd.DataFrame(data, index=ind, columns=[ 'Name', 'Age', 'Math']) df.index.name = 'ID' df.columns.name = 'Student' #Student Name Age Math #ID #No.1 Tom 18 95 #No.2 Kim 16 98 #No.3 Andy 19 96 #选取DataFrame的单个列 ret = df[[ 0]] #df的第1列 #Student Name #ID #No.1 Tom #No.2 Kim #No.3 Andy ret = df[[ -1]] #df的最后一列 #Student Math #ID #No.1 95 #No.2 98 #No.3 96 ret = df[[ -1, 0]] #df的最后一列和第一列 #Student Math Name #ID #No.1 95 Tom #No.2 98 Kim #No.3 96 Andy #df.ix[val] 选取Dataframe的单个行或一组行 ret = df.ix[[ 0]] #df的第一行 #Student Name Age Math #ID #No.1 Tom 18 95 ret = df.ix[[ -1]] #df的最后一行 #Student Name Age Math #ID #No.3 Andy 19 96 ret = df.ix[[ -1, 0]] #df的最后一行和第一行 #Student Name Age Math #ID #No.3 Andy 19 96 #No.1 Tom 18 95 #df.ix[:,val] 选取单个列或列子集 ret = df.ix[ 0: 2, [ 0]] #第一列中从0到1序号的列子集 #Student Name #ID #No.1 Tom #No.2 Kim ret = df.ix[: -1, [ 0]] #第一列中不包含最后一个元素的列子集 #Student Name #ID #No.1 Tom #No.2 Kim #df.ix[val1,val2] 将一个或多个轴匹配到新索引 ret = df.ix[[ 0], [ 0]] #求第一行第一列元素 #Student Name #ID #No.1 Tom ret = df.ix[[ 0], [ 1]] #求第一行第二列元素 #Student Age #ID #No.1 18 ret = df.ix[[ 1], [ 0]] #求第2行第一列元素 #Student Name #ID #No.2 Kim

df.reindex()+df.xs()+df.iloc[] + df.get_value() + df.get_values() + df.set_value()

import pandas as pd from pandas import DataFrame if __name__== '__main__': data = { 'Name':[ 'Tom', 'Kim', 'Andy'], 'Age':[ 18, 16, 19], 'Height':[ 1.7, 1.5, 1.6]} ind = [ 'No.1', 'No.2', 'No.3'] df = pd.DataFrame(data, index=ind, columns=[ 'Name', 'Age', 'Height']) df.index.name = 'ID' df.columns.name = 'Student' #Student Name Age Height #ID #No.1 Tom 18 1.7 #No.2 Kim 16 1.5 #No.3 Andy 19 1.6 #reindex方法将一个或多个轴匹配到新索引 ret = df.reindex(index=[ 'No.3', 'No.2', 'No.1']) #按照指定的行索引显示 #Student Name Age Height #ID #No.3 Andy 19 1.6 #No.2 Kim 16 1.5 #No.1 Tom 18 1.7 ret = df.reindex(index=[ 'No.3', 'No.2', 'No.1'], columns=[ 'Name', 'Age']) #Student Name Age #ID #No.3 Andy 19.0 #No.2 Kim 16.0 #No. NaN NaN ret = df.reindex(index=[ 'No.1'], columns=[ 'Name', 'Age']) #Student Name Age #ID #No.1 Tom 18 ret = df.reindex(index=[ 'No.1'], columns=[ 'Name']) #Student Name #ID #No.1 Tom #xs方法根据标签选取单行或者单列，返回一个Series ret = df.xs(key= 'No.1', axis= 0) #获取由key指定的行No.1，必须设置axis=0 #Student #Name Tom #Age 18 #Height 1.7 #Name: No.1, dtype: object ret = df.xs(key= 'Name', axis= 1) #获取由key指定的列Name，必须设置axis=1 #ID #No.1 Tom #No.2 Kim #4No.3 Andy #Name: Name, dtype: object ret = df.xs(key= 'Age', axis= 1) #获取由key指定的列Age，必须设置axis=1 #ID #No.1 18

Processed: 0.024, SQL: 9