Python数据分析与展示之Pandas数据特征分析学习笔记手札及代码实战

    技术2022-07-11  79

    Pandas数据特征分析

    学习笔记手札及单元小结数据的排序Pandas库的数据排序 数据的基本统计分析数据的累计统计分析累计统计分析函数 数据的相关分析实例:房价增幅与M2增幅的相关性

    学习笔记手札及单元小结

    数据的排序

    Pandas库的数据排序

    .sort_index()方法在指定轴上根据索引进行排序,默认升序 .sort_index(axis=0, ascending=True)

    import pandas as pd import numpy as np b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b']) b Out[4]: 0 1 2 3 4 c 0 1 2 3 4 a 5 6 7 8 9 d 10 11 12 13 14 b 15 16 17 18 19 b.sort_index() Out[5]: 0 1 2 3 4 a 5 6 7 8 9 b 15 16 17 18 19 c 0 1 2 3 4 d 10 11 12 13 14 b.sort_index(ascending=False) Out[6]: 0 1 2 3 4 d 10 11 12 13 14 c 0 1 2 3 4 b 15 16 17 18 19 a 5 6 7 8 9 c =b.sort_index(axis=1,ascending=False) c Out[8]: 4 3 2 1 0 c 4 3 2 1 0 a 9 8 7 6 5 d 14 13 12 11 10 b 19 18 17 16 15 c = c.sort_index() c Out[10]: 4 3 2 1 0 a 9 8 7 6 5 b 19 18 17 16 15 c 4 3 2 1 0 d 14 13 12 11 10

    .sort_values()方法在指定轴上根据数值进行排序,默认升序 Series.sort_values(axis=0,ascending=True) DataFrame.sort_values(by,axis=0,ascending=True) by:axis轴上的某个索引或索引列表

    import pandas as pd import numpy as np b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b']) b Out[4]: 0 1 2 3 4 c 0 1 2 3 4 a 5 6 7 8 9 d 10 11 12 13 14 b 15 16 17 18 19 c = b.sort_values(2,ascending=False) c Out[6]: 0 1 2 3 4 b 15 16 17 18 19 d 10 11 12 13 14 a 5 6 7 8 9 c 0 1 2 3 4 c = c.sort_values('a',axis=1,ascending=False) c Out[8]: 4 3 2 1 0 b 19 18 17 16 15 d 14 13 12 11 10 a 9 8 7 6 5 c 4 3 2 1 0

    NaN统一放到排序末尾

    import pandas as pd import numpy as np a = pd.DataFrame(np.arange(12).reshape(3,4),index=['a','b','c']) a Out[4]: 0 1 2 3 a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','b','d']) b Out[6]: 0 1 2 3 4 c 0 1 2 3 4 a 5 6 7 8 9 b 10 11 12 13 14 d 15 16 17 18 19 c =a + b c Out[8]: 0 1 2 3 4 a 5.0 7.0 9.0 11.0 NaN b 14.0 16.0 18.0 20.0 NaN c 8.0 10.0 12.0 14.0 NaN d NaN NaN NaN NaN NaN c.sort_values(2,ascending = False) Out[9]: 0 1 2 3 4 b 14.0 16.0 18.0 20.0 NaN c 8.0 10.0 12.0 14.0 NaN a 5.0 7.0 9.0 11.0 NaN d NaN NaN NaN NaN NaN c.sort_values(2,ascending(2,ascending = True)) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-10-7fe7e363a39a> in <module> ----> 1 c.sort_values(2,ascending(2,ascending = True)) NameError: name 'ascending' is not defined c.sort_values(2,ascending = True) Out[11]: 0 1 2 3 4 a 5.0 7.0 9.0 11.0 NaN c 8.0 10.0 12.0 14.0 NaN b 14.0 16.0 18.0 20.0 NaN d NaN NaN NaN NaN NaN

    数据的基本统计分析

    import pandas as pd a = pd.Series([9,8,7,6],index=['a','b','c','d']) a Out[3]: a 9 b 8 c 7 d 6 dtype: int64 a.describe() Out[4]: count 4.000000 mean 7.500000 std 1.290994 min 6.000000 25% 6.750000 50% 7.500000 75% 8.250000 max 9.000000 dtype: float64 type(a.describe()) Out[5]: pandas.core.series.Series a.describa()['count'] #一定要注意规范书写代码 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-6-30160ee76872> in <module> ----> 1 a.describa()['count'] ~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name) 5272 if self._info_axis._can_hold_identifiers_and_holds_name(name): 5273 return self[name] -> 5274 return object.__getattribute__(self, name) 5275 5276 def __setattr__(self, name: str, value) -> None: AttributeError: 'Series' object has no attribute 'describa' a.describe()['count'] Out[7]: 4.0 a.describe()['max'] Out[8]: 9.0

    数据的累计统计分析

    import pandas as pd import numpy as np b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b']) b Out[4]: 0 1 2 3 4 c 0 1 2 3 4 a 5 6 7 8 9 d 10 11 12 13 14 b 15 16 17 18 19 b.cumsum() Out[5]: 0 1 2 3 4 c 0 1 2 3 4 a 5 7 9 11 13 d 15 18 21 24 27 b 30 34 38 42 46 b.cumprod() Out[6]: 0 1 2 3 4 c 0 1 2 3 4 a 0 6 14 24 36 d 0 66 168 312 504 b 0 1056 2856 5616 9576 b.cummin() Out[7]: 0 1 2 3 4 c 0 1 2 3 4 a 0 1 2 3 4 d 0 1 2 3 4 b 0 1 2 3 4 b.cummax() Out[8]: 0 1 2 3 4 c 0 1 2 3 4 a 5 6 7 8 9 d 10 11 12 13 14 b 15 16 17 18 19

    累计统计分析函数

    import pandas as pd import numpy as np b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','b','d']) b Out[4]: 0 1 2 3 4 c 0 1 2 3 4 a 5 6 7 8 9 b 10 11 12 13 14 d 15 16 17 18 19 b.rolling(2).sum() Out[5]: 0 1 2 3 4 c NaN NaN NaN NaN NaN a 5.0 7.0 9.0 11.0 13.0 b 15.0 17.0 19.0 21.0 23.0 d 25.0 27.0 29.0 31.0 33.0 b.rolling(3).sum() Out[6]: 0 1 2 3 4 c NaN NaN NaN NaN NaN a NaN NaN NaN NaN NaN b 15.0 18.0 21.0 24.0 27.0 d 30.0 33.0 36.0 39.0 42.0

    数据的相关分析

    实例:房价增幅与M2增幅的相关性

    import pandas as pd import numpy as np b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','b','d']) b Out[4]: 0 1 2 3 4 c 0 1 2 3 4 a 5 6 7 8 9 b 10 11 12 13 14 d 15 16 17 18 19 b.rolling(2).sum() Out[5]: 0 1 2 3 4 c NaN NaN NaN NaN NaN a 5.0 7.0 9.0 11.0 13.0 b 15.0 17.0 19.0 21.0 23.0 d 25.0 27.0 29.0 31.0 33.0 b.rolling(3).sum() Out[6]: 0 1 2 3 4 c NaN NaN NaN NaN NaN a NaN NaN NaN NaN NaN b 15.0 18.0 21.0 24.0 27.0 d 30.0 33.0 36.0 39.0 42.0 import pandas as pd hprice = pd.Series([3.04,22.93,12.75,22.6,12.33],index=['2008','2009','2010','2011','2012']) m2 = pd.Series([8.18,18.38,9.13,7.82,6.69],index=['2008','2009','2010','2011','2012']) hprice.corr(m2) Out[10]: 0.5239439145220387
    Processed: 0.012, SQL: 9