Pandas: Pandas入门笔记(1)————Pandas的数据格式和基本操作

Pandas入门笔记

从今天开始更新Pandas基础语法笔记，之前一直都是零零散散的用的，打算每天花一段时间系统记录一下pandas的基本语法。
主要参考了bilibili的up主@蚂蚁学python的视频Python Pandas 数据分析，编程练习100例和W3Cschool的Pandas中文教程。

P1 From list to series

# From list to series / 由list构造pandas series
import pandas as pd

courses = ["Chinese", "Math", "English", "Chemistry"]
#courses_series = pd.Series(data=courses)
courses_series = pd.Series(courses)

print(courses_series)

输出：

>>> print(courses_series)
0      Chinese
1         Math
2      English
3    Chemistry
dtype: object

# 注意：index为默认的数字索引~

P2 From dictionary to series

# From dictionary to series / 由dictionary构造pandas series
import pandas as pd

grades = {"Chinese": 92, "Math": 95, "English": 100, "Chemistry": 98}
#grades_series = pd.Series(data=grades)
grades_series = pd.Series(grades)

print(grades_series)

输出：

>>> print(grades_series)
Chinese       92
Math          95
English      100
Chemistry     98
dtype: int64

# 注意：此时，字典中的键成为了索引，相应的值成为了Series的数据列。

P3 From series to list / 从series变回list

# From series to list / 从series变回list
import pandas as pd

grades = {"Chinese": 92, "Math": 95, "English": 100, "Chemistry": 98}
grades_series = pd.Series(grades)

grades_original = grades_series.to_list()
print(grades_original)

输出：

>>> print(grades_original)
[92, 95, 100, 98]

# 仅保留了数据列（字典里的值）

P4 From series to dataframe / 从series转为dataframe

# From series to dataframe / 从series变回dataframe
import pandas as pd

grades = {"Chinese": 92, "Math": 95, "English": 100, "Chemistry": 98}
grades_series = pd.Series(grades)

grades_df = pd.DataFrame(grades_series, columns=['grade'])
print(grades_df)

输出：

           grade
Chinese       92
Math          95
English      100
Chemistry     98

# 列名是通过columns=['grade']手动加上的

思考：如果这里需要把索引列的列名（比如'subject'）也加上：

grades_df = grades_series.reset_index() # 将列名转化为普通list
grades_df.columns = ['subject', 'grade'] # 修改列名

print(grades_df)

输出：

     subject  grade
0    Chinese     92
1       Math     95
2    English    100
3  Chemistry     98

P5 Create series using numpy

# 创建一个长度为9的pandas series，要求索引为101-109，数据值为10-90，数据类型为float

import numpy as np
import pandas as pd

s = pd.Series(
    np.arange(10,100,10), # 省略了data=, arange(start, stop, step)
    index=np.arange(101,110),
    dtype='float'
)

print(s)

输出：

>>> print(s)
101    10.0
102    20.0
103    30.0
104    40.0
105    50.0
106    60.0
107    70.0
108    80.0
109    90.0
dtype: float64

P6 Change datatype in pandas series

import numpy as np
import pandas as pd

s1 = pd.Series(
    data = ["123", "456", "789"],
    index = list('abc')
)

# 要求将data列的string转化为整型数字int

s2 = s1.astype(int)   #也可以用s2 = s1.map(int)

输出：

>>> s1
a    123
b    456
c    789
dtype: object
>>> s2
a    123
b    456
c    789
dtype: int64 # 数据类型已经从object转成了int64

P7 Append elements to series

import pandas as pd

grades = {"Chinese": 92, "Math": 95, "English": 100, "Chemistry": 98}
grades_series = pd.Series(grades)

## 现在要再加两门课 Biology 和 Spanish 的成绩
new_grades_series = grades_series.append(
    pd.Series({"Biology": 99, "Spanish": 96})
)
print(new_grades_series)

输出：

>>> new_grades_series = grades_series.append(
...     pd.Series({"Biology": 99, "Spanish": 96})
... )
<stdin>:1: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
>>> print(new_grades_series)
Chinese       92
Math          95
English      100
Chemistry     98
Biology       99
Spanish       96
dtype: int64

# 提出了警告，认为series.append功能过时，该功能确实可以通过series的concat函数完成。

P8 Reset_index

See: P4 From series to dataframe / 从series转为dataframe

P9 From dictionary to dataframe

import pandas as pd

df = pd.DataFrame(
    {
        "Abbr": ["A", "T", "C", "G"],
        "Nucleotide": ["Adenine", "Thymine", "Guanine", "Cytosine"],
        "Chinese Name": ["腺嘌呤", "胸腺嘧啶", "鸟嘌呤", "胞嘧啶"],
        "Count": [43442,52452,21342,23323]
    }
)
print(df)

输出：

>>> print(df)
  Abbr Nucleotide Chinese Name  Count
0    A    Adenine          腺嘌呤  43442
1    T    Thymine         胸腺嘧啶  52452
2    C    Guanine          鸟嘌呤  21342
3    G   Cytosine          胞嘧啶  23323

P10 Set index for dataframe

# 上一题的索引列是数字索引，现在想默认使用Abbr作为索引
import pandas as pd

df = pd.DataFrame(
    {
        "Abbr": ["A", "T", "C", "G"],
        "Nucleotide": ["Adenine", "Thymine", "Guanine", "Cytosine"],
        "Chinese Name": ["腺嘌呤", "胸腺嘧啶", "鸟嘌呤", "胞嘧啶"],
        "Count": [43442,52452,21342,23323]
    }
)

df.set_index("Abbr", inplace=True)
print(df)

输出：

>>> print(df)
     Nucleotide Chinese Name  Count
Abbr                               
A       Adenine          腺嘌呤  43442
T       Thymine         胸腺嘧啶  52452
C       Guanine          鸟嘌呤  21342
G      Cytosine          胞嘧啶  23323

# 现在的索引列就是Abbr了

P11 Generate datetime index (A month)

import pandas as pd

date_range = pd.date_range(start='2023-11-01', end='2023-11-30')
# 测试证明：这个1号写01或者只写1都可以
print(date_range)

输出：

>>> print(date_range)
DatetimeIndex(['2023-11-01', '2023-11-02', '2023-11-03', '2023-11-04',
               '2023-11-05', '2023-11-06', '2023-11-07', '2023-11-08',
               '2023-11-09', '2023-11-10', '2023-11-11', '2023-11-12',
               '2023-11-13', '2023-11-14', '2023-11-15', '2023-11-16',
               '2023-11-17', '2023-11-18', '2023-11-19', '2023-11-20',
               '2023-11-21', '2023-11-22', '2023-11-23', '2023-11-24',
               '2023-11-25', '2023-11-26', '2023-11-27', '2023-11-28',
               '2023-11-29', '2023-11-30'],
              dtype='datetime64[ns]', freq='D')

#数据类型是 datetime

P12 Generate datetime index (Every Monday in a year)

import pandas as pd

date_range = pd.date_range(start='2023-1-1', end='2023-12-31', freq='W-MON')
# W-MON: W means weekly frequency; MON means monday

print(date_range)

输出：

>>> print(date_range)
DatetimeIndex(['2023-01-02', '2023-01-09', '2023-01-16', '2023-01-23',
               '2023-01-30', '2023-02-06', '2023-02-13', '2023-02-20',
               '2023-02-27', '2023-03-06', '2023-03-13', '2023-03-20',
               '2023-03-27', '2023-04-03', '2023-04-10', '2023-04-17',
               '2023-04-24', '2023-05-01', '2023-05-08', '2023-05-15',
               '2023-05-22', '2023-05-29', '2023-06-05', '2023-06-12',
               '2023-06-19', '2023-06-26', '2023-07-03', '2023-07-10',
               '2023-07-17', '2023-07-24', '2023-07-31', '2023-08-07',
               '2023-08-14', '2023-08-21', '2023-08-28', '2023-09-04',
               '2023-09-11', '2023-09-18', '2023-09-25', '2023-10-02',
               '2023-10-09', '2023-10-16', '2023-10-23', '2023-10-30',
               '2023-11-06', '2023-11-13', '2023-11-20', '2023-11-27',
               '2023-12-04', '2023-12-11', '2023-12-18', '2023-12-25'],
              dtype='datetime64[ns]', freq='W-MON')

P13 Generate datetime index (All hours in a day)

import pandas as pd

date_range = pd.date_range(start='2023-11-7', periods = 24, freq='H')
# H: H means hourly

# Another
# date_range = pd.date_range(start='2023-11-7', end='2023-11-8', freq='H', inclusive='left')
# inclusive='left' 左闭右开

print(date_range)

输出：

>>> print(date_range)
DatetimeIndex(['2023-11-07 00:00:00', '2023-11-07 01:00:00',
               '2023-11-07 02:00:00', '2023-11-07 03:00:00',
               '2023-11-07 04:00:00', '2023-11-07 05:00:00',
               '2023-11-07 06:00:00', '2023-11-07 07:00:00',
               '2023-11-07 08:00:00', '2023-11-07 09:00:00',
               '2023-11-07 10:00:00', '2023-11-07 11:00:00',
               '2023-11-07 12:00:00', '2023-11-07 13:00:00',
               '2023-11-07 14:00:00', '2023-11-07 15:00:00',
               '2023-11-07 16:00:00', '2023-11-07 17:00:00',
               '2023-11-07 18:00:00', '2023-11-07 19:00:00',
               '2023-11-07 20:00:00', '2023-11-07 21:00:00',
               '2023-11-07 22:00:00', '2023-11-07 23:00:00'],
              dtype='datetime64[ns]', freq='H')

P14 Generate Datetime in Dataframe

import pandas as pd

date_range = pd.date_range(start='2023-11-1', periods = 31)

# 生成一个df，然后获得该DateTimeIndex里面的列表作为df的一列

df = pd.DataFrame(data=date_range, columns=['day'])

df['day of the year'] = df['day'].dt.dayofyear
#df['day']这一列是时间类型，可以用.dt访问日期，dayofyear是固有属性，表示一年中的第几天

print(df)

输出：

>>> print(df)
          day  day of the year
0  2023-11-01              305
1  2023-11-02              306
2  2023-11-03              307
3  2023-11-04              308
4  2023-11-05              309
5  2023-11-06              310
6  2023-11-07              311
7  2023-11-08              312
8  2023-11-09              313
9  2023-11-10              314
10 2023-11-11              315
11 2023-11-12              316
12 2023-11-13              317
13 2023-11-14              318
14 2023-11-15              319
15 2023-11-16              320
16 2023-11-17              321
17 2023-11-18              322
18 2023-11-19              323
19 2023-11-20              324
20 2023-11-21              325
21 2023-11-22              326
22 2023-11-23              327
23 2023-11-24              328
24 2023-11-25              329
25 2023-11-26              330
26 2023-11-27              331
27 2023-11-28              332
28 2023-11-29              333
29 2023-11-30              334
30 2023-12-01              335

# 看下dayofyear属性里是啥：

>>> df['day'].dt.dayofyear
0     305
1     306
2     307
3     308
4     309
5     310
6     311
7     312
8     313
9     314
10    315
11    316
12    317
13    318
14    319
15    320
16    321
17    322
18    323
19    324
20    325
21    326
22    327
23    328
24    329
25    330
26    331
27    332
28    333
29    334
30    335
Name: day, dtype: int64

P15 Generate a dataset of random numbers with date as index

import pandas as pd
import numpy as np

date_range = pd.date_range(start='2023-11-7', periods=1000)

data = {
    'norm': np.random.normal(loc=0, scale=1, size=1000),
    #np.random.normal: 正态分布抽取，loc分布的均值，scale分布的宽度，size输出值的维度
    'uniform': np.random.uniform(low=0, high=1, size=1000),
    #np.random.uniform: 均匀分布的样品里随机取样，low下界（闭），high上界（开）,size输出值维度
    'binomial': np.random.binomial(n=1, p=0.2, size=1000)
    #np.random.binomial: 二项式分布样品里随机抽样，n伯努利实验重复次数，p独立事件概率，size输出值的维度
}

df = pd.DataFrame(data=data, index=date_range)

print(df)

输出：

>>> print(df)
                norm   uniform  binomial
2023-11-07 -0.673309  0.534941         0
2023-11-08 -0.641853  0.586685         0
2023-11-09  0.400873  0.312451         0
2023-11-10  2.219158  0.048007         0
2023-11-11 -1.581015  0.952856         0
...              ...       ...       ...
2026-07-29  0.697699  0.460210         0
2026-07-30  0.655960  0.089912         0
2026-07-31 -0.647422  0.630736         0
2026-08-01  0.925788  0.921439         0
2026-08-02  0.100924  0.508387         0

[1000 rows x 3 columns]

P16 输出Dataframe的前几行或后几行

# P15的结果
df = pd.DataFrame(data=data, index=date_range)

# 和shell类似的head和tail用来输出前几行和后几行
print(df.head(10))
print(df.tail(5))

输出：

>>> print(df.head(10))
                norm   uniform  binomial
2023-11-07  0.402368  0.713925         0
2023-11-08  1.109252  0.370335         0
2023-11-09 -0.741132  0.492781         0
2023-11-10  1.299560  0.303082         0
2023-11-11 -0.738895  0.628298         0
2023-11-12 -0.758462  0.563461         0
2023-11-13  1.487287  0.279051         0
2023-11-14 -0.948288  0.198686         0
2023-11-15 -0.487149  0.299550         0
2023-11-16  1.048030  0.937323         0
>>> print(df.tail(5))
                norm   uniform  binomial
2026-07-29 -1.410419  0.094071         0
2026-07-30  1.301222  0.752692         1
2026-07-31  0.344197  0.177724         0
2026-08-01 -1.138749  0.094369         1
2026-08-02 -0.098699  0.030986         0

P17 输出Dataframe的基本信息和数据统计结果

# P15的结果
df = pd.DataFrame(data=data, index=date_range)

# 查看一下
print(df.info())
print(df.describe())

输出：

# df.info() 查看基本信息，显示了索引的信息，每一列的信息，非空列数和数据类型，还显示了内存占用
>>> print(df.info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2023-11-07 to 2026-08-02
Freq: D
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   norm      1000 non-null   float64
 1   uniform   1000 non-null   float64
 2   binomial  1000 non-null   int64  
dtypes: float64(2), int64(1)
memory usage: 31.2 KB
None
# df.describe() 查看了每一列的列数、平均数、标准差、最小值、四分位数和最大值
>>> print(df.describe())
              norm      uniform     binomial
count  1000.000000  1000.000000  1000.000000
mean      0.042103     0.490103     0.209000
std       0.973974     0.293130     0.406798
min      -3.315361     0.000975     0.000000
25%      -0.641808     0.235426     0.000000
50%       0.019109     0.477758     0.000000
75%       0.775706     0.741953     0.000000
max       3.097657     0.999351     1.000000

P18 查看某一列的某个值出现了多少次

# P15的结果
df = pd.DataFrame(data=data, index=date_range)

# 查看binomial列的0的频数和1的频数
print(df["binomial"].value_counts())

输出：

>>> print(df["binomial"].value_counts())
0    791
1    209
Name: binomial, dtype: int64

P19 Dataframe存为csv文件

# P15的结果
df = pd.DataFrame(data=data, index=date_range)

# 取前100列输出
df.head(100).to_csv("/Users/emmett/Desktop/P19_output.csv")

结果：

(base) emmett@EmmettdeMacBook-Air ~ % head ~/Desktop/P19_output.csv         
,norm,uniform,binomial
2023-11-07,0.4023684187649195,0.7139247843762494,0
2023-11-08,1.1092517540800102,0.37033490685548265,0
2023-11-09,-0.7411318489406269,0.4927811317047385,0
2023-11-10,1.2995596866176715,0.30308160140089524,0
2023-11-11,-0.7388953663805584,0.6282983721098137,0
2023-11-12,-0.7584618817922961,0.5634610419207209,0
2023-11-13,1.487286995350167,0.2790511370963281,0
2023-11-14,-0.9482877456716767,0.1986855859133323,0
2023-11-15,-0.48714905003231695,0.2995498262349372,0

P20 Dataframe读取csv文件

import pandas as pd

df = pd.read_csv("/Users/emmett/Desktop/P19_output.csv", index_col=0)

print(df.info())

输出：

>>> print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 2023-11-07 to 2024-02-14
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   norm      100 non-null    float64
 1   uniform   100 non-null    float64
 2   binomial  100 non-null    int64  
dtypes: float64(2), int64(1)
memory usage: 3.1+ KB
None

Python > Pandas

Pandas: Pandas入门笔记(1)————Pandas的数据格式和基本操作

https://emmettpeng.github.io/2023/11/01/pandas-base/

Author

Emmett Peng

Posted on

November 1, 2023

Licensed under

Pandas: Pandas入门笔记(2)————通过股票数据实践查看、筛选、设置索引、删除列、重命名等操作 Previous

给Galaxy注册增加强密码设置 Next