Pandas: Pandas入门笔记(2)————通过股票数据实践查看、筛选、设置索引、删除列、重命名等操作

Pandas入门笔记

从今天开始更新Pandas基础语法笔记,之前一直都是零零散散的用的,打算每天花一段时间系统记录一下pandas的基本语法。
主要参考了bilibili的up主@蚂蚁学python的视频Python Pandas 数据分析,编程练习100例W3Cschool的Pandas中文教程

这一部分是上一部分的延伸,是蚂蚁学python的视频的P21-30,使用的数据是一个股票数据,数据来源是:https://github.com/yatingliu2019/Python-Pandas-data-analysis/blob/main/%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90/00700.HK.csv

以下是数据的预览:

1
2
3
4
5
6
7
8
9
10
11
Date,Open,High,Low,Close,Volume
2021-09-30,456.0,464.6,453.8,461.4,17335451
2021-09-29,461.6,465.0,450.2,465.0,18250450
2021-09-28,467.0,476.2,464.6,469.8,20947276
2021-09-27,459.0,473.0,455.2,464.6,17966998
2021-09-24,461.4,473.4,456.2,460.2,16656914
2021-09-23,460.2,469.6,456.4,463.2,22210868
2021-09-21,446.0,453.8,443.2,450.0,16556875
2021-09-20,456.6,457.4,448.0,454.2,15513224
2021-09-17,445.8,467.6,445.2,461.8,23982628
2021-09-16,446.8,454.8,445.0,451.0,24519868

P21&22 加载csv表格到数据框,查看数据基本情况和统计

1
2
3
4
5
import pandas as pd

df = pd.read_csv('/Users/emmett/Downloads/00700.HK.csv')
print(df.info())
print(df.describe())

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
>>> print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4267 entries, 0 to 4266
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 4267 non-null object
1 Open 4267 non-null float64
2 High 4267 non-null float64
3 Low 4267 non-null float64
4 Close 4267 non-null float64
5 Volume 4267 non-null int64
dtypes: float64(4), int64(1), object(1)
memory usage: 200.1+ KB
None
# 说明数据已经被正确加载了
>>> print(df.describe())
Open High Low Close Volume
count 4267.000000 4267.000000 4267.000000 4267.000000 4.267000e+03
mean 211.010729 213.575773 208.063849 210.778935 1.198943e+07
std 170.116303 172.173404 167.716394 169.807145 1.413407e+07
min 3.375000 3.500000 3.375000 3.375000 4.523670e+05
25% 58.000000 59.200000 56.050000 57.500000 3.600698e+06
50% 170.900000 173.300000 168.000000 170.700000 7.647284e+06
75% 330.700000 333.100000 326.500000 330.000000 1.726551e+07
max 767.000000 775.500000 751.500000 766.500000 4.397750e+08
# 数据的基本统计情况

P23 将索引列变为普通索引列

1
2
3
4
5
6
7
8
9
# P20里面已经学到了,怎么把csv文件里的第一列作为索引列,使用index_col=0
import pandas as pd

df = pd.read_csv('/Users/emmett/Downloads/00700.HK.csv', index_col=0)

# 现在要把Date这一列变为普通列
df.reset_index(inplace=True)
print(df)
print(df.columns)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
>>> print(df)
index Date Open High Low Close Volume
0 0 2021-09-30 456.000 464.600 453.800 461.400 17335451
1 1 2021-09-29 461.600 465.000 450.200 465.000 18250450
2 2 2021-09-28 467.000 476.200 464.600 469.800 20947276
3 3 2021-09-27 459.000 473.000 455.200 464.600 17966998
4 4 2021-09-24 461.400 473.400 456.200 460.200 16656914
... ... ... ... ... ... ... ...
4262 4262 2004-06-23 4.050 4.450 4.025 4.425 55016000
4263 4263 2004-06-21 4.125 4.125 3.950 4.000 22817000
4264 4264 2004-06-18 4.200 4.250 3.950 4.025 36598000
4265 4265 2004-06-17 4.150 4.375 4.125 4.225 83801500
4266 4266 2004-06-16 4.375 4.625 4.075 4.150 439775000

[4267 rows x 7 columns]
>>> print(df.columns)
Index(['index', 'Date', 'Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')

# 现在Date变成了普通列,索引有单独的index列了

P24 通过日期信息,添加年月列

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd

df = pd.read_csv('/Users/emmett/Downloads/00700.HK.csv')
# 肯定是根据Date列的信息获得每一行对应的年份和月份

df["Date"] = pd.to_datetime(df["Date"])
# 把数据列通过to_datetime后就可以通过dt访问它的很多属性
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month

print(df)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> print(df)
Date Open High Low Close Volume Year Month
0 2021-09-30 456.000 464.600 453.800 461.400 17335451 2021 9
1 2021-09-29 461.600 465.000 450.200 465.000 18250450 2021 9
2 2021-09-28 467.000 476.200 464.600 469.800 20947276 2021 9
3 2021-09-27 459.000 473.000 455.200 464.600 17966998 2021 9
4 2021-09-24 461.400 473.400 456.200 460.200 16656914 2021 9
... ... ... ... ... ... ... ... ...
4262 2004-06-23 4.050 4.450 4.025 4.425 55016000 2004 6
4263 2004-06-21 4.125 4.125 3.950 4.000 22817000 2004 6
4264 2004-06-18 4.200 4.250 3.950 4.025 36598000 2004 6
4265 2004-06-17 4.150 4.375 4.125 4.225 83801500 2004 6
4266 2004-06-16 4.375 4.625 4.075 4.150 439775000 2004 6

[4267 rows x 8 columns]

P25 计算每一年的平均收盘价(特定行的某一列的均值)

1
2
3
4
5
6
7
8
9
10
# 需要计算每一年的平均收盘价,就是需要先筛选某一年的所有行,再计算这些行的Close那一列的均值
import pandas as pd

df = pd.read_csv('/Users/emmett/Downloads/00700.HK.csv')

df["Date"] = pd.to_datetime(df["Date"])
df["Year"] = df["Date"].dt.year

# 利用groupby函数对某一列的不同元素进行分组,也就是把Year那一列的归类
print(df.groupby("Year")["Close"].mean())

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
>>> print(df.groupby("Year")["Close"].mean())
Year
2004 4.338686
2005 6.568927
2006 15.865951
2007 37.882724
2008 54.818367
2009 96.369679
2010 157.299598
2011 189.737398
2012 228.987045
2013 337.136066
2014 271.291498
2015 144.824291
2016 176.562041
2017 291.066667
2018 372.678862
2019 346.225203
2020 479.141129
2021 586.649189
Name: Close, dtype: float64

#这样就输出了每年的收盘价(Close)的均值

P26 找出某列最小值的行

1
2
3
4
5
6
7
import pandas as pd

df = pd.read_csv('/Users/emmett/Downloads/00700.HK.csv')
print(df["Close"].min()) # Close这一列的最小值
print(df["Close"].argmin()) # Close最小值的那一行的索引
print(df.loc[df["Close"].argmin()]) # 根据索引locate那一行 loc不加[]返回series
print(df.loc[[df["Close"].argmin()]]) # loc加[]返回dataframe

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> print(df["Close"].min()) # Close这一列的最小值
3.375
>>> print(df["Close"].argmin()) # Close最小值的那一行的索引
4240
>>> print(df.loc[df["Close"].argmin()]) # 根据索引locate那一行
Date 2004-07-26
Open 3.45
High 3.5
Low 3.375
Close 3.375
Volume 7439000
Name: 4240, dtype: object
>>> print(df.loc[[df["Close"].argmin()]]) # loc加[]返回dataframe
Date Open High Low Close Volume
4240 2004-07-26 3.45 3.5 3.375 3.375 7439000

问题是:如果最小值在多行出现了,argmin似乎只会返回其中的第一行,需要返回所有包含最小值的行怎么办?

1
2
# 利用条件进行筛选
print(df[df['Close'] == df['Close'].min()])

输出:

1
2
3
4
5
6
7
# 为了演示多行结果,我手动添加了Close列为3.375的两行,看看结果
>>> print(df[df['Close'] == df['Close'].min()])
Date Open High Low Close Volume
4240 2004-07-26 3.450 3.500 3.375 3.375 7439000
4267 2004-06-17 4.375 4.625 4.075 3.375 439775000
4268 2004-06-18 4.375 4.625 4.075 3.375 439775000
# 事实证明,直接条件筛选也可以

P27 筛选部分数据列

1
2
3
4
5
6
import pandas as pd

df = pd.read_csv('/Users/emmett/Downloads/00700.HK.csv')

# 假如只要某几列
print(df[["Date","Open","Close","Volume"]])

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> print(df[["Date","Open","Close","Volume"]])
Date Open Close Volume
0 2021-09-30 456.000 461.400 17335451
1 2021-09-29 461.600 465.000 18250450
2 2021-09-28 467.000 469.800 20947276
3 2021-09-27 459.000 464.600 17966998
4 2021-09-24 461.400 460.200 16656914
... ... ... ... ...
4262 2004-06-23 4.050 4.425 55016000
4263 2004-06-21 4.125 4.000 22817000
4264 2004-06-18 4.200 4.025 36598000
4265 2004-06-17 4.150 4.225 83801500
4266 2004-06-16 4.375 4.150 439775000

[4267 rows x 4 columns]

P28 将Date设为索引列

1
2
3
4
5
6
7
import pandas as pd

df = pd.read_csv('/Users/emmett/Downloads/00700.HK.csv')

# 假如只要某几列
df.set_index("Date", inplace=True)
print(df)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
>>> print(df)
Open High Low Close Volume
Date
2021-09-30 456.000 464.600 453.800 461.400 17335451
2021-09-29 461.600 465.000 450.200 465.000 18250450
2021-09-28 467.000 476.200 464.600 469.800 20947276
2021-09-27 459.000 473.000 455.200 464.600 17966998
2021-09-24 461.400 473.400 456.200 460.200 16656914
... ... ... ... ... ...
2004-06-23 4.050 4.450 4.025 4.425 55016000
2004-06-21 4.125 4.125 3.950 4.000 22817000
2004-06-18 4.200 4.250 3.950 4.025 36598000
2004-06-17 4.150 4.375 4.125 4.225 83801500
2004-06-16 4.375 4.625 4.075 4.150 439775000

[4267 rows x 5 columns]

P29 删除不需要的数据列

1
2
3
4
5
6
7
import pandas as pd

df = pd.read_csv('/Users/emmett/Downloads/00700.HK.csv')

# 删除High和Low
df.drop(columns=["High", "Low"], inplace=True)
print(df)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> print(df)
Date Open Close Volume
0 2021-09-30 456.000 461.400 17335451
1 2021-09-29 461.600 465.000 18250450
2 2021-09-28 467.000 469.800 20947276
3 2021-09-27 459.000 464.600 17966998
4 2021-09-24 461.400 460.200 16656914
... ... ... ... ...
4262 2004-06-23 4.050 4.425 55016000
4263 2004-06-21 4.125 4.000 22817000
4264 2004-06-18 4.200 4.025 36598000
4265 2004-06-17 4.150 4.225 83801500
4266 2004-06-16 4.375 4.150 439775000

[4267 rows x 4 columns]

P30 对列名进行重命名

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd

df = pd.read_csv('/Users/emmett/Downloads/00700.HK.csv')

# 方法一:直接覆盖列名,缺点是全部重命名,不想重命名的只能维持不变
df.columns = ["D","O","H","L","C","V"]
print(df)

# 方法二:利用rename和字典传回修改规则,可以只修改部分,现在用这个办法把列名改回去,但是D和V不变
df.rename(columns={"O":"Open", "H":"High", "L":"Low", "C":"Close"}, inplace=True)
print(df)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
>>> # 方法一:直接覆盖列名,缺点是全部重命名,不想重命名的只能维持不变
>>> df.columns = ["D","O","H","L","C","V"]
>>> print(df)
D O H L C V
0 2021-09-30 456.000 464.600 453.800 461.400 17335451
1 2021-09-29 461.600 465.000 450.200 465.000 18250450
2 2021-09-28 467.000 476.200 464.600 469.800 20947276
3 2021-09-27 459.000 473.000 455.200 464.600 17966998
4 2021-09-24 461.400 473.400 456.200 460.200 16656914
... ... ... ... ... ... ...
4262 2004-06-23 4.050 4.450 4.025 4.425 55016000
4263 2004-06-21 4.125 4.125 3.950 4.000 22817000
4264 2004-06-18 4.200 4.250 3.950 4.025 36598000
4265 2004-06-17 4.150 4.375 4.125 4.225 83801500
4266 2004-06-16 4.375 4.625 4.075 4.150 439775000

[4267 rows x 6 columns]
>>> # 方法二:利用rename和字典传回修改规则,可以只修改部分,现在用这个办法把列名改回去,但是D和V不变
>>> df.rename(columns={"O":"Open", "H":"High", "L":"Low", "C":"Close"}, inplace=True)
>>> print(df)
D Open High Low Close V
0 2021-09-30 456.000 464.600 453.800 461.400 17335451
1 2021-09-29 461.600 465.000 450.200 465.000 18250450
2 2021-09-28 467.000 476.200 464.600 469.800 20947276
3 2021-09-27 459.000 473.000 455.200 464.600 17966998
4 2021-09-24 461.400 473.400 456.200 460.200 16656914
... ... ... ... ... ... ...
4262 2004-06-23 4.050 4.450 4.025 4.425 55016000
4263 2004-06-21 4.125 4.125 3.950 4.000 22817000
4264 2004-06-18 4.200 4.250 3.950 4.025 36598000
4265 2004-06-17 4.150 4.375 4.125 4.225 83801500
4266 2004-06-16 4.375 4.625 4.075 4.150 439775000

[4267 rows x 6 columns]

Pandas: Pandas入门笔记(2)————通过股票数据实践查看、筛选、设置索引、删除列、重命名等操作
https://emmettpeng.github.io/2023/11/15/pandas-base-2/
Author
Emmett Peng
Posted on
November 15, 2023
Licensed under