Pandas: Pandas入门笔记(1)————Pandas的数据格式和基本操作

Pandas入门笔记

从今天开始更新Pandas基础语法笔记,之前一直都是零零散散的用的,打算每天花一段时间系统记录一下pandas的基本语法。
主要参考了bilibili的up主@蚂蚁学python的视频Python Pandas 数据分析,编程练习100例W3Cschool的Pandas中文教程

P1 From list to series

1
2
3
4
5
6
7
8
# From list to series / 由list构造pandas series
import pandas as pd

courses = ["Chinese", "Math", "English", "Chemistry"]
#courses_series = pd.Series(data=courses)
courses_series = pd.Series(courses)

print(courses_series)

输出:

1
2
3
4
5
6
7
8
>>> print(courses_series)
0 Chinese
1 Math
2 English
3 Chemistry
dtype: object

# 注意:index为默认的数字索引~

P2 From dictionary to series

1
2
3
4
5
6
7
8
# From dictionary to series / 由dictionary构造pandas series
import pandas as pd

grades = {"Chinese": 92, "Math": 95, "English": 100, "Chemistry": 98}
#grades_series = pd.Series(data=grades)
grades_series = pd.Series(grades)

print(grades_series)

输出:

1
2
3
4
5
6
7
8
>>> print(grades_series)
Chinese 92
Math 95
English 100
Chemistry 98
dtype: int64

# 注意:此时,字典中的键成为了索引,相应的值成为了Series的数据列。

P3 From series to list / 从series变回list

1
2
3
4
5
6
7
8
# From series to list / 从series变回list
import pandas as pd

grades = {"Chinese": 92, "Math": 95, "English": 100, "Chemistry": 98}
grades_series = pd.Series(grades)

grades_original = grades_series.to_list()
print(grades_original)

输出:

1
2
3
4
>>> print(grades_original)
[92, 95, 100, 98]

# 仅保留了数据列(字典里的值)

P4 From series to dataframe / 从series转为dataframe

1
2
3
4
5
6
7
8
# From series to dataframe / 从series变回dataframe
import pandas as pd

grades = {"Chinese": 92, "Math": 95, "English": 100, "Chemistry": 98}
grades_series = pd.Series(grades)

grades_df = pd.DataFrame(grades_series, columns=['grade'])
print(grades_df)

输出:

1
2
3
4
5
6
7
           grade
Chinese 92
Math 95
English 100
Chemistry 98

# 列名是通过columns=['grade']手动加上的

思考:如果这里需要把索引列的列名(比如'subject')也加上:

1
2
3
4
grades_df = grades_series.reset_index() # 将列名转化为普通list
grades_df.columns = ['subject', 'grade'] # 修改列名

print(grades_df)

输出:

1
2
3
4
5
     subject  grade
0 Chinese 92
1 Math 95
2 English 100
3 Chemistry 98

P5 Create series using numpy

1
2
3
4
5
6
7
8
9
10
11
12
# 创建一个长度为9的pandas series,要求索引为101-109,数据值为10-90,数据类型为float

import numpy as np
import pandas as pd

s = pd.Series(
np.arange(10,100,10), # 省略了data=, arange(start, stop, step)
index=np.arange(101,110),
dtype='float'
)

print(s)

输出:

1
2
3
4
5
6
7
8
9
10
11
>>> print(s)
101 10.0
102 20.0
103 30.0
104 40.0
105 50.0
106 60.0
107 70.0
108 80.0
109 90.0
dtype: float64

P6 Change datatype in pandas series

1
2
3
4
5
6
7
8
9
10
11
import numpy as np
import pandas as pd

s1 = pd.Series(
data = ["123", "456", "789"],
index = list('abc')
)

# 要求将data列的string转化为整型数字int

s2 = s1.astype(int) #也可以用s2 = s1.map(int)

输出:

1
2
3
4
5
6
7
8
9
10
>>> s1
a 123
b 456
c 789
dtype: object
>>> s2
a 123
b 456
c 789
dtype: int64 # 数据类型已经从object转成了int64

P7 Append elements to series

1
2
3
4
5
6
7
8
9
10
import pandas as pd

grades = {"Chinese": 92, "Math": 95, "English": 100, "Chemistry": 98}
grades_series = pd.Series(grades)

## 现在要再加两门课 Biology 和 Spanish 的成绩
new_grades_series = grades_series.append(
pd.Series({"Biology": 99, "Spanish": 96})
)
print(new_grades_series)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> new_grades_series = grades_series.append(
... pd.Series({"Biology": 99, "Spanish": 96})
... )
<stdin>:1: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
>>> print(new_grades_series)
Chinese 92
Math 95
English 100
Chemistry 98
Biology 99
Spanish 96
dtype: int64

# 提出了警告,认为series.append功能过时,该功能确实可以通过series的concat函数完成。

P8 Reset_index

See: P4 From series to dataframe / 从series转为dataframe

P9 From dictionary to dataframe

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd

df = pd.DataFrame(
{
"Abbr": ["A", "T", "C", "G"],
"Nucleotide": ["Adenine", "Thymine", "Guanine", "Cytosine"],
"Chinese Name": ["腺嘌呤", "胸腺嘧啶", "鸟嘌呤", "胞嘧啶"],
"Count": [43442,52452,21342,23323]
}
)
print(df)

输出:

1
2
3
4
5
6
>>> print(df)
Abbr Nucleotide Chinese Name Count
0 A Adenine 腺嘌呤 43442
1 T Thymine 胸腺嘧啶 52452
2 C Guanine 鸟嘌呤 21342
3 G Cytosine 胞嘧啶 23323

P10 Set index for dataframe

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 上一题的索引列是数字索引,现在想默认使用Abbr作为索引
import pandas as pd

df = pd.DataFrame(
{
"Abbr": ["A", "T", "C", "G"],
"Nucleotide": ["Adenine", "Thymine", "Guanine", "Cytosine"],
"Chinese Name": ["腺嘌呤", "胸腺嘧啶", "鸟嘌呤", "胞嘧啶"],
"Count": [43442,52452,21342,23323]
}
)

df.set_index("Abbr", inplace=True)
print(df)

输出:

1
2
3
4
5
6
7
8
9
>>> print(df)
Nucleotide Chinese Name Count
Abbr
A Adenine 腺嘌呤 43442
T Thymine 胸腺嘧啶 52452
C Guanine 鸟嘌呤 21342
G Cytosine 胞嘧啶 23323

# 现在的索引列就是Abbr了

P11 Generate datetime index (A month)

1
2
3
4
5
import pandas as pd

date_range = pd.date_range(start='2023-11-01', end='2023-11-30')
# 测试证明:这个1号写01或者只写1都可以
print(date_range)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
>>> print(date_range)
DatetimeIndex(['2023-11-01', '2023-11-02', '2023-11-03', '2023-11-04',
'2023-11-05', '2023-11-06', '2023-11-07', '2023-11-08',
'2023-11-09', '2023-11-10', '2023-11-11', '2023-11-12',
'2023-11-13', '2023-11-14', '2023-11-15', '2023-11-16',
'2023-11-17', '2023-11-18', '2023-11-19', '2023-11-20',
'2023-11-21', '2023-11-22', '2023-11-23', '2023-11-24',
'2023-11-25', '2023-11-26', '2023-11-27', '2023-11-28',
'2023-11-29', '2023-11-30'],
dtype='datetime64[ns]', freq='D')

#数据类型是 datetime

P12 Generate datetime index (Every Monday in a year)

1
2
3
4
5
6
import pandas as pd

date_range = pd.date_range(start='2023-1-1', end='2023-12-31', freq='W-MON')
# W-MON: W means weekly frequency; MON means monday

print(date_range)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> print(date_range)
DatetimeIndex(['2023-01-02', '2023-01-09', '2023-01-16', '2023-01-23',
'2023-01-30', '2023-02-06', '2023-02-13', '2023-02-20',
'2023-02-27', '2023-03-06', '2023-03-13', '2023-03-20',
'2023-03-27', '2023-04-03', '2023-04-10', '2023-04-17',
'2023-04-24', '2023-05-01', '2023-05-08', '2023-05-15',
'2023-05-22', '2023-05-29', '2023-06-05', '2023-06-12',
'2023-06-19', '2023-06-26', '2023-07-03', '2023-07-10',
'2023-07-17', '2023-07-24', '2023-07-31', '2023-08-07',
'2023-08-14', '2023-08-21', '2023-08-28', '2023-09-04',
'2023-09-11', '2023-09-18', '2023-09-25', '2023-10-02',
'2023-10-09', '2023-10-16', '2023-10-23', '2023-10-30',
'2023-11-06', '2023-11-13', '2023-11-20', '2023-11-27',
'2023-12-04', '2023-12-11', '2023-12-18', '2023-12-25'],
dtype='datetime64[ns]', freq='W-MON')

P13 Generate datetime index (All hours in a day)

1
2
3
4
5
6
7
8
9
10
import pandas as pd

date_range = pd.date_range(start='2023-11-7', periods = 24, freq='H')
# H: H means hourly

# Another
# date_range = pd.date_range(start='2023-11-7', end='2023-11-8', freq='H', inclusive='left')
# inclusive='left' 左闭右开

print(date_range)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> print(date_range)
DatetimeIndex(['2023-11-07 00:00:00', '2023-11-07 01:00:00',
'2023-11-07 02:00:00', '2023-11-07 03:00:00',
'2023-11-07 04:00:00', '2023-11-07 05:00:00',
'2023-11-07 06:00:00', '2023-11-07 07:00:00',
'2023-11-07 08:00:00', '2023-11-07 09:00:00',
'2023-11-07 10:00:00', '2023-11-07 11:00:00',
'2023-11-07 12:00:00', '2023-11-07 13:00:00',
'2023-11-07 14:00:00', '2023-11-07 15:00:00',
'2023-11-07 16:00:00', '2023-11-07 17:00:00',
'2023-11-07 18:00:00', '2023-11-07 19:00:00',
'2023-11-07 20:00:00', '2023-11-07 21:00:00',
'2023-11-07 22:00:00', '2023-11-07 23:00:00'],
dtype='datetime64[ns]', freq='H')

P14 Generate Datetime in Dataframe

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd

date_range = pd.date_range(start='2023-11-1', periods = 31)

# 生成一个df,然后获得该DateTimeIndex里面的列表作为df的一列

df = pd.DataFrame(data=date_range, columns=['day'])

df['day of the year'] = df['day'].dt.dayofyear
#df['day']这一列是时间类型,可以用.dt访问日期,dayofyear是固有属性,表示一年中的第几天

print(df)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
>>> print(df)
day day of the year
0 2023-11-01 305
1 2023-11-02 306
2 2023-11-03 307
3 2023-11-04 308
4 2023-11-05 309
5 2023-11-06 310
6 2023-11-07 311
7 2023-11-08 312
8 2023-11-09 313
9 2023-11-10 314
10 2023-11-11 315
11 2023-11-12 316
12 2023-11-13 317
13 2023-11-14 318
14 2023-11-15 319
15 2023-11-16 320
16 2023-11-17 321
17 2023-11-18 322
18 2023-11-19 323
19 2023-11-20 324
20 2023-11-21 325
21 2023-11-22 326
22 2023-11-23 327
23 2023-11-24 328
24 2023-11-25 329
25 2023-11-26 330
26 2023-11-27 331
27 2023-11-28 332
28 2023-11-29 333
29 2023-11-30 334
30 2023-12-01 335

# 看下dayofyear属性里是啥:

>>> df['day'].dt.dayofyear
0 305
1 306
2 307
3 308
4 309
5 310
6 311
7 312
8 313
9 314
10 315
11 316
12 317
13 318
14 319
15 320
16 321
17 322
18 323
19 324
20 325
21 326
22 327
23 328
24 329
25 330
26 331
27 332
28 333
29 334
30 335
Name: day, dtype: int64

P15 Generate a dataset of random numbers with date as index

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import numpy as np

date_range = pd.date_range(start='2023-11-7', periods=1000)

data = {
'norm': np.random.normal(loc=0, scale=1, size=1000),
#np.random.normal: 正态分布抽取,loc分布的均值,scale分布的宽度,size输出值的维度
'uniform': np.random.uniform(low=0, high=1, size=1000),
#np.random.uniform: 均匀分布的样品里随机取样,low下界(闭),high上界(开),size输出值维度
'binomial': np.random.binomial(n=1, p=0.2, size=1000)
#np.random.binomial: 二项式分布样品里随机抽样,n伯努利实验重复次数,p独立事件概率,size输出值的维度
}

df = pd.DataFrame(data=data, index=date_range)

print(df)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> print(df)
norm uniform binomial
2023-11-07 -0.673309 0.534941 0
2023-11-08 -0.641853 0.586685 0
2023-11-09 0.400873 0.312451 0
2023-11-10 2.219158 0.048007 0
2023-11-11 -1.581015 0.952856 0
... ... ... ...
2026-07-29 0.697699 0.460210 0
2026-07-30 0.655960 0.089912 0
2026-07-31 -0.647422 0.630736 0
2026-08-01 0.925788 0.921439 0
2026-08-02 0.100924 0.508387 0

[1000 rows x 3 columns]

P16 输出Dataframe的前几行或后几行

1
2
3
4
5
6
# P15的结果
df = pd.DataFrame(data=data, index=date_range)

# 和shell类似的head和tail用来输出前几行和后几行
print(df.head(10))
print(df.tail(5))

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
>>> print(df.head(10))
norm uniform binomial
2023-11-07 0.402368 0.713925 0
2023-11-08 1.109252 0.370335 0
2023-11-09 -0.741132 0.492781 0
2023-11-10 1.299560 0.303082 0
2023-11-11 -0.738895 0.628298 0
2023-11-12 -0.758462 0.563461 0
2023-11-13 1.487287 0.279051 0
2023-11-14 -0.948288 0.198686 0
2023-11-15 -0.487149 0.299550 0
2023-11-16 1.048030 0.937323 0
>>> print(df.tail(5))
norm uniform binomial
2026-07-29 -1.410419 0.094071 0
2026-07-30 1.301222 0.752692 1
2026-07-31 0.344197 0.177724 0
2026-08-01 -1.138749 0.094369 1
2026-08-02 -0.098699 0.030986 0

P17 输出Dataframe的基本信息和数据统计结果

1
2
3
4
5
6
# P15的结果
df = pd.DataFrame(data=data, index=date_range)

# 查看一下
print(df.info())
print(df.describe())

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# df.info() 查看基本信息,显示了索引的信息,每一列的信息,非空列数和数据类型,还显示了内存占用
>>> print(df.info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2023-11-07 to 2026-08-02
Freq: D
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 norm 1000 non-null float64
1 uniform 1000 non-null float64
2 binomial 1000 non-null int64
dtypes: float64(2), int64(1)
memory usage: 31.2 KB
None
# df.describe() 查看了每一列的列数、平均数、标准差、最小值、四分位数和最大值
>>> print(df.describe())
norm uniform binomial
count 1000.000000 1000.000000 1000.000000
mean 0.042103 0.490103 0.209000
std 0.973974 0.293130 0.406798
min -3.315361 0.000975 0.000000
25% -0.641808 0.235426 0.000000
50% 0.019109 0.477758 0.000000
75% 0.775706 0.741953 0.000000
max 3.097657 0.999351 1.000000

P18 查看某一列的某个值出现了多少次

1
2
3
4
5
# P15的结果
df = pd.DataFrame(data=data, index=date_range)

# 查看binomial列的0的频数和1的频数
print(df["binomial"].value_counts())

输出:

1
2
3
4
>>> print(df["binomial"].value_counts())
0 791
1 209
Name: binomial, dtype: int64

P19 Dataframe存为csv文件

1
2
3
4
5
# P15的结果
df = pd.DataFrame(data=data, index=date_range)

# 取前100列输出
df.head(100).to_csv("/Users/emmett/Desktop/P19_output.csv")

结果:

1
2
3
4
5
6
7
8
9
10
11
(base) emmett@EmmettdeMacBook-Air ~ % head ~/Desktop/P19_output.csv         
,norm,uniform,binomial
2023-11-07,0.4023684187649195,0.7139247843762494,0
2023-11-08,1.1092517540800102,0.37033490685548265,0
2023-11-09,-0.7411318489406269,0.4927811317047385,0
2023-11-10,1.2995596866176715,0.30308160140089524,0
2023-11-11,-0.7388953663805584,0.6282983721098137,0
2023-11-12,-0.7584618817922961,0.5634610419207209,0
2023-11-13,1.487286995350167,0.2790511370963281,0
2023-11-14,-0.9482877456716767,0.1986855859133323,0
2023-11-15,-0.48714905003231695,0.2995498262349372,0

P20 Dataframe读取csv文件

1
2
3
4
5
import pandas as pd

df = pd.read_csv("/Users/emmett/Desktop/P19_output.csv", index_col=0)

print(df.info())

输出:

1
2
3
4
5
6
7
8
9
10
11
12
>>> print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 2023-11-07 to 2024-02-14
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 norm 100 non-null float64
1 uniform 100 non-null float64
2 binomial 100 non-null int64
dtypes: float64(2), int64(1)
memory usage: 3.1+ KB
None

Pandas: Pandas入门笔记(1)————Pandas的数据格式和基本操作
https://emmettpeng.github.io/2023/11/01/pandas-base/
Author
Emmett Peng
Posted on
November 1, 2023
Licensed under