跳转至

pandas从入门到放弃

Create/Read/Write

Create

data = pd.DataFrame({"v1": [1,2,3], "v2": [4,5,6]})

Read

data = pd.read_csv("xxx.csv")

Write

data.to_csv("xxx.csv")

Index/Select/Assign

Index

  • iloc: 基于整数位置进行数据选择
  • loc: 基于标签(Label)索引进行数据选择

Select

top_oceania_wines = reviews.loc[
    (reviews.country.isin(['Australia', 'New Zealand'])) & 
    (reviews.country.isin(['Australia', 'New Zealand'])) & (reviews.points >= 95) (reviews.points >= 95) 
]

Assign

reviews['critic'] = 'everyone'
reviews['index_backwards'] = range(len(reviews), 0, -1)

Summary/Map

Summary

reviews.points.mean()   # 平均值
reviews.points.median() # 中位数
reviews.taster_name.unique() # 去重后的数据 set<tasker_name>
reviews.taster_name.value_counts() # 去重数据 以及出现的次数 map<tasker_name, int>
df.loc['A'].idxmax() # A列最大值的索引idx  (lambda df : min(df['A'].idx) => idx)

Map

  • map:主要用于Series对象,可以方便地进行元素级别的转换,特别是基于映射关系(如字典)。
  • apply:更为灵活,可用于DataFrame和Series对象,既能进行元素级别的转换,也能对整行或整列进行操作。
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
df['A'] = df['A'].apply(lambda row: row['A'] + 5, axis=1)

需要注意的是,apply函数有个重要的参数 axis

  • axis=0:默认值,表示对每列(列)应用函数。
  • axis=1:表示对每行(行)应用函数。较为常用
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [10, 20, 30],
    'C': [100, 200, 300]
})
col_sum = df.apply(sum, axis=0) 
print(col_sum)
# 输出:
# A      6
# B     60
# C    600
# dtype: int64

row_sum = df.apply(sum, axis=1)
print(row_sum)
# 输出:
# 0    111
# 1    222
# 2    333
# dtype: int64

Group/Sort

Group

reviews.groupby('taster').size() # group by taster count(1)
reviews.groupby(["variety"]).price.agg([min, max]) # group by variety min(price), max(price)

Sort

reviews.groupby('price').points.max().sort_index(ascending=True)
reviews.groupby(["variety"]).price.agg([min, max]).sort_values(by=["min", "max"], ascending=False)

DataType/Missing Value

DataType

reviews.points.dtype
reviews.points.astype('str')

MissingValue

reviews[pd.isnull(reviews.price)]
reviews.region_1.fillna("Unknown")