Transform data from long format to wide format

在数据存储时,由于缺失等情况,为了节省存储空间,很多时候会采取long format,待到建模或分析等时候,wide format会使操作更便捷。
于是,将long format的数据转换成wide format,就成了很常见的一种数据转换的场景。
在pandas中运用unstack方法还是非常方便的,不过对于数据量很大的情况,pandas可能hold不住。
本文记录了用pandas和sqlalchemy进行format转换的方法。

1
2
3
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, MetaData, Table, select, and_, func, case

数据观察

对于每个用户,有数据的维度可能是不同的。如果维度较多,而缺失情况较多的情况下,有时会采用long format的形式存储以降低存储空间。
如下表所示,id列代表用户id,对于1号用户,prod有abcde五个维度,这五个维度都有数值num;对于2号用户,prod只有abcd四个维度,即维度e的值为空;同理,对于3号用户,维度b的值为空。

1
2
3
4
5
6
7
8
df_long = pd.DataFrame(
{
'id': [1]*5+[2]*4+[3]*4,
'prod': list('abcde')+list('abcd')+list('acde'),
'num': np.random.randint(low=0, high=100, size=13)
}
)
df_long[['id','prod','num']]

id prod num
0 1 a 81
1 1 b 39
2 1 c 61
3 1 d 90
4 1 e 35
5 2 a 83
6 2 b 26
7 2 c 76
8 2 d 56
9 3 a 8
10 3 c 51
11 3 d 65
12 3 e 21

pandas解决方案

在pandas中可以用groupby和agg方法进行聚合后,用unstack展开数据,一行代码即可搞定。

1
df_long.groupby(['id','prod']).agg({'num':sum}).unstack()

num
prod a b c d e
id
1 81.0 39.0 61.0 90.0 35.0
2 83.0 26.0 76.0 56.0 NaN
3 8.0 NaN 51.0 65.0 21.0

sql解决方案

但在实际应用中,很多情况下是因为数据量过大才采用了long format的存储方式,这时候pandas可能无法胜任该任务,需要在sql或hive中解决。

1
2
engine = create_engine('postgres://postgres:houzi0105@localhost:5432/bigdata')
engine.execute('CREATE SCHEMA IF NOT EXISTS test;')
<sqlalchemy.engine.result.ResultProxy at 0x7f32758dcac8>
1
df_long.to_sql('test', engine, schema='test', if_exists='replace', index=False)

原始数据

1
2
3
4
5
metadata = MetaData(engine, schema='test')
tbl_test = Table('test', metadata, autoload=True, autoload_with=engine)
sql1 = select([tbl_test])
pd.read_sql(sql1, engine)

id num prod
0 1 81 a
1 1 39 b
2 1 61 c
3 1 90 d
4 1 35 e
5 2 83 a
6 2 26 b
7 2 76 c
8 2 56 d
9 3 8 a
10 3 51 c
11 3 65 d
12 3 21 e

利用case when语法展开

这一步除了对case when语法的理解要求外,还需要掌握Python的list comprehensions和unpacking的技巧。

1
2
3
4
5
6
7
8
9
sql2 = select(
[
tbl_test.c.id,
*[
case([(tbl_test.c.prod == prod, tbl_test.c.num)]).label(prod)
for prod in pd.read_sql(select([tbl_test.c.prod]).distinct(), engine)['prod']
]
]
).alias('spread')

星号里面的list即为如下的case when语句:

1
2
3
4
5
for x in [
case([(tbl_test.c.prod == prod, tbl_test.c.num)]).label(prod)
for prod in pd.read_sql(select([tbl_test.c.prod]).distinct(), engine)['prod']
]:
print(x.compile(compile_kwargs={"literal_binds": True}))
CASE WHEN (test.test.prod = 'c') THEN test.test.num END
CASE WHEN (test.test.prod = 'b') THEN test.test.num END
CASE WHEN (test.test.prod = 'a') THEN test.test.num END
CASE WHEN (test.test.prod = 'e') THEN test.test.num END
CASE WHEN (test.test.prod = 'd') THEN test.test.num END

得到的结果如下。可以看到,数据行数没有发生变化,但是列的维度得到了扩展,列名与prod列中的维度名一致,对应的值在对应的位置,若对应位置无对应值则为空。

1
pd.read_sql(sql2, engine)

id c b a e d
0 1 NaN NaN 81.0 NaN NaN
1 1 NaN 39.0 NaN NaN NaN
2 1 61.0 NaN NaN NaN NaN
3 1 NaN NaN NaN NaN 90.0
4 1 NaN NaN NaN 35.0 NaN
5 2 NaN NaN 83.0 NaN NaN
6 2 NaN 26.0 NaN NaN NaN
7 2 76.0 NaN NaN NaN NaN
8 2 NaN NaN NaN NaN 56.0
9 3 NaN NaN 8.0 NaN NaN
10 3 51.0 NaN NaN NaN NaN
11 3 NaN NaN NaN NaN 65.0
12 3 NaN NaN NaN 21.0 NaN

对扩展数据进行聚合

按id分组,对每一个维度的数据取max,max函数会自动舍弃空值,保留原始数值。

1
2
3
4
5
6
sql3 = select(
[
sql2.c.id,
*[func.max(sql2.c[col]).label(col) for col in sql2.c.keys() if col != 'id']
]
).group_by(sql2.c.id)
1
pd.read_sql(sql3, engine)

id c b a e d
0 1 61 39.0 81 35.0 90
1 3 51 NaN 8 21.0 65
2 2 76 26.0 83 NaN 56

查看SQL语句

这样,只需简单两段SQL语句,就完成了数据从long format到wide format的转换。

1
print(sql2.compile(compile_kwargs={'literal_binds': True}))
SELECT test.test.id, CASE WHEN (test.test.prod = 'c') THEN test.test.num END AS c, CASE WHEN (test.test.prod = 'b') THEN test.test.num END AS b, CASE WHEN (test.test.prod = 'a') THEN test.test.num END AS a, CASE WHEN (test.test.prod = 'e') THEN test.test.num END AS e, CASE WHEN (test.test.prod = 'd') THEN test.test.num END AS d 
FROM test.test
1
print(sql3.compile(compile_kwargs={'literal_binds': True}))
SELECT spread.id, max(spread.c) AS c, max(spread.b) AS b, max(spread.a) AS a, max(spread.e) AS e, max(spread.d) AS d 
FROM (SELECT test.test.id AS id, CASE WHEN (test.test.prod = 'c') THEN test.test.num END AS c, CASE WHEN (test.test.prod = 'b') THEN test.test.num END AS b, CASE WHEN (test.test.prod = 'a') THEN test.test.num END AS a, CASE WHEN (test.test.prod = 'e') THEN test.test.num END AS e, CASE WHEN (test.test.prod = 'd') THEN test.test.num END AS d 
FROM test.test) AS spread GROUP BY spread.id

感谢师父的提示和帮助!

坚持原创技术分享,您的支持将鼓励我继续创作!