分享
三行代码  ›  专栏  ›  技术社区  ›  Alice jinx

按组将行插入数据帧,并且条目来自另一个dataframe_复杂匹配项

  •  1
  • Alice jinx  · 技术社区  · 1 周前

    import pandas as pd
    
    df_recorded = pd.DataFrame({
        'id': ['2008 11', '2008 11', '2008 11', '2008 07', '2008 07', '2008 12', '2008 12', '2008 12'],
        'info': ['recorded', 'recorded', 'recorded', 'recorded', 'recorded', 'recorded', 'recorded', 'recorded', ],
        'score': [98, 68, 79, 75, 66, 62, 60, 60],
        'date' : ['2010-12-10', '2010-10-01', '2010-09-12', '2010-12-10', '2010-11-01', '2010-12-07', '2010-11-10', '2010-09-12']
    })
    
    df_missed = pd.DataFrame({
        'id': ['2008 11', '2008 07', '2008 12'],
        'missed_score': [62, 72, 80],
        'missed_date': ['2010-08-01', '2010-10-20', '2010-07-23']
    })
    
    
            id      info  score        date
    0  2008 11  recorded     98  2010-12-10
    1  2008 11  recorded     68  2010-10-01
    2  2008 11  recorded     79  2010-09-12
    3  2008 07  recorded     75  2010-12-10
    4  2008 07  recorded     66  2010-11-01
    5  2008 12  recorded     62  2010-12-07
    6  2008 12  recorded     60  2010-11-10
    7  2008 12  recorded     60  2010-09-12
    
    df_missed
            id  missed_score missed_date
    0  2008 11            62  2010-08-01
    1  2008 07            72  2010-10-20
    2  2008 12            80  2010-07-23
    

    我想为“df_recorded”中的每个组在末尾添加一行,例如,在“info”列中添加相同的“id=2008 11”和一个名为“missed”的新条目,然后通过搜索df_missed table来添加分数和日期,因此结果应该如下所示:

    Target result:
             id      info  score        date
    0   2008 11  recorded     98  2010-12-10
    1   2008 11  recorded     68  2010-10-01
    2   2008 11  recorded     79  2010-09-12
    3   2008 11    missed     62  2010-08-01 # new record
    4   2008 07  recorded     75  2010-12-10
    5   2008 07  recorded     66  2010-11-01
    6   2008 07    missed     72  2010-10-20 # new record
    7   2008 12  recorded     62  2010-12-07
    8   2008 12  recorded     60  2010-11-10
    9   2008 12  recorded     60  2010-09-12
    10  2008 12    missed     80  2010-07-23 # new record
    

    我试图用循环来编写代码,但速度很慢,效率低下。所以,如果你有什么想法可以让它变得更好,请帮忙。非常感谢。

    2 回复  |  直到 1 周前
        1
  •  4
  •   Henry Yik    1 周前

    你可以简单地重命名缺失的列 concat :

    df_missed.columns = ["id", "score", "date"]
    
    df = pd.concat([df_recorded,df_missed], ignore_index=True, sort=False).sort_values("id", ascending=False)
    df.loc[df["info"].isnull(),"info"] = "missing"
    print (df)
    
             id      info  score        date
    5   2008 12  recorded     62  2010-12-07
    6   2008 12  recorded     60  2010-11-10
    7   2008 12  recorded     60  2010-09-12
    10  2008 12   missing     80  2010-07-23
    0   2008 11  recorded     98  2010-12-10
    1   2008 11  recorded     68  2010-10-01
    2   2008 11  recorded     79  2010-09-12
    8   2008 11   missing     62  2010-08-01
    3   2008 07  recorded     75  2010-12-10
    4   2008 07  recorded     66  2010-11-01
    9   2008 07   missing     72  2010-10-20
    
        2
  •  1
  •   NAGA RAJ S    1 周前

    import pandas as pd
    
    df_recorded = pd.DataFrame({
        'id': ['2008 11', '2008 11', '2008 11', '2008 07', '2008 07', '2008 12', '2008 12', '2008 12'],
        'info': ['recorded', 'recorded', 'recorded', 'recorded', 'recorded', 'recorded', 'recorded', 'recorded', ],
        'score': [98, 68, 79, 75, 66, 62, 60, 60],
        'date' : ['2010-12-10', '2010-10-01', '2010-09-12', '2010-12-10', '2010-11-01', '2010-12-07', '2010-11-10', '2010-09-12']
    })
    
    df_missed = pd.DataFrame({
        'id': ['2008 11', '2008 07', '2008 12'],
        'missed_score': [62, 72, 80],
        'missed_date': ['2010-08-01', '2010-10-20', '2010-07-23']
    })
    df_missed.rename(columns={'missed_score':'score','missed_date':'date'},inplace=True)
    
    df_recorded=df_recorded.append(df_missed,ignore_index=True)
    df_recorded=df_recorded.fillna('missed')
    df_recorded.sort_values(by='id')
    

    输出

    id  info    score   date
    3   2008 07     recorded    75  2010-12-10
    4   2008 07     recorded    66  2010-11-01
    9   2008 07     missed  72  2010-10-20
    0   2008 11     recorded    98  2010-12-10
    1   2008 11     recorded    68  2010-10-01
    2   2008 11     recorded    79  2010-09-12
    8   2008 11     missed  62  2010-08-01
    5   2008 12     recorded    62  2010-12-07
    6   2008 12     recorded    60  2010-11-10
    7   2008 12     recorded    60  2010-09-12
    10  2008 12     missed  80  2010-07-23