Question

我拥有一个数据框架,其数据范围与以下各点相类似,其人数为x人(1,000人以上)、每人交易次数×以及变量数(1,000多个变量):

Person_ID	transaction_ID	variable_1	variable_2	variable_3	variable_X
person1	transaction1	123	0	1	abc
person1	transaction2	456	1	0	def
person1	transaction3	123	0	1	abc
personx	transaction1	123	0	1	abc
personx	transaction2	456	0	1	def

I want to pad it with rows containing -10 at the beginning of every person id group so that the total number of rows per person id group is 6, like the following:

Person_ID	transaction_ID	variable_1	variable_2	variable_3	variable_X
person1	-10	-10	-10	-10	-10
person1	-10	-10	-10	-10	-10
person1	-10	-10	-10	-10	-10
person1	transaction1	123	0	1	abc
person1	transaction2	456	1	0	def
person1	transaction3	123	0	1	abc
personx	-10	-10	-10	-10	-10
personx	-10	-10	-10	-10	-10
personx	-10	-10	-10	-10	-10
personx	-10	-10	-10	-10	-10
personx	transaction1	123	0	1	abc
personx	transaction2	456	0	1	def

这里是我所尝试的法典(与目录一起更新)和下文中的错误。

df2 = pd.DataFrame([[  ] * len(newdf.columns)], columns=newdf.columns)
df2

for row in newdf.groupby( person_id )[ transaction_id ]:
   x=newdf.groupby( person_id )[ person_id ].nunique()
   if x.any() < 6:
       newdf=pd.concat([newdf, df2*(6-x)], ignore_index=True)

RuntimeWarning:  <  not supported between instances of  int  and  tuple , sort order is undefined for incomparable objects.
  newdf=pd.concat([newdf, df2*(6-x)], ignore_index=True)

It appended several NaN rows to the bottom of the dataframe, but not inbetween groups as needed. Thank you in advance as I am a beginner.

Answer 1

<><><>>>>

use groupby + apply

def func1(df):
    n = 6 - len(df)
    if n > 0:
        df1 = pd.DataFrame(df[ Person_ID ].iloc[0], columns=[ Person_ID ], index=range(0, n))
        return pd.concat([df1.reindex(df.columns, axis=1, fill_value=-10), df])
页: 1 = df.groupby( Person_ID , group_keys=False).apply(func1).reset_index(drop=True)

页: 1

www.un.org/Depts/DGACM/index_spanish.htm 例

import pandas as pd
data1 = { Person_ID : [ person1 ,  person1 ,  person1 ,  personx ,  personx ], 
          transaction_ID : [ transaction1 ,  transaction2 ,  transaction3 ,  transaction1 ,  transaction2 ], 
          variable_1 : [123, 456, 123, 123, 456], 
          variable_2 : [0, 1, 0, 0, 0], 
          variable_3 : [1, 0, 1, 1, 1], 
          variable_X : [ abc ,  def ,  abc ,  abc ,  def ]}
df = pd.DataFrame(data1)

Answer 2

You can use the method .concat() instead of .append(). And you can use reindex() to repeat the rows.

举这个例子:

    import pandas as pd

data = [[ Person1 ,  transaction1 , 803.5, 1],
 [ Person2 ,  transaction2 , 776.6, 2],
 [ Person3 ,  transaction3 , 3.9, 0],
 [ Person4 ,  transaction1 , 8.1, 7],
  [ Person5 ,  transaction2 , 1.7, 1],
  [ Person6 ,  transaction3 , 505.6, 2],
   [ Person7 ,  transaction1 , 1.5, 1]]

df = pd.DataFrame(data, columns=[ Person_ID ,  transaction_ID ,  variable_1 ,  variable_2 ])

dfnew = df #create a copy

new_column = df[ Person_ID ] #you gonna use this column to insert its values

for column, values in df.iteritems(): #fill every cell with -10
  dfnew[column] = -10

dfnew.insert(0,  New_Column_Person_ID , new_column) #insert values of the first column

unique_values=df.groupby( Person_ID )[ Person_ID ].nunique()

index_unique_values = pd.DataFrame(unique_values.index)

z = pd.concat([dfnew, index_unique_values], ignore_index=True) #concat instead of append method

z.reindex(z.index.repeat(3)) #repeat rows

友情链接