I have to insert values available in DataFrame1 into one of the column with empty values with DataFrame2. Basically updating column in DataFrame2.
Both DataFrames have 2 common columns.
Is there a way to do same using Java? Or there can be different approach?
Sample Input :
1) File1.csv
BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR,VERSION,PRIM_SW
0501841898,BIN ,404154,1000,Y
0681220958,BIN ,735332,1000,Y
5992410180,BIN ,454680,1000,Y
6995270884,SREBIN ,1000252750295575,1000,Y
Here BILL_ID is system id and BILL_NBR is external id.
2) File2.csv
TXN_ID,TXN_TYPE,BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR
01234, ABC ," ",BIN ,404154
22365, XYZ ," ",BIN ,735332
45890, LKJ ," ",BIN ,454680
23456, MPK ," ",SREBIN ,1000252750295575
Sample Output
As shown below BILL_ID value should be populated in File2.csv
01234, ABC ,501841898,BIN ,404154
22365, XYZ ,681220958,BIN ,735332
45890, LKJ ,5992410180,BIN ,454680
23456, MPK ,6995270884,SREBIN ,1000252750295575
I have created two DataFrames and loaded both file's data into it, now I am not sure how to proceed.
EDIT
Basically I want clarity on below three steps:
how to get BILL_NBR and BILL_NBR_TYPE_CD values from File2.csv?
For this step I have written : file2Df.select("BILL_NBR_TYPE_CD","BILL_NBR");
How to get BILL_ID values from File1.csv based on the values retrieved in step1 ?
How to update BILL_ID values accordingly in File2.csv ?
I am new to spark and I would appreciate if someone can give pointers.
解决方案
You need to join two tables based on BILL_NBR column.
Assumption: There is one to one relation between BILL_NBR and BILL_ID columns.
Assuming that your dataframe names for File1.csv and File2.csv are file1DF and file2DF respectively, following should work for you:
Dataset file1DF = file1DF.select("BILL_ID","BILL_NBR","BILL_NBR_TYPE_CD");
Dataset file2DF = file2DF.select("TXN_ID","TXN_TYPE","BILL_NBR_TYPE_CD","BILL_NBR");
Dataset file2DF = file2DF.join(file1DF, file1DF("BILL_NBR","BILL_NBR_TYPE_CD"));
Note: I haven't got resources to test above code by running it. Please let me know if you face any compile time or run time error.