I download a XLS file from the web using selenium.
I tried many options I found in stack-overflow and other websites to read the XLS file :
import pandas as pd
df = pd.read_excel('test.xls') # Read XLS file
Expected "little-endian" marker, found b'\xff\xfe'
And
df = pd.ExcelFile('test.xls').parse('Sheet1') # Read XLSX file
Expected "little-endian" marker, found b'\xff\xfe'
And again
from xlrd import open_workbook
book = open_workbook('test.xls')
CompDocError: Expected "little-endian" marker, found b'\xff\xfe'
I have tried different encoding: utf-8, ANSII, utf_16_be, utf16
I have even tried to get the encoding of the file from notepad or other applications.
Type of file : Microsoft Excel 97-2003 Worksheet (.xls)
I can open the file with Excel without any issue.
What's frustrating is that if I open the file with excel and just press save I then can read the file with of the previous python command.
I would be really grateful if someone could provide me other ideas I could try. I need to open this file with a python script only.
Thanks,
Max
Solution(Somewhat messy but simple) that could potentially work for any type of Excel file :
Called VBA from python to Open and save the file in Excel. Excel "clean-up" the file and then Python is able to read it with any read Excel type function
Solution inspired by @Serge Ballesta and @John Y comments.
## Open a file in Excel and save it to correct the encoding error
import win32com.client
import pandas
downloadpath="c:\\firefox_downloads\\"
filename="myfile.xls"
xl=win32com.client.Dispatch("Excel.Application")
xl.Application.DisplayAlerts = False # disables Excel pop up message (for saving the file)
wb = xl.Workbooks.Open(Filename=downloadpath+filename)
wb.SaveAs(downloadpath+filename)
wb.Close
xl.Application.DisplayAlerts = True # enables Excel pop up message for saving the file
df = pandas.ExcelFile(downloadpath+filename).parse('Sheet1') # Read XLSX file
Thank you all!
解决方案
What does pd mean?? What
pandas is made for data science. In my opinion, you have to use openpyxl (read and write only xlsx) or xlwt/xlrd (read xls... and write only xls).
from xlrd import open_workbook
book = open_workbook()
sheet =....
It has several examples with this on Internet...