I'm currently attempting to scrape data from a specific page on LinkedIn. I have a script that is able to log into LinkedIn, but I run into a snag when I try to access the page containing the data. When I call requests.get(data_url), I end up with the html for the LinkedIn loading screen that is displayed before LinkedIn loads the actual page content. Is there a way to make requests wait for LinkedIn to display the site data before actually scraping the html data? I essentially need to let the page fully render before I can 'get' the contents. My current script is below.
import requests
from bs4 import BeautifulSoup
client = requests.Session()
HOMEPAGE_URL = 'https://www.linkedin.com'
LOGIN_URL = 'https://www.linkedin.com/uas/login-submit'
html = client.get(HOMEPAGE_URL).content
soup = BeautifulSoup(html)
csrf = soup.find(id="loginCsrfParam-login")['value']
login_information = {
'session_key':'EMAIL',
'session_password':'PASSWORD',
'loginCsrfParam': csrf,
}
client.post(LOGIN_URL, data=login_information)
r = client.get(data_url)
解决方案
If any parts of the web page is rendered dynamically, for example using Javascript, beautifulsoup might not be able to work with that.
I use Selenium + PhantomJS. I load the page (wait for it to fully load) and then enter the login details. Selenium has nice API which lets you programmatically check for specific html elements and wait for them to appear which is very useful in such cases.