I'm trying to extract the fanfiction from an Archive of Our Own URL in order to use the NLTK library to do some linguistic analysis on it. However every attempt at scraping the HTML from the URL is returning everything BUT the fanfic (and the comments form, which I don't need).
First I tried with the built in urllib library (and BeautifulSoup):
import urllib
from bs4 import BeautifulSoup
html = request.urlopen("http://archiveofourown.org/works/6846694").read()
soup = BeautifulSoup(html,"html.parser")
soup.prettify()
Then I found out about the Requests library, and how the User Agent could be part of the problem, so I tried this with the same results:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gec