最节省时间的编程语言_我如何使用编程技巧来节省超过8个小时的写作时间-CSDN博客

最节省时间的编程语言

最近在Facebook上， David Smooke （ Hackernoon的首席执行官）发表了一篇文章，其中列出了2018年的Top Tech Stories 。他还提到，如果有人希望列出类似的JavaScript清单，他很乐意在Hackernoon的首页上使用它。

在不断争取更多人阅读我的作品的努力中，我不能错过这个机会，因此我立即开始计划如何制作这样的清单。

由于这一年快要结束了，我的时间有限，因此我决定不手工搜索帖子，而是使用网络抓取技巧。

我相信学习如何制作这样的刮板是一个有用的练习，并且可以作为一个有趣的案例研究。

如果您已阅读有关如何创建instagram机器人的文章，那么您就会知道，使用Node.js与网站进行交互的最佳方法是使用控制Chrome实例的puppeteer库。这样，我们可以做潜在用户可以在网站上做的所有事情。

这是存储库的链接。

创建刮板

让我们用这个简单的助手来抽象地创建一个伪造者的浏览器和页面：

const createBrowser = async () => {
 const browser = await puppeteer.launch({ headless: true })

 return async function getPage<T>(url: string, callback: (page: puppeteer.Page) => Promise<T>) {
   const page = await browser.newPage()

   try {
     await page.goto(url, { waitUntil: 'domcontentloaded' })

     page.on('console', (msg) => console.log(msg.text()))

     const result = await callback(page)

     await page.close()

     return result
   } catch (e) {
     await page.close()

     throw e
   }
 }
}

我们在回调中使用该页面，因此可以避免一遍又一遍地重复相同的代码。由于这个帮手，我们不需要去看场定的url，从里面听console.logs担心page.evaluate和关闭页万事俱备后。函数的结果将在promise中返回，因此我们可以稍后await它，而不必在回调中使用结果。

让我们谈谈数据

在一个网站上，我们可以找到所有由Hackernoon发布的带有JavaScript标签的文章。它们是按日期排序的，但是有时会突然出现像2016年这样的文章，所以我们必须提防这一点。

我们可以仅从此帖子预览中提取所有需要的信息，而无需在新选项卡中实际打开该帖子，这使我们的工作更加轻松。

在上面显示的框中，我们看到了所需的所有数据：

作者的姓名和个人简介网址
文章标题和网址
拍手数
阅读时间
日期

这是文章的界面：

interface Article {
 articleUrl: string
 date: string
 claps: number
 articleTitle: string
 authorName: string
 authorUrl: string
 minRead: string
}

在“ 中”上，存在无限滚动，这意味着当我们向下滚动时，将加载更多文章。如果我们要使用GET请求获取静态HTML并使用JSDOM之类的库进行解析，那么获取这些文章将是不可能的，因为我们不能对静态HTML使用滚动。因此，当与网站进行任何形式的交互时， p可以挽救生命。

要获取所有已加载的帖子，我们可以使用：

Array.from(document.querySelectorAll('.postArticle'))
         .slice(offset)
         .map((post) => {})

现在，我们可以将每个帖子用作选择器的上下文-代替编写document.querySelector我们现在要编写post.querySelector 。这样，我们可以将搜索仅限于给定的post元素。

另外，请注意.slice(offset)片段-由于我们向下滚动而不是打开新页面，因此已经解析的文章仍然在那里。当然，我们可以再次解析它们，但这并不会真正有效。偏移量从0开始，每次我们刮掉一些文章时，我们都会将集合的长度添加到偏移量中。

offset += scrapedArticles.length

收集帖子数据

抓取数据时最常见的错误是“无法读取null的'textContent'属性”。我们将创建一个简单的辅助函数，以防止我们试图获取不存在的元素的属性。

function safeGet<T extends Element, K>(
  element: T,
  callback: (element: T) => K,
  fallbackValue = null,
): K {
  if (!element) {
    return fallbackValue
  }

  return callback(element)
}

safeGet仅在element存在时才执行回调。现在，让我们使用它来访问包含我们感兴趣的数据的元素的属性。

发表文章的日期

const dateElement = post.querySelector('time')
const date = safeGet(
  dateElement,
  (el) => new Date(el.dateTime).toUTCString(),
  '',
)

如果dateElement发生了一些事情，但dateElement它，我们的safeGet将防止出现错误。 <time>元素具有一个名为dateTime的属性，该属性保存文章发布日期的字符串表示形式。

const authorDataElement = post.querySelector<HTMLLinkElement>(

'.postMetaInline-authorLockup a[data-action="show-user-card"]',
)

const { authorUrl, authorName } = safeGet(
  authorDataElement,
  (el) => {
    return {
      authorUrl: removeQueryFromURL(el.href),
      authorName: el.textContent,
    }
  },
  {},
)

在此<a>元素内，我们可以找到用户的个人资料URL和他/她的名字。

同样，在这里我们使用removeQueryFromURL因为在我们要删除的查询中，作者的个人资料URL和帖子的URL都具有此奇怪的源参数：

https://hackernoon.com/javascript-2018-top-20-hackernoon-articles-of-the-year-9975563216d1？ 源=--------

？ URL中的字符表示查询参数的开始，因此让我们简单地删除其后的所有内容。

const removeQueryFromURL = (url: string) => url.split('?').shift()

我们将字符串拆分为？并只返回第一部分。

拍手

在上面的示例帖子中，我们看到“拍手”的数量为204，这是准确的。但是，一旦数字超过1000，它们就会显示为1K，2K，2.5K。如果我们需要拍手的确切数目，这可能是个问题。在我们的用例中，此舍入效果很好。

const clapsElement = post.querySelector('span > button')

const claps = safeGet(
  clapsElement,
  (el) => {
    const clapsString = el.textContent

    if (clapsString.endsWith('K')) {
      return Number(clapsString.slice(0, -1)) * 1000
    }

    return Number(clapsString)
  },
  0,
)

如果拍手的字符串表示形式以K结尾，我们只需删除K字母，然后将其乘以1000，就可以了。

文章的网址和标题

const articleTitleElement = post.querySelector('h3')
const articleTitle = safeGet(
  articleTitleElement,
  (el) => el.textContent
)

const articleUrlElement = post.querySelector<HTMLLinkElement>(
  '.postArticle-readMore a',
)

const articleUrl = safeGet(
  articleUrlElement,
  (el) => removeQueryFromURL(el.href)
)

同样，由于选择器是在post上下文中使用的，因此我们不需要对其结构过分具体。

“最小阅读量”

const minReadElement = post.querySelector<HTMLSpanElement>('span[title]')
const minRead = safeGet(minReadElement, (el) => el.title)

在这里，我们使用一个稍微不同的选择器：我们寻找一个包含data-title属性的<span> 。

注意：稍后我们将使用.title属性，因此区分它们很重要。

好的，我们现在已经抓取了当前显示在页面上的所有文章，但是如何滚动以加载更多文章？

滚动以加载更多文章

 // scroll to the bottom of the page 
await page.evaluate(() => {
  window.scrollTo(0, document.body.scrollHeight)
})

 // wait to fetch the new articles 
await page.waitFor(7500)

我们将页面滚动到底部并等待7.5秒。这是一个“安全”的时间-文章可能会在2秒钟内加载，但我们宁愿确保所有帖子都已加载，而不要错过一些帖子。如果时间是一个重要因素，我们可能会在请求处设置一些拦截器，该拦截器将获取帖子并在完成后继续前进。

什么时候结束刮

如果按日期对帖子进行排序，我们可以在我们遇到2017年以来的文章时停止抓取。但是，由于在2018年以来的文章之间出现了一些古怪的旧文章，我们无法这样做。我们可以做的是为2018年或以后发布的文章筛选已删除的文章。如果结果数组是空的，我们可以有把握地认为有没有更多的文章中，我们感兴趣的是，在matchingArticles我们一直被张贴在2018或更高版本的文章和parsedArticles我们只有被张贴在2018年的文章。

const matchingArticles = scrapedArticles.filter((article) => {
  return article && new Date(article.date).getFullYear() >= 2018
})

if (!matchingArticles.length) {
  return articles
}

const parsedArticles = matchingArticles.filter((article) => {
  return new Date(article.date).getFullYear() === 2018
})

articles = [...articles, ...parsedArticles]

如果matchingArticles为空，我们将返回所有文章，从而结束抓取。

放在一起

这是获取文章所需的全部代码：

const scrapArticles = async () => {
 const createPage = await createBrowser()

 return createPage<Article[]>('https: //hackernoon.com/tagged/javascript', async (page) => { 
   let articles: Article[] = []
   let offset = 0

   while (true) {
     console.log({ offset })

     const scrapedArticles: Article[] = await page.evaluate((offset) => {
       function safeGet<T extends Element, K>(
         element: T,
         callback: (element: T) => K,
         fallbackValue = null,
       ): K {
         if (!element) {
           return fallbackValue
         }

         return callback(element)
       }

       const removeQueryFromURL = (url: string) => url.split('?').shift()

       return Array.from(document.querySelectorAll('.postArticle'))
         .slice(offset)
         .map((post) => {
           try {
             const dateElement = post.querySelector('time')
             const date = safeGet(dateElement, (el) => new Date(el.dateTime).toUTCString(), '')

             const authorDataElement = post.querySelector<HTMLLinkElement>(
               '.postMetaInline-authorLockup a[data-action="show-user-card"]',
             )

             const { authorUrl, authorName } = safeGet(
               authorDataElement,
               (el) => {
                 return {
                   authorUrl: removeQueryFromURL(el.href),
                   authorName: el.textContent,
                 }
               },
               {},
             )

             const clapsElement = post.querySelector('span > button')

             const claps = safeGet(
               clapsElement,
               (el) => {
                 const clapsString = el.textContent

                 if (clapsString.endsWith('K')) {
                   return Number(clapsString.slice(0, -1)) * 1000
                 }

                 return Number(clapsString)
               },
               0,
             )

             const articleTitleElement = post.querySelector('h3')
             const articleTitle = safeGet(articleTitleElement, (el) => el.textContent)

             const articleUrlElement = post.querySelector<HTMLLinkElement>(
               '.postArticle-readMore a',
             )
             const articleUrl = safeGet(articleUrlElement, (el) => removeQueryFromURL(el.href))

             const minReadElement = post.querySelector<HTMLSpanElement>('span[title]')
             const minRead = safeGet(minReadElement, (el) => el.title)

             return {
               claps,
               articleTitle,
               articleUrl,
               date,
               authorUrl,
               authorName,
               minRead,
             } as Article
           } catch (e) {
             console.log(e.message)
             return null
           }
         })
     }, offset)

     offset += scrapedArticles.length

     // scroll to the bottom of the page 
     await page.evaluate(() => {
       window.scrollTo(0, document.body.scrollHeight)
     })

     // wait to fetch the new articles 
     await page.waitFor(7500)

     const matchingArticles = scrapedArticles.filter((article) => {
       return article && new Date(article.date).getFullYear() >= 2018
     })

     if (!matchingArticles.length) {
       return articles
     }

     const parsedArticles = matchingArticles.filter((article) => {
       return new Date(article.date).getFullYear() === 2018
     })

     articles = [...articles, ...parsedArticles]

     console.log(articles[articles.length - 1])
   }
 })
}

在以适当的格式保存数据之前，让我们按拍子降序对文章进行排序：

const sortArticlesByClaps = (articles: Article[]) => {
  return articles.sort(
    (fArticle, sArticle) => sArticle.claps - fArticle.claps
  )
}

现在让我们将文章输出为可读格式，因为到目前为止它们仅存在于我们计算机的内存中。

输出格式

JSON格式

我们可以使用JSON格式将所有数据转储到单个文件中。以这种方式存储所有文章可能会在将来的某个时候派上用场。

转换为JSON格式归结为键入：

const jsonRepresentation = JSON.stringify(articles)

我们现在可以停止使用文章的JSON表示，然后将我们认为属于该文章的文章复制并粘贴到列表中。但是，您可以想象，这也可以自动化。

HTML

与从JSON格式手动复制所有内容相比， HTML格式肯定会使从列表中复制和粘贴项目更加容易。

大卫在他的文章中列出以下方式的文章：

我们希望清单采用这样的格式。我们可以再次使用puppeteer创建HTML元素并对其进行操作，但是，由于我们正在使用HTML ，因此我们可以将值嵌入字符串中-浏览器将始终解析它们。

const createHTMLRepresentation = async (articles: Article[]) => {
 const list = articles
   .map((article) => {
     return `
       <li>
         <a href="${article.articleUrl}">${article.articleTitle}</a> by
         <a href="${article.authorUrl}">${article.authorName}</a>
         [${article.minRead}] (${article.claps})
       </li>
     `
   })
   .join('')

 return `
   <!DOCTYPE html>
   <html lang="en">
     <head>
       <meta charset="UTF-8" />
       <meta name="viewport" content="width=device-width, initial-scale=1.0" />
       <meta http-equiv="X-UA-Compatible" content="ie=edge" />
       <title>Articles</title>
     </head>
     <body>
       <ol>
         ${list}
       </ol>
     </body>
   </html>
 `
}

如您所见，我们只是在文章上使用.map()并返回一个字符串，其中包含以我们喜欢的方式格式化的数据。现在，我们有了一个包含<li>元素的数组-每个元素代表一篇文章。现在我们只需要.join()来创建一个字符串并将其嵌入到简单的HTML5模板中。

保存文件

剩下要做的最后一件事是将表示形式保存在单独的文件中。

const scrapedArticles = await scrapArticles()
const articles = sortArticlesByClaps(scrapedArticles)

console.log(`Scrapped ${articles.length} articles.`)

const jsonRepresentation = JSON.stringify(articles)
const htmlRepresentation = createHTMLRepresentation(articles)

await Promise.all([
  fs.writeFileAsync(jsonFilepath, jsonRepresentation),
  fs.writeFileAsync(htmlFilepath, htmlRepresentation),
])

结果

根据抓取工具 ，今年在Hackernoon上发布了894篇带有JavaScript标签的文章，平均每天有2.45篇文章。

HTML文件如下所示：

<li> 
 <a href="https://hackernoon.com/im-harvesting-credit-card-numbers-and-passwords-from-your-site-here-s-how-9a8cb347c5b5">I'm harvesting credit card numbers and passwords from your site. Here's how.</a> by 
 <a href="https://hackernoon.com/@david.gilbertson">David Gilbertson</a> 
 [10 min read] (222000) 
 </li> 
 
 <li> 
 <a href="https://hackernoon.com/part-2-how-to-stop-me-harvesting-credit-card-numbers-and-passwords-from-your-site-844f739659b9">Part 2: How to stop me harvesting credit card numbers and passwords from your site</a> by 
 <a href="https://hackernoon.com/@david.gilbertson">David Gilbertson</a> 
 [16 min read] (18300) 
 </li> 
 
 <li> 
 <a href="https://hackernoon.com/javascript-2018-top-20-hackernoon-articles-of-the-year-9975563216d1">JAVASCRIPT 2018 — TOP 20 HACKERNOON ARTICLES OF THE YEAR</a> by 
 <a href="https://hackernoon.com/@maciejcieslar">Maciej Cieślar</a> 
 [2 min read] (332) 
 </li>

现在是JSON文件：

[ 
 { 
 "claps": 222000, 
 "articleTitle": "I'm harvesting credit card numbers and passwords from your site. Here's how.", 
 "articleUrl": "https://hackernoon.com/im-harvesting-credit-card-numbers-and-passwords-from-your-site-here-s-how-9a8cb347c5b5", 
 "date": "Sat, 06 Jan 2018 08:48:50 GMT", 
 "authorUrl": "https://hackernoon.com/@david.gilbertson", 
 "authorName": "David Gilbertson", 
 "minRead": "10 min read" 
 }, 
 { 
 "claps": 18300, 
 "articleTitle": "Part 2: How to stop me harvesting credit card numbers and passwords from your site", 
 "articleUrl": "https://hackernoon.com/part-2-how-to-stop-me-harvesting-credit-card-numbers-and-passwords-from-your-site-844f739659b9", 
 "date": "Sat, 27 Jan 2018 08:38:33 GMT", 
 "authorUrl": "https://hackernoon.com/@david.gilbertson", 
 "authorName": "David Gilbertson", 
 "minRead": "16 min read" 
 }, 
 { 
 "claps": 218, 
 "articleTitle": "JAVASCRIPT 2018 -- TOP 20 HACKERNOON ARTICLES OF THE YEAR", 
 "articleUrl": "https://hackernoon.com/javascript-2018-top-20-hackernoon-articles-of-the-year-9975563216d1", 
 "date": "Sat, 29 Dec 2018 16:26:36 GMT", 
 "authorUrl": "https://hackernoon.com/@maciejcieslar", 
 "authorName": "Maciej Cieślar", 
 "minRead": "2 min read" 
 } 
 ]