所以我有一个刮刀,可以得到文章。然而,它并不总是能正常工作。我想在它不工作的时候更好地检查。例如,下面是我想让它刮的东西:Hello. This is a sequence of sentences that are put together. They don't have to follow this exact format, but something very close to this would be nice! Just basically stuff like this put together with the occasional weird formatting, which depends on what is scraped.
但很明显,我得到的信息可能不是:REGISTER | LOGIN | LOGOUT | Sign in to your account Forgot your password? {* #signInForm *}....
有没有python库可以检查字符串的一般格式?基本上,我在抓取文章,想看看被刮的文本是否是article-y。如果没有python库,最好的方法是某种regex匹配吗?这有可能做得相当好吗?在
任何帮助都将不胜感激,谢谢!!在
[edit]如果您投票决定结束,您介意留下一个关于原因的评论吗?原因是:NLP没有堆栈交换。因此,我还能在哪里问这个问题呢?谢谢。在