收集并简单清洗网页数据

1、数据的初步爬取在这里插入图片描述
2、准备
Python中一般使用UTF—8编码格式

·ascii码的使用
在这里插入图片描述
·gbk(转换为字节码以计算机可以理解的方式进行存储)
在这里插入图片描述
b’:字节码
\x:以十六进制存放
ba\xba
\xd7\xd6’(每两个字符代表一个字节)
b:11,d:13

·utf-8的使用在这里插入图片描述
每三个字符代表一个字节
在这里插入图片描述
将存储编码进行转码(decode解码,将其转换为汉字)
3、数据解析
在这里插入图片描述在这里插入图片描述
实例:
从网页数据的获取到网页数据的解析:

Jupyter QtConsole 4.4.1
Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

import requests

ur1='https://wiki.mbalib.com/wiki/Portal:证券'

r=requests.get(ur1)

r.status_code
Out[4]: 200

r.encoding
Out[5]: 'utf-8'

r.text
Out[6]: '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh" lang="zh" dir="ltr">\n  <head>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=0, minimum-scale=1.0, maximum-scale=1.0" />\n    <meta name="apple-itunes-app" content="app-id=908915361" />\n    <meta name="keywords" content="Portal:证券,A股,B股,H股,IPO,万科,万科股份,万科股权之争,上海证券交易所,上海黄金交易所,东京证券交易所" />\n<link rel="shortcut icon" href="/favicon.ico" />\n<link rel="search" type="application/opensearchdescription+xml" href="/w/opensearch_desc.php" title="MBA智库百科 (中文)" />\n<link rel="copyright" href="http://www.mbalib.com" />\n    <title>Portal:证券 - MBA智库百科</title>\n    <link rel="stylesheet" type="text/css" media="screen,projection" href="https://img.mbalib.com/wiki/MBALib/main_v921.css?v=20200303" />\n    <link rel="stylesheet" type="text/css" media="print" href="//img.mbalib.com/wiki/common/commonPrint.css" />\n    <!--[if lt IE 5.5000]><style type="text/css">@import "//img.mbalib.com/wiki/MBALib/IE50Fixes.css";</style><![endif]-->\n    <!--[if IE 5.5000]><style type="text/css">@import "//img.mbalib.com/wiki/MBALib/IE55Fixes.css";</style><![endif]-->\n    <!--[if IE 6]><style type="text/css">@import "//img.mbalib.com/wiki/MBALib/IE60Fixes.css";</style><![endif]-->\n      <link rel="stylesheet" href="//img.mbalib.com/wiki/MBALib/iconfont/iconfont.css">\n\n      <!--[if IE]><script type="text/javascript" src="//img.mbalib.com/wiki/common/IEFixes.js"></script>\n    <meta http-equiv="imagetoolbar" content="no" /><![endif]-->\n\n    <script type="text/javascript" src="//img.mbalib.com/common/jquery/jquery.js"></script>\n    <script type="text/javascript" src="//img.mbalib.com/common/notice_v0927.js?v=411"></script>\n    <script type="text/javascript" src="//img.mbalib.com/common/common.js?v=1010"></script>\n\t<script type="text/javascript" src="//img.mbalib.com/wiki/common/wiki.js?v=1010"></script>\n\n      <script>\n          //360\n          (function(){\n              var src = "https://jspassport.ssl.qhimg.com/11.0.1.js?d182b3f28525f2db83acfaaf6e696dba";\n              document.write(\'<script src="\' + src + \'" id="sozz"><\\/script>\');\n          })();\n      </script>\n      <!-- 请置于所有广告位代码之前 -->\n      <script src="https://dup.baidustatic.com/js/ds.js"></script>\n\n\n      <script>\n          $(document).ready(function(){initSetup();});\n          var _hmt = _hmt || [];\n          (function() {\n              var hm = document.createElement("script");\n              hm.src = "https://hm.baidu.com/hm.js?96771760d942f755aa887b0b28d1c30a";\n              var s = document.getElementsByTagName("script")[0];\n              s.parentNode.insertBefore(hm, s);\n          })();\n      </script>\n\n    <script type= "text/javascript">\n\t\t\tvar skin = "MBALib";\n\t\t\tvar stylepath = "/w/skins";\n\n\t\t\tvar wgArticlePath = "/wiki/$1";\n\t\t\tvar wgScriptPath = "/w";\n\t\t\tvar wgServer = "https://wiki.mbalib.com";\n                        \n\t\t\tvar wgCanonicalNamespace = "Portal";\n\t\t\tvar wgNamespaceNumber = 102;\n\t\t\tvar wgPageName = "Portal:证券";\n\t\t\tvar wgPageURL = "Portal:%E8%AF%81%E5%88%B8";\n\t\t\tvar wgTitle = "证券";\n\t\t\tvar wgVariantTitle = "证券";\n\t\t\tvar wgArticleId = 160595;\n\t\t\tvar wgIsArticle = true;\n                        \n\t\t\tvar wgUserName = null;\n\t\t\tvar wgUserLanguage = "zh";\n\t\t\tvar wgContentLanguage = "zh";\n\t\t</script>\n\t\t        <script type="text/javascript" src="//img.mbalib.com/wiki/common/wikibits.js"></script>\n    <style type="text/css">/*<![CDATA[*/\n@import "/w/index.php?title=MediaWiki:Common.css&usemsgcache=yes&action=raw&ctype=text/css&smaxage=18000";\n@import "/w/index.php?title=MediaWiki:MBALib.css&usemsgcache=yes&action=raw&ctype=text/css&smaxage=18000";\n@import "/w/index.php?title=-&action=raw&gen=css&maxage=18000";\n/*]]>*/</style>                <!-- Head Scripts -->\n    <script type="text/javascript" src="/w/skins/common/ajax.js"></script>\n\n\n\n  </head>\n  <body                 class="ns-102  article-160595">\n        <!-- 广告位:WIKI-07-365x110 -->\n<!-- 广告位:Mobile-Float -->\n<script>\n(function() {\n    var s = "_" + Math.random().toString(36).slice(2);\n    document.write(\'<div id="\' + s + \'"></div>\');\n\tif($(window).width()<992)\n\t{\n\t\t(window.slotbydup=window.slotbydup || []).push({\n\t        id: \'2406317\',\n\t        container: s,\n\t        size: \'20,3\',\n\t        display: \'float\',\n\t        async:true\n\t    });\n\t}else\n\t{\n    (window.slotbydup=window.slotbydup || []).push({\n        id: \'2403011\',\n        container: s,\n        size: \'365,110\',\n        display: \'float\'\n    });\n\t}\n})();\n</script>\n    <div id="globalWrapper" \t class=" ">\n      <div id="column-content">\n\t<div id="content">\n\t  <a name="top" id="contentTop"></a>\n\t  \t  <!--<h1 class="firstHeading">Portal:证券</h1>-->\n\t  <div class="firstHeading-wrap">\n            <h1 class="firstHeading">Portal:证券</h1>\n                        <script src="//img.mbalib.com/common/jquery/jquery.qrcode.min.js"></script>\n            <script>\n            (function(){\n                var created = 0;\n                var t;\n\t\t\t\tif(location.hash)\n\t\t\t\t{\n\t\t\t\t\tvar m = location.hash.substring(1);\n\t\t\t\t\tif(m == \'from_qrcode\')\n\t\t\t\t\t{\n\t\t\t\t\t\tif(typeof stateWikiVisitQrcode === \'function\' )\n\t\t\t\t\t\t\tstateWikiVisitQrcode();\n\t\t\t\t\t}\n\t\t\t\t}\n\n\t\t\t\tfunction createQrcode(url)\n\t\t\t\t{\n\t\t\t\t\tif(created == 0)\n\t\t\t\t\t{\n\t\t\t\t\t\t$("#wx-img").html(\'\');\n\t\t                $("#wx-img").qrcode({\n\t\t                    //render: "table", //table方式\n\t\t                    width: 200, //宽度\n\t\t                    height:200, //高度\n\t\t                    text: url //任意内容\n\t\t                });\n\t\t                if(typeof stateWikiShowQrcode === \'function\' )\n\t\t                \tstateWikiShowQrcode();\n\t\t                created = 1;\n\t\t\t\t\t}\n\t\t\t\t}\n                \n                $("#qrcode_wiki").mouseover(function(){\n                \tclearTimeout(t);\n\n                \tvar url = window.location.href;\n                \tvar url2 = \'https://www.mbalib.com/app/download?from_source=wiki_qrcode\';\n                \tvar pattern = /^(http|https):\\/\\/wiki\\.(test\\.|dev\\.)?mbalib\\.com\\/wiki\\/([^?#]*)/i;\n                \tvar d = pattern.exec(url);\n                \tvar key = d[3];\n                \turl = url2 + \'&key=\' + key;\n                \t\n\t                //url = url + \'#from_qrcode\';\n\t                if(false && url.length > 100)//create short url 2018.10.16不使用短链了,APP扫码不支持短链\n\t                {\n\t\t                if(created == 0)\n\t\t                {\n\t\t\t                $.post(\'https://ke.mbalib.com/api/shorturl\',{url:url},function(result){\n\t\t\t\t                if(result.url)\n\t\t\t\t                {\n\t\t\t\t\t                url = result.url;\n\t\t\t\t\t            }\n\t\t\t\t                createQrcode(url);\n\t\t\t\t                $(".app-guide-box").show();\n\t\t\t\t            },\'json\');\n\t\t                }else\n\t\t                {\n\t\t\t                $(".app-guide-box").show();\n\t\t\t            }\n\t\t            }else\n\t\t            {\n\t                \tcreateQrcode(url);\n\t\t                $(".app-guide-box").show();\n\t\t            }\n                }),$("#qrcode_wiki").mouseout(function(){\n                \tt=setTimeout(\'$(".app-guide-box").hide();\',500);\n                });\n\n                $(".app-guide-box").mouseover(function(){\n                \tclearTimeout(t);\n                  }),$(".app-guide-box").mouseout(function(){\n                      t=setTimeout(\'$(".app-guide-box").hide();\',500);\n                });\n            })();\n            </script>\n      </div>\n      <div style="clear:both;"></div>\n\t  <div id="bodyContent" >\n\t    <h3 id="siteSub">出自 MBA智库百科(<a href="//wiki.mbalib.com/">https://wiki.mbalib.com/</a>)</h3>\n\t    <div id="contentSub"></div>\n\t    \t    \t    <!-- start content -->\n\t    <div style="float:left; width:63%;"> \n<div style="clear: both;"></div> \n<div style="position: relative;border: 1px solid #aaafaa;background: #d6eaff;color: black;padding: .1em;text-align: center;font-weight: bold;font-size: 100%;margin-bottom: 0px;height:&nbsp;;border-bottom: none;"><span class="plainlinks" style="position: absolute;top: 1px;right: 6px;background: transparent;border: 0px;margin-bottom:.1em;font-size:80%;font-weight: normal;color: black;"> <a href="https://wiki.mbalib.com/w/index.php?title=Portal:%E8%AF%81%E5%88%B8/%E7%83%AD%E7%82%B9%E8%AF%9D%E9%A2%98&amp;action=edit"  class="external text" title="https://wiki.mbalib.com/w/index.php?title=Portal:%E8%AF%81%E5%88%B8/%E7%83%AD%E7%82%B9%E8%AF%9D%E9%A2%98&amp;action=edit" target="_blank"><span style="color:black" title="正在編輯 Portal:证券/热点话题">编辑</span></a> &nbsp;</span><div class="headline-2"><a class="headline" name=".E7.83.AD.E7.82.B9.E8.AF.9D.E9.A2.98"></a><h2 style="font-size:100%;font-weight:bold;border: none; margin: 0; padding:0; padding-bottom:.1em; color:black">热点话题</h2></div></div>\n<div style="display: block;border: 1px solid #aaafaa;vertical-align: top;background: white;color: black;margin-bottom: 10px;padding: 1em;margin-top: 0em;height:;">\n<div class="floatleft"><span><a href="/wiki/Image:%E4%B8%87%E7%A7%91%E8%82%A1%E6%9D%83%E4%B9%8B%E4%BA%895.jpg" class="image" title=""><img src="/w/images/thumb/8/8e/%E4%B8%87%E7%A7%91%E8%82%A1%E6%9D%83%E4%B9%8B%E4%BA%895.jpg/190px-%E4%B8%87%E7%A7%91%E8%82%A1%E6%9D%83%E4%B9%8B%E4%BA%895.jpg" alt="" width="190" height="143" longdesc="/wiki/Image:%E4%B8%87%E7%A7%91%E8%82%A1%E6%9D%83%E4%B9%8B%E4%BA%895.jpg" /></a></span></div><center><b><a href="/wiki/%E4%B8%87%E7%A7%91%E8%82%A1%E6%9D%83%E4%B9%8B%E4%BA%89" title="万科股权之争">万科股权之争</a></b></center>\n<dl><dd>\u3000\u3000 2016年11月29日,中国<a href="/wiki/%E6%81%92%E5%A4%A7%E9%9B%86%E5%9B%A2" title="恒大集团">恒大集团</a>发布公告披露,公司从2016年11月18日至11月29日透过其<a href="/wiki/%E9%99%84%E5%B1%9E%E5%85%AC%E5%8F%B8" title="附属公司">附属公司</a>在<a href="/wiki/%E5%B8%82%E5%9C%BA" title="市场">市场</a>上进一步收购共约5.10亿股<a href="/wiki/%E4%B8%87%E7%A7%91" title="万科">万科</a><a href="/wiki/A%E8%82%A1" title="A股">A股</a>,连同前收购,公司共持有约15.53亿股万科A股股票,占万科已发行股本总额约14.07%。截至本公布日期,本收购及前收购之总代价约为人民币362.73亿元。\n</dd></dl>\n<dl><dd>\u3000\u3000根据万科目前的<a href="/wiki/%E8%82%A1%E6%9C%AC%E7%BB%93%E6%9E%84" title="股本结构">股本结构</a>,大股东宝能系持股比例为25.40%,其次为华润持股比例15.31%。恒大此番增持<a href="/wiki/%E4%B8%87%E7%A7%91%E8%82%A1%E4%BB%BD" title="万科股份">万科股份</a>至14.07%,距离第二大股东位置十分逼近。对于收购万科股票的原因,恒大多次在公告中披露:万科为中国最大<a href="/wiki/%E6%88%BF%E5%9C%B0%E4%BA%A7%E5%BC%80%E5%8F%91%E5%95%86" title="房地产开发商">房地产开发商</a>之一,其财务表现强劲,收购事项为公司的投资行为。而持续一年有余的万科股权之争仍在持续,恒大多次增持令股权争夺再添变数。\n</dd></dl>\n<dl><dd>\u3000\u3000<b>相关知识:</b>《证券法》规定,投资者持有一个上市公司已发行股份的5%时,应在该事实发生之日起3日内,向国务院证券监督管理机构、证券交易所作出书面报告,通知该上市公司并予以公告,并且履行有关法律规定的义务。业内称之为"举牌"。...[<a href="/wiki/%E4%B8%BE%E7%89%8C" title="举牌">详细</a>]\n</dd></dl>\n<dl><dd>\u3000\u3000<b>相关条目:</b><a href="/wiki/%E5%A7%9A%E6%8C%AF%E5%8D%8E" title="姚振华">姚振华</a>、<a href="/wiki/%E7%8E%8B%E7%9F%B3" title="王石">王石</a>、<a href="/wiki/%E4%B8%87%E7%A7%91" title="万科">万科</a>、<a href="/wiki/%E8%82%A1%E6%9D%83" title="股权">股权</a>、<a href="/wiki/%E8%AE%B8%E5%AE%B6%E5%8D%B0" title="许家印">许家印</a>\n</dd></dl>\n<div class="noprint" style="clear: both; text-align: right; margin: .3em -.2em -.2em .3em; padding: .3em -.2em -.2em .3em; font-weight: bold;">  </div></div>\n<div style="clear: both;"></div> \n<div style="position: relative;border: 1px solid #aaafaa;background: #d6eaff;color: black;padding: .1em;text-align: center;font-weight: bold;font-size: 100%;margin-bottom: 0px;height:&nbsp;;border-bottom: none;"><span class="plainlinks" style="position: absolute;top: 1px;right: 6px;background: transparent;border: 0px;margin-bottom:.1em;font-size:80%;font-weight: normal;color: black;"> <a href="https://wiki.mbalib.com/w/index.php?title=Portal:%E8%AF%81%E5%88%B8/%E5%88%86%E7%B1%BB&amp;action=edit"  class="external text" title="https://wiki.mbalib.com/w/index.php?title=Portal:%E8%AF%81%E5%88%B8/%E5%88%86%E7%B1%BB&amp;action=edit" target="_blank"><span style="color:black" title="正在編輯 Portal:证券/分类">编辑</span></a> &nbsp;</span><div class="headline-2"><a class="headline" name=".E8.AF.81.E5.88.B8.E5.88.86.E7.B1.BB"></a><h2 style="font-size:100%;font-weight:bold;border: none; margin: 0; padding:0; padding-bottom:.1em; color:black">证券分类</h2></div></div>\n<div style="display: block;border: 1px solid #aaafaa;vertical-align: top;background: white;color: black;margin-bottom: 10px;padding: 1em;margin-top: 0em;height:;">\n<table width="100%" style="font-size:12px;">\n<tr>\n<td valign="top"><a href="/wiki/Category:%E8%82%A1%E7%A5%A8" title="Category:股票"><b>股票</b></a><br><a href="/wiki/A%E8%82%A1" title="A股">A股</a>\u3000<a href="/wiki/B%E8%82%A1" title="B股">B股</a>\u3000<a href="/wiki/H%E8%82%A1" title="H股">H股</a>\u3000<a href="/wiki/%E8%93%9D%E7%AD%B9%E8%82%A1" title="蓝筹股">蓝筹股</a>\u3000<a href="/wiki/IPO" title="IPO">IPO</a>\u3000<a href="/wiki/Category:%E8%82%A1%E7%A5%A8%E6%9C%AF%E8%AF%AD" title="Category:股票术语">&gt;&gt;</a>\n</td><td valign="top"><a href="/wiki/Category:%E6%9D%83%E8%AF%81" title="Category:权证"><b>权证</b></a><br><a href="/wiki/%E8%AE%A4%E8%B4%AD%E6%9D%83%E8%AF%81" title="认购权证">认购权证</a>\u3000<a href="/wiki/%E8%AE%A4%E6%B2%BD%E6%9D%83%E8%AF%81" title="认沽权证">认沽权证</a>\u3000<a href="/wiki/%E9%85%8D%E8%82%A1%E6%9D%83%E8%AF%81" title="配股权证">配股权证</a>\u3000<a href="/wiki/Category:%E6%9D%83%E8%AF%81%E6%9C%AF%E8%AF%AD" title="Category:权证术语">&gt;&gt;</a>\n</td><td valign="top"><a href="/wiki/Category:%E5%9F%BA%E9%87%91" title="Category:基金"><b>基金</b></a><br><a href="/wiki/%E5%BC%80%E6%94%BE%E5%BC%8F%E5%9F%BA%E9%87%91" title="开放式基金">开放式基金</a>\u3000<a href="/wiki/%E5%B0%81%E9%97%AD%E5%BC%8F%E5%9F%BA%E9%87%91" title="封闭式基金">封闭式基金</a>\u3000<a href="/wiki/%E6%8C%87%E6%95%B0%E5%9F%BA%E9%87%91" title="指数基金">指数基金</a>\u3000<a href="/wiki/Category:%E5%9F%BA%E9%87%91%E6%9C%AF%E8%AF%AD" title="Category:基金术语">&gt;&gt;</a>\n</td></tr>\n<tr>\n<td height="5" colspan="3">\n</td></tr>\n<tr>\n<td valign="top"><a href="/wiki/Category:%E5%A4%96%E6%B1%87" title="Category:外汇"><b>外汇</b></a><br><a href="/wiki/%E5%A4%96%E6%B1%87%E5%B8%82%E5%9C%BA" title="外汇市场">外汇市场</a>\u3000<a href="/wiki/%E5%A4%96%E6%B1%87%E4%BA%A4%E6%98%93" title="外汇交易">外汇交易</a>\u3000<a href="/wiki/%E5%A4%96%E6%B1%87%E5%82%A8%E5%A4%87" title="外汇储备">外汇储备</a>\u3000<a href="/wiki/Category:%E5%A4%96%E6%B1%87%E6%9C%AF%E8%AF%AD" title="Category:外汇术语">&gt;&gt;</a>\n</td><td valign="top"><a href="/wiki/Category:%E6%9C%9F%E8%B4%A7" title="Category:期货"><b>期货</b></a><br><a href="/wiki/%E6%9C%9F%E8%B4%A7%E5%B8%82%E5%9C%BA" title="期货市场">期货市场</a>\u3000<a href="/wiki/%E6%9C%9F%E8%B4%A7%E4%BA%A4%E6%98%93" title="期货交易">期货交易</a>\u3000<a href="/wiki/%E6%9C%9F%E8%B4%A7%E4%BA%A4%E6%98%93%E6%89%80" title="期货交易所">期货交易所</a>\u3000<a href="/wiki/Category:%E6%9C%9F%E8%B4%A7%E6%9C%AF%E8%AF%AD" title="Category:期货术语">&gt;&gt;</a>\n</td><td valign="top"><a href="/wiki/Category:%E6%9C%9F%E6%9D%83" title="Category:期权"><b>期权</b></a><br><a href="/wiki/%E8%82%A1%E7%A
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值