挖掘DBLP作者合作关系,FP-Growth算法实践(1):从DBLP数据集中提取目标信息(会议、作者等)



首先从官网下载DBLP数据集http://dblp.uni-trier.de/xml/只需下载 dblp.xml.gz 解压后得到1G多dblp.xml文件!文件略大。






从原始数据中提取样本:

r=open("dblp.xml","r")
w=open("dblpExample.xml","w")
for i in range(30):
	print "extract line", i
	c=r.readline()
	w.write(c)
r.close()
w.close()
最终结果:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2011-01-11" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>http://dx.doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2011-01-11" key="journals/acta/Simon83">
...
</article>
...
</dblp>

发现没用,因为只能看一种情况。下面采用另一种方法:




由于只提取如下会议:SDM, ICDM, ECML--PKDD, PAKDD, WSDM, DMKD, TKDE, KDD Explorations, ACM Trans. On KDD, CVPR, ICML, NIPS, COLT、CVPR、SIGIR、SIGKDD 十六个会议,至少从2000年至今的所有数据。

看一下SDM:

<inproceedings mdate="2014-02-12" key="conf/sdm/HanN08">
<author>Shuguo Han</author>
<author>Wee Keong Ng</author>
<title>Preemptive Measures against Malicious Party in Privacy-Preserving Data Mining.</title>
<pages>375-386</pages>
<year>2008</year>
<booktitle>SDM</booktitle>
<ee>http://dx.doi.org/10.1137/1.9781611972788.34</ee>
<crossref>conf/sdm/2008</crossref>
<url>db/conf/sdm/sdm2008.html#HanN08</url>
</inproceedings>
<inproceedings mdate="2015-12-30" key="conf/sdm/LiGGDZ15">
<author>Kang Li</author>
<author>Jing Gao</author>
<author>Suxin Guo</author>
<author>Nan Du</author>
<author>Aidong Zhang</author>
<title>Functional Node Detection on Linked Data.</title>
<pages>1-9</pages>
<year>2015</year>
<booktitle>SDM</booktitle>
<ee>http://dx.doi.org/10.1137/1.9781611974010.1</ee>
<crossref>conf/sdm/2015</crossref>
<url>db/conf/sdm/sdm2015.html#LiGGDZ15</url>
</inproceedings>


看一下ICDM:

<inproceedings mdate="2014-09-17" key="conf/icdm/LazarevicKKKT03">
<author>Aleksandar Lazarevic</author>
<author>Ramdev Kanapady</author>
<author>Chandrika Kamath</author>
<author>Vipin Kumar</author>
<author>Kumar K. Tamma</author>
<title>Localized Prediction of Continuous Target Variables Using Hierarchical Clustering.</title>
<pages>139-146</pages>
<year>2003</year>
<crossref>conf/icdm/2003</crossref>
<booktitle>ICDM</booktitle>
<ee>http://dx.doi.org/10.1109/ICDM.2003.1250913</ee>
<ee>http://doi.ieeecomputersociety.org/10.1109/ICDM.2003.1250913</ee>
<url>db/conf/icdm/icdm2003.html#LazarevicKKKT03</url>
</inproceedings>
<inproceedings mdate="2014-09-17" key="conf/icdm/CampagnaP09">
<author>Andrea Campagna</author>
<author>Rasmus Pagh</author>
<title>Finding Associations and Computing Similarity via Biased Pair Sampling.</title>
<pages>61-70</pages>
<year>2009</year>
<booktitle>ICDM</booktitle>
<ee>http://dx.doi.org/10.1109/ICDM.2009.35</ee>
<ee>http://doi.ieeecomputersociety.org/10.1109/ICDM.2009.35</ee>
<crossref>conf/icdm/2009</crossref>
<url>db/conf/icdm/icdm2009.html#CampagnaP09</url>
</inproceedings>

单独看ECML-PKDD:

<inproceedings mdate="2013-08-30" key="conf/pkdd/TomasevM13a">
<author>Nenad Tomasev</author>
<author>Dunja Mladenic</author>
<title>Image Hub Explorer: Evaluating Representations and Metrics for Content-Based Image Retrieval and Object Recognition.</title>
<pages>637-640</pages>
<year>2013</year>
<booktitle>ECML/PKDD (3)</booktitle>
<ee>http://dx.doi.org/10.1007/978-3-642-40994-3_44</ee>
<crossref>conf/pkdd/2013-3</crossref>
<url>db/conf/pkdd/pkdd2013-3.html#TomasevM13a</url>
</inproceedings>
<inproceedings mdate="2015-08-30" key="conf/pkdd/BudhathokiV15">
<author>Kailash Budhathoki</author>
<author>Jilles Vreeken</author>
<title>The Difference and the Norm - Characterising Similarities and Differences Between Databases.</title>
<pages>206-223</pages>
<year>2015</year>
<booktitle>ECML/PKDD (2)</booktitle>
<ee>http://dx.doi.org/10.1007/978-3-319-23525-7_13</ee>
<crossref>conf/pkdd/2015-2</crossref>
<url>db/conf/pkdd/pkdd2015-2.html#BudhathokiV15</url>
</inproceedings>


单独看PAKDD:

<inproceedings mdate="2008-05-15" key="conf/pakdd/HanN08">
<author>Shuguo Han</author>
<author>Wee Keong Ng</author>
<title>Privacy-Preserving Linear Fisher Discriminant Analysis.</title>
<pages>136-147</pages>
<year>2008</year>
<booktitle>PAKDD</booktitle>
<ee>http://dx.doi.org/10.1007/978-3-540-68125-0_14</ee>
<crossref>conf/pakdd/2008</crossref>
<url>db/conf/pakdd/pakdd2008.html#HanN08</url>
</inproceedings>
<inproceedings mdate="2005-05-18" key="conf/pakdd/BoWJ05">
<author>Liefeng Bo</author>
<author>Ling Wang</author>
<author>Licheng Jiao</author>
<title>Training Support Vector Machines Using Greedy Stagewise Algorithm.</title>
<pages>632-638</pages>
<year>2005</year>
<crossref>conf/pakdd/2005</crossref>
<booktitle>PAKDD</booktitle>
<ee>http://dx.doi.org/10.1007/11430919_73</ee>
<url>db/conf/pakdd/pakdd2005.html#BoWJ05</url>
</inproceedings>

单独看WSDM:

<inproceedin
  • 2
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值