Home

Awesome

Github 用户及仓库分析爬虫

爬虫介绍

写完了 Stackoverflow 的爬虫,这回打算写 Github 的,利用 Scrapy 框架对 Github 用户和仓库信息进行爬取,图片利用管道下载。
Github 是一个很棒的社区,这里可以找到很多优秀的项目,很多实用的库类,简直是 coder 的天堂,同时也是全球最大的同性交友社区? 爬取的数据主要分为两大类, User 类 和 Repo 类 ,也就是针对用户情况和仓库信息

User 类

先来看看 Github 全站 followers 人数 top10 都是哪些大犇

AvatarUserReposStarsFollowersFollowing
https://github.com/torvalds42535000
https://github.com/JakeWharton932133400012
https://github.com/tj25317002760046
https://github.com/ruanyf43125247000
https://github.com/addyosmani29273224700241
https://github.com/paulirish26168323300239
https://github.com/mojombo611212010011
https://github.com/gaearon202110017000171
https://github.com/sindresorhus87722001690040
https://github.com/daimajia60290016400236

Linus 大神以压倒性的优势夺得第一名,说实在不知道 Linus 的真不好意思说自己是写代码的,这是信仰。然而大神还是很傲娇的,毕竟没有 following anybody,可能是强到了没朋友了吧,毕竟 talk is cheap, show me the code。JakeWharton 以 34000+ 位于第二,以前也看过一点 Android 的东西,不知道说什么,膜拜吧。

中国区还是有两名种子选手挺进了 top10,阮一峰 和 代码家

Github 的地区选项自由度很大,所以比较难统计出各国的注册账户的人数。China 关键字的有 77473 人,USA 关键字有 48667 人

那来了解一下国情,在国区的这 77473 人中,followers 人数 top10 如下

AvatarUserFollowingFollowers
https://github.com/ruanyf025.2k
https://github.com/daimajia23616.5k
https://github.com/yyx9908038916.2k
https://github.com/michaelliao012.4k
https://github.com/JacksonTian14512.1k
https://github.com/Trinea3711.9k
https://github.com/lifesinger1210k
https://github.com/stormzhang889.6k
https://github.com/cloudwu19.5k
https://github.com/onevcat1209k

vue.js 作者尤雨溪位列第三。廖雪峰紧跟其后排在第四,话说我也看过他的 Python 教程的

个人仓库数量 top10,因为组织的话无法查看具体仓库数,所以就选取了个人的

UserRepos
https://github.com/pombredanne35.4k
https://github.com/gitter-badger27.1k
https://github.com/carriercomm18.8k
https://github.com/digideskio16.9k
https://github.com/bestwpw13.8k
https://github.com/modulexcite10.7k
https://github.com/happyqq9.1k
https://github.com/kleopatra9998.2k
https://github.com/treejames7.2k
https://github.com/carabina7.2k

前两名都好多,项目数量都达到了 27k 以上,好强,他们是怎么办到的

Repo 类

仓库的 stars top10

RepoForkStarWatch
https://github.com/freeCodeCamp/freeCodeCamp111212614397638
https://github.com/twbs/bootstrap504681097026833
https://github.com/vhf/free-programming-books20950838716221
https://github.com/facebook/react12036650304402
https://github.com/d3/d316709634633171
https://github.com/getify/You-Dont-Know-JS9232571383279
https://github.com/sindresorhus/awesome7113571193787
https://github.com/angular/angular.js27738555034407
https://github.com/tensorflow/tensorflow26135549764968
https://github.com/robbyrussell/oh-my-zsh12298525751895

仓库的 forks top10

RepoForkStarWatch
https://github.com/jtleek/datasharing1701713858546
https://github.com/rdpeng/ProgrammingAssignment2101258469117
https://github.com/octocat/Spoon-Knife907879969308
https://github.com/twbs/bootstrap504681097026833
https://github.com/rdpeng/ExData_Plotting14319013618
https://github.com/angular/angular.js27738555034407
https://github.com/rdpeng/RepData_PeerAssessment1270725717
https://github.com/tensorflow/tensorflow26135549764968
https://github.com/DataScienceSpecialization/courses240942538819
https://github.com/udacity/frontend-nanodegree-resume24044706118

两个 top10 中有多少个是重叠的呢,答案是 3 个

RepoStarForkWatch
https://github.com/twbs/bootstrap109702504686833
https://github.com/angular/angular.js55503277384407
https://github.com/tensorflow/tensorflow54976261354968

那你知道两者的 top100 中有多少个是重叠的吗,答案是 51 个,top500 是 270 个

forks 数超过 1000 的仓库共有 1586 个,看看各语言都有几个,选取排名前 10 的语言生成条形图

再把维度扩大到 10000,共 41 个

JavaScript,Java,Python 基本上是稳居前 3 名,特别是 JavaScript,真是大红大紫,当然我大 Python 也是很有潜力的

stars 数超过 1000 的仓库有 10410 个

超过 10000 的 402 个

各大语言的分布情况基本上和 forks 数是一致的。唯一不同的语言就是 HTML 换成了 CSS,不过也都差不多,这两门语言基本上都是不分家的

来看个有趣的排名,全站代码量 top3 的仓库

Repo
https://github.com/opengapps/arm
https://github.com/kiang/data.fda.gov.tw
https://github.com/hanxiao/hanxiao.github.io

了解一下 Python 的情况

Python 仓库 stars 数 top10

RepoForkStarWatch
https://github.com/vinta/awesome-python6215331632957
https://github.com/jakubroztocil/httpie194929302856
https://github.com/pallets/flask8430266181681
https://github.com/nvbn/thefuck127326200554
https://github.com/rg3/youtube-dl4846254531064
https://github.com/django/django10298252081523
https://github.com/kennethreitz/requests4462246001007
https://github.com/ansible/ansible7496227321634
https://github.com/josephmisiti/awesome-machine-learning5320219632221
https://github.com/scrapy/scrapy5338200531430

Python 仓库 forks 数 top10

RepoForkStarWatch
https://github.com/shadowsocks/shadowsocks10533173021520
https://github.com/django/django10298252081523
https://github.com/scikit-learn/scikit-learn9952181591646
https://github.com/pallets/flask8430266181681
https://github.com/ansible/ansible7496227321634
https://github.com/udacity/fullstack-nanodegree-vm649512222
https://github.com/vinta/awesome-python6215331632957
https://github.com/odoo/odoo604564811130
https://github.com/scrapy/scrapy5338200531430
https://github.com/josephmisiti/awesome-machine-learning5320219632221

shadowsocks 在 stars 里排不进 top10,居然在 forks 里勇夺第一了,这梯子圆了多少人的翻墙梦。另外一架梯子 XX-NET 很遗憾,两项都没挤进 top10,扎心了老铁

RepoForkStarWatch
https://github.com/XX-net/XX-Net4682137871343

老规矩,看看这两个 top10 交集部分,有 5 个,如下。( 两个前 top100 中交集有 52 个 )

RepoStarForkWatch
https://github.com/django/django25208102981523
https://github.com/pallets/flask2661884301681
https://github.com/ansible/ansible2273274961634
https://github.com/vinta/awesome-python3316362152957
https://github.com/josephmisiti/awesome-machine-learning2196353202221

两大 web 框架 django 和 flask 的表现还是不负众望的,awesome 系列在每种语言里都很受欢迎

谢谢观赏 (ง •̀_•́)ง (,,• ₃ •,,)