ecw2c理解元数据:使用BigQuery k-means将4,000个堆栈溢出标签聚类

本文探讨如何利用无监督学习中的k-means聚类算法,结合BigQuery的强大功能,对4,000多个Stack Overflow标签进行有效分组,以创建有意义的标签类别。" 113502442,8038675,Redis数据操作详解:Bit命令,"['Redis', '数据库', '数据结构', 'NoSQL']
摘要由CSDN通过智能技术生成
您如何将超过4,000个活动的Stack Overflow标签分组为有意义的组? 对于无监督学习和k均值聚类来说,这是一项完美的任务-现在您可以在BigQuery中完成所有这些工作。 让我们找出方法。
Visualizing a universe of clustered tags.
Felipe Hoffa is a Developer Advocate for Google Cloud. In this post he works with BigQuery – Google’s serverless data warehouse – to run k-means clustering over Stack Overflow’s published dataset, which is refreshed and uploaded to Google’s Cloud once a quarter. You can check out more about working with Stack Overflow data and BigQuery here and here.

4,000+ tags are a lot

These are the most active Stack Overflow tags since 2018 — they’re a lot. In this picture I only have 240 tags — how would you group and categorize 4,000+ of them?
# Tags with >180 questions since 2018
SELECT tag, COUNT(*) questions
FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`,
  UNNEST(SPLIT(tags, '|')) tag
WHERE creation_date > '2018-01-01'
GROUP BY 1
HAVING questions>180
ORDER BY 2 DESC
Top Stack Overflow tags by number of questions.

提示:共现标签

Let’s find tags that usually go together:
Co-occurring tags on Stack Overflow questions
These groupings make sense:
  • 'javascript'与'html'相关。
  • 'python'与'pandas'相关。
  • 'c#'与'.net'相关。
  • “打字稿”与“ angular”有关。
  • 等…
So I’ll take these relationships and I’ll save them on an auxiliary table — plus a percentage of how frequently a relationship happens for each tag.
CREATE OR REPLACE TABLE `deleting.stack_overflow_tag_co_ocurrence`
AS
WITH data AS (
  SELECT * 
  FROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
  WHERE creation_date > '2018-01-01'
), active_tags AS (
  SELECT tag, COUNT(*) c
  FROM data, UNNEST(SPLIT(tags, '|')) tag
  GROUP BY 1
  HAVING c>180
)

SELECT *, questions/questions_tag1 percent
FROM (
    SELECT *, MAX(questions) OVER(PARTITION BY tag1) questions_tag1
    FROM (
        SELECT tag1, tag2, COUNT(*) questions
        FROM data, UNNEST(SPLIT(tags, '|')) tag1, UNNEST(SPLIT(tags, '|')) tag2
        WHERE tag1 IN (SELECT tag FROM active_tags)
        AND tag2 IN (SELECT tag FROM active_tags)
        GROUP BY 1,2
        HAVING questions>30
    )
)

One-hot encoding

Now get ready for some SQL magic. BigQuery ML does a good job of hot-encoding strings, but it doesn’t handle arrays as I wish it did (stay tuned). So I’m going to create a string first that will define all the columns where I want to find co-occurrence. Then I can use that string to get a huge table, with a 1 for every time a tag co-occurs with the main one at least certain % of time. Let’s see first a subset of these results: What you see here is a co-occurrence matrix:
  • 'javascript'显示与'php','html','css','node.js'和'jquery'的关系。
  • 'android'显示与'java'的关系
  • “机器学习”显示了与“ python”的关系,但并非相反。
  • “多线程”显示了与“ python”,“ java”的关系 ','c#'和'android`
  • “单元测试”与此处几乎所有列的关系,除了与“ php”,“ html”,“ css”和“ jquery”有关。 / li>
You can reduce or augment the sensibility of these relations with the percent threshold:
SELECT tag1 
,IFNULL(ANY_VALUE(IF(tag2='javascript',1,null)),0) Xjavascript
,IFNULL(ANY_VALUE(IF(tag2='python',1,null)),0 ) Xpython
,IFNULL(ANY_VALUE(IF(tag2='java',1,null)),0) Xjava
,IFNULL(ANY_VALUE(IF(tag2='c#',1,null)),0) XcH
,IFNULL(ANY_VALUE(IF(tag2='android',1,null)),0) Xandroid
,IFNULL(ANY_VALUE(IF(tag2='php',1,null)),0) Xphp
,IFNULL(ANY_VALUE(IF(tag2='html',1,null)),0) Xhtml
,IFNULL(ANY_VALUE(IF(tag2='css',1,null)),0) Xcss
,IFNULL(ANY_VALUE(IF(tag2='node.js',1,null)),0) XnodeDjs
,IFNULL(ANY_VALUE(IF(tag2='jquery',1,null)),0) Xjquery
,SUM(questions) questions_tag1
FROM `deleting.stack_overflow_tag_co_ocurrence`
WHERE percent>0.03
GROUP BY tag1
ORDER BY questions_tag1 DESC 
LIMIT 100

K-means clustering time

Now — instead of using this small table, let’s use the whole table to compute k-means with BigQuery. With this line, I’m creating a one-hot encoding string that I can use later to define the 4,000+ columns I’ll use for k-means:
one_hot_big = client.query("""
SELECT STRING_AGG(
  FORMAT("IFNULL(ANY_VALUE(IF(tag2='%s',1,null)),0)X%s", tag2, REPLACE(REPLACE(REPLACE(REPLACE(tag2,'-','_'),'.','D'),'#','H'),'+','P')) 
) one_hot
FROM (
  SELECT tag2, SUM(questions) questions 
  FROM `deleting.stack_overflow_tag_co_ocurrence`
  GROUP BY tag2
  # ORDER BY questions DESC
  # LIMIT 10
)
""").to_dataframe().iloc[0]['one_hot']
And training a k-means model in BigQuery is really easy:
CREATE MODEL `deleting.kmeans_tagsubtag_50_big_a_01`
OPTIONS ( 
    model_type='kmeans',
    distance_type='COSINE',
    num_clusters=50 )
AS
WITH tag_and_subtags AS (
    SELECT tag1, %s
    FROM `deleting.stack_overflow_tag_co_ocurrence`
    WHERE percent>0.03
    GROUP BY tag1
)
SELECT * EXCEPT(tag1) 
FROM tag_and_subtags
Now we wait — while BigQuery shows us the progress of our training: And when it’s done, we even get an evaluation of our model:

Davies–Bouldin index: 1.8530
Mean squared distance: 0.8174

Performance note

Do we really need 4,000 one-hot encoded dimensions to obtain better clusters? Turns out that 500 are enough — and I like the results better. It also reduces the time for training the model in BigQuery from 24 minutes to 3. The same with only 30 dimensions lowers the time to 90 seconds — but I like the results better with 500. More on hyper-parameter tuning below.
500 one-hot encoded dimensions reduces time per iteration to 30 seconds, and a lower loss.
Davies–Bouldin index: 1.6910
Mean squared distance: 0.52332

Get ready for the results: The 50 clusters are…

Now it’s time to see our results. I’ll even look out for some tags I’m interested in: How are googleamazon, and azure represented in each cluster? These are the 50 groups that k-means clustering found — given the 1-hot encoding of related tags we did earlier in this post. Some results make a lot of sense — while others give great insight into what are the prevalent surrounding technologies to any Stack Overflow tag. Naming each centroid is always a challenge. Here I used the top 5 centroid weight vectors — see how below.

centroid 45: amazon-web-services, aws-lambda, amazon-s3, amazon-ec2, python
—–
amazon-web-services, amazon-s3, aws-lambda, amazon-ec2, amazon-dynamodb, terraform, aws-sdk, amazon-cloudformation, amazon-redshift, aws-api-gateway, amazon-cognito, boto3, cloud, alexa, amazon-rds, amazon-elastic-beanstalk, amazon-ecs, alexa-skills-kit, amazon-cloudfront, serverless, aws-cli, amazon-iam, amazon-cloudwatch, elastic-beanstalk, amazon-sqs, serverless-framework, amazon-athena, aws-amplify, aws-appsync, amazon-sns, alexa-skill, amazon-route53, amazon, amazon-kinesis, amazon-sagemaker, autoscaling, amazon-elb, amazon-ses, aws-cognito, aws-iot, terraform-provider-aws, api-gateway, amazon-vpc, aws-serverless, aws-codepipeline, aws-codebuild, amazon-rds-aurora, bitnami, amazon-lex, aws-step-functions, aws-code-deploy, aws-iam, aws-fargate, dynamodb-queries, boto

amazon: amazon-cognito, amazon-ses, amazon-redshift, aws-lambda, amazon-ecs, amazon-s3, amazon-web-services, amazon-athena, aws-api-gateway, amazon-rds, amazon, amazon-cloudfront, amazon-lex, aws-iot, amazon-elb, aws-code-deploy, amazon-cloudwatch, aws-cli

centroid 17: android, java, android-layout, android-recyclerview, kotlin
—–
android, json, xml, kotlin, android-studio, android-recyclerview, android-layout, android-fragments, xslt, serialization, android-intent, retrofit2, android-activity, android-room, nullpointerexception, retrofit, gson, android-volley, textview, android-viewpager, xml-parsing, recycler-adapter, android-edittext, android-sqlite, protocol-buffers, xsd, deserialization, android-constraintlayout, android-asynctask, fragment, android-architecture-components, android-livedata, imageview, scrollview, android-databinding, android-glide, android-animation, xquery, xslt-1.0, android-jetpack, android-manifest, navigation-drawer, adapter, bottomnavigationview, xslt-2.0, android-toolbar, onclicklistener, android-tablayout, android-cardview, android-spinner, android-adapter, picasso, android-linearlayout, transformation, android-drawable, android-architecture-navigation, android-imageview, android-custom-view, json-deserialization, android-view, android-actionbar, searchview, biztalk, android-coordinatorlayout, android-lifecycle, android-softkeyboard, floating-action-button, recyclerview-layout, swipe, android-relativelayout, android-xml, android-collapsingtoolbarlayout, android-button, android-scrollview, saxon, android-nestedscrollview, android-styles, xml-namespaces, xsl-fo, android-fragmentactivity, android-dialogfragment, android-viewholder, xml-serialization

centroid 7: android, java, javascript, ios, python
—–
android-gradle, bluetooth, rx-java2, build.gradle, dependencies, rx-java, google-play, sdk, android-ndk, corda, video-streaming, android-emulator, libgdx, android-webview, apk, location, java-native-interface, google-play-services, dagger-2, adb, codenameone, android-8.0-oreo, google-places-api, android-notifications, android-studio-3.0, broadcastreceiver, speech-recognition, arcore, sharedpreferences, streaming, gps, android-service, version, coordinates, androidx, native, sms, here-api, android-camera, android-permissions, uri, android-mediaplayer, locale, vert.x, exoplayer, google-maps-markers, settings, alarmmanager, spinner, proguard, okhttp3, text-to-speech, okhttp, updates, android-camera2, android-source, whatsapp, nfc, share, inputstream, google-fabric, xmpp, calculator, manifest, wifi, mpandroidchart, android-9.0-pie, rx-android, call, android-workmanager, mp4, hls, video-processing, release, barcode, android-support-library, alertdialog, android-viewmodel, dji-sdk, barcode-scanner, filepath, sip, google-cloud-messaging, gradle-plugin, android-arrayadapter, screen, payment, toolbar, google-play-console, dagger, mp3, indexoutofboundsexception, ejabberd, httpurlconnection, libraries, android-proguard, coroutine, h.264, simpledateformat, jacoco, background-process, rtsp, offline, root, sensor, splash-screen, android-bluetooth, android-testing, android-resources, android-tv, emulation, android-bitmap, android-listview, multipart, chromecast, android-broadcastreceiver, video-capture, google-maps-android-api-2, pojo, android-canvas, visibility, broadcast, google-play-games, dao, kotlin-android-extensions, avd, lint, android-jobscheduler, android-library, kotlinx.coroutines, firebase-mlkit, expandablelistview, obfuscation, android-contentprovider, appcelerator, mvp, live-streaming, in-app-billing, android-context, audio-streaming, arabic, android-alertdialog, kotlin-coroutines, zxing, android-videoview, fingerprint, braintree, audio-recording, deprecated, job-scheduling, android-wifi, wear-os, bottom-sheet, android-things, device, marker, right-to-left, google-login, mobile-application, media-player, countdowntimer, opengl-es-2.0, nullable, face-detection, exoplayer2.x, android-8.1-oreo, beacon, drawable, gradlew, mapbox-android, classnotfoundexception, parcelable, android-keystore, voice-recognition, toast, aar, google-places, android-theme, android-progressbar, paging, accelerometer, playback, gradle-kotlin-dsl, samsung-mobile, photo, ibeacon, android-appcompat, noclassdeffounderror, branch.io, rtmp, sceneform, foreground-service, google-cast, appcelerator-titanium, android-widget, logcat, android-pendingintent, android-fileprovider, android-gps, sha1, jodatime, android-sensors, android-appbarlayout, surfaceview, mpeg-dash, android-mvvm

google: google-maps-android-api-2, google-play-services, google-fabric, google-places, google-places-api, google-play-console, google-cast, google-login, google-maps-markers, google-play-games, google-play, google-cloud-messaging

centroid 28: angular, typescript, javascript, angular6, angular5
—–
angular, typescript, angular6, angular5, rxjs, angular-material, angular7, service, observable, routing, angular-cli, components, angular-reactive-forms, karma-jasmine, primeng, ag-gri

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值