pandas 查看编码模式_使用Pandas和Seaborn了解我的浏览模式

最新推荐文章于 2024-07-12 14:40:11 发布

cumian8165

最新推荐文章于 2024-07-12 14:40:11 发布

阅读量766

点赞数

文章标签：可视化 java python 机器学习大数据

原文链接：https://www.freecodecamp.org/news/understanding-my-browsing-pattern-using-pandas-and-seaborn-162b97e33e51/

版权

pandas 查看编码模式

by Kartik Godawat

由Kartik Godawat

使用Pandas和Seaborn了解我的浏览模式 (Understanding my browsing pattern using Pandas and Seaborn)

By three methods we may learn wisdom. First, by reflection, which is noblest; Second, by imitation, which is easiest; and third by experience, which is the bitterest. — Confucius

通过三种方法，我们可以学习智慧。首先，通过反思，这是最崇高的；其次，通过模仿，这是最简单的；第三是经验，这是最痛苦的。 —Kong子

For the purpose of tracking the time I spend on a browser, I use the Limitless extension on Chrome. While it gives me the time spent under categories, I thought it might be useful to inspect all my browsing data for the past year.

为了跟踪我在浏览器上花费的时间，我在Chrome上使用了Limitless扩展程序。尽管它可以让我花时间进行分类，但我认为检查过去一年的所有浏览数据可能会很有用。

Here begins my quest to understand all that there was in my browsing data.

在这里开始我的追求，了解一切，有我的浏览数据。

In the process, I used Pandas and Seaborn. Pandas is a python library for data manipulation and analysis. Seaborn is built on top of matplotlib, which makes creating visualizations easier than ever.

在此过程中，我使用了Pandas和Seaborn。 Pandas是用于数据处理和分析的python库。 Seaborn建立在matplotlib之上，这使得创建可视化文件比以往更加容易。

获取历史数据 (Getting History Data)

The first step in the process was to get all the browsing data for the past year. Google chrome stores the past 3 months of history on a device in SQLite format, but I ended up exporting my Google tracked data using Google TakeOut. The exported json has my browsing history across all devices, including mobile.

该过程的第一步是获取过去一年的所有浏览数据。 Google chrome在SQLite格式的设备上存储了过去3个月的历史记录，但是我最终使用Google TakeOut导出了Google跟踪的数据。导出的json具有我在所有设备(包括移动设备)上的浏览历史记录。

The history stored by Chrome or tracked by Google does not give me the session information i.e. time spent on each tab. So my analysis is mainly focused on the number of visits and the time of visit rather than the session or the duration. A part of me is relieved actually, to know that Google is not tracking it yet.

Chrome存储或Google跟踪的历史记录不会提供会话信息，即每个选项卡上花费的时间。因此，我的分析主要集中于访问次数和访问时间，而不是会话或持续时间。我的一部分实际上是松了一口气，要知道，谷歌没有跟踪它。

Once the data was downloaded, I began by loading the data into a Pandas dataframe:

下载数据后，我首先将数据加载到Pandas数据框中：

import pandas as pdwith open("BrowserHistory.json") as f:    data = json.loads(f.read())    df = pd.DataFrame(data["Browser History"])

# A possible param if differentiation is needed b/w different clientsdf.drop('client_id', axis=1, inplace=True)df.drop('favicon_url', axis=1, inplace=True)df.sample(1)

This is how the output looks like:

输出结果如下所示：

page_transition: Contains info on the type of page open like reload, type & enter, link open etc. I was satisfied by filtering only on LINK and TYPED

page_transition：包含有关打开页面类型的信息，例如重新加载，键入和输入，链接打开等。我仅对LINK和TYPED进行了过滤

df = df[(df['page_transition'] == "LINK") | (df['page_transition'] == "TYPED")]

提取/外推新列(功能)： (Extracting/Extrapolating new columns(features):)

To start off, I needed to break the time (in microseconds) to human-readable datetime format. Then I needed to derive features from it like hour, day, month, or day_of_week. From the URL field, extracting the top-level domain could be a useful field for analysis. So I used tldextract to create a new domain column in the dataframe.

首先，我需要将时间(以微秒为单位)更改为易于理解的日期时间格式。然后，我需要从中获取诸如小时，天，月或day_of_week之类的特征。从URL字段中提取顶级域可能是一个有用的分析字段。因此，我使用tldextract在数据框中创建了一个新的域列。

def convert_time(x):    return datetime.datetime.fromtimestamp(x/1000000)

days_arr = ["Mon","Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]def get_day_of_week(x):    return days_arr[x.weekday()]

def get_domain(x):    domain = tldextract.extract(x)[1]    sub_domain = tldextract.extract(x)[0]    if sub_domain == "mail":        return sub_domain + "." + domain    # Ugly hack to differentiate b/w drive.google.com and google.com    if domain == "google" and sub_domain=="www":        return "google_search"     return domain

# time_usec column is picked and for each row, convert_time(row) is called. The result is stored in the same dataframe under column dtdf['dt'] = df['time_usec'].apply(convert_time)...df['domain'] = df['url'].apply(get_domain)

Then I extrapolated the domain information to group well known domains into one or the other categories(buckets) defined by me:

然后，我推断出域信息，将众所周知的域分为我定义的一个或其他类别(存储桶)：

def get_category(x):    if x in ["coursera", "kadenze", "fast", "kaggle", "freecodecamp"]:        return "Learning"    elif x in ["ycombinator", "medium", "hackernoon"]:        return "TechReads"    ...    else:        return "Other"

# Cluster popular domains into a categorydf['category'] = df['domain'].apply(get_category)

After all the operations, the dataframe now contains the following columns and basic analysis could begin.

完成所有操作后，数据框现在包含以下列，并且可以开始基本分析。

Available columns: title,date,hour,month,is_secure,is_weekend,day_of_week,domain,category

可用列：标题，日期，小时，月，is_secure，is_weekend，day_of_week，域，类别

探索数据并创建可视化 (Exploring data and creating visualizations)

安全与不安全用法： (Secure vs Insecure usage:)

Once I have a dataframe with some numerical and categorical columns (month), creating a plot is super easy.

一旦有了包含一些数字列和分类列(月)的数据框，创建图就非常容易。

import seaborn as snssns.countplot(x="month", hue="is_secure", data=df)

# Manual inspection, picking 50 random domains which were insecurerandom.sample(df[df["is_secure"] == "N"].domain.unique(), 50)

# To view data for such domainsdf[(df["domain"] == "mydomain") & (df["is_secure"] == "N")]["url"]

After looking at a couple of such visits, I ended up checking this site. It asks for Passport or Aadhar (India’s equivalent of SSN) number, along with email and mobile, while booking a jungle-safari, over HTTP. I failed to notice it earlier! Final booking is handled through a separate and secure gateway. However, I still would feel much safer typing my demographics and passport details over HTTPS.

经过几次这样的访问，我最终检查了这个站点。它要求通过HTTP预订丛林旅行时要求提供Passport或Aadhar(印度的SSN等效)以及电子邮件和手机号码。 我没有注意到它！最终预订通过单独的安全网关处理。但是，通过HTTPS输入我的人口统计信息和护照详细信息仍然更加安全。

Instead of manually exploring rows, one stricter solution could be to add all such domains to an extension like BlockSite. They could be enabled as and when needed.

代替手动浏览行，一种更严格的解决方案是将所有此类域添加到诸如BlockSite的扩展中。可以在需要时启用它们。

工作日与周末浏览器的使用情况： (Weekday vs Weekend browser usage:)

#is_weekend="Y" for saturday and sunday, "N" otherwisesns.countplot(x="hour", hue="is_weekend", data=df)

浏览器使用数月： (Browser usage over months:)

To achieve this, I selected a subset of rows based on month’s condition and then grouped everything by the hour and date, to form a GitHub style heatmap viz.

为此，我根据月份的条件选择了行的子集，然后按小时和日期对所有内容进行了分组，以形成GitHub风格的热图即。

from matplotlib import pyplot as plt

# Getting unique values after grouping by hour and datedf_new = df[(df["month"] >= 11)].groupby(["hour", "date"])["domain"].size()df_new = df_new.reset_index(name="count")

plt.figure(figsize = (16,5))

# Pivot the dataframe to create a [hour x date] matrix containing countssns.heatmap(df_new.pivot("hour", "date", "count"), annot=False, cmap="PuBuGn")

The above code can easily be filtered. This can be done by adding more conditions to identify productive vs non-productive tab open timings and to view patterns over days. For example:

上面的代码可以很容易地被过滤。这可以通过添加更多条件来确定生产性和非生产性标签打开时间以及查看几天内的模式来完成。例如：

cat_arr = ["Shopping", "TravelBookings", "YouTube", "Social"]

df_new = df[(df["category"] in cat_arr)].groupby(["hour", "date"])["domain"].size()

浏览器按星期几和小时的访问次数： (Browser visits by day of week and hour:)

I created another type of aggregated heatmap where I tried visualizing wrt hours and which day of the week it is.

我创建了另一种汇总的热图，在其中尝试可视化wrt小时及其所在的星期几。

df_heat = df.groupby(["hour", "day_of_week"])["domain"].size().reset_index()df_heat2 = df_heat.pivot("hour", "day_of_week", "domain")sns.heatmap(df_heat2[days_arr] , cmap="YlGnBu")

One would expect post 5 pm Fridays through Monday morning to be of light usage. But what was interesting for me to reflect on was the light-colored areas on Wednesday evenings.

人们期望星期五至星期一早上5时发布邮件。但是，让我反思的是周三晚上的浅色区域。

Now to use the custom categories that I manually bucketed the domains into. I generate the same heatmap again. But now with a condition on popular shopping sites. Note that the list is manually created by me based on my memory and random peeks into the unique domains I visited.

现在使用手动将域存储到的自定义类别。我再次生成相同的热图。但现在在热门购物网站上有了条件。请注意，该列表是由我根据我的记忆手动创建的，并随机浏览了我访问过的唯一域。

df_heat = df[df["category"] == "Shopping"].groupby(["hour", "day_of_week"])["category"].size().reset_index()

It’s good to have the satisfaction that I usually do not go on a shopping spree during office hours. However, the chart encouraged me to manually explore Thursday(20:00–21:00) and Friday(15:00–16:00, 00:00–01:00). At a higher level, I was very confident that I never shop during office timings. However, the heat-map shows some instances of such visits, shattering my illusions.

感到满意的是，我通常在办公时间不购物，这是很好的。但是，图表鼓励我手动浏览星期四(20：00–21：00)和星期五(15：00–16：00、00：00–01：00)。在更高的水平上，我非常自信，我从来没有在办公时间购物。但是，热图显示了此类访问的一些实例，打破了我的幻想。

最重访的stackoverflow问题： (Most revisited stackoverflow questions:)

A good friend once told me:

一个好朋友曾经告诉我：

Understanding stackoverflow usage helps you understand either your areas of improvements or configurations/syntax you ought to remember.

了解堆栈溢出的用法有助于您了解需要记住的改进领域或配置/语法。

In any case, it’s good to have a cursory look at the most frequent visits for each month/quarter.

无论如何，最好粗略地查看每个月/季度的最常访问。

df_so = df[df["domain"] == "stackoverflow"].groupby(["url", "title"]).size()df_so = df_so.reset_index(name='count').sort_values('count',ascending=False)[["title", 'count']]

df_so.head(15)

Maybe I should cache the page which shows me how to iterate over a Pandas dataframe!

也许我应该缓存该页面，该页面向我展示如何遍历Pandas数据框！

Apart from stackoverflow, one of my most visited sites related to Pandas would be Chris Albon’s notes on python and data-wrangling.

除了stackoverflow之外，我最常访问的与Pandas相关的站点之一就是Chris Albon关于python和data-wrangling的注释。

In general, it is very interesting to observe how your most-visited pages change theme over months. For example, they may move from simple questions to more complex, deeper ones. This is true as you build your understanding towards something new.

通常，观察您的最常访问的页面如何在几个月内改变主题是非常有趣的。例如，它们可能从简单的问题转向更复杂，更深入的问题。当您对新事物建立理解时，这是对的。

Lastly, just for fun, I ended up concatenating titles of all my stack-overflow searches for the past year. I then generated a decent looking word-cloud out of it.

最后，只是为了好玩，我最终将过去一年中所有堆栈溢出搜索的标题串联在一起。然后，我从中生成了一个看起来不错的词云。

Thank you very much for your time. If you enjoyed reading, please give me some claps so more people see the article. Thank you! And, until next time, have a great day :)

非常感谢您的宝贵时间。如果您喜欢阅读，请给我一些鼓掌，以便让更多的人看到这篇文章。谢谢！而且，直到下一次，祝你有美好的一天:)

A working notebook is present on GitHub with some more visualizations and some quick hacks around the data. Please do try it out with your own history dump and share interesting insights!

GitHub上有一个可用的笔记本，其中包含更多可视化效果和一些有关数据的快速技巧。请通过您自己的历史记录转储尝试一下，并分享有趣的见解！