《CMU15-445/645》fall2018 HOMEWORK #1 - SQL

最新推荐文章于 2023-06-29 22:55:29 发布

千裡

最新推荐文章于 2023-06-29 22:55:29 发布

阅读量814

点赞数 1

分类专栏：数据库

本文链接：https://blog.csdn.net/weixin_44520881/article/details/111235025

版权

数据库专栏收录该内容

3 篇文章 0 订阅

订阅专栏

课程介绍

cmu15-445是一门关于数据库的课程，看到该课程的homeworks和project就觉得十分有挑战性。目前该课程已经更新到2020fall，为了方便学习，同时之前版本也在官网上公布了homeworks的参考答案，参考资料相对较多，因此选择2018秋季学期课程来学习。课程首页，youtube课程视频
homework1对应的作业文档为HOMEWORK #1 - SQL

HOMEWORK #1 - SQL

配置sqlite以及下载数据库省略，作业给出的数据库关系图如下。
在这里插入图片描述
如下题目在官网有参考答案，我还是记录一下我的解决过程。参考答案

Q1 [0 POINTS] (Q1_SAMPLE)

Count the number of cities. The purpose of this query is to make sure that the formatting of your output matches exactly the formatting of our auto- grading script.
Details: Print the number of cities (eliminating duplicates).

查询名字唯一的城市的数量

select count(distinct city) 
from station;

Q2 [5 POINTS] (Q2_WARMUP)

Count the number of stations in each city.
Details: Print city name and number of stations. Sort by number of stations (increasing), and break ties by city name (increasing).

需要查询每个城市的火车站的数量，并且先按照每个城市的火车站的数量排序，再按照城市名排序（都是升序）。此题并不难。

select city, count(station_id) as cnt 
from station 
group by city 
order by cnt asc, city asc;

Q3 [10 POINTS] (Q3_POPULAR_CITY)

Find the percentage of trips in each city. A trip belongs to a city as long as its start station or end station is in the city. For example, if a trip started from station A in city P and ended in station B in city Q, then the trip belongs to both city P and city Q. If P equals to Q, the trip is only counted once.
Details: Print city name and ratio between the number of trips that belong to that city against the total number of trips (a decimal between 0-1, round to four decimal places using ROUND()). Sort by ratio (decreasing), and break ties by city name (increasing).

此题需要查询每个城市名，以及其所占trip总数的比例（当其作为trip的起点或终点时候，起点终点不重复计数）。
按照如下步骤，将sql语句分解。

select city, (A) as ratio
from (B) as city_trip_cnt, (C) as trip_cnt
order by ratio DESC, city ASC;

A: 
round(city_trip_cnt.cnt * 1.0 / trip_cnt.cnt, 4)

B:
select city, count(distinct(id)) as cnt
from trip, station
where station_id  = start_station_id or station_id = end_station_id
group by city

C:
select count(*) as cnt from trip;

最终结果为：

select city, (ROUND(city_trip_cnt.cnt * 1.0 / all_trip_cnt.cnt, 4)) as ratio
from (	select city, count(distinct(id)) as cnt
		from trip, station
		where station_id = start_station_id or station_id = end_station_id
		group by city) as city_trip_cnt, (select count(*) as cnt from trip) as all_trip_cnt
order by ratio desc, city asc;

Q4 [15 POINTS] (Q4_MOST_POPULAR_STATION)

For each city, find the most popular station in that city. “Popular” means that the station has the highest count of visits. As above, either starting a trip or finishing a trip at a station, the trip is counted as one “visit” to that station. The trip is only counted once if the start station and the end station are the same.
Details: For each station, print city name, most popular station name and its visit count. Sort by city name, ascending.

此题有一定难度，需要找到每个城市最收欢迎的车站，输出访问次数。官方给出的参考答案如下：

with visit(station_id, station_name, city, cnt) as (
	select station_id, station_name, city, count(id) as cnt 
	from trip, station 
	where station_id = start_station_id or station_id = end_station_id 
	group by station_id)

select visit.city, visit.station_name, visit.cnt 
from visit 
where visit.cnt = (	select max(cnt) 
					from visit as max_visit 
					where max_visit.city = visit.city) 
order by city;

我按照参考答案的思路不用with as语法实现了一遍，发现with as相比来说清晰了很多。代码如下

select visit.city, visit.station_name, visit.cnt
from (A) as visit
where visit.cnt = (	select max(max_visit.cnt)
					from (A) as max_visit
					where max_visit.city = visit.city)
order by city asc;

A: 
select city, station_id, station_name, count(id) as cnt
from trip, station
where station_id = start_station_id or station_id = end_station_id
group by station_id;

最终代码为

select visit.city, visit.station_name, visit.cnt
from (	select city, station_id, station_name, count(id) as cnt
		from trip, station
		where station_id = start_station_id or station_id = end_station_id
		group by station_id) as visit
where visit.cnt = (	select max(max_visit.cnt)
					from (	select city, station_id, station_name, count(id) as cnt
							from trip, station
							where station_id = start_station_id or station_id = end_station_id
							group by station_id) as max_visit
					where max_visit.city = visit.city)
order by city asc;

可以看到使用with as语法降低了代码的复杂度。关于参考代码给出的group by而在select 中又出现了不是group by中的字段，感觉这样写不是很合适。（这里可能是因为group by的是pk 所以不会出现问题吧?）。

Q5 [15 POINTS] (Q5_DAYS_MOST_BIKE_UTILIZATION):

Find the top 10 days that have the highest average bike utilization. For simplicity, we only consider trips that use bikes with id <= 100. The average bike utilization on date D is calculated as the sum of the durations of all the trips that happened on date D divided by the total number of bikes with id <= 100, which is a constant. If a trip overlaps with date D, but starts before date D or ends after date D, then only the interval that overlaps with date D (from 0:00 to 24:00) will be counted when calculating the average bike utilization of date D. And we only calculate the average bike utilization for the date that has been either a start or an end date of a trip. You can assume that no trip has negative time (i.e., for all trips, start time <= end time).
Details: For the dates with the top 10 average duration, print the date and the average bike duration on that date (in seconds, round to four decimal places using the ROUND() function). Sort by the average duration, decreasing. Please refer to the updated note before Q1 when calculating the duration of a trip.
Hint: All timestamps are stored as text after loaded from csv in sqlite. You can use datetime(timestamp string) to get the timestamp out of the string and date(timestamp string) to get the date out of the string. You may also find the funtion strftime() helpful in computing the duration between two timestamps.

日期函数参考SQLite 日期 & 时间
with dates as中的unoin对date()取得的日期进行去重。最终的tdate即有trip的每一天。有一个容易误解的点是，最后的select中排序输出的天数只包含那些出现在start_time和end_time中的天数。

with dates as (
	select date(start_time) as tdate 
	from trip 
	union
	select date(end_time) as tdate
	from trip)
select tdate, round(sum(strftime('%s', min(datetime(end_time), datetime(tdate, '+1 day'))) - strftime('%s', max(datetime(start_time), datetime(tdate)))) * 1.0 / (select count(distinct(bike_id)) from trip where bike_id <= 100), 4) as avg_duration 
from trip, dates 
where bike_id <= 100 and datetime(start_time) < datetime(tdate, '+1 day') and datetime(end_time) > datetime(tdate) 
group by tdate 
order by avg_duration desc 
limit 10;

通过该问题，了解了日期的函数究竟是怎么一回事。date：年月日，time：时分秒，datetime：年月日+时分秒。

Q6 [10 POINTS] (Q6_OVERLAPPING_TRIPS)

One of the possible data-entry errors is to record a bike as being used in two different trips, at the same time. Thus, we want to spot pairs of overlapping intervals (start time, end time). To keep the output manageable, we ask you to do this check for bikes with id between 100 and 200 (both inclusive). Note: Assume that no trip has negative time, i.e., for all trips, start time <= end time.
Details: For each conflict (a pair of conflict trips), print the bike id, former trip id, former start time, former end time, latter trip id, latter start time, latter end time. Sort by bike id (increasing), break ties with former trip id (increasing) and then latter trip id (increasing).
Hint: (1) Report each conflict pair only once, so that former trip id < latter trip id. (2) We give you the (otherwise tricky) condition for conflicts: start1 < end2 AND end1 > start2

这题需要严格按照hint来写，并且非常简单。

select f_trip.bike_id, f_trip.id, f_trip.start_time, f_trip.end_time, l_trip.id, l_trip.start_time, l_trip.end_time
from trip as f_trip, trip as l_trip
where f_trip.bike_id >= 100 and f_trip.bike_id <= 200 and f_trip.bike_id = l_trip.bike_id and f_trip.id < l_trip.id
	and f_trip.start_time < l_trip.end_time and f_trip.end_time > l_trip.start_time
order by f_trip.bike_id ASC, f_trip.id ASC, l_trip.id;

Q7 [10 POINTS] (Q7_MULTI_CITY_BIKES)

Find all the bikes that have been to more than one city. A bike has been to a city as long as the start station or end station in one of its trips is in that city.
Details: For each bike that has been to more than one city, print the bike id and the number of cities it has been to. Sort by the number of cities (decreasing), then bike id (increasing).

根据题意，直接将trip和station连接然后用count(distinct city)去重即可。

select bike_id, count(distinct city) as cnt
from trip, station
where start_station_id = station_id or end_station_id = station_id
group by bike_id
having cnt > 1
order by cnt desc, bike_id ASC;

Q8 [10 POINTS] (Q8_BIKE_POPULARITY_BY_WEATHER)

Find what is the average number of trips made per day on each type of weather day. The type of weather on a day is specified by weather.events, such as ‘Rain’, ‘Fog’ and so on. For simplicity, we consider all days that does not have a weather event (weather.events = ‘\N’) as a single type of weather. Here a trip belongs to a date only if its start time is on that date. We use the weather at the starting position of that trip as its weather type as well. There are also ‘Rain’ and ‘rain’ in weather.events. For simplicity, we consider them as different types of weathers. When counting the total number of days for a weather, we consider a weather happened on a date as long as it happened in at least one region on that date.
Details: Print the name of the weather and the average number of trips made per day on that type of weather (round to four decimal places using ROUND()). Sort by the average number of trips (decreasing), then weather name (increasing).

这题读题目读的很迷惑，根据代码倒是很简单，直接求所有天气，该天气共有多少天，以及该天气中有多少天是出行(start_time)天，求个比例即可。

with event_cnt (events, cnt) as (
	select events, count(distinct date) as cnt 
	from weather 
	group by events
)
select w.events, round(1.0 * count(distinct t.id) / event_cnt.cnt, 4) as avg_num
from trip as t, station as s, weather as w, event_cnt
where t.start_station_id = s.station_id and date(t.start_time) = w.date and s.zip_code = w.zip_code and w.events = event_cnt.events
group by w.events
order by avg_num desc, w.events asc

Q9 [10 POINTS] (Q9_TEMPERATURE_SHORTER_TRIPS)

A short trip is a trip whose duration is <= 60 seconds. Compute the average temperature that a short trip starts versus the average temperature that a non-short trip starts. We use weather.mean_temp on the date of the start time as the Temperature measurement.
Details: Print the average temperature that a short trip starts and the average temperature that a non-short trip starts. (on the same row, and both round to four decimal places using ROUND()) Please refer to the updated note before Q1 when calculating the duration of a trip.

这题有个坑点就是连接trip、station、weater表的时候要使用date 和 zip_code 和station这几个字段来连接。其他没什么难点了。
查询short_trip

select round(1.0 * sum(mean_temp) / count(*), 4) as temp
from trip, station, weather
where strftime("%s", end_time) - strftime("%s", start_time) <= 60 and start_station_id = station_id 
	and station.zip_code = weather.zip_code and date(start_time) = date

查询long_trip

select round(1.0 * sum(mean_temp) / count(*), 4) as temp
from trip, station, weather
where strftime("%s", end_time) - strftime("%s", start_time) > 60 and start_station_id = station_id 
	and station.zip_code = weather.zip_code and date(start_time) = date

合并上述两个结果，最终结果如下：

select short_trip.temp, long_trip.temp
from (
	select round(1.0 * sum(mean_temp) / count(*), 4) as temp
	from trip, station, weather
	where strftime("%s", end_time) - strftime("%s", start_time) <= 60 and start_station_id = station_id 
			and station.zip_code = weather.zip_code and date(start_time) = date
) as short_trip, (
	select round(1.0 * sum(mean_temp) / count(*), 4) as temp
	from trip, station, weather
	where strftime("%s", end_time) - strftime("%s", start_time) > 60 and start_station_id = station_id 
		and station.zip_code = weather.zip_code and date(start_time) = date
) as long_trip;

Q10 [15 POINTS] (Q10_RIDING_IN_STORM)

For each zip code that has experienced ‘Rain-Thunderstorm’ weather, find the station that has the most number of trips in that zip code under the storm weather. For simplicity, we only consider the start time of a trip when deciding the station and the weather for that trip.
Details: Print the zip code that has experienced the ‘Rain-Thunderstorm’ weather, the name of the station that has the most number of trips under the strom weather in that zip code, and the total number of trips that station has under the storm weather. Sort by the zip code (increasing). You do not need to print the zip code that has experienced ‘Rain-Thunderstorm’ weather but no trip happens on any storm day in that zip code.

这题有点难，由于要找最大的，同时要考虑并列的，因此where中也需要重复查询storm_count，因此抽出来一个storm_count。

select zip_code, station_name, cnt
from (A)
where cnt = (B)
order by zip_code asc;

A:
with storm_count as (
	select s.zip_code, s.station_id, s.station_name, count(*) as cnt
	from weather as w, station as s, trip as t
	where date(t.start_time) = w.date and t.start_station_id = s.station_id and s.zip_code = w.zip_code
	group by s.zip_code, s.station_id
)

B:
select max(max_s_c.cnt)
from (A) as max_s_c
where storm_count.zip_code = max_s_c.zip_code

整合以后的最终结果为

with storm_count as (
	select s.zip_code, s.station_id, s.station_name, count(*) as cnt
	from weather as w, station as s, trip as t
	where date(t.start_time) = w.date and t.start_station_id = s.station_id and s.zip_code = w.zip_code
			and events = 'Rain-Thunderstorm'
	group by s.zip_code, s.station_id
)
select zip_code, station_name, cnt
from storm_count
where cnt = (	select max(max_s_c.cnt)
				from storm_count as max_s_c
				where storm_count.zip_code = max_s_c.zip_code)
order by zip_code asc;