Big data in finance 2024Python

Java Python In this assignment, we’ll learn the basics of analyzing/thinking about alternative data.

1. What you are being graded on: Sophistication of analysis and communication. A key challenge with data science in organizations is that most stakeholders are non-technical. Therefore, if you communicate with jargon, you will confuse people. Don’t show me a bunch of screenshots, give me a Powerpoint or writeup that makes sense that you’d be proud to show anyone. I have with permissions of the students put some past examples. They were not the most sophisticated work but served nicely as writeups.

Some things to consider

o State what the business problem or implications of what you’re doing.

o Are you doing descriptive, explanatory or predictive analysis? State the economic model that you are in testing in principle, if applicable.

o Automate, automate, automate. Try to do as little as possible by hand. The key here is to build your skills at coding and deploying. Describe how you are going to collect the data and so forth.

2. Choice of topic: I am spelling out some feasible options given the short course timeline. I have a “simple choice” that I think constitutes the bare minimum to pass this course, which is to propose an interesting data source, collect part of it, assemble into a database, and explain a reasonable economic application.

However, for those truly interested in data science, I have created a few potential topics that you can think about which I think would really help your career by giving you something that you can talk about in an interview.

Finally, you may propose your own. Please contact me ahead of time if you’re going to do a project other than what is listed below. Some students have already and I am happy to approve, but just want to be sure it fits within the course objectives. The main thing is to build your CV and learning experience.

What to submit

1. Due date:  midnight.

a. Date chosen to avoid Friday night.

2. What to submit: A writeup and code. The writeup may be an .ipynb file with embedded output + writeup.

3. You must use code. Python, R, Julia are acceptable. NO VBA, Excel etc.

Quant trading redux

· You may propose a paper to replicate that was not on the list. Must have been posted on SSRN in the last 3 years. Let me know before you do so I can approve the paper. The expectation will be that you will do an extension, not merely a replication. That is to say, you will be expected to do more, and there has to be some aspect of it which is “original”. For example internationally. You can ask me for that data. I am not “requiring” alternative data in a replication. Beyond replication, it is sufficient to think through what exactly a measure captures and how to improve it based on theory.

If you are ambitious, you may propose a trading strategy to test. However, if you are not anchored in a replication, you have to have reasonable motivation and I will judge the basic theoretical premise. Here are some things I’d be curious about:

· Take the open source asset pricing library and calculate turn of the month effects. How does this relate to traditional theories of mispricing?

· Find crypto data and motivate some calendar anomalies, testing them.

· Feel free to go intraday and build a machine learning model, but please follow standard back-testing procedures.

Discretionary research idea

For those of you who don’t want to code, I will judge this one very carefully on the basis of your logic and writing.

Go to the Bloomberg terminal, Eikon terminal, etc. and write me an investment thesis predicated on (i) forward-looking quant signals, (ii) alternative data. For each argument you make, you must cite an academic or industry whitepaper that is quantitatively supporting the argument you are making. In other words please do not hallucinate like a large language model. Please submit the paper.

YOU MUST SHOW ME SCREENSHOTS OF THE DATA YOU COLLECTED, DOCUMENT WHAT DATA YOU COLLECTED AND WHERE INCLUDING BLOOMBERG CODES

GPT + news headlines

Here’s a benchmark paper: https://arxiv.org/pdf/2304.07619.pdf

· You can use RavenPack, Tiingo, GDELT, Common Crawl etc. but must be explicit about your data, like HW2 you can discuss the advantages / disadvanteges

· You just need to use one or two large language models. You can also try a variety of prompts, given one reasonable model. If not too many people ask I can maybe help you.

Oldie but goodie

o Use OpenStreetMaps to creat Big data in finance 2024Python e a point-in-time nowcasting indicator of country-level economic growth?  I have downloaded for you already the giant file from OSM (as of last year, when a group asked, anyway). Use the macroeconomic forecasting regression framework that we learned in class to see if there is predictive relationship between various types of establishments and future economic aggregates. osm.planet_history

My plan is to post some queries on Confluence to make it easier to deal with the data, based on the code of some of the students last year.

o Instead of macro-aggregates, you can also use the same type of data to predict REIT prices, but you would have to collect these REITs yourself.

I want to know more about Hong Kong 

· Understanding trends in the financial adviser industry Scrape data from webb-site. One of his datasets provides the entire registration history of SFC-licensed professionals, including the firms they work at and when. The point of this question is to ask how has the industry changed over the last 20 years?

o What is the underlying source of data?

o Describe your scraper, the database you created to store the data, including the fields.

o Answer some interesting questions in the data:

§ Are Chinese banks becoming more dominant?

§ Is Hong Kong’s finance industry becoming less internationally diverse, or more?

§ Do people move from job to job faster than they used to?

§ How did these numbers change as a result of COVID-19? Are there interesting comparisons between Chinese and foreign banks?

§ Document 3 more interesting observations that wouldn’t bore me! :-)

Some other projects with looser guidelines 

· Can ChatGPT or Claude Opus reliably identify and interpret stock charts?

· Build a tickerization model that maps, given a sentence, any entities in the sentence and then maps them to tickers. To break this down, this is (i) an entity recognition problem, and (ii) a mapping problem.

· Unfortunately credit scoring data is hard to get. Use the Dewey SafeGraph dataset. Define a default as a closure of an establishment, which implies an establishment or parent company existed pre-COVID19 but did not by 2022 or 2023 whenever I downloaded the data. Given data ending in 2022 for your feature set and 2023 outcomes, build me a default prediction model using the features in Dewey data.

· Perplexity fact checking problem: https://aclanthology.org/2021.naacl-main.158.pdf

Maybe what you can do is merge two datasets and have a large language model check whether the match is accurate?

· Scrape the OpenRice dataset – tell me truly what is the best restaurant in Hong Kong? No just kidding. Download the reviews for 100 restaurants in Hong Kong and use an NLP model to classify the complaints.

· What is the economic impact of celebrity events?  Singapore paid a lot of money recently on Taylor Swift to Singapore, which until Crazy Rich Asians was not considered too cool. But since they were the sole location in Asia for the recent Taylor Swift conference, this may change. As one commentator noted, "Singapore felt so exciting then, leading some commentators to suggest that Singapore needed Ms Swift to inject some fun into an otherwise staid country.”

What is the economic impact of having a celebrity such as Taylor Swift? Using the SafeGraph dataset, please conduct a differences-in-differences study comparing hyperlocal areas where Taylor Swift had a concert to areas where she did not either in terms of prolonged foot traffic or spending. You must download her concert locations, geo-locate them, and identify plausible counterfacutals. This is a completely useless question in business, but its good for a policymaker to know, and also useful if you want to get a PhD or attend the academic Swiftposium next year. It is also a good data science and probably a hilarious thing to talk about in an interview.

If nothing else 

Find a publicly listed data source.

· What is the economic model, what is the Y, what is the X. What is the logic of this regression and why is it valuable

· What measure do you create from the data?

· What resources are there to substantiate this data interpretation

· Collect a sample of the data

· Describe the SQL schema you would create

· Show me some summary statistics. Ideally you would tell me something interesting about the dataset, and data cleaning. In the ideal case you would demonstrate the application but you can stop short of where is feasible fro you         

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值