Modernizing data architecture with data products in asset management

最新推荐文章于 2024-09-27 17:58:23 发布

李白的朋友王维

最新推荐文章于 2024-09-27 17:58:23 发布

阅读量112

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134860014

版权

Sake: Hi, my name is Sake. I am Co-Founder CEO at Nex. Uh I'm joined here by Chitra um and Darryl.

A quick one bit on Nex. We are um exhibiting here in the Marketplace section. We bring automation to data engineering which is integration, preparation, monitoring of data and help produce data products.

Um I'll let um Chitra introduce herself and tell a little bit about the role of data at her company.

Chitra: Hello everyone. I'm Chitra Hota. I had data engineering and architecture at Oaktree. My team builds data engineering pipelines and frameworks, data lakes, data warehouses to enable data analytics and reporting at scale across Oaktree.

Darryl: Hi everybody. I'm Darryl Cherry. I'm Chief Architect at Clearwater Analytics. And so with us, uh data is really the lifeblood of our business. We're pulling hundreds of gigs of data in from thousands of data sources every day and we are uh processing that data and doing lots of different analytics, providing products for our customers with it uh in terms of investment, accounting, analytics, performance, regulatory compliance and other things. So uh very fundamental to what we do every day in and out and for our customers.

Sake: Yeah. Thank you. For sharing that. So um you're both in financial services, asset management sector. Um what is the most challenging part of data in your in this sector for the audience here, the most challenging part of data for asset management is unstructured data. So we have a lot of unstructured data that needs to be married together with the traditional data. And the unstructured data can come in pdfs, text files, excel and you name it. So that is currently the biggest challenge.

And um Darryl, I mean, um in your case, you're, you're working with a broad variety of data across asset um you know, different asset classes. So if you can share with us a little bit, what is the variety of data you see and what challenges you are seeing in that?

Darryl: Yeah, for us, a lot of our challenges comes in the diverse heterogeneous nature of the data that we get. So like I mentioned, you know, hundreds of gigs from thousands of sources and this is data for lots of different uh securities and asset types, right? We have things from stocks, bonds, mortgage loans, limited partnerships, all kinds of data that come in. And when that data arrives, we have to do lots of normalization to it to make sure it gets into a single format. We enrich the data with our own data that we have from different security vendors for uh security data, pricing data. And then we take that data and put it in a common format so we can do all the reporting on it for our clients. So that that nature of diverse data, lots of different formats. We have to do lots of cleaning to it. Like like I said, normalization. Uh so it becomes a big challenge for us. And we were trying to find extra better ways to do that. You know, you have a spectrum of how you can handle that data from doing things very manually to doing things fully automated. So we drive to get that full automation into our data analysis and ingestion.

Sake: Ok. Um so um this session, we are talking about data architecture as well, right? So when you think about the architecture that you have put together what you are doing and evolving, um how do you think of the architecture in terms of also data products? What does, what is the architecture like and what role does data product play in there?

Chitra: So the architecture has to be simple and flexible and you should be able to process both structured and unstructured data. Data products are actually a very critical part of it because they enable easy transformation of data. They have a host of adapters available. You can connect to multiple data sources. And when you're starting to aggregate data for your end users, they kind of help you shorten the time to market as well as you need less of data engineers when you're trying to roll out a data platform. So they're very critical to what we're doing.

Darryl: Yeah, for us from a data product perspective, what's really critical is the lineage of the data understanding the transparency of it. So we get data in and we process it through our system. If there's some something amiss later, like we got data that was invalid or we had something that needs to be corrected, we're gonna be able to follow and trace that data back and understand where it came from. What do we need to change and then what sources do we need to contact to make sure we're getting the proper data in. So that's a real key element for Clearwater in terms of data products.

Sake: Ok. Yeah. So um data products become sort of an entity that make it easier to work with data, allow the users of data to, you know, be more agile, also create collaboration around it. Um and um you know, works great in terms of the structured world, structured data and so on, right?

Both of you are doing interesting work in the generative AI side. So talk to us a little bit about data in that landscape. How do how do you work with data, you know, working with the large language models? And what what do you see there from data architecture perspective?

Chitra: Sure. So we deal with word voice as well as unstructured data in the form of pdfs for voice. We have to obviously transcribe it and, and then use that transcription as an input to our data models. The same thing for the unstructured pdf data, there are two parts to it. Some pdfs will have structures in it like a table format. We then parse that out and then use that with along the traditional data. But in the case of s it's very interesting that we're doing it. Now, we are using the code interpreter to actually upload some of these pdfs, start interrogating them and then take the results out and keep that as a summary to use again, is inputs for a lot of insights that we have to create summarization of fact sheets and things like that. And that is integral part of the architecture now and will be for a great period of time to come.

Sake: Ok. So with, with large language models, generative AI and unstructured data. So at Clearwater, we have a number of different projects that we're doing around gen AI, some internal and some external. And what we found is that the lo ms with generative AI do a fantastic job at understanding unstructured data. And so we've had a number of different data sources and knowledge bases that we've been able to consolidate vectorize and then have semantic searching happening on that, that let our internal cs and operations teams utilize ll ms to ask questions, pull data out things that they couldn't do before. And so these are things where normally someone would have to understand all the documentation, all the data. And we're effectively taking these new users to our system and new customer supporting ops people and turning them into power users immediately they can get answers to their questions at their fingertips. We can extract all the data we want from our different knowledge sources. And the other thing we've uh really explored and had a lot of success with is the ability to take documents like pdf documents and have them uh upload it and then have the llm look at those, analyze them and we can extract lots of data from these different documents that allow us to pull out, you know, investment policy, compliance numbers, all kinds of information that helps us understand what our customers are doing and how they need to uh structure their different assets and compliance rules.

Sake: Ok. So um asset management is interesting in that there's a lot of structured data when we think about portfolios, stock, you know, stock market tickers prices. But then there's also a lot of unstructured data like, you know, filings, for example, investor letters, all these reports that come out, what are the practical use cases? How are investors or your, you know, end customers benefiting from this? And you know, if you can share some interesting and applications that have happened with this innovation.

Chitra: So we have a couple of applications that we have built in house one of the applications is called earnings calls, transcriber, where we actually record earnings calls, those we are allowed to record and then we transcribe them and then they take this text out, summarize it and send it to the analyst. So this is actually more of automation. And then also the input product can be used for sentiment analysis, can be used for named entity recognition and fed into other data that traditionally been mine from holdings, transactions and issuers and joined with that. So that's one very good example of how we use the unstructured data.

The second thing would be the quarterly reports, low notices, bankruptcy notices, which are all as pdfs. And when you are making an investment decision, you do need to have all these alternate forms of data available to you. So your insights are more risk adjusted and you're also looking at whether to invest in or not invest in based on all of these alternate data sources in addition to your traditional ones that you already have in house.

Sake: Ok. What are some of the cool use cases you are seeing what is the end user benefiting from both with the unstructured, you know llm based sort of unstructured data llm approaches as well as you know, the other data architecture run.

Darryl: Ok. So we have a couple of different use cases that have really highlighted value for our customers. One of them was around the ability on our website as they're looking at our product and they're analyzing their reports, they can ask questions about the reports and questions about their data and we can give them links. Not only the, the data that we're storing, the unstructured data has all the information about where the different reports are in the product. And then we can use tooling on top of that to now give them links. So they can ask a question, say, you know what is my cash position reports? And we can provide them links to the reports instantaneously to take them to that data. So that was a really huge win for our customers.

The other one was around one of our lp x products. We're actually beta testing some different products around looking at different fund prospectuses and other things that we have where they can ask questions when they see data changing in their reports, they're not sure why they can now do some investigation and quickly use the llm to ask questions on why is this irr change? Why, why did it go up so high? You know, they can get information quickly from the the large language model that has all that information behind it, you know, from the fund information, uh investment policy statements, other things that they're using, that's some very cool applications, right?

Sake: Um one of the things that you know, as you're heading into more automation, how does the role of investors sort of change with the benefit of this generative AI, right? I mean, you can ask so many questions, you can analyze so many so much data. Do investors become, you know, capable of producing more returns? And how do you see that from a technology driving the business perspective?

Chitra: So first of all, they are getting more savvy, not just the investors, but also having the marketing teams, the legal teams and the product specialist teams. They used to spend time writing out commentary for funds, for example, or writing a performance report, they will have this readily available and with produced in less amount of time because the summarization is already like the plot features that he is building in his team. It's going to be table stakes coming going forward. We just have to understand that with using the private version of an l model and and the documents that we are feeding it become our own repository and we vectorize and keep them for time for the time that we need it for.

So I didn't hear your question. Could you repeat it just to repeat my question again. It was like all this technology that you are building. Do you see the role of investors in how they act and how they make decisions, you know, changing over time? And what's kind of your uh thought process on that? Like how will investing change with the technology?

Darryl: That's a great question. So i think a lot of this is, is information is now gonna be people's fingertips. So things were very complex before you had to have a deep domain knowledge and lots of access to information before that is now someone can get from like a chat bot, a copilot and it turns someone who's kind of like i mentioned an early user, someone new in the system. Now, they're more of an expert in what they're doing so they can get lots of information at their fingertips. They know what's going on in terms of the market, they know what's going on in terms of their assets. How are their assets changing? What types of uh impacts to market events happen, like interest rate hikes other things like that. Now, instead of having to watch what goes on and, and kind of learn gradually, they can know these things much in a much quicker uh environment and a and take action on it and even get suggestions on, on actions they could take.

Sake: I have a technology engineering question. Both of you are experts in technology. Um what specific like aws technologies are you leveraging, for example? And um any sort of advice you would have for people who are, you know, building to scale on the cloud?

Chitra: Sure. So we are big believers in using cloud native aw services. We're using aws transcribe but not just transcribe itself. We have built metadata and configurations around it. We've also done custom coding and we've also scaled it so that when we do 100 earnings calls. for example, it's actually scaling to that. The other thing that we've done is for test extract. We use the services that are available at aws, test extract. And when we need version changes or feature requests, aws has actually worked with us to get those done. And we use that for the document parts of product. In addition to that, we obviously using glue and python and spark etl, we're using glue orchestration, we're using even bridge, using a lot of these services that are native to aws and lamda is, is like the state for all of us scaling. And that's how we are scaling most of our platforms Darryl: So we're using pretty much the gamut of amazon technologies, right? We're heavy in EKS. So we're a Kubernetes shop for our compute and lots of scale there obviously S3 from a storage perspective because it's tons of storage for cheap, which is great. Um Bedrock and SageMaker we've been using as well. We got into the early preview of Bedrock. So we started playing with that to see how it's gonna operate with some of the ML models and the Titan models. SageMaker for doing some training and fine tuning as well. So we're using that a lot. Um Lambdas for compute on top of EKS so there's some situations where Lambda is really the right tool for the right job and you know API Gateways, those are really the, the kind of the fundamental building blocks that we've used to build our platform out.

On top of large language models and data ingestion, we haven't started using Bedrock yet. We're still using OpenAI, but we do use API Gateway in the Lake Formation overall also and Redshift. And Athena.

Daryl specifically, if you can share like you deal with a variety of data, the company um from many different places and systems, if you can share a little bit about how you also use maybe Next line in some of these cases across the data variety.

Darryl: Yeah, absolutely. So Nex gives us some really key benefits that we we're enjoying as we use it for data ingestion. One is the fact that it can connect to so many different data sources. Like I said, we have thousands of data sources we're connecting to and a lot of those data sources are some that are SFTP, some that are shares, some that are API. So no matter how we're accessing data from other databases or from other, you know, file servers, we can pull that data in. And the flexibility I think of Nexus's transformations was something that was key. As I mentioned, all these different data sources, we have lots of, you know, data that needs to be cleaned up properly. It has different weird formats that we have to change it to and transform. So, Nex gives us that ability to not only use their built-in transformations, but if we have something that's custom or we need because we're getting, we have a certain vendor that's giving us data in a way that they don't, you know, that isn't standard, we can build a custom transformation to handle that particular vendor.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫