Exploring the Upper Limits of Text-Based Collaborative Filtering Using Large Language Models: Discoveries and Insights
dataset
Due to memory issues for some E2E training experiments, we constructed interaction sequences for each user by selecting their latest 23 items. We remove users with less than 5 interactions, simply because we do not consider cold user settings. After the basic pre-processing, we randomly selected 200,000 users (and their interactions) from both MIND and HM datasets, and 50,000 users from Bili.
Q1: How does the recommender system’s performance respond to the continuous increase in the item encoder’s size? Is the performance limits attainable at the scale of hundreds of billions?
All LMs are frozen in this study.
(answer to Q1) the TCF model with a 175B parameter LM may not have reached its performance ceiling.
Q2: Can super-large LMs, such as GPT-3 with 175-billion parameters, generate universal text representations?
(answer to Q2) even the item representation learned by an extremely large LM (e.g., GPT-3) may not result in a universal representation, at least not for the text recommendation task.
Q3: Can recommender models with a 175-billion parameter LM as the item encoder easily beat the simplest ID embedding based models (IDCF), especially for warm item recommenda- tion?
This is a significant advancement, as no previous study has explicitly claimed that TCF by freezing a NLP encoder can attain performance comparable to its IDCF counterparts for warm or popular item recommendation.
The answer to Q3 is that, for text-centric recommendation, TCF with the SASRec backbone and utilizing a 175B-parameter frozen LM can achieve similar performance to standard IDCF, even for popular item recommendation. However, even by retraining a super-large LM item encoder, TCF with a DSSM7 backbone has little chance to compete with its corresponding IDCF. The simple IDCF still remains a highly competitive approach in the warm item recom- mendation setting.
Q4: How close is the TCF paradigm to a universal recommender model?
we first pre-train a SASRec-based TCF model with the 175B parameter frozen LM as item encoder in a large-scale text recommendation dataset.We then directly evaluate the pre-trained model in the testing set of MIND, HM and QB.
(answer to Q4) while TCF models with large LMs do exhibit a certain degree of transfer learning capability, they still fall significantly short of being a universal recommender model, as we had initially envisioned.
Q5: Will the classic TCF paradigm be replaced by a recent prompt engineering based rec- ommendation method that utilizes ChatGPT (called ChatGPT4Rec)?
We randomly selected 1024 users from the testing sets of MIND, HM, and Bili, and created two tasks for ChatGPT. In the first task (Task 1 in Table 6), ChatGPT was asked to select the most preferred item from four candidates (one ground truth and three randomly selected items), given the user’s historical interactions as a condition. The second task (Task 2 in Table 6) was to ask ChatGPT to rank the top-10 preferred items from 100 candidates (one ground truth and 99 randomly selected items, excluding all historical interactions), also provided with the user’s historical interactions as input.
the answer to Q5 is that based on its current performance and limitations, ChatGPT is unable to substitute the classical TCF paradigm.
Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)
VIP5: Towards Multimodal Foundation Models for Recommendation
P5的多模态版本
task
- sequential recommendation
- direct recommendation
- explanation
framework
experiment
dataset
Towards Open-World Recommendation with Knowledge Augmentation from Large Language Models
we posit that instead of solely learning from narrowly defined data in the closed systems, recommender systems should be the open- world systems that can proactively acquire knowledge from the external world.
open-world knowledge for recommendation
- reasoning knowledge: inferred from the user behavior history enables a more comprehensive understanding of the users.
- factual knowledge: provides valuable common sense information about the candidate items and thereby improves the recommendation quality.
shortcomings of LLMs as recommenders.
- Predictive accuracy: LLMs is generally outperformed by classical recommenders
- Inference latency.
- Compositional gap: Requiring direct recommendation results from LLMs is currently beyond their capability and cannot fully exploit the open-world knowledge encoded in LLMs
framework
Knowledge Reasoning and Generation
two challenges:
- compositional gap: User’s clicks on items are motivated by multiple key aspects and user’s interests are diverse and multifaceted, which involve multiple reasoning steps.
- the generated factual knowledge may be correct but useless, as it may not align with the inferred user preferences.
factorization prompting
- preference reasoning prompt
- Item factual prompt
Knowledge Adaptation
new challenges:
- The knowledge generated by LLMs is usually in the form of text, which cannot be directly leveraged by traditional RSs that typically process categorical features.
- Even if some LLMs are open-sourced, the decoded outputs are usually large dense vectors (e.g., 4096 for each token) and lie in a semantic space that differs significantly from the recommendation space.
- The generated knowledge may contain noise or unreliable information
Experiments