https://python.langchain.com.cn/docs/modules/data_connection/document_loaders/how_to/json
JSON Document Loader in LangChain
This content is based on LangChain’s official documentation (langchain.com.cn) and explains the JSONLoader—a tool to extract structured data from JSON files using jq queries—in simplified terms. It strictly preserves original source codes, examples, and knowledge points without arbitrary additions or modifications.
Key Note: JSON (JavaScript Object Notation) is a lightweight data format for storage and exchange. JSONLoader uses the jq library to query specific fields/structures from JSON files, converting results into LangChain Document objects.
1. What is JSONLoader?
JSONLoader parses JSON files and extracts targeted data using jq schema (a query language for JSON).
- Core function: Use
jqqueries to filter and extract specific fields (e.g., chat messages) from JSON. - Output: Each extracted item becomes a
Documentobject, withpage_content(extracted data) andmetadata(e.g., file source, sequence number). - Dependencies: Requires the
jqPython package (for parsingjqschemas).
2. Prerequisites
First, install the required jq package:
pip install jq
3. Step 1: Import Required Modules
The code below imports all necessary libraries—exactly as in the original documentation:
from langchain.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint
4. Step 2: View the JSON File Structure
Before extracting data, let’s load and inspect the sample JSON file (facebook_chat.json) to understand its structure.
Code (Exact as Original):
# Define the path to the JSON file
file_path = './example_data/facebook_chat.json'
# Load and print the full JSON data
data = json.loads(Path(file_path).read_text())
pprint(data)
Output (Exact as Original):
{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
'is_still_participant': True,
'joinable_mode': {'link': '', 'mode': 1},
'magic_words': [],
'messages': [{'content': 'Bye!',
'sender_name': 'User 2',
'timestamp_ms': 1675597571851},
{'content': 'Oh no worries! Bye',
'sender_name': 'User 1',
'timestamp_ms': 1675597435669},
{'content': 'No Im sorry it was my mistake, the blue one is not '
'for sale',
'sender_name': 'User 2',
'timestamp_ms': 1675596277579},
# ... remaining messages omitted for brevity
],
'participants': [{'name': 'User 1'}, {'name': 'User 2'}],
'thread_path': 'inbox/User 1 and User 2 chat',
'title': 'User 1 and User 2 chat'}
5. Example 1: Basic Extraction (Extract content from messages)
Extract the content field from all items in the messages array using a simple jq schema.
Code (Exact as Original):
# Initialize JSONLoader with jq schema to extract .messages[].content
loader = JSONLoader(
file_path='./example_data/facebook_chat.json',
jq_schema='.messages[].content' # jq query: get "content" from each "messages" item
)
# Load extracted data into Document objects
data = loader.load()
# Print the results
pprint(data)
Output (Exact as Original):
[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),
Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),
Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3}),
Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4}),
Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5}),
Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6}),
Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7}),
Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8}),
Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9}),
Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10}),
Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11})]
6. Example 2: Extract Data with Custom Metadata
Extract both content (as page_content) and additional fields (e.g., sender_name, timestamp_ms) as metadata using a custom metadata_func.
Step 6.1: Define the Metadata Extraction Function
# Define a function to extract custom metadata from each message
def metadata_func(record: dict, metadata: dict) -> dict:
metadata["sender_name"] = record.get("sender_name") # Add sender name to metadata
metadata["timestamp_ms"] = record.get("timestamp_ms") # Add timestamp to metadata
return metadata
Step 6.2: Initialize JSONLoader with Metadata
loader = JSONLoader(
file_path='./example_data/facebook_chat.json',
jq_schema='.messages[]', # jq query: iterate over each "messages" item (full record)
content_key="content", # Field to use as page_content
metadata_func=metadata_func # Custom metadata function
)
# Load data with metadata
data = loader.load()
# Print the results
pprint(data)
Output (Exact as Original):
[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),
Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),
# ... remaining documents follow the same structure
]
7. Example 3: Customize Metadata (Modify source Field)
Use metadata_func to modify default metadata (e.g., simplify the source path to be more readable).
Step 7.1: Define a Modified Metadata Function
def metadata_func(record: dict, metadata: dict) -> dict:
metadata["sender_name"] = record.get("sender_name")
metadata["timestamp_ms"] = record.get("timestamp_ms")
# Simplify the "source" path to start from "langchain"
if "source" in metadata:
source = metadata["source"].split("/")
source = source[source.index("langchain"):]
metadata["source"] = "/".join(source)
return metadata
Step 7.2: Load Data with Modified Metadata
loader = JSONLoader(
file_path='./example_data/facebook_chat.json',
jq_schema='.messages[]',
content_key="content",
metadata_func=metadata_func
)
data = loader.load()
pprint(data)
Output (Exact as Original):
[Document(page_content='Bye!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),
Document(page_content='Oh no worries! Bye', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),
# ... remaining documents follow the same structure
]
8. Common jq Schemas for JSON Structures
Below are reusable jq schemas for common JSON structures (from the original documentation):
| JSON Structure | jq_schema | Description |
|---|---|---|
[{"text": ...}, {"text": ...}, {"text": ...}] | ".[].text" | Extract “text” from each item in a top-level array |
{"key": [{"text": ...}, {"text": ...}]} | ".key[].text" | Extract “text” from each item in the “key” array |
["...", "...", "..."] | ".[]" | Extract all items from a top-level array |
Key Takeaways
- Dependencies: Install
jqbefore using JSONLoader. - Basic Extraction: Use
jq_schemato target specific fields (e.g.,.messages[].content). - Custom Metadata: Use
metadata_functo add/modify metadata (e.g., sender name, timestamp). jqSchemas: Adapt the schema to match your JSON structure (see common examples above).- Output: Each extracted item is a
Documentwithpage_contentandmetadata(default:source+seq_num).
540

被折叠的 条评论
为什么被折叠?



