JSON Document Loader in LangChain

最新推荐文章于 2025-11-18 11:43:25 发布

原创最新推荐文章于 2025-11-18 11:43:25 发布 · 207 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#json #langchain #服务器

Langchain 专栏收录该内容

77 篇文章

订阅专栏

https://python.langchain.com.cn/docs/modules/data_connection/document_loaders/how_to/json

JSON Document Loader in LangChain

This content is based on LangChain’s official documentation (langchain.com.cn) and explains the JSONLoader—a tool to extract structured data from JSON files using jq queries—in simplified terms. It strictly preserves original source codes, examples, and knowledge points without arbitrary additions or modifications.

Key Note: JSON (JavaScript Object Notation) is a lightweight data format for storage and exchange. JSONLoader uses the jq library to query specific fields/structures from JSON files, converting results into LangChain Document objects.

1. What is JSONLoader?

JSONLoader parses JSON files and extracts targeted data using jq schema (a query language for JSON).

Core function: Use jq queries to filter and extract specific fields (e.g., chat messages) from JSON.
Output: Each extracted item becomes a Document object, with page_content (extracted data) and metadata (e.g., file source, sequence number).
Dependencies: Requires the jq Python package (for parsing jq schemas).

2. Prerequisites

First, install the required jq package:

pip install jq

3. Step 1: Import Required Modules

The code below imports all necessary libraries—exactly as in the original documentation:

from langchain.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint

4. Step 2: View the JSON File Structure

Before extracting data, let’s load and inspect the sample JSON file (facebook_chat.json) to understand its structure.

Code (Exact as Original):

# Define the path to the JSON file
file_path = './example_data/facebook_chat.json'

# Load and print the full JSON data
data = json.loads(Path(file_path).read_text())
pprint(data)

Output (Exact as Original):

{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
 'is_still_participant': True,
 'joinable_mode': {'link': '', 'mode': 1},
 'magic_words': [],
 'messages': [{'content': 'Bye!',
               'sender_name': 'User 2',
               'timestamp_ms': 1675597571851},
              {'content': 'Oh no worries! Bye',
               'sender_name': 'User 1',
               'timestamp_ms': 1675597435669},
              {'content': 'No Im sorry it was my mistake, the blue one is not '
                          'for sale',
               'sender_name': 'User 2',
               'timestamp_ms': 1675596277579},
              # ... remaining messages omitted for brevity
             ],
 'participants': [{'name': 'User 1'}, {'name': 'User 2'}],
 'thread_path': 'inbox/User 1 and User 2 chat',
 'title': 'User 1 and User 2 chat'}

5. Example 1: Basic Extraction (Extract `content` from `messages`)

Extract the content field from all items in the messages array using a simple jq schema.

Code (Exact as Original):

# Initialize JSONLoader with jq schema to extract .messages[].content
loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[].content'  # jq query: get "content" from each "messages" item
)

# Load extracted data into Document objects
data = loader.load()

# Print the results
pprint(data)

Output (Exact as Original):

[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),
 Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),
 Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3}),
 Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4}),
 Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5}),
 Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6}),
 Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7}),
 Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8}),
 Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9}),
 Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10}),
 Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11})]

6. Example 2: Extract Data with Custom Metadata

Extract both content (as page_content) and additional fields (e.g., sender_name, timestamp_ms) as metadata using a custom metadata_func.

Step 6.1: Define the Metadata Extraction Function

# Define a function to extract custom metadata from each message
def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["sender_name"] = record.get("sender_name")  # Add sender name to metadata
    metadata["timestamp_ms"] = record.get("timestamp_ms")  # Add timestamp to metadata
    return metadata

Step 6.2: Initialize JSONLoader with Metadata

loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[]',  # jq query: iterate over each "messages" item (full record)
    content_key="content",  # Field to use as page_content
    metadata_func=metadata_func  # Custom metadata function
)

# Load data with metadata
data = loader.load()

# Print the results
pprint(data)

Output (Exact as Original):

[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),
 Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),
 # ... remaining documents follow the same structure
]

7. Example 3: Customize Metadata (Modify `source` Field)

Use metadata_func to modify default metadata (e.g., simplify the source path to be more readable).

Step 7.1: Define a Modified Metadata Function

def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["sender_name"] = record.get("sender_name")
    metadata["timestamp_ms"] = record.get("timestamp_ms")
    # Simplify the "source" path to start from "langchain"
    if "source" in metadata:
        source = metadata["source"].split("/")
        source = source[source.index("langchain"):]
        metadata["source"] = "/".join(source)
    return metadata

Step 7.2: Load Data with Modified Metadata

loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[]',
    content_key="content",
    metadata_func=metadata_func
)

data = loader.load()
pprint(data)

Output (Exact as Original):

[Document(page_content='Bye!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),
 Document(page_content='Oh no worries! Bye', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),
 # ... remaining documents follow the same structure
]

8. Common `jq` Schemas for JSON Structures

Below are reusable jq schemas for common JSON structures (from the original documentation):

JSON Structure	`jq_schema`	Description
`[{"text": ...}, {"text": ...}, {"text": ...}]`	`".[].text"`	Extract “text” from each item in a top-level array
`{"key": [{"text": ...}, {"text": ...}]}`	`".key[].text"`	Extract “text” from each item in the “key” array
`["...", "...", "..."]`	`".[]"`	Extract all items from a top-level array

Key Takeaways

Dependencies: Install jq before using JSONLoader.
Basic Extraction: Use jq_schema to target specific fields (e.g., .messages[].content).
Custom Metadata: Use metadata_func to add/modify metadata (e.g., sender name, timestamp).
jq Schemas: Adapt the schema to match your JSON structure (see common examples above).
Output: Each extracted item is a Document with page_content and metadata (default: source + seq_num).