JSON Document Loader in LangChain

https://python.langchain.com.cn/docs/modules/data_connection/document_loaders/how_to/json

JSON Document Loader in LangChain

This content is based on LangChain’s official documentation (langchain.com.cn) and explains the JSONLoader—a tool to extract structured data from JSON files using jq queries—in simplified terms. It strictly preserves original source codes, examples, and knowledge points without arbitrary additions or modifications.

Key Note: JSON (JavaScript Object Notation) is a lightweight data format for storage and exchange. JSONLoader uses the jq library to query specific fields/structures from JSON files, converting results into LangChain Document objects.

1. What is JSONLoader?

JSONLoader parses JSON files and extracts targeted data using jq schema (a query language for JSON).

  • Core function: Use jq queries to filter and extract specific fields (e.g., chat messages) from JSON.
  • Output: Each extracted item becomes a Document object, with page_content (extracted data) and metadata (e.g., file source, sequence number).
  • Dependencies: Requires the jq Python package (for parsing jq schemas).

2. Prerequisites

First, install the required jq package:

pip install jq

3. Step 1: Import Required Modules

The code below imports all necessary libraries—exactly as in the original documentation:

from langchain.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint

4. Step 2: View the JSON File Structure

Before extracting data, let’s load and inspect the sample JSON file (facebook_chat.json) to understand its structure.

Code (Exact as Original):

# Define the path to the JSON file
file_path = './example_data/facebook_chat.json'

# Load and print the full JSON data
data = json.loads(Path(file_path).read_text())
pprint(data)

Output (Exact as Original):

{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
 'is_still_participant': True,
 'joinable_mode': {'link': '', 'mode': 1},
 'magic_words': [],
 'messages': [{'content': 'Bye!',
               'sender_name': 'User 2',
               'timestamp_ms': 1675597571851},
              {'content': 'Oh no worries! Bye',
               'sender_name': 'User 1',
               'timestamp_ms': 1675597435669},
              {'content': 'No Im sorry it was my mistake, the blue one is not '
                          'for sale',
               'sender_name': 'User 2',
               'timestamp_ms': 1675596277579},
              # ... remaining messages omitted for brevity
             ],
 'participants': [{'name': 'User 1'}, {'name': 'User 2'}],
 'thread_path': 'inbox/User 1 and User 2 chat',
 'title': 'User 1 and User 2 chat'}

5. Example 1: Basic Extraction (Extract content from messages)

Extract the content field from all items in the messages array using a simple jq schema.

Code (Exact as Original):

# Initialize JSONLoader with jq schema to extract .messages[].content
loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[].content'  # jq query: get "content" from each "messages" item
)

# Load extracted data into Document objects
data = loader.load()

# Print the results
pprint(data)

Output (Exact as Original):

[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),
 Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),
 Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 3}),
 Document(page_content='I thought you were selling the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 4}),
 Document(page_content='Im not interested in this bag. Im interested in the blue one!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 5}),
 Document(page_content='Here is $129', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 6}),
 Document(page_content='', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 7}),
 Document(page_content='Online is at least $100', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 8}),
 Document(page_content='How much do you want?', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 9}),
 Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10}),
 Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11})]

6. Example 2: Extract Data with Custom Metadata

Extract both content (as page_content) and additional fields (e.g., sender_name, timestamp_ms) as metadata using a custom metadata_func.

Step 6.1: Define the Metadata Extraction Function

# Define a function to extract custom metadata from each message
def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["sender_name"] = record.get("sender_name")  # Add sender name to metadata
    metadata["timestamp_ms"] = record.get("timestamp_ms")  # Add timestamp to metadata
    return metadata

Step 6.2: Initialize JSONLoader with Metadata

loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[]',  # jq query: iterate over each "messages" item (full record)
    content_key="content",  # Field to use as page_content
    metadata_func=metadata_func  # Custom metadata function
)

# Load data with metadata
data = loader.load()

# Print the results
pprint(data)

Output (Exact as Original):

[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),
 Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),
 # ... remaining documents follow the same structure
]

7. Example 3: Customize Metadata (Modify source Field)

Use metadata_func to modify default metadata (e.g., simplify the source path to be more readable).

Step 7.1: Define a Modified Metadata Function

def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["sender_name"] = record.get("sender_name")
    metadata["timestamp_ms"] = record.get("timestamp_ms")
    # Simplify the "source" path to start from "langchain"
    if "source" in metadata:
        source = metadata["source"].split("/")
        source = source[source.index("langchain"):]
        metadata["source"] = "/".join(source)
    return metadata

Step 7.2: Load Data with Modified Metadata

loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[]',
    content_key="content",
    metadata_func=metadata_func
)

data = loader.load()
pprint(data)

Output (Exact as Original):

[Document(page_content='Bye!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1, 'sender_name': 'User 2', 'timestamp_ms': 1675597571851}),
 Document(page_content='Oh no worries! Bye', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2, 'sender_name': 'User 1', 'timestamp_ms': 1675597435669}),
 # ... remaining documents follow the same structure
]

8. Common jq Schemas for JSON Structures

Below are reusable jq schemas for common JSON structures (from the original documentation):

JSON Structurejq_schemaDescription
[{"text": ...}, {"text": ...}, {"text": ...}]".[].text"Extract “text” from each item in a top-level array
{"key": [{"text": ...}, {"text": ...}]}".key[].text"Extract “text” from each item in the “key” array
["...", "...", "..."]".[]"Extract all items from a top-level array

Key Takeaways

  • Dependencies: Install jq before using JSONLoader.
  • Basic Extraction: Use jq_schema to target specific fields (e.g., .messages[].content).
  • Custom Metadata: Use metadata_func to add/modify metadata (e.g., sender name, timestamp).
  • jq Schemas: Adapt the schema to match your JSON structure (see common examples above).
  • Output: Each extracted item is a Document with page_content and metadata (default: source + seq_num).
<think>好的,我现在需要解决用户在使用LangChain_community的JSONLoader时遇到的问题。首先,我要回忆一下相关的文档引用,用户提供了五个引用,其中引用[2]、[3]、[4]提到了JSONLoader的具体使用方法和可能的问题。 用户提到代码中的错误,但具体错误信息没有给出。我需要根据常见的JSONLoader使用问题来推测可能的情况。根据引用[3]中的测试代码,用户可能在使用JSONLoader时遇到了参数配置的问题。例如,jq_schema的设置是否正确?text_content参数是否有误? 首先,检查参数设置。在引用[3]的示例代码中,jq_schema被设置为".",这可能是指定JSON结构的路径。如果用户的JSON文件结构复杂,可能需要调整这个schema。比如,如果数据在某个嵌套的字段里,jq_schema可能需要指向具体的路径,如".key.subkey"。 其次,text_content参数设为False,这可能导致内容未被正确提取。根据引用[4],JSONLoader需要正确提取内容和元数据,如果text_content设置不当,可能无法生成Document对象。需要确认该参数是否应设为True,或者是否需要指定具体的文本字段。 另外,文件路径是否正确?用户示例中的file_path是"test.json",如果实际文件路径不同,会导致文件找不到的错误。需要检查文件是否存在,路径是否正确。 还有可能遇到网络限制,如引用[2]提到的API调用时的网络问题,但用户的问题是关于本地JSON文件的加载,所以可能不涉及网络,但需要排除其他可能性。 此外,引用[5]提到PDFLoader比较粗糙,可能JSONLoader也存在类似问题,比如处理大型文件时内存不足,这时候可能需要异步加载或惰性加载,如引用[1]所述,但用户的问题是否涉及内存优化还不确定。 总结可能的问题点:jq_schema配置错误、text_content参数错误、文件路径错误、JSON格式不符合预期。对应的解决方法包括调整jq_schema路径,检查text_content设置,确认文件路径正确,验证JSON文件的结构是否符合预期。 需要将这些可能性整理成结构化的回答,给出常见问题及解决方法,并建议用户检查这些方面。同时生成相关问题,帮助用户深入理解。</think>### LangChain_community JSONLoader 常见问题及解决方法 #### 1. **jq_schema 配置错误** - **问题表现**:`ValueError: 无法解析JSON结构` 或 `内容提取为空`。 - **原因**:`jq_schema` 参数未正确指向目标字段。例如,若JSON数据嵌套在 `data.text` 字段中,需明确路径。 - **解决**:调整 `jq_schema` 至正确路径。例如: ```python loader = JSONLoader( file_path="test.json", jq_schema=".data.text", # 假设文本字段在data
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值