################################ Tutorial 2: 数据加载 ################################ .. include:: ../links.ref .. include:: ../tags.ref .. include:: ../abbrs.ref .. contents:: 目录 :local: :depth: 2 数据加载概述 ============ LlamaIndex 提供了强大的数据加载能力,支持 100+ 种数据源。 数据加载是 RAG 管道的第一步,也是最关键的一步。 .. code-block:: text 数据加载流程: ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ 原始数据 │───►│ Reader │───►│ Document │ │ (各种格式) │ │ (加载器) │ │ (标准格式) │ └─────────────┘ └─────────────┘ └─────────────┘ 数据连接器类型 -------------- .. list-table:: :header-rows: 1 :widths: 20 30 50 * - 类型 - 示例 - 说明 * - 本地文件 - PDF, Word, Markdown - 最常用的数据源 * - 数据库 - MySQL, PostgreSQL - 结构化数据 * - 云存储 - S3, Google Drive - 远程文件 * - Web - 网页, API - 在线数据 * - 知识库 - Notion, Confluence - 协作平台 SimpleDirectoryReader ===================== 最常用的文件加载器,支持多种文件格式。 基本用法 -------- .. code-block:: python from llama_index.core import SimpleDirectoryReader # 加载整个目录 reader = SimpleDirectoryReader(input_dir="./data") documents = reader.load_data() print(f"加载了 {len(documents)} 个文档") for doc in documents: print(f"- {doc.metadata.get('file_name', 'unknown')}") 指定文件类型 ------------ .. code-block:: python # 只加载特定类型的文件 reader = SimpleDirectoryReader( input_dir="./data", required_exts=[".pdf", ".md", ".txt"] # 只加载这些类型 ) documents = reader.load_data() 递归加载子目录 -------------- .. code-block:: python # 递归加载所有子目录 reader = SimpleDirectoryReader( input_dir="./data", recursive=True, # 包含子目录 exclude=["*.log", "temp/*"] # 排除模式 ) documents = reader.load_data() 自定义元数据 ------------ .. code-block:: python def custom_metadata_func(file_path: str) -> dict: """自定义元数据提取函数""" import os from datetime import datetime stat = os.stat(file_path) return { "file_name": os.path.basename(file_path), "file_path": file_path, "file_size": stat.st_size, "created_at": datetime.fromtimestamp(stat.st_ctime).isoformat(), "modified_at": datetime.fromtimestamp(stat.st_mtime).isoformat(), } reader = SimpleDirectoryReader( input_dir="./data", file_metadata=custom_metadata_func ) documents = reader.load_data() PDF 文件加载 ============ 基础 PDF 加载 ------------- .. code-block:: python from llama_index.readers.file import PDFReader # 使用专门的 PDF 读取器 pdf_reader = PDFReader() documents = pdf_reader.load_data(file="./documents/report.pdf") for doc in documents: print(f"页码: {doc.metadata.get('page_label', 'N/A')}") print(f"内容预览: {doc.text[:200]}...") 高级 PDF 处理 ------------- .. code-block:: python # 安装高级 PDF 处理器 # pip install llama-index-readers-llama-parse from llama_parse import LlamaParse # LlamaParse 提供更好的 PDF 解析 parser = LlamaParse( api_key="your-llama-cloud-api-key", result_type="markdown", # 输出 Markdown 格式 language="zh", # 支持中文 verbose=True ) documents = parser.load_data("./documents/complex_report.pdf") 处理扫描 PDF ------------ .. code-block:: python # 对于扫描的 PDF,需要 OCR 支持 # pip install pytesseract pdf2image from llama_index.readers.file import ImageReader # 先将 PDF 转换为图片,再进行 OCR image_reader = ImageReader( text_type="plain_text", parser_config={"lang": "chi_sim+eng"} # 中英文混合 ) 数据库加载 ========== SQL 数据库 ---------- .. code-block:: python from llama_index.readers.database import DatabaseReader # 连接数据库 db_reader = DatabaseReader( uri="mysql://user:password@localhost:3306/mydb" ) # 执行 SQL 查询并加载结果 documents = db_reader.load_data( query="SELECT id, title, content FROM articles WHERE status = 'published'" ) print(f"从数据库加载了 {len(documents)} 条记录") 自定义数据库加载 ---------------- .. code-block:: python from sqlalchemy import create_engine, text from llama_index.core import Document def load_from_database(connection_string: str, query: str) -> list: """自定义数据库加载函数""" engine = create_engine(connection_string) documents = [] with engine.connect() as conn: result = conn.execute(text(query)) for row in result: doc = Document( text=str(row.content), metadata={ "id": row.id, "title": row.title, "source": "database", "table": "articles" } ) documents.append(doc) return documents # 使用自定义加载器 docs = load_from_database( "postgresql://user:pass@localhost/db", "SELECT * FROM knowledge_base" ) Web 数据加载 ============ 网页爬取 -------- .. code-block:: python from llama_index.readers.web import SimpleWebPageReader # 加载单个网页 web_reader = SimpleWebPageReader(html_to_text=True) documents = web_reader.load_data( urls=["https://docs.llamaindex.ai/en/stable/"] ) print(f"网页内容长度: {len(documents[0].text)}") 递归网页爬取 ------------ .. code-block:: python from llama_index.readers.web import WholeSiteReader # 爬取整个网站(谨慎使用) site_reader = WholeSiteReader( prefix="https://docs.llamaindex.ai", max_depth=2 # 最大深度 ) documents = site_reader.load_data( base_url="https://docs.llamaindex.ai/en/stable/" ) API 数据加载 ------------ .. code-block:: python import requests from llama_index.core import Document def load_from_api(api_url: str, headers: dict = None) -> list: """从 API 加载数据""" response = requests.get(api_url, headers=headers) data = response.json() documents = [] for item in data.get("items", []): doc = Document( text=item.get("content", ""), metadata={ "id": item.get("id"), "title": item.get("title"), "source": "api", "url": api_url } ) documents.append(doc) return documents # 示例:从 GitHub API 加载 docs = load_from_api( "https://api.github.com/repos/run-llama/llama_index/readme", headers={"Accept": "application/vnd.github.v3+json"} ) 云存储加载 ========== AWS S3 ------ .. code-block:: python # pip install llama-index-readers-s3 from llama_index.readers.s3 import S3Reader s3_reader = S3Reader( bucket="my-bucket", prefix="documents/", # 可选前缀 aws_access_id="your-access-key", aws_access_secret="your-secret-key" ) documents = s3_reader.load_data() Google Drive ------------ .. code-block:: python # pip install llama-index-readers-google from llama_index.readers.google import GoogleDriveReader drive_reader = GoogleDriveReader( credentials_path="./credentials.json" ) # 加载特定文件夹 documents = drive_reader.load_data(folder_id="your-folder-id") 知识库平台 ========== Notion ------ .. code-block:: python # pip install llama-index-readers-notion from llama_index.readers.notion import NotionPageReader notion_reader = NotionPageReader( integration_token="your-notion-token" ) # 加载特定页面 documents = notion_reader.load_data( page_ids=["page-id-1", "page-id-2"] ) # 或加载整个数据库 documents = notion_reader.load_data( database_id="your-database-id" ) Confluence ---------- .. code-block:: python # pip install llama-index-readers-confluence from llama_index.readers.confluence import ConfluenceReader confluence_reader = ConfluenceReader( base_url="https://your-domain.atlassian.net/wiki", user="your-email@example.com", api_token="your-api-token" ) # 加载特定空间 documents = confluence_reader.load_data( space_key="MYSPACE", include_attachments=True ) 自定义 Reader ============= 创建自定义数据加载器 -------------------- .. code-block:: python from llama_index.core.readers.base import BaseReader from llama_index.core import Document from typing import List import json class CustomJSONReader(BaseReader): """自定义 JSON 文件读取器""" def __init__(self, text_field: str = "content"): self.text_field = text_field def load_data(self, file_path: str) -> List[Document]: with open(file_path, 'r', encoding='utf-8') as f: data = json.load(f) documents = [] items = data if isinstance(data, list) else [data] for item in items: text = item.get(self.text_field, "") metadata = {k: v for k, v in item.items() if k != self.text_field} doc = Document(text=text, metadata=metadata) documents.append(doc) return documents # 使用自定义读取器 reader = CustomJSONReader(text_field="body") documents = reader.load_data("./data/articles.json") 批量加载多种格式 ---------------- .. code-block:: python from pathlib import Path from llama_index.core import Document class MultiFormatLoader: """支持多种格式的批量加载器""" def __init__(self): self.readers = {} def register_reader(self, extension: str, reader): """注册文件格式对应的读取器""" self.readers[extension] = reader def load_directory(self, directory: str) -> List[Document]: """加载目录下所有支持的文件""" documents = [] dir_path = Path(directory) for file_path in dir_path.rglob("*"): if file_path.is_file(): ext = file_path.suffix.lower() if ext in self.readers: docs = self.readers[ext].load_data(str(file_path)) documents.extend(docs) return documents # 使用示例 loader = MultiFormatLoader() loader.register_reader(".pdf", PDFReader()) loader.register_reader(".json", CustomJSONReader()) all_docs = loader.load_directory("./data") 数据预处理 ========== 文档清洗 -------- .. code-block:: python import re from llama_index.core import Document def clean_document(doc: Document) -> Document: """清洗文档内容""" text = doc.text # 移除多余空白 text = re.sub(r'\s+', ' ', text) # 移除特殊字符 text = re.sub(r'[^\w\s\u4e00-\u9fff.,!?;:\'\"()-]', '', text) # 规范化标点 text = text.strip() return Document(text=text, metadata=doc.metadata) # 批量清洗 cleaned_docs = [clean_document(doc) for doc in documents] 添加全局元数据 -------------- .. code-block:: python def add_global_metadata(documents: List[Document], metadata: dict) -> List[Document]: """为所有文档添加全局元数据""" for doc in documents: doc.metadata.update(metadata) return documents # 添加项目信息 documents = add_global_metadata(documents, { "project": "knowledge-base", "version": "1.0", "indexed_at": "2024-01-01" }) 实战示例:多源数据整合 ====================== .. code-block:: python from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings from llama_index.readers.database import DatabaseReader from llama_index.readers.web import SimpleWebPageReader def build_unified_knowledge_base(): """构建统一的知识库""" all_documents = [] # 1. 加载本地文档 print("加载本地文档...") local_reader = SimpleDirectoryReader( input_dir="./documents", recursive=True, required_exts=[".pdf", ".md", ".txt"] ) local_docs = local_reader.load_data() for doc in local_docs: doc.metadata["source_type"] = "local_file" all_documents.extend(local_docs) print(f" - 加载了 {len(local_docs)} 个本地文档") # 2. 加载数据库记录 print("加载数据库记录...") db_reader = DatabaseReader(uri="sqlite:///knowledge.db") db_docs = db_reader.load_data( query="SELECT title, content FROM knowledge WHERE active = 1" ) for doc in db_docs: doc.metadata["source_type"] = "database" all_documents.extend(db_docs) print(f" - 加载了 {len(db_docs)} 条数据库记录") # 3. 加载网页内容 print("加载网页内容...") web_reader = SimpleWebPageReader(html_to_text=True) web_docs = web_reader.load_data(urls=[ "https://example.com/doc1", "https://example.com/doc2" ]) for doc in web_docs: doc.metadata["source_type"] = "web" all_documents.extend(web_docs) print(f" - 加载了 {len(web_docs)} 个网页") # 4. 构建统一索引 print(f"\n总共加载 {len(all_documents)} 个文档") print("构建向量索引...") index = VectorStoreIndex.from_documents(all_documents) return index # 构建知识库 knowledge_index = build_unified_knowledge_base() # 查询 query_engine = knowledge_index.as_query_engine() response = query_engine.query("总结我们的产品特点") print(response) 小结 ==== 本教程介绍了: - SimpleDirectoryReader 的使用方法 - 各种文件格式的加载:PDF、数据库、Web 等 - 云存储和知识库平台的数据加载 - 自定义 Reader 的创建 - 数据预处理和清洗 - 多源数据整合的最佳实践 下一步 ------ 在下一个教程中,我们将学习节点解析(Node Parsing), 了解如何将文档智能地分割为更小的、可索引的单元。 练习 ==== 1. 使用 SimpleDirectoryReader 加载你的本地文档 2. 创建一个自定义 Reader 加载特定格式的数据 3. 尝试从数据库或 API 加载数据 4. 实现一个多源数据整合的知识库