Design Document: Multi-Repo GitHub Documentation Indexing
Overview
This document describes the design for extending the RAG Indexer Lambda to support multi-repository GitHub documentation indexing. The feature enables the system to fetch, process, and index documentation from multiple GitHub repositories configured via the GITHUB_REPOS environment variable.
Key Changes from Current Implementation
| Aspect | Current | Multi-Repo Extension |
|---|---|---|
| Repository Configuration | Single repo via environment variables | Comma-separated list of owner/repo entries |
| File Path Storage | Simple file paths | Repo-prefixed paths (owner/repo/path/to/file.md) |
| Embedding Storage | Single source (github) |
Repo-prefixed source files |
| Documentation Generation | Single repo info | Aggregated from all configured repos |
| Error Handling | Fail on any error | Continue on repo-specific errors |
Architecture Overview
graph TD
A[Event Trigger] --> B{Event Type}
B -->|Scheduled Event| C[Multi-Repo GitHub Indexing]
B -->|S3 Event| D[Single S3 File Indexing]
B -->|Manual| C
B -->|Manual| E[Full S3 Scan]
C --> F[Parse GITHUB_REPOS Config]
F --> G[Iterate Repositories]
G --> H[Fetch Files from GitHub]
H --> I[Change Detection via SHA]
I --> J[Process Files]
J --> K[Generate Embeddings]
K --> L[Store in DynamoDB]
L --> M[Generate Project Documentation]
M --> N[Upload to S3]
D --> O[Fetch Single S3 File]
O --> P[Process File]
P --> Q[Store Embedding]
style C fill:#e1f5ff
style M fill:#e1f5ff
style N fill:#e1f5ff
Architecture
Component Diagram
graph TB
subgraph "Lambda Handler"
A[RAGIndexerHandler] --> B[indexFromGitHub]
A --> C[indexFromS3]
A --> D[indexSingleS3File]
end
subgraph "Multi-Repo Indexing"
B --> E[ConfigParser]
B --> F[GitHubFetcher]
B --> G[EmbeddingGenerator]
B --> H[DocumentationGenerator]
end
subgraph "Services"
F --> I[GitHubClient]
G --> J[EmbeddingClient]
H --> K[S3DataClient]
end
subgraph "Persistence"
L[DynamoDB] --> M[EmbeddingsStore]
L --> N[FetcherStateStore]
K --> O[S3Bucket]
end
style A fill:#e1f5ff
style B fill:#e1f5ff
style E fill:#fff4e1
style F fill:#e1f5ff
style G fill:#e1f5ff
style H fill:#fff4e1
Data Flow Diagram
sequenceDiagram
participant Event as Event Source
participant Handler as RAGIndexerHandler
participant Config as ConfigParser
participant GitHub as GitHubClient
participant State as FetcherStateStore
participant Embed as EmbeddingGenerator
participant Store as EmbeddingsStore
participant DocGen as DocumentationGenerator
participant S3 as S3DataClient
Event->>Handler: Scheduled Event
Handler->>Config: Parse GITHUB_REPOS
Config-->>Handler: Repo List
loop For Each Repository
Handler->>State: Get Previous Hashes
State-->>Handler: Hash Map
Handler->>GitHub: List Contents (owner/repo/docs/)
GitHub-->>Handler: File Entries
Handler->>GitHub: Download Changed Files
GitHub-->>Handler: File Content
Handler->>State: Save Updated Hashes
end
Handler->>Embed: Generate Embeddings
Embed-->>Handler: Embedding Records
Handler->>Store: Delete Old GitHub Embeddings
Store-->>Handler: Ack
Handler->>Store: Batch Write Embeddings
Store-->>Handler: Ack
Handler->>DocGen: Generate Documentation
DocGen-->>Handler: ProjectDocumentation
Handler->>S3: PutObject profile-data/project-summary.json
S3-->>Handler: ETag
Handler-->>Event: IndexingResult
Components and Interfaces
Modified Interfaces
RAGIndexerDeps
export interface RAGIndexerDeps {
gitHubClient: GitHubClient;
gitHubRepos: Array<{ owner: string; name: string; docsPath: string }>;
s3Client: S3DataClient;
s3Config: S3FetcherConfig;
embeddingClient: EmbeddingClient;
embeddingsStore: EmbeddingsWriteStore;
fetcherStateStore: FetcherStateStore;
}
Changes:
gitHubReposchanged from single repo config to array of repo configurations- Each repo includes
owner,name, and optionaldocsPath(defaults todocs)
GitHubFile
export interface GitHubFile {
path: string; // Now includes repo prefix: owner/repo/path/to/file.md
content: string;
sha: string;
lastModified: number;
}
Changes:
pathnow includes repo prefix for namespace isolation
New Interfaces
ProjectDocumentation
export interface ProjectDocumentation {
version: string;
generatedAt: string;
repoInfo: {
owner: string;
name: string;
branch: string;
}[];
summary: {
description: string;
techStack: string[]; // Limited to 10 entries
keyFeatures: string[];
documentationFiles: string[]; // Limited to 20 entries
};
structure: {
directories: string[]; // Limited to 50 entries
mainFiles: string[];
};
apiEndpoints: {
path: string;
method: string;
description: string;
}[];
dataModels: {
name: string;
type: string;
description: string;
}[]; // Limited to 100 entries
}
Modified Functions
indexFromGitHub
Signature:
export async function indexFromGitHub(
deps: RAGIndexerDeps,
): Promise<IndexingResult>
Changes:
- Now iterates through all configured repositories
- Uses repo-prefixed paths for file storage
- Generates and uploads project documentation
- Continues processing on repo-specific errors
Algorithm:
- Parse
GITHUB_REPOSenvironment variable into repo configurations - Load previous SHA hashes from
FetcherStateStore - For each repository:
- Filter previous hashes to repo-specific entries
- Fetch files from GitHub with change detection
- Add repo prefix to file paths
- Update hash map with repo-prefixed paths
- Process all changed files through RAG pipeline
- Delete old GitHub embeddings from DynamoDB
- Batch write new embeddings (25 per batch)
- Generate project documentation
- Upload documentation to S3
- Save updated hash map
processFile
Signature:
export async function processFile(
content: string,
source: 'github' | 's3',
sourceFile: string, // Now includes repo prefix
lastModified: number,
embeddingClient: EmbeddingClient,
): Promise<{ records: EmbeddingRecord[]; errors: string[] }>
Changes:
- No functional changes, but
sourceFilenow includes repo prefix - Raw text handling unchanged (2048 char truncation,
'other'section type)
New Functions
generateProjectDocumentation
Signature:
export async function generateProjectDocumentation(
repos: Array<{ owner: string; name: string }>,
files: GitHubFile[],
): Promise<ProjectDocumentation>
Purpose: Extract structured documentation from fetched GitHub files
Algorithm:
- Initialize empty collections for each data type
- For each file:
- Extract directory structure from path
- Categorize as documentation if
docs/or.md - Parse
package.jsonfor tech stack - Parse
infrastructure/template.yamlfor API endpoints - Parse TypeScript files for data models
- Apply limits to each collection
- Build and return
ProjectDocumentationobject
Limits:
- Tech stack: 10 entries
- Documentation files: 20 entries
- Directories: 50 entries
- Data models: 100 entries
- API endpoints: 50 entries
generateAndUploadDocumentation
Signature:
async function generateAndUploadDocumentation(
repos: Array<{ owner: string; name: string }>,
files: GitHubFile[],
s3Client: S3DataClient,
bucket: string,
): Promise<void>
Purpose: Generate and upload project documentation to S3
Algorithm:
- Call
generateProjectDocumentation - Serialize as pretty-printed JSON (2-space indentation)
- Upload to
profile-data/project-summary.json - Log success or error
Data Models
EmbeddingRecord
export interface EmbeddingRecord {
chunkId: string;
embedding: number[];
content: string;
source: 'github' | 's3';
sectionType: string;
metadata: {
sourceFile: string; // Repo-prefixed: owner/repo/path/to/file.md
lastUpdated: number;
chunkIndex: number;
};
updatedAt: number;
}
Changes:
metadata.sourceFilenow includes repo prefix
IndexingResult
export interface IndexingResult {
source: 'github' | 's3';
repo?: string; // Optional repo identifier for multi-repo indexing
filesProcessed: number;
chunksGenerated: number;
embeddingsStored: number;
errors: string[];
}
Changes:
- Added optional
repofield for tracking per-repo results
ProjectDocumentation
See "New Interfaces" section above.
Correctness Properties
A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.
Property 1: Repo Configuration Parsing
For any comma-separated string of owner/repo entries, parsing the GITHUB_REPOS environment variable SHALL produce an array of repository configurations with trimmed owner and repo names, and default docsPath set to docs.
Validates: Requirements 1.1, 1.2, 1.3
Property 2: Multi-Repo File Fetching
For any set of configured repositories, when a scheduled indexing event occurs, the RAG_Indexer SHALL fetch files from all repositories and no files shall be missed due to early termination.
Validates: Requirements 2.1, 2.4
Property 3: File Extension Filtering
For any set of files in a GitHub repository directory, filtering to .md files SHALL include only files with the .md extension and exclude all other files.
Validates: Requirement 2.2
Property 4: Change Detection Accuracy
For any file in a GitHub repository, comparing the current SHA hash against the previously stored SHA SHALL correctly identify unchanged files (skip download) and changed files (download content).
Validates: Requirements 3.2, 3.3, 3.4
Property 5: Repo-Prefixed Path Namespace
For any file fetched from any repository, prepending the owner/repo/ prefix to the file path SHALL produce a unique namespace that prevents path collisions across repositories.
Validates: Requirements 4.1, 4.2
Property 6: Embedding Storage Isolation
For any set of embeddings, storing them with source: 'github' and repo-prefixed metadata.sourceFile SHALL ensure that deleting GitHub embeddings does not affect S3 embeddings.
Validates: Requirements 4.2, 4.3
Property 7: Batch Write Size Limit
For any set of embedding records, writing to DynamoDB SHALL use batch sizes of at most 25 records per BatchWrite request.
Validates: Requirement 4.4
Property 8: Documentation Generation Completeness
For any set of fetched GitHub files, generating project documentation SHALL extract all directories, tech stack entries, API endpoints, and data models that meet the categorization criteria.
Validates: Requirements 6.2, 6.3, 6.4, 6.5, 6.6
Property 9: Documentation Generation Limits
For any set of extracted documentation data, applying limits SHALL ensure that tech stack has at most 10 entries, documentation files has at most 20 entries, directories has at most 50 entries, and data models has at most 100 entries.
Validates: Requirement 6.7
Property 10: JSON Serialization Format
For any ProjectDocumentation object, serializing to JSON SHALL produce pretty-printed output with 2-space indentation.
Validates: Requirement 7.2
Property 11: Error Continuation
For any set of repositories, when an error occurs processing one repository, the RAG_Indexer SHALL continue processing remaining repositories and log the error.
Validates: Requirements 2.4, 7.3, 9.5
Property 12: HTTP Status Code Accuracy
For any indexing run, returning the HTTP status code SHALL be 200 when no errors occur, 207 when partial errors occur, and 500 when a fatal error occurs.
Validates: Requirements 9.2, 9.3, 9.4
Property 13: Raw Text Truncation
For any file content that does not parse as valid ProfileData JSON, truncating to 2048 characters before generating an embedding SHALL ensure the content length does not exceed 2048 characters.
Validates: Requirement 10.2
Property 14: Raw Text Section Type
For any file content that does not parse as valid ProfileData JSON, assigning the section type 'other' SHALL ensure the chunk's sectionType field is set to 'other'.
Validates: Requirement 10.3
Property 15: Chunk ID Pattern
For any raw text content, generating a chunk ID using the pattern {source}#other#{sanitized_file_path} SHALL produce a valid identifier with special characters replaced by underscores.
Validates: Requirement 10.4
Error Handling Strategy
Error Categories
| Category | Description | Handling |
|---|---|---|
| Repository-specific errors | GitHub API errors for a specific repo | Log and continue with other repos |
| Embedding generation errors | Bedrock API failures | Log error, continue with other chunks |
| S3 upload errors | Profile data bucket upload failures | Log error, continue indexing |
| Configuration errors | Invalid GITHUB_REPOS format |
Log error, use default repos |
| Fatal errors | DynamoDB connection failures, critical infrastructure issues | Return 500 status |
Error Propagation
// Repository-specific errors are caught and logged
for (const repo of deps.gitHubRepos) {
try {
const fetchResult = await fetchFromGitHubWithConfig(...);
// Process files...
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
result.errors.push(`Repository ${repo.owner}/${repo.name} failed: ${message}`);
// Continue to next repo
}
}
// Fatal errors are caught at the top level
try {
// Main processing logic
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
console.error('RAG Indexer fatal error:', message);
return { statusCode: 500, body: JSON.stringify({ error: message }) };
}
Status Code Logic
| Condition | Status Code | Body |
|---|---|---|
| No errors | 200 | {"filesProcessed": N, "embeddingsStored": N, "errors": [], "results": [...]} |
| Partial errors | 207 | Same as 200, but errors array is non-empty |
| Fatal error | 500 | {"error": "error message"} |
Testing Strategy
Dual Testing Approach
Unit Tests: Verify specific examples, edge cases, and error conditions Property Tests: Verify universal properties across all inputs (when applicable)
Property-Based Testing
The following properties are suitable for property-based testing:
- Repo Configuration Parsing - Test with random comma-separated strings
- File Extension Filtering - Test with random file name lists
- Change Detection Accuracy - Test with random SHA pairs
- Repo-Prefixed Path Namespace - Test with random repo/file combinations
- Embedding Storage Isolation - Test with random embedding sets
- Batch Write Size Limit - Test with random record counts
- Documentation Generation Limits - Test with large datasets
- JSON Serialization Format - Test with random documentation objects
- Error Continuation - Test with random error scenarios
- Raw Text Truncation - Test with random text lengths
- Raw Text Section Type - Test with random non-JSON content
- Chunk ID Pattern - Test with random file paths
Example-Based Testing
The following behaviors are better tested with specific examples:
- HTTP Status Code Accuracy - Test specific error scenarios
- S3 Event Processing - Test with specific S3 event structures
- Scheduled Event Processing - Test with specific scheduled event structures
- Manual Invocation - Test with unknown event types
Integration Testing
The following require integration tests:
- GitHub API Integration - Test with real GitHub API calls
- DynamoDB Persistence - Test with real DynamoDB table
- S3 Upload - Test with real S3 bucket
- Embedding Generation - Test with real Bedrock API
Test Configuration
- Property-based tests: Minimum 100 iterations per property
- Tag format:
Feature: multi-repo-indexing, Property {number}: {property_text} - Integration tests: 1-3 representative examples per external service
Test Files
| File | Purpose |
|---|---|
backend/src/handlers/rag-indexer-handler.test.ts |
Handler logic tests |
backend/src/services/rag/github-fetcher.test.ts |
GitHub fetcher tests |
backend/src/services/rag/s3-fetcher.test.ts |
S3 fetcher tests |
backend/src/services/rag/embedding-service.test.ts |
Embedding generation tests |
backend/src/services/rag/chunker.test.ts |
Chunking tests |
scripts/drive-ingest.test.ts |
Integration tests |
Implementation Notes
Environment Variables
| Variable | Format | Default | Description |
|---|---|---|---|
GITHUB_REPOS |
owner/repo,owner/repo,... |
Preconfigured list | Comma-separated list of repositories to index |
GITHUB_TOKEN |
Token string | Required | GitHub Personal Access Token |
EMBEDDINGS_TABLE |
Table name | Required | DynamoDB table for embeddings |
PROFILE_DATA_BUCKET |
Bucket name | Required | S3 bucket for profile data |
DynamoDB Schema
Table: EMBEDDINGS_TABLE
Primary Key: chunkId (String)
Items:
- Embedding records:
chunkId,embedding,content,source,sectionType,metadata,updatedAt - State records:
chunkId=__fetcher_state__#github_hashesor__fetcher_state__#s3_etags,data= hash/ETag map
S3 Object Keys
| Key | Purpose |
|---|---|
profile-data/project-summary.json |
Generated project documentation |
profile-data/*.json |
Profile data files (existing) |
GitHub API Rate Limits
- Unauthenticated: 60 requests/hour
- Authenticated: 5000 requests/hour
- Consider pagination for repositories with many files
Performance Considerations
- Batch Processing: Process files in parallel where possible
- Change Detection: Only fetch changed files to reduce API calls
- Batch Writes: Use DynamoDB batch write (25 items) for efficiency
- Error Isolation: Continue processing on non-fatal errors
Security Considerations
- GitHub Token: Store in AWS Secrets Manager or SSM Parameter Store
- Environment Variables: Encrypt at rest using AWS KMS
- S3 Upload: Use S3 bucket policies to restrict access