Design Document: Multi-Repo GitHub Documentation Indexing

Overview

This document describes the design for extending the RAG Indexer Lambda to support multi-repository GitHub documentation indexing. The feature enables the system to fetch, process, and index documentation from multiple GitHub repositories configured via the GITHUB_REPOS environment variable.

Key Changes from Current Implementation

Aspect	Current	Multi-Repo Extension
Repository Configuration	Single repo via environment variables	Comma-separated list of `owner/repo` entries
File Path Storage	Simple file paths	Repo-prefixed paths (`owner/repo/path/to/file.md`)
Embedding Storage	Single source (`github`)	Repo-prefixed source files
Documentation Generation	Single repo info	Aggregated from all configured repos
Error Handling	Fail on any error	Continue on repo-specific errors

Architecture Overview

graph TD
    A[Event Trigger] --> B{Event Type}
    B -->|Scheduled Event| C[Multi-Repo GitHub Indexing]
    B -->|S3 Event| D[Single S3 File Indexing]
    B -->|Manual| C
    B -->|Manual| E[Full S3 Scan]
    
    C --> F[Parse GITHUB_REPOS Config]
    F --> G[Iterate Repositories]
    G --> H[Fetch Files from GitHub]
    H --> I[Change Detection via SHA]
    I --> J[Process Files]
    J --> K[Generate Embeddings]
    K --> L[Store in DynamoDB]
    L --> M[Generate Project Documentation]
    M --> N[Upload to S3]
    
    D --> O[Fetch Single S3 File]
    O --> P[Process File]
    P --> Q[Store Embedding]
    
    style C fill:#e1f5ff
    style M fill:#e1f5ff
    style N fill:#e1f5ff

Architecture

Component Diagram

graph TB
    subgraph "Lambda Handler"
        A[RAGIndexerHandler] --> B[indexFromGitHub]
        A --> C[indexFromS3]
        A --> D[indexSingleS3File]
    end
    
    subgraph "Multi-Repo Indexing"
        B --> E[ConfigParser]
        B --> F[GitHubFetcher]
        B --> G[EmbeddingGenerator]
        B --> H[DocumentationGenerator]
    end
    
    subgraph "Services"
        F --> I[GitHubClient]
        G --> J[EmbeddingClient]
        H --> K[S3DataClient]
    end
    
    subgraph "Persistence"
        L[DynamoDB] --> M[EmbeddingsStore]
        L --> N[FetcherStateStore]
        K --> O[S3Bucket]
    end
    
    style A fill:#e1f5ff
    style B fill:#e1f5ff
    style E fill:#fff4e1
    style F fill:#e1f5ff
    style G fill:#e1f5ff
    style H fill:#fff4e1

Data Flow Diagram

sequenceDiagram
    participant Event as Event Source
    participant Handler as RAGIndexerHandler
    participant Config as ConfigParser
    participant GitHub as GitHubClient
    participant State as FetcherStateStore
    participant Embed as EmbeddingGenerator
    participant Store as EmbeddingsStore
    participant DocGen as DocumentationGenerator
    participant S3 as S3DataClient

    Event->>Handler: Scheduled Event
    Handler->>Config: Parse GITHUB_REPOS
    Config-->>Handler: Repo List
    
    loop For Each Repository
        Handler->>State: Get Previous Hashes
        State-->>Handler: Hash Map
        
        Handler->>GitHub: List Contents (owner/repo/docs/)
        GitHub-->>Handler: File Entries
        
        Handler->>GitHub: Download Changed Files
        GitHub-->>Handler: File Content
        
        Handler->>State: Save Updated Hashes
    end
    
    Handler->>Embed: Generate Embeddings
    Embed-->>Handler: Embedding Records
    
    Handler->>Store: Delete Old GitHub Embeddings
    Store-->>Handler: Ack
    
    Handler->>Store: Batch Write Embeddings
    Store-->>Handler: Ack
    
    Handler->>DocGen: Generate Documentation
    DocGen-->>Handler: ProjectDocumentation
    
    Handler->>S3: PutObject profile-data/project-summary.json
    S3-->>Handler: ETag
    
    Handler-->>Event: IndexingResult

Components and Interfaces

Modified Interfaces

RAGIndexerDeps

export interface RAGIndexerDeps {
  gitHubClient: GitHubClient;
  gitHubRepos: Array<{ owner: string; name: string; docsPath: string }>;
  s3Client: S3DataClient;
  s3Config: S3FetcherConfig;
  embeddingClient: EmbeddingClient;
  embeddingsStore: EmbeddingsWriteStore;
  fetcherStateStore: FetcherStateStore;
}

Changes:

gitHubRepos changed from single repo config to array of repo configurations
Each repo includes owner, name, and optional docsPath (defaults to docs)

GitHubFile

export interface GitHubFile {
  path: string;  // Now includes repo prefix: owner/repo/path/to/file.md
  content: string;
  sha: string;
  lastModified: number;
}

Changes:

path now includes repo prefix for namespace isolation

New Interfaces

ProjectDocumentation

export interface ProjectDocumentation {
  version: string;
  generatedAt: string;
  repoInfo: {
    owner: string;
    name: string;
    branch: string;
  }[];
  summary: {
    description: string;
    techStack: string[];      // Limited to 10 entries
    keyFeatures: string[];
    documentationFiles: string[];  // Limited to 20 entries
  };
  structure: {
    directories: string[];    // Limited to 50 entries
    mainFiles: string[];
  };
  apiEndpoints: {
    path: string;
    method: string;
    description: string;
  }[];
  dataModels: {
    name: string;
    type: string;
    description: string;
  }[];  // Limited to 100 entries
}

Modified Functions

indexFromGitHub

Signature:

export async function indexFromGitHub(
  deps: RAGIndexerDeps,
): Promise<IndexingResult>

Changes:

Now iterates through all configured repositories
Uses repo-prefixed paths for file storage
Generates and uploads project documentation
Continues processing on repo-specific errors

Algorithm:

Parse GITHUB_REPOS environment variable into repo configurations
Load previous SHA hashes from FetcherStateStore
For each repository:
- Filter previous hashes to repo-specific entries
- Fetch files from GitHub with change detection
- Add repo prefix to file paths
- Update hash map with repo-prefixed paths
Process all changed files through RAG pipeline
Delete old GitHub embeddings from DynamoDB
Batch write new embeddings (25 per batch)
Generate project documentation
Upload documentation to S3
Save updated hash map

processFile

Signature:

export async function processFile(
  content: string,
  source: 'github' | 's3',
  sourceFile: string,  // Now includes repo prefix
  lastModified: number,
  embeddingClient: EmbeddingClient,
): Promise<{ records: EmbeddingRecord[]; errors: string[] }>

Changes:

No functional changes, but sourceFile now includes repo prefix
Raw text handling unchanged (2048 char truncation, 'other' section type)

New Functions

generateProjectDocumentation

Signature:

export async function generateProjectDocumentation(
  repos: Array<{ owner: string; name: string }>,
  files: GitHubFile[],
): Promise<ProjectDocumentation>

Purpose: Extract structured documentation from fetched GitHub files

Algorithm:

Initialize empty collections for each data type
For each file:
- Extract directory structure from path
- Categorize as documentation if docs/ or .md
- Parse package.json for tech stack
- Parse infrastructure/template.yaml for API endpoints
- Parse TypeScript files for data models
Apply limits to each collection
Build and return ProjectDocumentation object

Limits:

Tech stack: 10 entries
Documentation files: 20 entries
Directories: 50 entries
Data models: 100 entries
API endpoints: 50 entries

generateAndUploadDocumentation

Signature:

async function generateAndUploadDocumentation(
  repos: Array<{ owner: string; name: string }>,
  files: GitHubFile[],
  s3Client: S3DataClient,
  bucket: string,
): Promise<void>

Purpose: Generate and upload project documentation to S3

Algorithm:

Call generateProjectDocumentation
Serialize as pretty-printed JSON (2-space indentation)
Upload to profile-data/project-summary.json
Log success or error

Data Models

EmbeddingRecord

export interface EmbeddingRecord {
  chunkId: string;
  embedding: number[];
  content: string;
  source: 'github' | 's3';
  sectionType: string;
  metadata: {
    sourceFile: string;  // Repo-prefixed: owner/repo/path/to/file.md
    lastUpdated: number;
    chunkIndex: number;
  };
  updatedAt: number;
}

Changes:

metadata.sourceFile now includes repo prefix

IndexingResult

export interface IndexingResult {
  source: 'github' | 's3';
  repo?: string;  // Optional repo identifier for multi-repo indexing
  filesProcessed: number;
  chunksGenerated: number;
  embeddingsStored: number;
  errors: string[];
}

Changes:

Added optional repo field for tracking per-repo results

ProjectDocumentation

See "New Interfaces" section above.

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Repo Configuration Parsing

For any comma-separated string of owner/repo entries, parsing the GITHUB_REPOS environment variable SHALL produce an array of repository configurations with trimmed owner and repo names, and default docsPath set to docs.

Validates: Requirements 1.1, 1.2, 1.3

Property 2: Multi-Repo File Fetching

For any set of configured repositories, when a scheduled indexing event occurs, the RAG_Indexer SHALL fetch files from all repositories and no files shall be missed due to early termination.

Validates: Requirements 2.1, 2.4

Property 3: File Extension Filtering

For any set of files in a GitHub repository directory, filtering to .md files SHALL include only files with the .md extension and exclude all other files.

Validates: Requirement 2.2

Property 4: Change Detection Accuracy

For any file in a GitHub repository, comparing the current SHA hash against the previously stored SHA SHALL correctly identify unchanged files (skip download) and changed files (download content).

Validates: Requirements 3.2, 3.3, 3.4

Property 5: Repo-Prefixed Path Namespace

For any file fetched from any repository, prepending the owner/repo/ prefix to the file path SHALL produce a unique namespace that prevents path collisions across repositories.

Validates: Requirements 4.1, 4.2

Property 6: Embedding Storage Isolation

For any set of embeddings, storing them with source: 'github' and repo-prefixed metadata.sourceFile SHALL ensure that deleting GitHub embeddings does not affect S3 embeddings.

Validates: Requirements 4.2, 4.3

Property 7: Batch Write Size Limit

For any set of embedding records, writing to DynamoDB SHALL use batch sizes of at most 25 records per BatchWrite request.

Validates: Requirement 4.4

Property 8: Documentation Generation Completeness

For any set of fetched GitHub files, generating project documentation SHALL extract all directories, tech stack entries, API endpoints, and data models that meet the categorization criteria.

Validates: Requirements 6.2, 6.3, 6.4, 6.5, 6.6

Property 9: Documentation Generation Limits

For any set of extracted documentation data, applying limits SHALL ensure that tech stack has at most 10 entries, documentation files has at most 20 entries, directories has at most 50 entries, and data models has at most 100 entries.

Validates: Requirement 6.7

Property 10: JSON Serialization Format

For any ProjectDocumentation object, serializing to JSON SHALL produce pretty-printed output with 2-space indentation.

Validates: Requirement 7.2

Property 11: Error Continuation

For any set of repositories, when an error occurs processing one repository, the RAG_Indexer SHALL continue processing remaining repositories and log the error.

Validates: Requirements 2.4, 7.3, 9.5

Property 12: HTTP Status Code Accuracy

For any indexing run, returning the HTTP status code SHALL be 200 when no errors occur, 207 when partial errors occur, and 500 when a fatal error occurs.

Validates: Requirements 9.2, 9.3, 9.4

Property 13: Raw Text Truncation

For any file content that does not parse as valid ProfileData JSON, truncating to 2048 characters before generating an embedding SHALL ensure the content length does not exceed 2048 characters.

Validates: Requirement 10.2

Property 14: Raw Text Section Type

For any file content that does not parse as valid ProfileData JSON, assigning the section type 'other' SHALL ensure the chunk's sectionType field is set to 'other'.

Validates: Requirement 10.3

Property 15: Chunk ID Pattern

For any raw text content, generating a chunk ID using the pattern {source}#other#{sanitized_file_path} SHALL produce a valid identifier with special characters replaced by underscores.

Validates: Requirement 10.4

Error Handling Strategy

Error Categories

Category	Description	Handling
Repository-specific errors	GitHub API errors for a specific repo	Log and continue with other repos
Embedding generation errors	Bedrock API failures	Log error, continue with other chunks
S3 upload errors	Profile data bucket upload failures	Log error, continue indexing
Configuration errors	Invalid `GITHUB_REPOS` format	Log error, use default repos
Fatal errors	DynamoDB connection failures, critical infrastructure issues	Return 500 status

Error Propagation

// Repository-specific errors are caught and logged
for (const repo of deps.gitHubRepos) {
  try {
    const fetchResult = await fetchFromGitHubWithConfig(...);
    // Process files...
  } catch (error) {
    const message = error instanceof Error ? error.message : String(error);
    result.errors.push(`Repository ${repo.owner}/${repo.name} failed: ${message}`);
    // Continue to next repo
  }
}

// Fatal errors are caught at the top level
try {
  // Main processing logic
} catch (error) {
  const message = error instanceof Error ? error.message : String(error);
  console.error('RAG Indexer fatal error:', message);
  return { statusCode: 500, body: JSON.stringify({ error: message }) };
}

Status Code Logic

Condition	Status Code	Body
No errors	200	`{"filesProcessed": N, "embeddingsStored": N, "errors": [], "results": [...]}`
Partial errors	207	Same as 200, but errors array is non-empty
Fatal error	500	`{"error": "error message"}`

Testing Strategy

Dual Testing Approach

Unit Tests: Verify specific examples, edge cases, and error conditions Property Tests: Verify universal properties across all inputs (when applicable)

Property-Based Testing

The following properties are suitable for property-based testing:

Repo Configuration Parsing - Test with random comma-separated strings
File Extension Filtering - Test with random file name lists
Change Detection Accuracy - Test with random SHA pairs
Repo-Prefixed Path Namespace - Test with random repo/file combinations
Embedding Storage Isolation - Test with random embedding sets
Batch Write Size Limit - Test with random record counts
Documentation Generation Limits - Test with large datasets
JSON Serialization Format - Test with random documentation objects
Error Continuation - Test with random error scenarios
Raw Text Truncation - Test with random text lengths
Raw Text Section Type - Test with random non-JSON content
Chunk ID Pattern - Test with random file paths

Example-Based Testing

The following behaviors are better tested with specific examples:

HTTP Status Code Accuracy - Test specific error scenarios
S3 Event Processing - Test with specific S3 event structures
Scheduled Event Processing - Test with specific scheduled event structures
Manual Invocation - Test with unknown event types

Integration Testing

The following require integration tests:

GitHub API Integration - Test with real GitHub API calls
DynamoDB Persistence - Test with real DynamoDB table
S3 Upload - Test with real S3 bucket
Embedding Generation - Test with real Bedrock API

Test Configuration

Property-based tests: Minimum 100 iterations per property
Tag format: Feature: multi-repo-indexing, Property {number}: {property_text}
Integration tests: 1-3 representative examples per external service

Test Files

File	Purpose
`backend/src/handlers/rag-indexer-handler.test.ts`	Handler logic tests
`backend/src/services/rag/github-fetcher.test.ts`	GitHub fetcher tests
`backend/src/services/rag/s3-fetcher.test.ts`	S3 fetcher tests
`backend/src/services/rag/embedding-service.test.ts`	Embedding generation tests
`backend/src/services/rag/chunker.test.ts`	Chunking tests
`scripts/drive-ingest.test.ts`	Integration tests

Implementation Notes

Environment Variables

Variable	Format	Default	Description
`GITHUB_REPOS`	`owner/repo,owner/repo,...`	Preconfigured list	Comma-separated list of repositories to index
`GITHUB_TOKEN`	Token string	Required	GitHub Personal Access Token
`EMBEDDINGS_TABLE`	Table name	Required	DynamoDB table for embeddings
`PROFILE_DATA_BUCKET`	Bucket name	Required	S3 bucket for profile data

DynamoDB Schema

Table: EMBEDDINGS_TABLE

Primary Key: chunkId (String)

Items:

Embedding records: chunkId, embedding, content, source, sectionType, metadata, updatedAt
State records: chunkId = __fetcher_state__#github_hashes or __fetcher_state__#s3_etags, data = hash/ETag map

S3 Object Keys

Key	Purpose
`profile-data/project-summary.json`	Generated project documentation
`profile-data/*.json`	Profile data files (existing)

GitHub API Rate Limits

Unauthenticated: 60 requests/hour
Authenticated: 5000 requests/hour
Consider pagination for repositories with many files

Performance Considerations

Batch Processing: Process files in parallel where possible
Change Detection: Only fetch changed files to reduce API calls
Batch Writes: Use DynamoDB batch write (25 items) for efficiency
Error Isolation: Continue processing on non-fatal errors

Security Considerations

GitHub Token: Store in AWS Secrets Manager or SSM Parameter Store
Environment Variables: Encrypt at rest using AWS KMS
S3 Upload: Use S3 bucket policies to restrict access