Back to Blog
technical-referenceragindexingmulti-repoembeddingsazuresearch

Design Document: Multi-Repo GitHub Documentation Indexing

Multi-repo indexing design: chunking strategy, embedding pipeline, cross-repo retrieval scoring, and incremental update protocol.

September 15, 2025·11 min read

Design Document: Multi-Repo GitHub Documentation Indexing

Overview

This document describes the design for extending the RAG Indexer Lambda to support multi-repository GitHub documentation indexing. The feature enables the system to fetch, process, and index documentation from multiple GitHub repositories configured via the GITHUB_REPOS environment variable.

Key Changes from Current Implementation

Aspect Current Multi-Repo Extension
Repository Configuration Single repo via environment variables Comma-separated list of owner/repo entries
File Path Storage Simple file paths Repo-prefixed paths (owner/repo/path/to/file.md)
Embedding Storage Single source (github) Repo-prefixed source files
Documentation Generation Single repo info Aggregated from all configured repos
Error Handling Fail on any error Continue on repo-specific errors

Architecture Overview

graph TD
    A[Event Trigger] --> B{Event Type}
    B -->|Scheduled Event| C[Multi-Repo GitHub Indexing]
    B -->|S3 Event| D[Single S3 File Indexing]
    B -->|Manual| C
    B -->|Manual| E[Full S3 Scan]
    
    C --> F[Parse GITHUB_REPOS Config]
    F --> G[Iterate Repositories]
    G --> H[Fetch Files from GitHub]
    H --> I[Change Detection via SHA]
    I --> J[Process Files]
    J --> K[Generate Embeddings]
    K --> L[Store in DynamoDB]
    L --> M[Generate Project Documentation]
    M --> N[Upload to S3]
    
    D --> O[Fetch Single S3 File]
    O --> P[Process File]
    P --> Q[Store Embedding]
    
    style C fill:#e1f5ff
    style M fill:#e1f5ff
    style N fill:#e1f5ff

Architecture

Component Diagram

graph TB
    subgraph "Lambda Handler"
        A[RAGIndexerHandler] --> B[indexFromGitHub]
        A --> C[indexFromS3]
        A --> D[indexSingleS3File]
    end
    
    subgraph "Multi-Repo Indexing"
        B --> E[ConfigParser]
        B --> F[GitHubFetcher]
        B --> G[EmbeddingGenerator]
        B --> H[DocumentationGenerator]
    end
    
    subgraph "Services"
        F --> I[GitHubClient]
        G --> J[EmbeddingClient]
        H --> K[S3DataClient]
    end
    
    subgraph "Persistence"
        L[DynamoDB] --> M[EmbeddingsStore]
        L --> N[FetcherStateStore]
        K --> O[S3Bucket]
    end
    
    style A fill:#e1f5ff
    style B fill:#e1f5ff
    style E fill:#fff4e1
    style F fill:#e1f5ff
    style G fill:#e1f5ff
    style H fill:#fff4e1

Data Flow Diagram

sequenceDiagram
    participant Event as Event Source
    participant Handler as RAGIndexerHandler
    participant Config as ConfigParser
    participant GitHub as GitHubClient
    participant State as FetcherStateStore
    participant Embed as EmbeddingGenerator
    participant Store as EmbeddingsStore
    participant DocGen as DocumentationGenerator
    participant S3 as S3DataClient

    Event->>Handler: Scheduled Event
    Handler->>Config: Parse GITHUB_REPOS
    Config-->>Handler: Repo List
    
    loop For Each Repository
        Handler->>State: Get Previous Hashes
        State-->>Handler: Hash Map
        
        Handler->>GitHub: List Contents (owner/repo/docs/)
        GitHub-->>Handler: File Entries
        
        Handler->>GitHub: Download Changed Files
        GitHub-->>Handler: File Content
        
        Handler->>State: Save Updated Hashes
    end
    
    Handler->>Embed: Generate Embeddings
    Embed-->>Handler: Embedding Records
    
    Handler->>Store: Delete Old GitHub Embeddings
    Store-->>Handler: Ack
    
    Handler->>Store: Batch Write Embeddings
    Store-->>Handler: Ack
    
    Handler->>DocGen: Generate Documentation
    DocGen-->>Handler: ProjectDocumentation
    
    Handler->>S3: PutObject profile-data/project-summary.json
    S3-->>Handler: ETag
    
    Handler-->>Event: IndexingResult

Components and Interfaces

Modified Interfaces

RAGIndexerDeps

export interface RAGIndexerDeps {
  gitHubClient: GitHubClient;
  gitHubRepos: Array<{ owner: string; name: string; docsPath: string }>;
  s3Client: S3DataClient;
  s3Config: S3FetcherConfig;
  embeddingClient: EmbeddingClient;
  embeddingsStore: EmbeddingsWriteStore;
  fetcherStateStore: FetcherStateStore;
}

Changes:

  • gitHubRepos changed from single repo config to array of repo configurations
  • Each repo includes owner, name, and optional docsPath (defaults to docs)

GitHubFile

export interface GitHubFile {
  path: string;  // Now includes repo prefix: owner/repo/path/to/file.md
  content: string;
  sha: string;
  lastModified: number;
}

Changes:

  • path now includes repo prefix for namespace isolation

New Interfaces

ProjectDocumentation

export interface ProjectDocumentation {
  version: string;
  generatedAt: string;
  repoInfo: {
    owner: string;
    name: string;
    branch: string;
  }[];
  summary: {
    description: string;
    techStack: string[];      // Limited to 10 entries
    keyFeatures: string[];
    documentationFiles: string[];  // Limited to 20 entries
  };
  structure: {
    directories: string[];    // Limited to 50 entries
    mainFiles: string[];
  };
  apiEndpoints: {
    path: string;
    method: string;
    description: string;
  }[];
  dataModels: {
    name: string;
    type: string;
    description: string;
  }[];  // Limited to 100 entries
}

Modified Functions

indexFromGitHub

Signature:

export async function indexFromGitHub(
  deps: RAGIndexerDeps,
): Promise<IndexingResult>

Changes:

  • Now iterates through all configured repositories
  • Uses repo-prefixed paths for file storage
  • Generates and uploads project documentation
  • Continues processing on repo-specific errors

Algorithm:

  1. Parse GITHUB_REPOS environment variable into repo configurations
  2. Load previous SHA hashes from FetcherStateStore
  3. For each repository:
    • Filter previous hashes to repo-specific entries
    • Fetch files from GitHub with change detection
    • Add repo prefix to file paths
    • Update hash map with repo-prefixed paths
  4. Process all changed files through RAG pipeline
  5. Delete old GitHub embeddings from DynamoDB
  6. Batch write new embeddings (25 per batch)
  7. Generate project documentation
  8. Upload documentation to S3
  9. Save updated hash map

processFile

Signature:

export async function processFile(
  content: string,
  source: 'github' | 's3',
  sourceFile: string,  // Now includes repo prefix
  lastModified: number,
  embeddingClient: EmbeddingClient,
): Promise<{ records: EmbeddingRecord[]; errors: string[] }>

Changes:

  • No functional changes, but sourceFile now includes repo prefix
  • Raw text handling unchanged (2048 char truncation, 'other' section type)

New Functions

generateProjectDocumentation

Signature:

export async function generateProjectDocumentation(
  repos: Array<{ owner: string; name: string }>,
  files: GitHubFile[],
): Promise<ProjectDocumentation>

Purpose: Extract structured documentation from fetched GitHub files

Algorithm:

  1. Initialize empty collections for each data type
  2. For each file:
    • Extract directory structure from path
    • Categorize as documentation if docs/ or .md
    • Parse package.json for tech stack
    • Parse infrastructure/template.yaml for API endpoints
    • Parse TypeScript files for data models
  3. Apply limits to each collection
  4. Build and return ProjectDocumentation object

Limits:

  • Tech stack: 10 entries
  • Documentation files: 20 entries
  • Directories: 50 entries
  • Data models: 100 entries
  • API endpoints: 50 entries

generateAndUploadDocumentation

Signature:

async function generateAndUploadDocumentation(
  repos: Array<{ owner: string; name: string }>,
  files: GitHubFile[],
  s3Client: S3DataClient,
  bucket: string,
): Promise<void>

Purpose: Generate and upload project documentation to S3

Algorithm:

  1. Call generateProjectDocumentation
  2. Serialize as pretty-printed JSON (2-space indentation)
  3. Upload to profile-data/project-summary.json
  4. Log success or error

Data Models

EmbeddingRecord

export interface EmbeddingRecord {
  chunkId: string;
  embedding: number[];
  content: string;
  source: 'github' | 's3';
  sectionType: string;
  metadata: {
    sourceFile: string;  // Repo-prefixed: owner/repo/path/to/file.md
    lastUpdated: number;
    chunkIndex: number;
  };
  updatedAt: number;
}

Changes:

  • metadata.sourceFile now includes repo prefix

IndexingResult

export interface IndexingResult {
  source: 'github' | 's3';
  repo?: string;  // Optional repo identifier for multi-repo indexing
  filesProcessed: number;
  chunksGenerated: number;
  embeddingsStored: number;
  errors: string[];
}

Changes:

  • Added optional repo field for tracking per-repo results

ProjectDocumentation

See "New Interfaces" section above.

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Repo Configuration Parsing

For any comma-separated string of owner/repo entries, parsing the GITHUB_REPOS environment variable SHALL produce an array of repository configurations with trimmed owner and repo names, and default docsPath set to docs.

Validates: Requirements 1.1, 1.2, 1.3

Property 2: Multi-Repo File Fetching

For any set of configured repositories, when a scheduled indexing event occurs, the RAG_Indexer SHALL fetch files from all repositories and no files shall be missed due to early termination.

Validates: Requirements 2.1, 2.4

Property 3: File Extension Filtering

For any set of files in a GitHub repository directory, filtering to .md files SHALL include only files with the .md extension and exclude all other files.

Validates: Requirement 2.2

Property 4: Change Detection Accuracy

For any file in a GitHub repository, comparing the current SHA hash against the previously stored SHA SHALL correctly identify unchanged files (skip download) and changed files (download content).

Validates: Requirements 3.2, 3.3, 3.4

Property 5: Repo-Prefixed Path Namespace

For any file fetched from any repository, prepending the owner/repo/ prefix to the file path SHALL produce a unique namespace that prevents path collisions across repositories.

Validates: Requirements 4.1, 4.2

Property 6: Embedding Storage Isolation

For any set of embeddings, storing them with source: 'github' and repo-prefixed metadata.sourceFile SHALL ensure that deleting GitHub embeddings does not affect S3 embeddings.

Validates: Requirements 4.2, 4.3

Property 7: Batch Write Size Limit

For any set of embedding records, writing to DynamoDB SHALL use batch sizes of at most 25 records per BatchWrite request.

Validates: Requirement 4.4

Property 8: Documentation Generation Completeness

For any set of fetched GitHub files, generating project documentation SHALL extract all directories, tech stack entries, API endpoints, and data models that meet the categorization criteria.

Validates: Requirements 6.2, 6.3, 6.4, 6.5, 6.6

Property 9: Documentation Generation Limits

For any set of extracted documentation data, applying limits SHALL ensure that tech stack has at most 10 entries, documentation files has at most 20 entries, directories has at most 50 entries, and data models has at most 100 entries.

Validates: Requirement 6.7

Property 10: JSON Serialization Format

For any ProjectDocumentation object, serializing to JSON SHALL produce pretty-printed output with 2-space indentation.

Validates: Requirement 7.2

Property 11: Error Continuation

For any set of repositories, when an error occurs processing one repository, the RAG_Indexer SHALL continue processing remaining repositories and log the error.

Validates: Requirements 2.4, 7.3, 9.5

Property 12: HTTP Status Code Accuracy

For any indexing run, returning the HTTP status code SHALL be 200 when no errors occur, 207 when partial errors occur, and 500 when a fatal error occurs.

Validates: Requirements 9.2, 9.3, 9.4

Property 13: Raw Text Truncation

For any file content that does not parse as valid ProfileData JSON, truncating to 2048 characters before generating an embedding SHALL ensure the content length does not exceed 2048 characters.

Validates: Requirement 10.2

Property 14: Raw Text Section Type

For any file content that does not parse as valid ProfileData JSON, assigning the section type 'other' SHALL ensure the chunk's sectionType field is set to 'other'.

Validates: Requirement 10.3

Property 15: Chunk ID Pattern

For any raw text content, generating a chunk ID using the pattern {source}#other#{sanitized_file_path} SHALL produce a valid identifier with special characters replaced by underscores.

Validates: Requirement 10.4

Error Handling Strategy

Error Categories

Category Description Handling
Repository-specific errors GitHub API errors for a specific repo Log and continue with other repos
Embedding generation errors Bedrock API failures Log error, continue with other chunks
S3 upload errors Profile data bucket upload failures Log error, continue indexing
Configuration errors Invalid GITHUB_REPOS format Log error, use default repos
Fatal errors DynamoDB connection failures, critical infrastructure issues Return 500 status

Error Propagation

// Repository-specific errors are caught and logged
for (const repo of deps.gitHubRepos) {
  try {
    const fetchResult = await fetchFromGitHubWithConfig(...);
    // Process files...
  } catch (error) {
    const message = error instanceof Error ? error.message : String(error);
    result.errors.push(`Repository ${repo.owner}/${repo.name} failed: ${message}`);
    // Continue to next repo
  }
}

// Fatal errors are caught at the top level
try {
  // Main processing logic
} catch (error) {
  const message = error instanceof Error ? error.message : String(error);
  console.error('RAG Indexer fatal error:', message);
  return { statusCode: 500, body: JSON.stringify({ error: message }) };
}

Status Code Logic

Condition Status Code Body
No errors 200 {"filesProcessed": N, "embeddingsStored": N, "errors": [], "results": [...]}
Partial errors 207 Same as 200, but errors array is non-empty
Fatal error 500 {"error": "error message"}

Testing Strategy

Dual Testing Approach

Unit Tests: Verify specific examples, edge cases, and error conditions Property Tests: Verify universal properties across all inputs (when applicable)

Property-Based Testing

The following properties are suitable for property-based testing:

  1. Repo Configuration Parsing - Test with random comma-separated strings
  2. File Extension Filtering - Test with random file name lists
  3. Change Detection Accuracy - Test with random SHA pairs
  4. Repo-Prefixed Path Namespace - Test with random repo/file combinations
  5. Embedding Storage Isolation - Test with random embedding sets
  6. Batch Write Size Limit - Test with random record counts
  7. Documentation Generation Limits - Test with large datasets
  8. JSON Serialization Format - Test with random documentation objects
  9. Error Continuation - Test with random error scenarios
  10. Raw Text Truncation - Test with random text lengths
  11. Raw Text Section Type - Test with random non-JSON content
  12. Chunk ID Pattern - Test with random file paths

Example-Based Testing

The following behaviors are better tested with specific examples:

  1. HTTP Status Code Accuracy - Test specific error scenarios
  2. S3 Event Processing - Test with specific S3 event structures
  3. Scheduled Event Processing - Test with specific scheduled event structures
  4. Manual Invocation - Test with unknown event types

Integration Testing

The following require integration tests:

  1. GitHub API Integration - Test with real GitHub API calls
  2. DynamoDB Persistence - Test with real DynamoDB table
  3. S3 Upload - Test with real S3 bucket
  4. Embedding Generation - Test with real Bedrock API

Test Configuration

  • Property-based tests: Minimum 100 iterations per property
  • Tag format: Feature: multi-repo-indexing, Property {number}: {property_text}
  • Integration tests: 1-3 representative examples per external service

Test Files

File Purpose
backend/src/handlers/rag-indexer-handler.test.ts Handler logic tests
backend/src/services/rag/github-fetcher.test.ts GitHub fetcher tests
backend/src/services/rag/s3-fetcher.test.ts S3 fetcher tests
backend/src/services/rag/embedding-service.test.ts Embedding generation tests
backend/src/services/rag/chunker.test.ts Chunking tests
scripts/drive-ingest.test.ts Integration tests

Implementation Notes

Environment Variables

Variable Format Default Description
GITHUB_REPOS owner/repo,owner/repo,... Preconfigured list Comma-separated list of repositories to index
GITHUB_TOKEN Token string Required GitHub Personal Access Token
EMBEDDINGS_TABLE Table name Required DynamoDB table for embeddings
PROFILE_DATA_BUCKET Bucket name Required S3 bucket for profile data

DynamoDB Schema

Table: EMBEDDINGS_TABLE

Primary Key: chunkId (String)

Items:

  • Embedding records: chunkId, embedding, content, source, sectionType, metadata, updatedAt
  • State records: chunkId = __fetcher_state__#github_hashes or __fetcher_state__#s3_etags, data = hash/ETag map

S3 Object Keys

Key Purpose
profile-data/project-summary.json Generated project documentation
profile-data/*.json Profile data files (existing)

GitHub API Rate Limits

  • Unauthenticated: 60 requests/hour
  • Authenticated: 5000 requests/hour
  • Consider pagination for repositories with many files

Performance Considerations

  1. Batch Processing: Process files in parallel where possible
  2. Change Detection: Only fetch changed files to reduce API calls
  3. Batch Writes: Use DynamoDB batch write (25 items) for efficiency
  4. Error Isolation: Continue processing on non-fatal errors

Security Considerations

  1. GitHub Token: Store in AWS Secrets Manager or SSM Parameter Store
  2. Environment Variables: Encrypt at rest using AWS KMS
  3. S3 Upload: Use S3 bucket policies to restrict access