Theme
Document Management
Knowledge bases are populated with documents that get chunked, embedded, and indexed for vector search. You can upload files directly or connect external document sources for automatic synchronization.
Uploading Documents
Upload files directly through the admin portal by clicking Upload Documents on the knowledge base detail page.
Document upload interface with drag-and-drop area, file list, and processing status indicators
Supported Formats
| Format | Extension | Notes |
|---|---|---|
.pdf | Text is extracted; scanned PDFs require OCR-enabled processing | |
| Word | .docx | Tables and formatted text are preserved |
| Plain Text | .txt | Indexed as-is |
| HTML | .html | Tags are stripped; text content is extracted |
After upload, documents go through the indexing pipeline: text extraction, chunking, embedding generation, and vector storage. Track progress on the document list where each file shows its current status.
External Document Sources
Connect external storage platforms to automatically sync documents into your knowledge base.
Supported Providers
| Provider | Authentication | Description |
|---|---|---|
| SharePoint | OAuth / App credentials | Sync from SharePoint document libraries via Microsoft Graph API |
| Google Drive | Service account / OAuth | Sync folders or shared drives |
| Amazon S3 | Access key / IAM role | Sync from S3 buckets with optional prefix filtering |
| Dropbox | OAuth | Sync from Dropbox folders |
| OneDrive | OAuth / App credentials | Sync from OneDrive for Business via Microsoft Graph API |
| Box | OAuth / JWT | Sync from Box folders |
| Azure Blob | Connection string / SAS token | Sync from Azure Blob Storage containers |
External document sources panel showing connected providers with sync status and schedule
Configuring an External Source
- Go to the knowledge base detail page and select the External Sources tab.
- Click Add Source.
- Select the provider type (SharePoint, Google Drive, etc.).
- Choose an existing storage integration or create a new one with the provider credentials.
- Fill in the provider-specific fields (site URL, document library, folder path, etc.).
- Optionally configure file type filters to limit which file extensions are synced (e.g., PDF, DOCX only).
- Optionally configure a document whitelist to sync only specific files (see below).
- Optionally configure citation URL mapping to transform source URLs into public-facing links.
- Save and trigger the initial sync.
Document Whitelist
The document whitelist lets you control exactly which files are synced from an external source. When a whitelist is configured, only files matching the list are downloaded and indexed. All other files in the source folder are skipped.
Setting Up a Whitelist
- Open the external source configuration (create or edit).
- In the Document Whitelist section, add filenames using one of these methods:
- Upload a file: Click Upload .txt or .csv and select a file containing document names.
.txtfiles: one filename per line..csvfiles: the first column is used as the filename. A header row (containing "filename", "name", or "document") is automatically detected and skipped.
- Paste manually: Type or paste filenames directly into the text area, one per line.
- Upload a file: Click Upload .txt or .csv and select a file containing document names.
- Choose the Match by option:
- Filename: Matches the file name only (e.g.,
Benefits Guide.pdf). Use this when filenames are unique across folders. - Full path: Matches the complete folder path (e.g.,
HR/Benefits Guide.pdf). Use this when the same filename exists in multiple folders.
- Filename: Matches the file name only (e.g.,
- Save the source configuration.
The whitelist count is displayed on the source card (e.g., "10 whitelisted") so you can see at a glance which sources have a whitelist active.
Updating a Whitelist
When you modify the whitelist and re-sync:
- New files added to the list: Downloaded and indexed on the next sync.
- Files removed from the list: Automatically cleaned up. The file is removed from cloud storage, its search index chunks are deleted, and the document is marked as Deleted.
- No whitelist (empty list): All files in the source folder are synced. This is the default behavior.
TIP
Start with a small whitelist while testing your source connection. Once you confirm the correct files are being pulled, expand the list or remove the whitelist entirely to sync all files.
WARNING
The whitelist is case-insensitive. Benefits Guide.pdf and benefits guide.pdf are treated as the same entry. The maximum whitelist size is 10,000 entries.
Source Actions
Each connected source displays a set of action buttons:
| Action | Icon | Description |
|---|---|---|
| Sync | sync | Sync changes since the last run (delta sync) |
| Full Sync | sync_problem | Re-download all files from the source |
| Test Connection | wifi_tethering | Verify that the credentials and source path are valid |
| Edit | edit | Modify source configuration, whitelist, or filters |
| Delete | delete | Remove the source configuration (previously synced documents remain) |
Sync Statistics
After each sync completes, the source card displays statistics:
| Metric | Description |
|---|---|
| Added | New documents downloaded and added to the knowledge base |
| Updated | Existing documents that were re-downloaded due to changes at the source |
| Skipped | Documents that were unchanged or filtered out (by extension or whitelist) |
| Failed | Documents that encountered errors during download or processing |
Sync Schedules
| Schedule | Description |
|---|---|
| Manual | Sync only when triggered manually from the admin portal |
| Hourly | Check for new or updated files every hour |
| Daily | Sync once per day at the configured time |
| Weekly | Sync once per week on the configured day and time |
TIP
Start with a manual schedule while validating that the correct files are being indexed. Switch to automatic sync once you are satisfied with the results.
Document Status Tracking
Each document displays its current status in the documents table:
| Status | Badge Color | Description |
|---|---|---|
| Pending | Gray | Queued for processing |
| Processing | Blue | Text extraction and embedding generation in progress |
| Indexed | Green | Successfully indexed and available for search |
| Failed | Red | Processing failed; check the error details for troubleshooting |
| Deleted | Faded red | Removed from the knowledge base (whitelist change, source deletion, or manual removal). The document is excluded from search and will not be re-indexed. |
| Deactivated | Amber | Manually deactivated by an administrator. The document remains in storage but is excluded from search. |
Filtering by Status
Use the Status dropdown above the documents table to filter by any status. The All Status view shows all documents except deleted ones. Select Deleted explicitly to review removed documents.
Collection Statistics
The stats ribbon at the top of each knowledge base shows real-time document and indexing metrics:
| Metric | Description |
|---|---|
| Documents | Total number of documents in the collection |
| Chunks | Total number of text chunks created from all documents |
| Indexed | Documents that have been successfully indexed and are available for search |
| Pending | Documents waiting to be processed |
| Failed | Documents that failed during indexing |
| Deleted | Documents that have been removed (via whitelist reconciliation, source deletion, or manual deletion) |
| Deactivated | Documents that have been manually deactivated by an administrator |
| Last Sync | Timestamp of the most recent sync or indexing operation |
Documents Table
The documents table provides a sortable, filterable view of all documents in the knowledge base.
Sorting
Click any column header to sort the table by that column. Click again to reverse the sort direction. An arrow icon indicates the active sort column and direction.
Sortable columns: Title, Source, Chunks, Status, Added, Last Indexed.
Filtering
| Filter | Description |
|---|---|
| Search | Free-text search across document title, source URL, and filename |
| Type | Filter by document source type: Files (uploaded), Web URLs, or GCS imports |
| Status | Filter by processing status: All, Ready (indexed), Processing, Queued (pending), Failed, or Deleted |
Bulk Actions
Select multiple documents using the checkboxes to perform bulk operations such as re-indexing or deletion. A floating toolbar appears at the bottom of the page when documents are selected.
WARNING
Deleting a document removes it from the vector index. Any future searches will no longer return results from that document. This action cannot be undone.
