Skip to content

Document Management

Knowledge bases are populated with documents that get chunked, embedded, and indexed for vector search. You can upload files directly or connect external document sources for automatic synchronization.

Uploading Documents

Upload files directly through the admin portal by clicking Upload Documents on the knowledge base detail page.

imageDocument upload interface with drag-and-drop area, file list, and processing status indicators
Uploading documents to a knowledge base

Supported Formats

FormatExtensionNotes
PDF.pdfText is extracted; scanned PDFs require OCR-enabled processing
Word.docxTables and formatted text are preserved
Plain Text.txtIndexed as-is
HTML.htmlTags are stripped; text content is extracted

After upload, documents go through the indexing pipeline: text extraction, chunking, embedding generation, and vector storage. Track progress on the document list where each file shows its current status.

External Document Sources

Connect external storage platforms to automatically sync documents into your knowledge base.

Supported Providers

ProviderAuthenticationDescription
SharePointOAuth / App credentialsSync from SharePoint document libraries via Microsoft Graph API
Google DriveService account / OAuthSync folders or shared drives
Amazon S3Access key / IAM roleSync from S3 buckets with optional prefix filtering
DropboxOAuthSync from Dropbox folders
OneDriveOAuth / App credentialsSync from OneDrive for Business via Microsoft Graph API
BoxOAuth / JWTSync from Box folders
Azure BlobConnection string / SAS tokenSync from Azure Blob Storage containers
imageExternal document sources panel showing connected providers with sync status and schedule
Managing external document sources for a knowledge base

Configuring an External Source

  1. Go to the knowledge base detail page and select the External Sources tab.
  2. Click Add Source.
  3. Select the provider type (SharePoint, Google Drive, etc.).
  4. Choose an existing storage integration or create a new one with the provider credentials.
  5. Fill in the provider-specific fields (site URL, document library, folder path, etc.).
  6. Optionally configure file type filters to limit which file extensions are synced (e.g., PDF, DOCX only).
  7. Optionally configure a document whitelist to sync only specific files (see below).
  8. Optionally configure citation URL mapping to transform source URLs into public-facing links.
  9. Save and trigger the initial sync.

Document Whitelist

The document whitelist lets you control exactly which files are synced from an external source. When a whitelist is configured, only files matching the list are downloaded and indexed. All other files in the source folder are skipped.

Setting Up a Whitelist

  1. Open the external source configuration (create or edit).
  2. In the Document Whitelist section, add filenames using one of these methods:
    • Upload a file: Click Upload .txt or .csv and select a file containing document names.
      • .txt files: one filename per line.
      • .csv files: the first column is used as the filename. A header row (containing "filename", "name", or "document") is automatically detected and skipped.
    • Paste manually: Type or paste filenames directly into the text area, one per line.
  3. Choose the Match by option:
    • Filename: Matches the file name only (e.g., Benefits Guide.pdf). Use this when filenames are unique across folders.
    • Full path: Matches the complete folder path (e.g., HR/Benefits Guide.pdf). Use this when the same filename exists in multiple folders.
  4. Save the source configuration.

The whitelist count is displayed on the source card (e.g., "10 whitelisted") so you can see at a glance which sources have a whitelist active.

Updating a Whitelist

When you modify the whitelist and re-sync:

  • New files added to the list: Downloaded and indexed on the next sync.
  • Files removed from the list: Automatically cleaned up. The file is removed from cloud storage, its search index chunks are deleted, and the document is marked as Deleted.
  • No whitelist (empty list): All files in the source folder are synced. This is the default behavior.

TIP

Start with a small whitelist while testing your source connection. Once you confirm the correct files are being pulled, expand the list or remove the whitelist entirely to sync all files.

WARNING

The whitelist is case-insensitive. Benefits Guide.pdf and benefits guide.pdf are treated as the same entry. The maximum whitelist size is 10,000 entries.

Source Actions

Each connected source displays a set of action buttons:

ActionIconDescription
SyncsyncSync changes since the last run (delta sync)
Full Syncsync_problemRe-download all files from the source
Test Connectionwifi_tetheringVerify that the credentials and source path are valid
EditeditModify source configuration, whitelist, or filters
DeletedeleteRemove the source configuration (previously synced documents remain)

Sync Statistics

After each sync completes, the source card displays statistics:

MetricDescription
AddedNew documents downloaded and added to the knowledge base
UpdatedExisting documents that were re-downloaded due to changes at the source
SkippedDocuments that were unchanged or filtered out (by extension or whitelist)
FailedDocuments that encountered errors during download or processing

Sync Schedules

ScheduleDescription
ManualSync only when triggered manually from the admin portal
HourlyCheck for new or updated files every hour
DailySync once per day at the configured time
WeeklySync once per week on the configured day and time

TIP

Start with a manual schedule while validating that the correct files are being indexed. Switch to automatic sync once you are satisfied with the results.

Document Status Tracking

Each document displays its current status in the documents table:

StatusBadge ColorDescription
PendingGrayQueued for processing
ProcessingBlueText extraction and embedding generation in progress
IndexedGreenSuccessfully indexed and available for search
FailedRedProcessing failed; check the error details for troubleshooting
DeletedFaded redRemoved from the knowledge base (whitelist change, source deletion, or manual removal). The document is excluded from search and will not be re-indexed.
DeactivatedAmberManually deactivated by an administrator. The document remains in storage but is excluded from search.

Filtering by Status

Use the Status dropdown above the documents table to filter by any status. The All Status view shows all documents except deleted ones. Select Deleted explicitly to review removed documents.

Collection Statistics

The stats ribbon at the top of each knowledge base shows real-time document and indexing metrics:

MetricDescription
DocumentsTotal number of documents in the collection
ChunksTotal number of text chunks created from all documents
IndexedDocuments that have been successfully indexed and are available for search
PendingDocuments waiting to be processed
FailedDocuments that failed during indexing
DeletedDocuments that have been removed (via whitelist reconciliation, source deletion, or manual deletion)
DeactivatedDocuments that have been manually deactivated by an administrator
Last SyncTimestamp of the most recent sync or indexing operation

Documents Table

The documents table provides a sortable, filterable view of all documents in the knowledge base.

Sorting

Click any column header to sort the table by that column. Click again to reverse the sort direction. An arrow icon indicates the active sort column and direction.

Sortable columns: Title, Source, Chunks, Status, Added, Last Indexed.

Filtering

FilterDescription
SearchFree-text search across document title, source URL, and filename
TypeFilter by document source type: Files (uploaded), Web URLs, or GCS imports
StatusFilter by processing status: All, Ready (indexed), Processing, Queued (pending), Failed, or Deleted

Bulk Actions

Select multiple documents using the checkboxes to perform bulk operations such as re-indexing or deletion. A floating toolbar appears at the bottom of the page when documents are selected.

WARNING

Deleting a document removes it from the vector index. Any future searches will no longer return results from that document. This action cannot be undone.

OmniBots Agent Assist