---
title: Content Indexing
description: >-
  Understand content indexing in Clear Ideas. Learn about text extraction, OCR
  capabilities, supported file formats, and indexing status for enhanced search.
ogTitle: Content Indexing Guide
ogDescription: >-
  Master content indexing in Clear Ideas. Learn about text extraction, OCR,
  supported formats, and indexing status for enhanced AI search
ogImage: /assets/images/og/guide-content-indexing.webp
navigation:
  icon: fasl fa-sync
---

# Content Indexing

The Clear Ideas App features an integrated indexing system for textual content across supported document formats. When enabled in the site settings, content is automatically indexed upon upload, provided it falls within the supported file formats. Content indexing enables powerful search capabilities and AI features by making document content searchable and accessible.

## How Content Indexing Works

Content indexing extracts text from uploaded documents and makes it searchable. The indexing process:

1. **Upload**: Document is uploaded to a site
2. **Text Extraction**: Text is extracted from the document (using OCR for images/PDFs if needed)
3. **Indexing**: Extracted text is processed and indexed
4. **Search Availability**: Content becomes available for search and AI features

**Automatic Process**: Indexing happens automatically when content is uploaded, provided indexing is enabled in site settings.

## Supported File Formats

Clear Ideas supports indexing for a wide range of document formats:

- AsciiDoc - .asciidoc (text/asciidoc)
- CommonMark - .commonmark (text/commonmark)
- Creole - .creole (text/x-creole)
- CSV - .csv (text/csv)
- DocBook - .docbook (application/docbook+xml)
- EPUB - .epub (application/epub+zip)
- Excel - .xls (application/vnd.ms-excel)
- Excel - .xlsx (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
- HTML - .html, .htm (text/html)
- ICML - .icml (application/vnd.adobe.icml)
- JATS - .jats (application/jats+xml)
- Jira - .jira (text/x-jira)
- JSON - .json (application/json)
- Jupyter Notebook - .ipynb (application/x-ipynb+json)
- LaTeX - .latex (application/x-latex)
- Markdown - .md, .markdown (text/markdown)
- MediaWiki - .mediawiki (text/x-mediawiki)
- OpenDocument Text - .odt (application/vnd.oasis.opendocument.text)
- OPML - .opml (application/x-opml)
- Org mode - .org (text/x-org)
- PDF - .pdf (application/pdf)
- PowerPoint - .ppt (application/vnd.ms-powerpoint)
- PowerPoint Presentation - .pptx (application/vnd.openxmlformats-officedocument.presentationml.presentation)
- reStructuredText - .rst (text/x-rst)
- RTF - .rtf (application/rtf)
- Texinfo - .texinfo (application/x-texinfo)
- Textile - .textile (text/x-textile)
- Wiki - .wiki (text/x-wiki)
- Word Document - .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document)

**Text-Based Formats**: Documents with native text content (DOCX, TXT, MD, etc.) are indexed directly.

**Image-Based Formats**: Documents requiring text extraction (PDFs, images) use OCR for indexing.

## Optical Character Recognition (OCR)

While many supported formats are inherently text-based, others, such as PDFs, may require Optical Character Recognition (OCR) for efficient text extraction.

### How OCR Works

**Image Analysis**: OCR analyzes images and PDF pages to identify text characters

**Text Extraction**: Extracted text is converted to searchable content

**Processing**: Extracted text is processed and indexed for search

**AI Features**: Once extracted, text becomes accessible for AI-enhanced search and AI chat functionalities

### OCR Requirements

**Site Settings**: OCR must be enabled in site settings (see [Advanced Search Settings](/site-administrator-guide/advanced-search-settings))

**Account Settings**: OCR must be enabled at the account level

**File Types**: OCR applies to PDFs and image-based documents

### OCR Processing Time

**Small Documents**: Typically process quickly

**Large Documents**: May take longer depending on page count and complexity

**Background Processing**: OCR processing happens in the background

## Security and Privacy

All extracted text is secured in an encrypted state throughout its lifecycle in the Clear Ideas App, with the sole exception being when the text is utilized to fulfill an AI Chat request.

### Encryption

**At Rest**: Extracted text is encrypted when stored

**In Transit**: Text is encrypted when transmitted

**AI Processing**: Text is decrypted only when needed for AI features, then re-encrypted

For further details on the security of extracted text, refer to our [Encryption & Privacy](/security/encryption-and-privacy) guide.

## Content Indexing Status

Administrators can monitor the content indexing status of each file through a dedicated icon displayed next to the file.

### Indexing Status Icons

**Indexed**: File has been successfully indexed and is searchable

**Indexing**: File is currently being indexed

**Not Indexed**: File has not been indexed (may be unsupported format or indexing disabled)

**Error**: Indexing failed (may need to retry or check file format)

For more information, please visit our [Information Icons](/guide/site-view#information-icons) section.

### Checking Indexing Status

**File List**: Indexing status icons appear next to files in list and tile views

**File Details**: Indexing status is shown in file information panels

**Settings**: Site settings show overall indexing status and statistics

## Enabling Content Indexing

### Account-Level Settings

Content indexing must be enabled at the account level:

1. Navigate to **Settings > Search**
2. Enable **Full-Text Search** and **OCR for PDFs** as needed
3. Save settings

### Site-Level Settings

Site-level settings can override account defaults:

1. Navigate to **Site Settings > Search**
2. Configure indexing settings for the site
3. Save settings

**Note**: Site settings can only restrict features enabled at the account level. See [Advanced Search Settings](/site-administrator-guide/advanced-search-settings) for details.

## Indexing and AI Features

### AI-Enhanced Search

Indexed content enables AI-enhanced search, which uses semantic understanding to find relevant content even when exact keywords don't match.

**Benefits**:
- Better search results
- Semantic understanding
- Context-aware results

### AI Chat

Indexed content powers AI Chat, allowing AI models to answer questions based on your documents.

**Benefits**:
- Document-based answers
- Context-aware responses
- Comprehensive understanding

### Document Summaries

Indexed content enables AI document summaries, providing quick overviews of document content.

**Benefits**:
- Quick document understanding
- Time-saving summaries
- Content overviews

## Redacted Representations and Indexing

When PDF redaction is finalized, Clear Ideas generates a separate hidden redacted representation and processes that artifact through the same extraction pipeline used for other indexed content.

This allows restricted-role users to work from:

- redacted extracted text
- redacted search data
- redacted AI summaries

while authorized roles can continue to work from the original representation.

This representation-aware approach ensures that search, AI Chat, and summaries align with the version of the document the user is permitted to access.

**Important**:

- redacted PDFs support OCR and extracted-text processing for restricted-role AI and search behavior
- finalized redacted PDFs are not yet guaranteed to preserve searchable embedded OCR or text metadata inside the exported PDF itself

See [PDF Redaction](/site-administrator-guide/pdf-redaction).

## Indexing Statistics

### Page Equivalents Indexed

**Metric**: Site statistics show "Page Equivalents Indexed" - an estimate of pages processed for search/AI

**Usage**: This metric helps understand indexing scope and usage

**Tracking**: Monitor indexing statistics to understand content processing

### Indexing Limits

**Plan Limits**: Your plan may include limits on indexed pages

**Overage**: Additional indexed pages beyond plan limits may incur charges

**Monitoring**: Track indexing usage to stay within plan limits

## Best Practices

### Enable Indexing

**When Needed**: Enable indexing for sites where search and AI features are important

**Selective**: Consider disabling indexing for sites where it's not needed to optimize performance

**Review**: Regularly review indexing settings to ensure they match your needs

### Monitor Status

**Regular Checks**: Check indexing status regularly to ensure content is being indexed

**Error Resolution**: Address indexing errors promptly to ensure content is searchable

**Status Icons**: Use status icons to quickly identify indexing issues

### Optimize Content

**File Formats**: Use supported formats for best indexing results

**Text Quality**: Ensure documents have clear, readable text for OCR

**Structure**: Well-structured documents index more effectively

## Troubleshooting

### Content Not Indexing

**Check Settings**: Verify indexing is enabled at account and site levels

**File Format**: Ensure file format is supported for indexing

**Status Icons**: Check status icons for indexing errors

**Retry**: Some indexing failures may resolve on retry

### Slow Indexing

**Large Files**: Large files may take longer to index

**OCR Processing**: OCR processing may take time for image-heavy documents

**Background Processing**: Indexing happens in the background - allow time for completion

### Indexing Errors

**Format Issues**: Some file formats may not index correctly

**Corrupted Files**: Corrupted files may fail to index

**Support**: Contact support if indexing errors persist

## Related Documentation

- [Advanced Search Settings](/site-administrator-guide/advanced-search-settings) - Configure search and indexing
- [Site AI Settings](/site-administrator-guide/site-ai-settings) - Enable AI features that use indexed content
- [Encryption and Privacy](/security/encryption-and-privacy) - Learn about security practices
- [Site Statistics](/site-administrator-guide/site-statistics) - View indexing statistics
