Content Indexing

The Clear Ideas App features an integrated indexing system for textual content across supported document formats. When enabled in the site settings, content is automatically indexed upon upload, provided it falls within the supported file formats. Content indexing enables powerful search capabilities and AI features by making document content searchable and accessible.

How Content Indexing Works

Content indexing extracts text from uploaded documents and makes it searchable. The indexing process:

  1. Upload: Document is uploaded to a site
  2. Text Extraction: Text is extracted from the document (using OCR for images/PDFs if needed)
  3. Indexing: Extracted text is processed and indexed
  4. Search Availability: Content becomes available for search and AI features

Automatic Process: Indexing happens automatically when content is uploaded, provided indexing is enabled in site settings.

Supported File Formats

Clear Ideas supports indexing for a wide range of document formats:

  • JSON - .json (application/json)
  • Word Document - .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
  • OpenDocument Text - .odt (application/vnd.oasis.opendocument.text)
  • EPUB - .epub (application/epub+zip)
  • PowerPoint Presentation - .pptx (application/vnd.openxmlformats-officedocument.presentationml.presentation)
  • PowerPoint - .ppt (application/vnd.ms-powerpoint)
  • RTF - .rtf (application/rtf)
  • Texinfo - .texinfo (application/x-texinfo)
  • LaTeX - .latex (application/x-latex)
  • PDF - .pdf (application/pdf)
  • reStructuredText - .rst (text/x-rst)
  • Textile - .textile (text/x-textile)
  • MediaWiki - .mediawiki (text/x-mediawiki)
  • DocBook - .docbook (application/docbook+xml)
  • JATS - .jats (application/jats+xml)
  • Org mode - .org (text/x-org)
  • Jupyter Notebook - .ipynb (application/x-ipynb+json)
  • CSV - .csv (text/csv)
  • AsciiDoc - .asciidoc (text/asciidoc)
  • CommonMark - .commonmark (text/commonmark)
  • Creole - .creole (text/x-creole)
  • OPML - .opml (application/x-opml)
  • ICML - .icml (application/vnd.adobe.icml)
  • Wiki - .wiki (text/x-wiki)
  • Jira - .jira (text/x-jira)
  • HTML - .html, .htm (text/html)
  • Markdown - .md, .markdown (text/markdown)
  • Excel - .xls (application/vnd.ms-excel)
  • Excel - .xlsx (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)

Text-Based Formats: Documents with native text content (DOCX, TXT, MD, etc.) are indexed directly.

Image-Based Formats: Documents requiring text extraction (PDFs, images) use OCR for indexing.

Optical Character Recognition (OCR)

While many supported formats are inherently text-based, others, such as PDFs, may require Optical Character Recognition (OCR) for efficient text extraction.

How OCR Works

Image Analysis: OCR analyzes images and PDF pages to identify text characters

Text Extraction: Extracted text is converted to searchable content

Processing: Extracted text is processed and indexed for search

AI Features: Once extracted, text becomes accessible for AI-enhanced search and AI chat functionalities

OCR Requirements

Site Settings: OCR must be enabled in site settings (see Advanced Search Settings)

Account Settings: OCR must be enabled at the account level

File Types: OCR applies to PDFs and image-based documents

OCR Processing Time

Small Documents: Typically process quickly

Large Documents: May take longer depending on page count and complexity

Background Processing: OCR processing happens in the background

Security and Privacy

All extracted text is secured in an encrypted state throughout its lifecycle in the Clear Ideas App, with the sole exception being when the text is utilized to fulfill an AI Chat request.

Encryption

At Rest: Extracted text is encrypted when stored

In Transit: Text is encrypted when transmitted

AI Processing: Text is decrypted only when needed for AI features, then re-encrypted

For further details on the security of extracted text, refer to our Encryption & Privacy guide.

Content Indexing Status

Administrators can monitor the content indexing status of each file through a dedicated icon displayed next to the file.

Indexing Status Icons

Indexed: File has been successfully indexed and is searchable

Indexing: File is currently being indexed

Not Indexed: File has not been indexed (may be unsupported format or indexing disabled)

Error: Indexing failed (may need to retry or check file format)

For more information, please visit our Information Icons section.

Checking Indexing Status

File List: Indexing status icons appear next to files in list and tile views

File Details: Indexing status is shown in file information panels

Settings: Site settings show overall indexing status and statistics

Enabling Content Indexing

Account-Level Settings

Content indexing must be enabled at the account level:

  1. Navigate to Settings > Search
  2. Enable Full-Text Search and OCR for PDFs as needed
  3. Save settings

Site-Level Settings

Site-level settings can override account defaults:

  1. Navigate to Site Settings > Search
  2. Configure indexing settings for the site
  3. Save settings

Note: Site settings can only restrict features enabled at the account level. See Advanced Search Settings for details.

Indexing and AI Features

Indexed content enables AI-enhanced search, which uses semantic understanding to find relevant content even when exact keywords don't match.

Benefits:

  • Better search results
  • Semantic understanding
  • Context-aware results

AI Chat

Indexed content powers AI Chat, allowing AI models to answer questions based on your documents.

Benefits:

  • Document-based answers
  • Context-aware responses
  • Comprehensive understanding

Document Summaries

Indexed content enables AI document summaries, providing quick overviews of document content.

Benefits:

  • Quick document understanding
  • Time-saving summaries
  • Content overviews

Indexing Statistics

Page Equivalents Indexed

Metric: Site statistics show "Page Equivalents Indexed" - an estimate of pages processed for search/AI

Usage: This metric helps understand indexing scope and usage

Tracking: Monitor indexing statistics to understand content processing

Indexing Limits

Plan Limits: Your plan may include limits on indexed pages

Overage: Additional indexed pages beyond plan limits may incur charges

Monitoring: Track indexing usage to stay within plan limits

Best Practices

Enable Indexing

When Needed: Enable indexing for sites where search and AI features are important

Selective: Consider disabling indexing for sites where it's not needed to optimize performance

Review: Regularly review indexing settings to ensure they match your needs

Monitor Status

Regular Checks: Check indexing status regularly to ensure content is being indexed

Error Resolution: Address indexing errors promptly to ensure content is searchable

Status Icons: Use status icons to quickly identify indexing issues

Optimize Content

File Formats: Use supported formats for best indexing results

Text Quality: Ensure documents have clear, readable text for OCR

Structure: Well-structured documents index more effectively

Troubleshooting

Content Not Indexing

Check Settings: Verify indexing is enabled at account and site levels

File Format: Ensure file format is supported for indexing

Status Icons: Check status icons for indexing errors

Retry: Some indexing failures may resolve on retry

Slow Indexing

Large Files: Large files may take longer to index

OCR Processing: OCR processing may take time for image-heavy documents

Background Processing: Indexing happens in the background - allow time for completion

Indexing Errors

Format Issues: Some file formats may not index correctly

Corrupted Files: Corrupted files may fail to index

Support: Contact support if indexing errors persist