Docs
WebSearchAPI.ai Scraper API Reference

WebSearchAPI.ai Scraper API Reference

Complete reference for the WebSearchAPI.ai Scraper endpoint with advanced content extraction capabilities

WebSearchAPI.ai offers a powerful web scraping API built for developers who need full control over content extraction. Extract clean, structured content from any URL with browser rendering, CSS selectors, JavaScript injection, and AI-optimized markdown formatting perfect for LLMs.

Introduction

The WebSearchAPI.ai Scraper API allows you to:

  • Extract and parse content from any URL with AI-optimized formatting
  • Use full browser rendering for JavaScript-heavy websites
  • Target specific content with CSS selectors
  • Inject custom JavaScript before scraping
  • Generate AI-powered alt text for images
  • Customize markdown output formatting
  • Extract links and images with summaries
  • Control privacy and caching behavior

Endpoint Details

POST

/scrape

Base URL: https://api.websearchapi.ai

Full Endpoint: https://api.websearchapi.ai/scrape

Authentication

Authentication Required All API requests require authentication using your API key in the Authorization header.

Include your API key in the Authorization header using the Bearer token format:

Authorization: Bearer YOUR_API_KEY

You can obtain an API key by signing up for a WebSearchAPI.ai account. Each account receives 1,000 free API credits monthly.

Request Format

All requests to the Scraper API should be made as HTTP POST requests with a JSON body containing the scraping parameters.

Required Headers

HeaderValue
Content-Typeapplication/json
AuthorizationBearer YOUR_API_KEY

Request Body

The request body should be a JSON object containing your scraping parameters:

{
  "url": "https://example.com",
  "returnFormat": "markdown",
  "engine": "browser",
  "targetSelector": "article, main",
  "removeSelector": "header, footer, nav, .ads",
  "withLinksSummary": true,
  "withImagesSummary": true
}

Request Parameters

Example Requests

curl -X POST https://api.websearchapi.ai/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "url": "https://example.com/article",
  "returnFormat": "markdown",
  "engine": "browser",
  "targetSelector": "article, main",
  "removeSelector": "header, footer, nav, .ads",
  "withLinksSummary": true,
  "withImagesSummary": true,
  "withGeneratedAlt": true,
  "timeout": 15
}'

Response Format

Success Response (200 OK)

A successful API call returns a JSON object with the following structure:

{
  "code": 200,
  "data": {
    "title": "Example Article Title",
    "url": "https://example.com/article",
    "content": "# Example Article Title\n\nThis is the extracted content in markdown format...",
    "links": {
      "https://example.com/page1": "Page 1 Title",
      "https://example.com/page2": "Page 2 Title"
    },
    "images": {
      "https://example.com/image1.jpg": "AI-generated alt text for image 1",
      "https://example.com/image2.jpg": "AI-generated alt text for image 2"
    }
  }
}

Response Fields

Error Responses

WebSearchAPI.ai returns standard HTTP status codes with JSON error details:

{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Invalid request parameters",
    "details": {
      "url": "Invalid URL format"
    }
  }
}

Returned when the request contains invalid or missing parameters.

Rate Limits and Quotas

WebSearchAPI.ai uses a credit-based system for API usage. Each account receives 1,000 free API credits monthly. Scraper API credit usage varies based on features used:

OperationCredits
Basic scraping (direct engine)1 credit
Browser rendering2 credits
With link/image summaries+1 credit
With AI-generated alt text+1 credit
Screenshot/pageshot2 credits

Pro Tip: Optimize Credit Usage Use the direct engine for simple, static pages to save credits. Reserve browser and cf-browser-rendering engines for JavaScript-heavy sites. Enable AI features like withGeneratedAlt only when needed for accessibility or LLM processing.

Advanced Features

Browser Rendering Engines

The Scraper API offers three rendering engines optimized for different use cases:

EngineBest ForSpeedJavaScript Support
directStatic HTML pages, simple sitesFastestNone
browserJavaScript-heavy sites, SPAsMediumFull
cf-browser-renderingComplex SPAs, anti-bot protectionSlowerFull + Advanced

Choosing the Right Engine

Direct Engine - Perfect for blogs, documentation sites, and static content. Fastest response time with minimal credit usage.

Browser Engine - Ideal for modern websites with JavaScript rendering. Handles React, Vue, and Angular applications.

CF Browser Rendering - Advanced rendering for complex SPAs and sites with anti-bot measures. Best for challenging scraping scenarios.

CSS Selectors for Precise Extraction

Use targetSelector and removeSelector to focus on the content you need:

{
  "url": "https://example.com/article",
  "targetSelector": "article, main, .content",
  "removeSelector": "header, footer, nav, .ads, .sidebar, .comments"
}

This approach:

  • Reduces noise in extracted content
  • Saves tokens for LLM processing
  • Improves content quality
  • Focuses on relevant information

JavaScript Injection

Execute custom JavaScript before scraping to manipulate the DOM:

{
  "url": "https://example.com",
  "engine": "browser",
  "injectPageScript": "document.querySelector('.cookie-banner')?.remove(); document.querySelector('.newsletter-popup')?.remove();"
}

Common use cases:

  • Remove popups and overlays
  • Trigger lazy-loaded content
  • Manipulate page elements
  • Extract data from JavaScript variables

AI-Enhanced Features

Generated Alt Text

Enable AI-powered alt text generation for images:

{
  "url": "https://example.com",
  "withImagesSummary": true,
  "withGeneratedAlt": true
}

This feature:

  • Generates descriptive alt text for images
  • Improves accessibility
  • Enhances LLM understanding of visual content
  • Useful for RAG applications

ReaderLM-v2 Processing

Use advanced AI processing for better content extraction:

{
  "url": "https://example.com",
  "respondWith": "readerlm-v2"
}

Benefits:

  • Superior HTML-to-Markdown conversion
  • Better preservation of complex structures
  • Enhanced table and list formatting
  • Optimized for LLM consumption

Markdown Customization

Customize markdown output to match your preferences:

{
  "url": "https://example.com",
  "returnFormat": "markdown",
  "mdHeadingStyle": "atx",
  "mdBulletListMarker": "-",
  "mdEmDelimiter": "*",
  "mdStrongDelimiter": "**",
  "mdLinkStyle": "inline"
}

Perfect for:

  • Maintaining consistent documentation styles
  • Matching existing markdown conventions
  • Optimizing for specific markdown parsers
  • Personal formatting preferences

Privacy and Compliance

Control privacy settings for GDPR compliance:

{
  "url": "https://example.com",
  "dnt": true,
  "noCache": true,
  "proxy": "eu"
}

Privacy features:

  • dnt: Sends Do Not Track header
  • noCache: Bypasses cache for fresh data
  • proxy: Use EU-based proxies for GDPR compliance
  • setCookie: Custom cookie settings for authenticated content

Use Cases

RAG Applications

Extract clean, token-optimized content for retrieval-augmented generation systems. Perfect for building knowledge bases and AI assistants.

Knowledge Base Building

Scrape documentation sites and build comprehensive knowledge graphs with link extraction and structured content.

Content Migration

Migrate content from legacy systems to modern platforms with preserved formatting and structure.

News Aggregation

Extract articles from news sites with clean formatting, removing ads and navigation clutter automatically.

Academic Research

Extract citations, references, and structured data from research papers and academic websites.

E-commerce Data

Extract product information, prices, reviews, and inventory status from e-commerce platforms.

Market Intelligence

Monitor competitor websites for pricing changes, product launches, and content updates.

SEO Analysis

Extract page content, metadata, and structure for SEO audits and competitive analysis.

Compliance Monitoring

Track legal and regulatory websites for updates and changes to compliance requirements.

Best Practices

1. Choose the Right Engine

Start with direct engine and upgrade to browser only when needed:

// Try direct first
let payload = { url: targetUrl, engine: 'direct' };
 
// If content is missing, retry with browser
if (contentIncomplete) {
  payload.engine = 'browser';
}

2. Use Selectors Wisely

Target specific content to reduce noise and save tokens:

{
  "targetSelector": "article, main, .post-content",
  "removeSelector": "header, footer, nav, aside, .ads, .related-posts"
}

3. Optimize Token Budget

Set token limits for LLM optimization:

{
  "tokenBudget": 10000,
  "retainImages": "none"
}

4. Handle Errors Gracefully

Always implement retry logic with exponential backoff:

async function scrapeWithRetry(url, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      const response = await fetch(apiUrl, options);
      if (response.ok) return await response.json();
 
      if (response.status === 429) {
        // Rate limited, wait and retry
        await sleep(Math.pow(2, i) * 1000);
        continue;
      }
 
      throw new Error(`HTTP ${response.status}`);
    } catch (error) {
      if (i === maxRetries - 1) throw error;
    }
  }
}

5. Monitor Credit Usage

Track credits to avoid unexpected charges:

const creditsConsumed = response.headers.get('X-Credits-Consumed');
const creditsRemaining = response.headers.get('X-Credits-Remaining');
 
if (creditsRemaining < 100) {
  console.warn('Low credits! Consider upgrading.');
}