Skip to main content
Integrate ZenRows with LlamaIndex to enable your RAG applications to access, index, and synthesize up-to-date web content from any website, including those with anti-bot protection and dynamic content.

What Is LlamaIndex?

LlamaIndex is an open-source framework that connects LLMs to external data sources, databases, documents, and APIs. It provides tools for data ingestion, indexing, and query-based retrieval, commonly used to build retrieval-augmented generation (RAG) applications, which can be used to feed AI agents with up-to-date information.

Key Integration Benefits

  • Uninterrupted access to data: Build a reliable data layer that can access information from any website without getting blocked by anti-bot measures.
  • Real-time information retrieval: Extract real-time data faster and more efficiently before it becomes stale.
  • Direct extraction of LLM-friendly data: Get pre-formatted LLM-friendly data, such as the Markdown or JSON version of any website. ZenRows also enables the extraction of specific data directly.
  • Less code, more data: Scrape data continuously with an auto-managed and auto-scaled solution with a simple API call.
  • Business-oriented development: No extra engineering time and resources will be wasted on debugging or fixes.
  • Handle dynamic content easily: Access heavily dynamic websites without performing complex waits and user simulations.
  • Borderless data retrieval: Expose AI applications to data from any specific location without IP limitations using residential proxies with geo-targeted IPs.

Use Cases of LlamaIndex-ZenRows Integration

  • Real-time price monitoring: Use ZenRows to scrape prices from several product sites in real-time and synthesize a comprehensive comparison with an LLM.
  • Competitive research: Scrape several competitors’ offerings, product launches, strategies, and more with ZenRows and draw a correlation between the data using an LLM.
  • News and trends summarization: Use ZenRows to aggregate news, trends, hashtags, and more, across similar platforms. Summarize the aggregated data with an LLM and extract specific insights.
  • Dynamic chatbots: Build a chatbot that can access the web or specific web pages in real time to provide updated information.

Getting Started: Basic Usage

This example demonstrates how to extract content from a protected website using the ZenRowsWebReader. The ZenRowsWebReader enables you to use the official ZenRows Universal Scraper API as a data loader for web scraping in LlamaIndex.
1

Install the package

pip3 install llama-index-readers-web
2

Basic implementation

Import ZenRowsWebReader from llama-index-readers-web. Initialize ZenRowsWebReader as a reader instance. Then, set your ZenRows parameters through this instance.Load the target site as a document and return its content in the specified format (Markdown response):
Python
# pip3 install llama-index-readers-web
from llama_index.readers.web import ZenRowsWebReader

api_key = "YOUR_ZENROWS_API_KEY"

# initialize the reader
reader = ZenRowsWebReader(
    api_key=api_key,
    js_render=True,
    premium_proxy=True,
    response_type="markdown",
)

# scrape a single URL
documents = reader.load_data(["https://www.scrapingcourse.com/antibot-challenge/"])
print(documents[0].text)
The code returns a Markdown format of the target site, as shown:
Markdown
[![](https://www.scrapingcourse.com/assets/images/logo.svg) Scraping Course](http://www.scrapingcourse.com/)

# Antibot Challenge

![](https://www.scrapingcourse.com/assets/images/challenge.svg)

## You bypassed the Antibot challenge! :D

Advanced Usage: Building a Simple RAG System

This example creates a simple RAG system that indexes multiple websites and responds to queries using the collected data. You’ll need an OpenAI API key to use the LLM and embedding features. So, prepare your OpenAI API key.
1

Install the packages

pip3 install llama-index-readers-web llama-index-llms-openai llama-index-embeddings-openai
2

Set up ZenRowsWebReader

Import the required packages and specify your ZenRows and OpenAI API keys. Initialize ZenRowsWebReader using the desired ZenRows parameters. Include js_render and premium_proxy to effectively bypass anti-bot measures.
Python
# pip3 install llama-index-readers-web llama-index-llms-openai llama-index-embeddings-openai
from llama_index.core import VectorStoreIndex
from llama_index.readers.web import ZenRowsWebReader
import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

api_key = "YOUR_ZENROWS_API_KEY"

# set up ZenRowsWebReader
reader = ZenRowsWebReader(
    api_key=api_key,
    js_render=True,
    premium_proxy=True,
    response_type="markdown",
    wait=2000,
)
3

Set up a vector index

Specify the target URLs in a list, load their web pages as documents, and create a vectorized index of the documents:
Python
# ...
urls = [
    "https://www.scrapingcourse.com/ecommerce",
    "https://www.scrapingcourse.com/button-click",
    "https://www.scrapingcourse.com/infinite-scrolling",
]

# load each URL as a document
documents = reader.load_data(urls)

# create index
index = VectorStoreIndex.from_documents(documents)
4

Query the index

Initialize a query engine from the index, pass a prompt to query it, and return the query response:
Python
# ...
# query the content
query_engine = index.as_query_engine()
response = query_engine.query("What are the key features?")
print(response)
5

Complete code

Python
# pip3 install llama-index-readers-web llama-index-llms-openai llama-index-embeddings-openai
from llama_index.core import VectorStoreIndex
from llama_index.readers.web import ZenRowsWebReader
import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

api_key = "YOUR_ZENROWS_API_KEY"

# set up ZenRowsWebReader
reader = ZenRowsWebReader(
    api_key=api_key,
    js_render=True,
    premium_proxy=True,
    response_type="markdown",
    wait=2000,
)

urls = [
    "https://www.scrapingcourse.com/ecommerce",
    "https://www.scrapingcourse.com/button-click",
    "https://www.scrapingcourse.com/infinite-scrolling",
]

# load each URL as a document
documents = reader.load_data(urls)

# create index
index = VectorStoreIndex.from_documents(documents)

# query the content
query_engine = index.as_query_engine()
response = query_engine.query("What are the key features?")
print(response)
LlamaIndex uses ZenRows to retrieve each website’s information in Markdown format, vectorizes it, and synthesizes a response based on the query.Here’s a sample response from the above code:
Markdown
The key features include a menu with options like Shop, Home, Cart, Checkout, and My account. Additionally, there is a list of products with images, names, prices, and options to select or add to cart for each item.
Congratulations! 🎉You’ve integrated ZenRows with LlamaIndex.

API Reference

ParameterTypeDescription
urlstrRequired. The URL to scrape
js_renderboolEnable JavaScript rendering with a headless browser. Essential for modern web apps, SPAs, and sites with dynamic content (default: False)
js_instructionsstrExecute custom JavaScript on the page to interact with elements, scroll, click buttons, or manipulate content
premium_proxyboolUse residential IPs to bypass anti-bot protection. Essential for accessing protected sites (default: False)
proxy_countrystrSet the country of the IP used for the request. Use for accessing geo-restricted content. Two-letter country code
session_idintMaintain the same IP for multiple requests for up to 10 minutes. Essential for multi-step processes
custom_headersdictInclude custom headers in your request to mimic browser behavior
wait_forstrWait for a specific CSS Selector to appear in the DOM before returning content
waitintWait a fixed amount of milliseconds after page load
block_resourcesstrBlock specific resources (images, fonts, etc.) from loading to speed up scraping
response_typestrConvert HTML to other formats. Options: “markdown”, “plaintext”, “pdf”
css_extractorstrExtract specific elements using CSS selectors (JSON format)
autoparseboolAutomatically extract structured data from HTML (default: False)
screenshotstrCapture an above-the-fold screenshot of the page (default: “false”)
screenshot_fullpagestrCapture a full-page screenshot (default: “false”)
screenshot_selectorstrCapture a screenshot of a specific element using CSS Selector
screenshot_formatstrChoose between “png” (default) and “jpeg” formats for screenshots
screenshot_qualityintFor JPEG format, set the quality from 1 to 100. Lower values reduce file size but decrease quality
original_statusboolReturn the original HTTP status code from the target page (default: False)
allowed_status_codesstrReturns the content even if the target page fails with the specified status codes. Useful for debugging or when you need content from error pages
json_responseboolCapture network requests in JSON format, including XHR or Fetch data. Ideal for intercepting API calls made by the web page (default: False)
outputsstrSpecify which data types to extract from the scraped HTML. Accepted values: emails, phone numbers, headings, images, audios, videos, links, menus, hashtags, metadata, tables, favicon
For complete parameter documentation and details, see the official ZenRows’ Universal Scraper API Reference.

Troubleshooting

The returned response is incomplete:

  • Solution 1: Ensure you activate js_render and premium_proxy to bypass anti-bot measures and scrape reliably.
  • Solution 2: Apply enough wait time to allow dynamic content to load completely before scraping. If a specific element holding the required data loads slowly, you can also wait for it using the wait_for parameter.
  • Solution 3: If only partial responses are returned, the LLM may be missing relevant information in the chunk. Adjust the engine query retrieval by increasing the number of chunks the LLM receives from the documents. Increase the chunk by adding a similarity_top_k parameter to the query engine as shown:
    Python
    # ...
    # query the content
    query_engine = index.as_query_engine(similarity_top_k=10)
    # …
    
  • Solution 4: If you’ve used the css_extractor parameter to target specific elements, ensure you’ve entered the correct selectors.

API key or authentication error

  • Solution: Ensure you’ve supplied your LLM (e.g., OpenAI) and ZenRows API keys correctly.

Module not found

  • Solution: Install all the required modules:
    • llama-index-readers-web
    • llama-index-llms-openai
    • llama-index-embeddings-openai

Resources

Frequently Asked Questions (FAQ)

The use cases of LlamaIndex-ZenRows integration are diverse. However, the primary application is to enable AI applications to access and reason over live, real-world web data, even from sites with anti-bot protections or dynamic content.
Yes, you can scrape data from specific elements using their CSS selectors via the css_extractor parameter.
Yes. The ZenRowsWebReader inherits all the features and capabilities of the ZenRows Universal Scraper API.
LlamaIndex supports many popular LLMs, such as Groq, OpenAI, Anthropic, and more. Check LlamaIndex’s official documentation for the supported LLMs.
LlamaIndex isn’t explicitly designed for web scraping information from websites. However, you can add a scraping layer to LlamaIndex by pairing it with a web scraping tool like ZenRows, which provides it with anti-bot bypass capabilities.