How to Integrate LlamaIndex with ZenRows

Integrate ZenRows with LlamaIndex to enable your RAG applications to access, index, and synthesize up-to-date web content from any website, including those with anti-bot protection and dynamic content.

What Is LlamaIndex?

LlamaIndex is an open-source framework that connects LLMs to external data sources, databases, documents, and APIs. It provides tools for data ingestion, indexing, and query-based retrieval, commonly used to build retrieval-augmented generation (RAG) applications, which can be used to feed AI agents with up-to-date information.

Key Integration Benefits

Uninterrupted access to data: Build a reliable data layer that can access information from any website without getting blocked by anti-bot measures.
Real-time information retrieval: Extract real-time data faster and more efficiently before it becomes stale.
Direct extraction of LLM-friendly data: Get pre-formatted LLM-friendly data, such as the Markdown or JSON version of any website. ZenRows also enables the extraction of specific data directly.
Less code, more data: Scrape data continuously with an auto-managed and auto-scaled solution with a simple API call.
Business-oriented development: No extra engineering time and resources will be wasted on debugging or fixes.
Handle dynamic content easily: Access heavily dynamic websites without performing complex waits and user simulations.
Borderless data retrieval: Expose AI applications to data from any specific location without IP limitations using residential proxies with geo-targeted IPs.

Use Cases of LlamaIndex-ZenRows Integration

Real-time price monitoring: Use ZenRows to scrape prices from several product sites in real-time and synthesize a comprehensive comparison with an LLM.
Competitive research: Scrape several competitors’ offerings, product launches, strategies, and more with ZenRows and draw a correlation between the data using an LLM.
News and trends summarization: Use ZenRows to aggregate news, trends, hashtags, and more, across similar platforms. Summarize the aggregated data with an LLM and extract specific insights.
Dynamic chatbots: Build a chatbot that can access the web or specific web pages in real time to provide updated information.

Getting Started: Basic Usage

This example demonstrates how to extract content from a protected website using the ZenRowsWebReader. The ZenRowsWebReader enables you to use the official ZenRows Universal Scraper API as a data loader for web scraping in LlamaIndex.

Install the package

pip3 install llama-index-readers-web

Basic implementation

Import ZenRowsWebReader from llama-index-readers-web. Initialize ZenRowsWebReader as a reader instance. Then, set your ZenRows parameters through this instance.Load the target site as a document and return its content in the specified format (Markdown response):

Python

# pip3 install llama-index-readers-web
from llama_index.readers.web import ZenRowsWebReader

api_key = "YOUR_ZENROWS_API_KEY"

# initialize the reader
reader = ZenRowsWebReader(
    api_key=api_key,
    js_render=True,
    premium_proxy=True,
    response_type="markdown",
)

# scrape a single URL
documents = reader.load_data(["https://www.scrapingcourse.com/antibot-challenge/"])
print(documents[0].text)

The code returns a Markdown format of the target site, as shown:

Markdown

[![](https://www.scrapingcourse.com/assets/images/logo.svg) Scraping Course](http://www.scrapingcourse.com/)

# Antibot Challenge

![](https://www.scrapingcourse.com/assets/images/challenge.svg)

## You bypassed the Antibot challenge! :D

Advanced Usage: Building a Simple RAG System

This example creates a simple RAG system that indexes multiple websites and responds to queries using the collected data. You’ll need an OpenAI API key to use the LLM and embedding features. So, prepare your OpenAI API key.

Install the packages

pip3 install llama-index-readers-web llama-index-llms-openai llama-index-embeddings-openai

Set up ZenRowsWebReader

Import the required packages and specify your ZenRows and OpenAI API keys. Initialize ZenRowsWebReader using the desired ZenRows parameters. Include js_render and premium_proxy to effectively bypass anti-bot measures.

Python

# pip3 install llama-index-readers-web llama-index-llms-openai llama-index-embeddings-openai
from llama_index.core import VectorStoreIndex
from llama_index.readers.web import ZenRowsWebReader
import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

api_key = "YOUR_ZENROWS_API_KEY"

# set up ZenRowsWebReader
reader = ZenRowsWebReader(
    api_key=api_key,
    js_render=True,
    premium_proxy=True,
    response_type="markdown",
    wait=2000,
)

Set up a vector index

Specify the target URLs in a list, load their web pages as documents, and create a vectorized index of the documents:

Python

# ...
urls = [
    "https://www.scrapingcourse.com/ecommerce",
    "https://www.scrapingcourse.com/button-click",
    "https://www.scrapingcourse.com/infinite-scrolling",
]

# load each URL as a document
documents = reader.load_data(urls)

# create index
index = VectorStoreIndex.from_documents(documents)

Query the index

Initialize a query engine from the index, pass a prompt to query it, and return the query response:

Python

# ...
# query the content
query_engine = index.as_query_engine()
response = query_engine.query("What are the key features?")
print(response)

Complete code

Python

# pip3 install llama-index-readers-web llama-index-llms-openai llama-index-embeddings-openai
from llama_index.core import VectorStoreIndex
from llama_index.readers.web import ZenRowsWebReader
import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

api_key = "YOUR_ZENROWS_API_KEY"

# set up ZenRowsWebReader
reader = ZenRowsWebReader(
    api_key=api_key,
    js_render=True,
    premium_proxy=True,
    response_type="markdown",
    wait=2000,
)

urls = [
    "https://www.scrapingcourse.com/ecommerce",
    "https://www.scrapingcourse.com/button-click",
    "https://www.scrapingcourse.com/infinite-scrolling",
]

# load each URL as a document
documents = reader.load_data(urls)

# create index
index = VectorStoreIndex.from_documents(documents)

# query the content
query_engine = index.as_query_engine()
response = query_engine.query("What are the key features?")
print(response)

LlamaIndex uses ZenRows to retrieve each website’s information in Markdown format, vectorizes it, and synthesizes a response based on the query.Here’s a sample response from the above code:

Markdown

The key features include a menu with options like Shop, Home, Cart, Checkout, and My account. Additionally, there is a list of products with images, names, prices, and options to select or add to cart for each item.

Congratulations! 🎉You’ve integrated ZenRows with LlamaIndex.

API Reference

Parameter	Type	Description
`url`	str	Required. The URL to scrape
`js_render`	bool	Enable JavaScript rendering with a headless browser. Essential for modern web apps, SPAs, and sites with dynamic content (default: False)
`js_instructions`	str	Execute custom JavaScript on the page to interact with elements, scroll, click buttons, or manipulate content
`premium_proxy`	bool	Use residential IPs to bypass anti-bot protection. Essential for accessing protected sites (default: False)
`proxy_country`	str	Set the country of the IP used for the request. Use for accessing geo-restricted content. Two-letter country code
`session_id`	int	Maintain the same IP for multiple requests for up to 10 minutes. Essential for multi-step processes
`custom_headers`	dict	Include custom headers in your request to mimic browser behavior
`wait_for`	str	Wait for a specific CSS Selector to appear in the DOM before returning content
`wait`	int	Wait a fixed amount of milliseconds after page load
`block_resources`	str	Block specific resources (images, fonts, etc.) from loading to speed up scraping
`response_type`	str	Convert HTML to other formats. Options: “markdown”, “plaintext”, “pdf”
`css_extractor`	str	Extract specific elements using CSS selectors (JSON format)
`autoparse`	bool	Automatically extract structured data from HTML (default: False)
`screenshot`	str	Capture an above-the-fold screenshot of the page (default: “false”)
`screenshot_fullpage`	str	Capture a full-page screenshot (default: “false”)
`screenshot_selector`	str	Capture a screenshot of a specific element using CSS Selector
`screenshot_format`	str	Choose between “png” (default) and “jpeg” formats for screenshots
`screenshot_quality`	int	For JPEG format, set the quality from 1 to 100. Lower values reduce file size but decrease quality
`original_status`	bool	Return the original HTTP status code from the target page (default: False)
`allowed_status_codes`	str	Returns the content even if the target page fails with the specified status codes. Useful for debugging or when you need content from error pages
`json_response`	bool	Capture network requests in JSON format, including XHR or Fetch data. Ideal for intercepting API calls made by the web page (default: False)
`outputs`	str	Specify which data types to extract from the scraped HTML. Accepted values: emails, phone numbers, headings, images, audios, videos, links, menus, hashtags, metadata, tables, favicon

For complete parameter documentation and details, see the official ZenRows’ Universal Scraper API Reference.

Troubleshooting

The returned response is incomplete:

Solution 1: Ensure you activate js_render and premium_proxy to bypass anti-bot measures and scrape reliably.
Solution 2: Apply enough wait time to allow dynamic content to load completely before scraping. If a specific element holding the required data loads slowly, you can also wait for it using the wait_for parameter.
Solution 3: If only partial responses are returned, the LLM may be missing relevant information in the chunk. Adjust the engine query retrieval by increasing the number of chunks the LLM receives from the documents. Increase the chunk by adding a similarity_top_k parameter to the query engine as shown:
Python
```
# ...
# query the content
query_engine = index.as_query_engine(similarity_top_k=10)
# …
```
Solution 4: If you’ve used the css_extractor parameter to target specific elements, ensure you’ve entered the correct selectors.

API key or authentication error

Solution: Ensure you’ve supplied your LLM (e.g., OpenAI) and ZenRows API keys correctly.

Module not found

Solution: Install all the required modules:
- llama-index-readers-web
- llama-index-llms-openai
- llama-index-embeddings-openai

Resources

ZenRowsWebReader on GitHub

Frequently Asked Questions (FAQ)

What is the main use case of LlamaIndex-ZenRows integration?

The use cases of LlamaIndex-ZenRows integration are diverse. However, the primary application is to enable AI applications to access and reason over live, real-world web data, even from sites with anti-bot protections or dynamic content.

Does LlamaIndex-ZenRows integration support extraction via CSS selectors?

Yes, you can scrape data from specific elements using their CSS selectors via the css_extractor parameter.

Can I use all of ZenRows' parameters with ZenRowsWebReader?

Yes. The ZenRowsWebReader inherits all the features and capabilities of the ZenRows Universal Scraper API.

Which LLM integrations does LlamaIndex support?

LlamaIndex supports many popular LLMs, such as Groq, OpenAI, Anthropic, and more. Check LlamaIndex’s official documentation for the supported LLMs.

Can I use ZenRows with LlamaIndex for Web Scraping?

LlamaIndex isn’t explicitly designed for web scraping information from websites. However, you can add a scraping layer to LlamaIndex by pairing it with a web scraping tool like ZenRows, which provides it with anti-bot bypass capabilities.

Developer Tools

No-code/Low-code Integrations

AI & Automation

Captcha Solvers

How to Integrate LlamaIndex with ZenRows

What Is LlamaIndex?

Key Integration Benefits

Use Cases of LlamaIndex-ZenRows Integration

Getting Started: Basic Usage

Advanced Usage: Building a Simple RAG System

API Reference

Troubleshooting

The returned response is incomplete:

API key or authentication error

Module not found

Resources

Frequently Asked Questions (FAQ)

Developer Tools

No-code/Low-code Integrations

AI & Automation

Captcha Solvers

​What Is LlamaIndex?

​Key Integration Benefits

​Use Cases of LlamaIndex-ZenRows Integration

​Getting Started: Basic Usage

​Advanced Usage: Building a Simple RAG System

​API Reference

​Troubleshooting

​The returned response is incomplete:

​API key or authentication error

​Module not found

​Resources

​Frequently Asked Questions (FAQ)

What Is LlamaIndex?

Key Integration Benefits

Use Cases of LlamaIndex-ZenRows Integration

Getting Started: Basic Usage

Advanced Usage: Building a Simple RAG System

API Reference

Troubleshooting

The returned response is incomplete:

API key or authentication error

Module not found

Resources

Frequently Asked Questions (FAQ)