What Is LlamaIndex?
LlamaIndex is an open-source framework that connects LLMs to external data sources, databases, documents, and APIs. It provides tools for data ingestion, indexing, and query-based retrieval, commonly used to build retrieval-augmented generation (RAG) applications, which can be used to feed AI agents with up-to-date information.Key Integration Benefits
- Uninterrupted access to data: Build a reliable data layer that can access information from any website without getting blocked by anti-bot measures.
- Real-time information retrieval: Extract real-time data faster and more efficiently before it becomes stale.
- Direct extraction of LLM-friendly data: Get pre-formatted LLM-friendly data, such as the Markdown or JSON version of any website. ZenRows also enables the extraction of specific data directly.
- Less code, more data: Scrape data continuously with an auto-managed and auto-scaled solution with a simple API call.
- Business-oriented development: No extra engineering time and resources will be wasted on debugging or fixes.
- Handle dynamic content easily: Access heavily dynamic websites without performing complex waits and user simulations.
- Borderless data retrieval: Expose AI applications to data from any specific location without IP limitations using residential proxies with geo-targeted IPs.
Use Cases of LlamaIndex-ZenRows Integration
- Real-time price monitoring: Use ZenRows to scrape prices from several product sites in real-time and synthesize a comprehensive comparison with an LLM.
- Competitive research: Scrape several competitors’ offerings, product launches, strategies, and more with ZenRows and draw a correlation between the data using an LLM.
- News and trends summarization: Use ZenRows to aggregate news, trends, hashtags, and more, across similar platforms. Summarize the aggregated data with an LLM and extract specific insights.
- Dynamic chatbots: Build a chatbot that can access the web or specific web pages in real time to provide updated information.
Getting Started: Basic Usage
This example demonstrates how to extract content from a protected website using theZenRowsWebReader.
The ZenRowsWebReader enables you to use the official ZenRows Universal Scraper API as a data loader for web scraping in LlamaIndex.
1
Install the package
2
Basic implementation
Import The code returns a Markdown format of the target site, as shown:
ZenRowsWebReader from llama-index-readers-web. Initialize ZenRowsWebReader as a reader instance. Then, set your ZenRows parameters through this instance.Load the target site as a document and return its content in the specified format (Markdown response):Python
Markdown
Advanced Usage: Building a Simple RAG System
This example creates a simple RAG system that indexes multiple websites and responds to queries using the collected data. You’ll need an OpenAI API key to use the LLM and embedding features. So, prepare your OpenAI API key.1
Install the packages
2
Set up ZenRowsWebReader
Import the required packages and specify your ZenRows and OpenAI API keys. Initialize
ZenRowsWebReader using the desired ZenRows parameters. Include js_render and premium_proxy to effectively bypass anti-bot measures.Python
3
Set up a vector index
Specify the target URLs in a list, load their web pages as documents, and create a vectorized index of the documents:
Python
4
Query the index
Initialize a query engine from the index, pass a prompt to query it, and return the query response:
Python
5
Complete code
Python
Markdown
API Reference
| Parameter | Type | Description |
|---|---|---|
url | str | Required. The URL to scrape |
js_render | bool | Enable JavaScript rendering with a headless browser. Essential for modern web apps, SPAs, and sites with dynamic content (default: False) |
js_instructions | str | Execute custom JavaScript on the page to interact with elements, scroll, click buttons, or manipulate content |
premium_proxy | bool | Use residential IPs to bypass anti-bot protection. Essential for accessing protected sites (default: False) |
proxy_country | str | Set the country of the IP used for the request. Use for accessing geo-restricted content. Two-letter country code |
session_id | int | Maintain the same IP for multiple requests for up to 10 minutes. Essential for multi-step processes |
custom_headers | dict | Include custom headers in your request to mimic browser behavior |
wait_for | str | Wait for a specific CSS Selector to appear in the DOM before returning content |
wait | int | Wait a fixed amount of milliseconds after page load |
block_resources | str | Block specific resources (images, fonts, etc.) from loading to speed up scraping |
response_type | str | Convert HTML to other formats. Options: “markdown”, “plaintext”, “pdf” |
css_extractor | str | Extract specific elements using CSS selectors (JSON format) |
autoparse | bool | Automatically extract structured data from HTML (default: False) |
screenshot | str | Capture an above-the-fold screenshot of the page (default: “false”) |
screenshot_fullpage | str | Capture a full-page screenshot (default: “false”) |
screenshot_selector | str | Capture a screenshot of a specific element using CSS Selector |
screenshot_format | str | Choose between “png” (default) and “jpeg” formats for screenshots |
screenshot_quality | int | For JPEG format, set the quality from 1 to 100. Lower values reduce file size but decrease quality |
original_status | bool | Return the original HTTP status code from the target page (default: False) |
allowed_status_codes | str | Returns the content even if the target page fails with the specified status codes. Useful for debugging or when you need content from error pages |
json_response | bool | Capture network requests in JSON format, including XHR or Fetch data. Ideal for intercepting API calls made by the web page (default: False) |
outputs | str | Specify which data types to extract from the scraped HTML. Accepted values: emails, phone numbers, headings, images, audios, videos, links, menus, hashtags, metadata, tables, favicon |
For complete parameter documentation and details, see the official ZenRows’ Universal Scraper API Reference.
Troubleshooting
The returned response is incomplete:
- Solution 1: Ensure you activate
js_renderandpremium_proxyto bypass anti-bot measures and scrape reliably. - Solution 2: Apply enough
waittime to allow dynamic content to load completely before scraping. If a specific element holding the required data loads slowly, you can also wait for it using the wait_for parameter. - Solution 3: If only partial responses are returned, the LLM may be missing relevant information in the chunk. Adjust the engine query retrieval by increasing the number of chunks the LLM receives from the documents. Increase the chunk by adding a similarity_top_k parameter to the query engine as shown:
Python
- Solution 4: If you’ve used the
css_extractorparameter to target specific elements, ensure you’ve entered the correct selectors.
API key or authentication error
- Solution: Ensure you’ve supplied your LLM (e.g., OpenAI) and ZenRows API keys correctly.
Module not found
- Solution: Install all the required modules:
llama-index-readers-webllama-index-llms-openaillama-index-embeddings-openai
Resources
Frequently Asked Questions (FAQ)
What is the main use case of LlamaIndex-ZenRows integration?
What is the main use case of LlamaIndex-ZenRows integration?
The use cases of LlamaIndex-ZenRows integration are diverse. However, the primary application is to enable AI applications to access and reason over live, real-world web data, even from sites with anti-bot protections or dynamic content.
Does LlamaIndex-ZenRows integration support extraction via CSS selectors?
Does LlamaIndex-ZenRows integration support extraction via CSS selectors?
Yes, you can scrape data from specific elements using their CSS selectors via the
css_extractor parameter.Can I use all of ZenRows' parameters with ZenRowsWebReader?
Can I use all of ZenRows' parameters with ZenRowsWebReader?
Yes. The
ZenRowsWebReader inherits all the features and capabilities of the ZenRows Universal Scraper API.Which LLM integrations does LlamaIndex support?
Which LLM integrations does LlamaIndex support?
LlamaIndex supports many popular LLMs, such as Groq, OpenAI, Anthropic, and more. Check LlamaIndex’s official documentation for the supported LLMs.
Can I use ZenRows with LlamaIndex for Web Scraping?
Can I use ZenRows with LlamaIndex for Web Scraping?
LlamaIndex isn’t explicitly designed for web scraping information from websites. However, you can add a scraping layer to LlamaIndex by pairing it with a web scraping tool like ZenRows, which provides it with anti-bot bypass capabilities.