Reimagining Web Scraping in the Age of AI

How LLMs Simplify Data Extraction by “Reading” Websites Like a Human

Dec 27, 2024

Introduction

Web scraping has long been a standard technique for gathering data from websites. However, traditional approaches come with significant headaches: they’re often brittle, difficult to maintain, and costly to manage, especially when the website structure changes. In this blog post, we’ll explore a more resilient strategy: using Large Language Models (LLMs) to “read” the webpage the same way a human would. We’ll then show you—with real code snippets—how to implement an AI Scraper Agent that can parse, understand, and extract data with minimal overhead and high scalability.

1. Challenges with Traditional Web Scraping

Complexity and Cost: Traditional scrapers rely on CSS selectors, XPath, or HTML parsing libraries. They’re prone to breaking whenever websites undergo minor structural changes.
Anti-Scraping Measures: Websites frequently implement CAPTCHAs, rate limiting, or other deterrents that increase maintenance overhead.
High Maintenance: Constant site updates mean your engineering team must keep patching or rebuilding the scraper logic.

2. A Better Way: The AI-Powered Scraper Agent

Instead of parsing DOM elements directly, an AI-powered approach uses Large Language Models to interpret a webpage as though it were a block of text. This means:

Less Fragility: Tiny HTML changes won’t break your scraper because the AI looks at the textual content itself (e.g., as Markdown).
Scalability: Adding a new site is as simple as passing in a different URL or prompt.
Reduced Engineering Overhead: You don’t have to write complex HTML parsing logic. Instead, you simply “ask” the AI to find the relevant data.

3. Essential Tools for an AI Scraper Agent

Firecrawl: Converts websites into Markdown for easy LLM consumption.
OpenAI + Instructor: Provides the LLM intelligence to parse and extract the data. In the code, we wrap OpenAI’s client with an instructor_client to handle advanced extraction workflows.
Langsmith: Monitors and traces your AI agent’s activity, including cost, inputs, and outputs, for easier debugging.

4. AI Scraper Agent: The Core Architecture

Below is a simplified step-by-step overview. Then we’ll dive into actual code snippets to show how it all comes together:

Define a Data Model with Pydantic.
Create an AI Agent that knows:
- Its role and system-level instructions
- The data it needs to find
- Tools it can use (Firecrawl, LLM parser, etc.)
Invoke the Agent to:
- Scrape the page (via Firecrawl)
- Convert content to Markdown
- Use the LLM to parse structured data
- Recursively update your data model as fields are filled
Store Results in JSON or a database, making sure not to overwrite previously completed fields.

5. Code Walkthrough

Below are real code snippets from an example AI Scraper Agent project. Feel free to adapt them for your own use cases.

5.1. Setting Up the Environment and Imports

python

import openai
import re, time, os
from firecrawl import FirecrawlApp
from dotenv import load_dotenv
import json
from tenacity import retry, wait_random_exponential, stop_after_attempt
from termcolor import colored
import tiktoken
from langsmith import traceable
from langsmith.wrappers import wrap_openai
import tempfile, requests
from openai import OpenAI

import instructor
from pydantic import BaseModel, Field, create_model
from typing import List, Optional, Dict, Any, Type, get_type_hints, Union

openai: For interacting with OpenAI’s GPT models.
firecrawl.FirecrawlApp: Our scraping tool to convert pages to Markdown.
langsmith: Observability and monitoring.
instructor: A thin layer on top of OpenAI for advanced extraction.

load_dotenv()

# Initialize OpenAI client with LangSmith wrapper and instructor
client = wrap_openai(openai.Client())
instructor_client = instructor.from_openai(client, mode=instructor.Mode.TOOLS)

# Constants
GPT_MODEL = "gpt-4o"
max_token = 100000

Here, we set up environment variables (API keys, etc.) and specify the GPT model we’ll be using.

5.2. Defining the Data Model

We use Pydantic to create strongly typed models. For example, if you’re scraping monthly car sales data, you might define:

class ModelSales(BaseModel):
    model_name: str = Field(..., description="The name of the car model.")
    units_sold: int = Field(..., description="The number of units sold in the given month for the make model.")

class ManufacturerSales(BaseModel):
    month: int = Field(..., description="The month for which the sales data is reported, e.g., '10'.")
    year: int = Field(..., description="The year for which the sales data is reported, e.g., 2024.")
    manufacturer_name: str = Field(..., description="The name of the car manufacturer.")
    total_units_sold: int = Field(..., description="The total number of units sold by manufacturer in the given month.")
    models: List[ModelSales] = Field(..., description="A list of sales data for each model under this manufacturer.")

class DataPoints(BaseModel):
    manufacturers: List[ManufacturerSales] = Field(..., description="A list of sales data grouped by manufacturer.")

This structure ensures consistency and makes it easier to integrate with other systems or analytics pipelines.

5.3. Converting Webpages to Markdown

@traceable(run_type="tool", name="Scrape")
@retry(wait=wait_random_exponential(multiplier=1, max=60), stop=stop_after_attempt(3))

def scrape(url, data_points, links_scraped):
    """
    Scrape a given URL and extract structured data with retry logic.
    """
    app = FirecrawlApp()

    try:
        # Add delay between requests to avoid overwhelming the server
        time.sleep(2)

        # Firecrawl scrapes and returns a dictionary with { "markdown": ... }
        scraped_data = app.scrape_url(url)
        markdown = scraped_data["markdown"][: (max_token * 2)]
        links_scraped.append(url)

        extracted_data = extract_data_from_content(markdown, data_points, links_scraped, url)
        return extracted_data
    except Exception as e:
        print(f"Error scraping URL {url}")
        raise

FirecrawlApp automatically converts the target URL into a Markdown string.
We pass this Markdown to the next step, which is the LLM extraction.

5.4. Extracting Data With the LLM

After converting to Markdown, we feed the content to the LLM to find the data points we care about:

def extract_data_from_content(content, data_points, links_scraped, url):
    """
    Extract structured data from parsed content using the GPT model.
    """
    # Dynamically create a Pydantic model to handle missing fields
    FilteredModel = create_filtered_model(data_points, DataPoints, links_scraped)

    # Use the LLM to fill in these fields based on the Markdown
    result = instructor_client.chat.completions.create(
        model=GPT_MODEL,
        response_model=FilteredModel,
        messages=[{"role": "user", "content": content}],
    )

    filtered_data = filter_empty_fields(result)
    data_to_update = [
        {"name": key, "value": value["value"], "reference": url, "type": value["type"]}
        for key, value in filtered_data.items() if key != 'relevant_urls_might_contain_further_info'
    ]

    # Merge results into our main data structure
    update_data(data_points, data_to_update)

    return result.json()

Here, the instructor_client uses the GPT model to parse the Markdown, matching it against a FilteredModel that we dynamically create from our DataPoints structure.

5.5. Updating Your Data Model Incrementally

As the LLM finds partial matches, we iteratively update our data structure without overwriting existing valid data:

def update_data(data_points, datas_update):
    """
    Update the state with new data points found and save to file.
    """
    try:
        for data in datas_update:
            for obj in data_points:
                if obj["name"] == data["name"]:
                    # If it's a list, we handle merges differently
                    if data["type"].lower() == "list":
                        data_value = json.loads(data["value"]) if isinstance(data["value"], str) else data["value"]
                        ...
                    else:
                        # Single-value field
                        obj["value"] = json.loads(data["value"]) if data["type"].lower() == "dict" else data["value"]

        # Save interim updates to file
        save_json_pretty(data_points, f"{entity_name}.json")
        return "data updated and saved"
    except Exception as e:
        return "Unable to update data points"

This method merges new fields into data_points. Finally, it persists the data to a JSON file for safekeeping.

5.6. Putting It All Together: The “Agent” Workflow

You can think of the “agent” as the orchestration layer that decides how to use the above tools:

@traceable(name="Call agent")
def call_agent(
    prompt, system_prompt, tools, plan, data_points, entity_name, links_scraped
):
    """
    Call the AI agent to perform tasks based on the given prompt and tools.
    """
    messages = []
    ...
    # 1. Possibly create a 'plan' by instructing the model
    # 2. Provide user/system instructions
    # 3. The model will request tools (like 'scrape') if it needs more data
    # 4. We handle any tool calls, run them, and feed the results back
    # 5. Stop once the model says it’s finished

    while state == "running":
        chat_response = chat_completion_request(messages, tool_choice=None, tools=tools)
        ...
        if current_choice.finish_reason == "tool_calls":
            # Use the requested tool with the arguments provided by the LLM
            for tool_call in tool_calls:
                function = tool_call.function.name
                arguments = json.loads(tool_call.function.arguments)
                result = tools_list[function](
                    arguments["url"], data_points, links_scraped
                ) if function == "scrape" else ...
                ...
        if current_choice.finish_reason == "stop":
            state = "finished"

    return messages[-1]["content"]

The agent reads the conversation messages, decides on the next action, and calls the appropriate tool (e.g., the scrape function). This pattern makes it easy to add or remove functionality without rewriting the entire logic.

6. Real-World Example: Scraping Monthly China Auto Sales

In the code below, we’re scraping a Chinese automotive website to get monthly sales data:

entity_name = 'tesla_nov_2024_sales_data'
monthly_sales_page = "http://www.myhomeok.com/xiaoliang/changshang/104_86.htm"

6.1. Running the Research (Scraping) Loop

Finally, we run through each generated URL, instruct the agent to scrape, and accumulate results:

data_keys = list(DataPoints.__fields__.keys())
data_fields = DataPoints.__fields__
data_points = [{"name": key, "value": None, "reference": None, "description": data_fields[key].description} for key in data_keys]

filename = f"{entity_name}.json"

scrape(monthly_sales_page, data_points, [])

7. What the Final Output Looks Like

After the AI agent completes its scraping, you’ll end up with a JSON file containing structured data similar to this. More impressively, it was even able to translate the Chinese content into english.

{
    "description": "A list of sales data grouped by manufacturer.",
    "name": "manufacturers",
    "reference": null,
        {
            "manufacturer_name": "Tesla",
            "models": [
                {
                    "model_name": "Model Y",
                    "units_sold": 46595
                },
                {
                    "model_name": "Model 3",
                    "units_sold": 32261
                }
            ],
            "month": 11,
            "reference": "http://www.myhomeok.com/xiaoliang/changshang/104_86.htm",
            "total_units_sold": 78856,
            "year": 2024
        }
    ]
}

This can be used for reporting, analytics, or further data manipulation without having to rework HTML parsing logic each time the website changes.

8. Key Benefits of an AI Scraper Agent

Resilience to Page Layout Changes
Because we rely on semantic understanding rather than brittle HTML selectors, minor site redesigns are less likely to break the scraper.
Scalability
Onboarding a new site or data requirement is mostly a matter of updating your system prompts or data models.
Reduced Engineering Overhead
No need to rebuild and maintain complicated parsing scripts for each website.
Easy Debugging and Cost Monitoring
Tools like Langsmith help you see how many tokens you’re spending and trace exactly which steps the agent took.

9. Future Work: Expanding AI Scraper Capabilities

As powerful as an AI-driven scraping solution can be, there’s still more you can do to broaden its reach and flexibility. Here are two exciting areas we’ll explore in future installments:

Navigating and Paginating Through Pages Using LLM
- Dynamic Page Discovery: Let the agent locate and follow “Next” or “Load More” links without manual selectors.
- Adaptive Querying: Ask the AI to figure out the best way to access subsequent pages (e.g., analyzing relative or parameter-based URLs).
- Resilience to Changes: Avoid breaking changes by using semantic understanding rather than DOM element IDs or classes.
Bypassing Login and Authentication Using LLM
- Form Submission & Cookies: Automate the login process by asking the LLM to detect and fill the correct login fields, handle session cookies, and maintain persistence.
- Two-Factor & Captchas: Combine AI-based solutions or third-party services to tackle advanced authentication steps like CAPTCHAs or multi-factor flows.
- Context-Aware Workflows: Instruct the LLM to adapt to new login pages or flows without rewriting custom scripts.

By focusing on AI-driven strategies, you can keep your scrapers flexible, reduce hardcoded logic, and maintain robust performance—even for sites with tricky pagination or authentication hurdles. Stay tuned for more details in our upcoming posts!

Conclusion

By leveraging Large Language Models to parse and extract data, you can build a significantly more robust and scalable web scraper. Thanks to tools like Firecrawl, Langsmith, and OpenAI’s GPT models, you get both ease of setup and powerful debugging/tracing capabilities. No more struggling with CSS selectors breaking every time the HTML changes—let the AI do the heavy lifting so you can focus on using the data rather than wrestling with code maintenance.

Ready to transform your data extraction process?

Check out Firecrawl for converting web pages to Markdown.
Use OpenAI’s GPT models with Instructor for context-aware parsing.
Leverage Langsmith for monitoring, cost management, and debugging your agents.

Say goodbye to brittle scrapers and hello to a more intelligent approach!

Oakheart Lab

Discussion about this post