How I Simplified Web Scraping and Data Extraction with Firecrawl and LLM
I started building JobXtension (still in development), a Chrome Extension designed to simplify job hunting. It automatically extracts and organizes key details from job description pages, such as job title, company name, location, required skills, and responsibilities. The data is stored in a database, making it easy for users to access through a dashboard.
When I began, I knew the biggest challenge would be extracting job details like the title, location, skills, and description from various job portals. I initially tried several scraping methods, including Puppeteer and Scrapingdog. However, they turned out to be either too complex or unreliable, especially for dynamic web pages. I needed a solution that was simple, efficient, and adaptable.
That’s when I discovered Firecrawl.
The Problem
Manually copying job details for every application felt tedious, and existing scraping libraries often required detailed configurations to work with modern web pages. Each job portal presented its unique challenges, like different layouts and dynamic elements, making the process time-consuming.
To make my tool effective, I needed:
A way to extract data directly from the currently active Chrome tab.
A structure to organize extracted data in a clear, consistent format (JSON).
An efficient pipeline that could handle diverse job postings.
The Solution: Firecrawl + LLM
I integrated Firecrawl to scrape the active tab effortlessly. Firecrawl simplified the heavy lifting of parsing job description pages. It returned the content in multiple formats like Markdown and HTML, which were perfect for my use case.
Next, I combined this with an LLM (Large Language Model) to interpret the scraped data and structure it in JSON. This streamlined the process entirely. Here's the workflow I used:
Scraping the Active Tab: Firecrawl pulled the job details directly from the browser, handling dynamic content seamlessly.
Parsing with LLM: Using the scraped data, I prompted the LLM to extract the relevant job details - title, location, skills, and description - and format them into a neat JSON structure.
Prerequisites
You’ll need to set up the required API keys.
Firecrawl API Key
Head out to Firecrawl’s website and sign up for an account
Navigate to the API Keys section in your dashboard to generate an API key and keep it safe.
Firecrawl provides free 500 credits for initial use.
OpenAI API Key
Go to OpenAI’s platform and create an account if you don’t already have one.
In the dashboard, select "API Keys" and generate a new key.
Alternatives: If OpenAI isn't available, consider LLaMA models, Google’s Gemini, or Hugging Face.
Install Required Libraries
Install the necessary libraries with this command:
pip install firecrawl openai
The Code in Action
While I can’t share my exact project code (more updates to come in future blogs!), here’s a simplified version of the solution:
from firecrawl import FirecrawlApp
from openai import OpenAI
openai_client = OpenAI(api_key="your-openai-api-key")
app = FirecrawlApp(api_key="your-firecrawl-api-key")
# Scrape the job page
scrape_result = app.scrape_url(
'https://www.ycombinator.com/companies/firecrawl/jobs/EK9HRDs-founding-developer-relations-community-support',
params={'formats': ['markdown']}
)
scrape_content = scrape_result.get('markdown')
# Use an LLM to extract and structure the data
prompt = f"""
Given this scraped job description:
{scrape_content}
Extract details like:
- Job Title
- Company Name
- Location
- Skills
- Description
Format these details in JSON.
"""
completion = openai_client.chat.completions.create(
model="gpt-3.5",
messages=[{"role": "user", "content": prompt}],
temperature=0.2
)
print(completion.choices[0].message.content)
In just a few lines of code, I was able to extract and structure job details with ease.
Why Firecrawl Worked For Me
Firecrawl stood out because:
It was easy to integrate with the languages I code in.
It handled dynamic web pages gracefully.
It provided clean and structured output (Markdown/HTML), simplifying downstream processing.
Conclusion
What started as a complex and overwhelming task turned into a hassle-free experience with Firecrawl and LLM. This combination let me focus on building features instead of dealing with complex scraping.
Firecrawl has SDKs in JavaScript, Go, Python, and Rust. Feel free to switch between any language you’re comfortable with.
If you’re working on something similar and struggling with scraping, give Firecrawl a try—it’s a game-changer!
Leveraging the right tools can make all the difference.
Connect with me on Bluesky (I am an active Bluesky user). I’m always up for discussions about Developer Tools, Open Source, or DevRel. Let’s share ideas and grow together!
Till then …
Keep Learning 🚀 and Keep Building ❤️