AI Analysis of a Web Page

All Articles AI Culture Data Management Level 12 News Python Salesforce Software Development Testing

AI can help when a project needs to analyze a number of web pages for information.

Recently, we had a project that needed to take an arbitrary web page (provided as input) and attempt to extract job listing information.

Without AI, we would consider that an intractable problem for a limited budget. It might be doable if we were focused on a single site, as long as the site's layout remained consistent. But supporting any number of layouts and content distributions?

Enter the LLM.

In this case, we're using OpenAI's platform API, and that service doesn't fetch the page contents for you. It would be nice if it did, but we need to perform that step ourselves.

That's not all bad, though, because we can take the opportunity to strip out some tags that may confuse the results. Script, style, iframe, navigation, and any forms all need to go.

The rest of the content can be passed on in a prompt after being chunked for length. When chunking HTML, we need to be aware of what tag we're in, so if the chunk isn't big enough for the complete tag, we can break it down and give it a closing tag.

OpenAI has a feature that allows us to pass a Pydantic model to make better sense of the model's results and give us a structured output. For each chunk, we can get an object that has the potential title, description, date, relevant links, and other information on the page as needed.

After all of the chunks are run, we simply need to combine the description contents, then we can run the full record through any needed ensuing processes.

Originally published on 2025-04-02 by Matt Lewellyn

Reach out to us to discuss your complex deployment needs (or to chat about Star Trek)