Started to test changedetection (https://github.com/dgtlmoon/changedetection.io) for similar usecases (monitoring government grant webpages), it can also detect change in pdf, but I didn't test that feature that much. Worked fine so far for me.
Selfhosted
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
-
Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
-
No spam posting.
-
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
-
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
-
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
-
No trolling.
Resources:
- selfh.st Newsletter and index of selfhosted software and apps
- awesome-selfhosted software
- awesome-sysadmin resources
- Self-Hosted Podcast from Jupiter Broadcasting
Any issues on the community? Report it using the report flag.
Questions? DM the mods!
Can you point me to a tutorial how to setup that up properly for websites? I tried it a while ago and could not get it to work...
Hello! For changedetection.io there is setup instruction with PIP install: https://github.com/dgtlmoon/changedetection.io/wiki/Microsoft-Windows What is your use case?
@xyro Thanks for sharing your case! I’ve also tested changedetection.io — it’s a great tool for basic site monitoring.
But in my tests, it doesn’t go beyond the surface. If there’s a page with multiple document links, it’ll detect changes in the list (via diff), but it won’t automatically download and analyze the new documents themselves.
Here’s how I’ve approached this:
- Crawl the page to extract links
- Detect new document URLs
- Download each document and extract keywords
- Generate an AI summary using a local LLM
- Add the result to a readable feed
P.S. If it helps, I can create a YAML template tailored to your grant-tracking case and run a quick test.
Do you send the result of the diff to an Ollama instance ? I would be curious to see the pipeline 😇
@xyro Ah, I see! I’m not using Ollama at the moment — my setup is based on GPT4All with a locally hosted DeepSeek model, which handles the semantic parsing directly.
As mentioned earlier, the pipeline doesn’t just diff pages — it detects new document URLs from the source feed (via selectors), downloads them, and generates structured summaries. Here's a snippet from the YAML config to illustrate how that works:
(extract:
events:
selector: "results[*]"
fields:
url: pdf_url
title: title
order_number: executive_order_number
download:
extensions: [".pdf"]
gpt:
prompt: |
Analyze this Executive Order document:
- Purpose: 1–2 sentences
- Key provisions: 3–5 bullet points
- Agencies involved: list
- Revokes/amends: if any
- Policy impact: neutral analysis
)
To keep things efficient, I also support regex-based extraction before passing content to the LLM. That way, I can isolate relevant blocks (e.g. addresses, client names, conclusions) and reduce the noise in the prompt. Example from another config:
processing:
extract_regex:
- "object of cultural heritage"
- "address[:\\s]\\s*(.{10,100}?)(?=\\n|$)"
- "project(?:s)?"
- "circumstances"
- "client\\s*:?\\s*(.{10,100}?)(?=\\n|$)"
- "(?:conclusions?)\\s*(.{50,300}?)(?=\\n|$)"
Let me know if you're experimenting with similar flows — I’d be happy to share templates or compare how DeepSeek performs on your sources!