this post was submitted on 04 Aug 2025
33 points (100.0% liked)

Selfhosted

51574 readers
290 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS
 

Hello! I'm evaluating tools to track changes in:

  • Government/legal PDFs (new regulations, court rulings)
  • News sites without reliable RSS
  • Tender portals
  • Property management messages (e.g. service notices)
  • Bank terms and policy updates

Current options I've tried:
• Huginn — Powerful but requires significant setup, no unified feed • Changedetection-io — good for HTML, limited for documents

Key needs:
✓ Local processing (no cloud dependencies)
✓ Multi-page PDF support
✓ Customizable alert rules
✓ Trying to reduce manual monitoring overhead — looking for robust, offline-first approaches

What's working well for others? Especially interested in:

  1. Solutions combining OCR + text analysis
  2. Experience with local LLMs for this (NLP, not just diff)
  3. Creative workarounds you've built

(P.S. Testing a deep scraping + LLM pipeline — if results look promising, will share.)

top 6 comments
sorted by: hot top controversial new old
[–] xyro@lemmy.ca 5 points 1 month ago* (last edited 1 month ago) (2 children)

Started to test changedetection (https://github.com/dgtlmoon/changedetection.io) for similar usecases (monitoring government grant webpages), it can also detect change in pdf, but I didn't test that feature that much. Worked fine so far for me.

[–] theorangeninja@sopuli.xyz 1 points 1 month ago (1 children)

Can you point me to a tutorial how to setup that up properly for websites? I tried it a while ago and could not get it to work...

[–] alfablend@lemmy.world 2 points 1 month ago

Hello! For changedetection.io there is setup instruction with PIP install: https://github.com/dgtlmoon/changedetection.io/wiki/Microsoft-Windows What is your use case?

[–] alfablend@lemmy.world -1 points 1 month ago (1 children)

@xyro Thanks for sharing your case! I’ve also tested changedetection.io — it’s a great tool for basic site monitoring.

But in my tests, it doesn’t go beyond the surface. If there’s a page with multiple document links, it’ll detect changes in the list (via diff), but it won’t automatically download and analyze the new documents themselves.

Here’s how I’ve approached this:

  1. Crawl the page to extract links
  2. Detect new document URLs
  3. Download each document and extract keywords
  4. Generate an AI summary using a local LLM
  5. Add the result to a readable feed

P.S. If it helps, I can create a YAML template tailored to your grant-tracking case and run a quick test.

[–] xyro@lemmy.ca 1 points 1 month ago (1 children)

Do you send the result of the diff to an Ollama instance ? I would be curious to see the pipeline 😇

[–] alfablend@lemmy.world 1 points 1 month ago

@xyro Ah, I see! I’m not using Ollama at the moment — my setup is based on GPT4All with a locally hosted DeepSeek model, which handles the semantic parsing directly.

As mentioned earlier, the pipeline doesn’t just diff pages — it detects new document URLs from the source feed (via selectors), downloads them, and generates structured summaries. Here's a snippet from the YAML config to illustrate how that works:

(extract:
  events:
    selector: "results[*]"
    fields:
      url: pdf_url
      title: title
      order_number: executive_order_number

download:
  extensions: [".pdf"]

gpt:
  prompt: |
    Analyze this Executive Order document:
    - Purpose: 1–2 sentences
    - Key provisions: 3–5 bullet points
    - Agencies involved: list
    - Revokes/amends: if any
    - Policy impact: neutral analysis
)

To keep things efficient, I also support regex-based extraction before passing content to the LLM. That way, I can isolate relevant blocks (e.g. addresses, client names, conclusions) and reduce the noise in the prompt. Example from another config:

processing:
  extract_regex:
    - "object of cultural heritage"
    - "address[:\\s]\\s*(.{10,100}?)(?=\\n|$)"
    - "project(?:s)?"
    - "circumstances"
    - "client\\s*:?\\s*(.{10,100}?)(?=\\n|$)"
    - "(?:conclusions?)\\s*(.{50,300}?)(?=\\n|$)"

Let me know if you're experimenting with similar flows — I’d be happy to share templates or compare how DeepSeek performs on your sources!