Selfhosted

51723 readers

734 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz

"What’s Your Preferred Self-Hosted Solution for Deep Monitoring (Beyond Simple Page Changes)?" (lemmy.world)

submitted 1 month ago by alfablend@lemmy.world to c/selfhosted@lemmy.world

6 comments fedilink hide all child comments

Hello! I'm evaluating tools to track changes in:

Government/legal PDFs (new regulations, court rulings)
News sites without reliable RSS
Tender portals
Property management messages (e.g. service notices)
Bank terms and policy updates

Current options I've tried:
• Huginn — Powerful but requires significant setup, no unified feed • Changedetection-io — good for HTML, limited for documents

Key needs:
✓ Local processing (no cloud dependencies)
✓ Multi-page PDF support
✓ Customizable alert rules
✓ Trying to reduce manual monitoring overhead — looking for robust, offline-first approaches

What's working well for others? Especially interested in:

Solutions combining OCR + text analysis
Experience with local LLMs for this (NLP, not just diff)
Creative workarounds you've built

(P.S. Testing a deep scraping + LLM pipeline — if results look promising, will share.)

you are viewing a single comment's thread
view the rest of the comments

[–] xyro@lemmy.ca 5 points 1 month ago* (last edited 1 month ago) (2 children)

Started to test changedetection (https://github.com/dgtlmoon/changedetection.io) for similar usecases (monitoring government grant webpages), it can also detect change in pdf, but I didn't test that feature that much. Worked fine so far for me.

[–] theorangeninja@sopuli.xyz 1 points 1 month ago (1 children)

Can you point me to a tutorial how to setup that up properly for websites? I tried it a while ago and could not get it to work...

[–] alfablend@lemmy.world 2 points 1 month ago

Hello! For changedetection.io there is setup instruction with PIP install: https://github.com/dgtlmoon/changedetection.io/wiki/Microsoft-Windows What is your use case?

[–] alfablend@lemmy.world -1 points 1 month ago (1 children)

@xyro Thanks for sharing your case! I’ve also tested changedetection.io — it’s a great tool for basic site monitoring.

But in my tests, it doesn’t go beyond the surface. If there’s a page with multiple document links, it’ll detect changes in the list (via diff), but it won’t automatically download and analyze the new documents themselves.

Here’s how I’ve approached this:

Crawl the page to extract links
Detect new document URLs
Download each document and extract keywords
Generate an AI summary using a local LLM
Add the result to a readable feed

P.S. If it helps, I can create a YAML template tailored to your grant-tracking case and run a quick test.

[–] xyro@lemmy.ca 1 points 1 month ago (1 children)

Do you send the result of the diff to an Ollama instance ? I would be curious to see the pipeline 😇

[–] alfablend@lemmy.world 1 points 1 month ago

@xyro Ah, I see! I’m not using Ollama at the moment — my setup is based on GPT4All with a locally hosted DeepSeek model, which handles the semantic parsing directly.

As mentioned earlier, the pipeline doesn’t just diff pages — it detects new document URLs from the source feed (via selectors), downloads them, and generates structured summaries. Here's a snippet from the YAML config to illustrate how that works:

(extract:
  events:
    selector: "results[*]"
    fields:
      url: pdf_url
      title: title
      order_number: executive_order_number

download:
  extensions: [".pdf"]

gpt:
  prompt: |
    Analyze this Executive Order document:
    - Purpose: 1–2 sentences
    - Key provisions: 3–5 bullet points
    - Agencies involved: list
    - Revokes/amends: if any
    - Policy impact: neutral analysis
)

To keep things efficient, I also support regex-based extraction before passing content to the LLM. That way, I can isolate relevant blocks (e.g. addresses, client names, conclusions) and reduce the noise in the prompt. Example from another config:

processing:
  extract_regex:
    - "object of cultural heritage"
    - "address[:\\s]\\s*(.{10,100}?)(?=\\n|$)"
    - "project(?:s)?"
    - "circumstances"
    - "client\\s*:?\\s*(.{10,100}?)(?=\\n|$)"
    - "(?:conclusions?)\\s*(.{50,300}?)(?=\\n|$)"

Let me know if you're experimenting with similar flows — I’d be happy to share templates or compare how DeepSeek performs on your sources!