Selfhosted

51751 readers

1461 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz

"What’s Your Preferred Self-Hosted Solution for Deep Monitoring (Beyond Simple Page Changes)?" (lemmy.world)

submitted 1 month ago by alfablend@lemmy.world to c/selfhosted@lemmy.world

6 comments fedilink hide all child comments

Hello! I'm evaluating tools to track changes in:

Government/legal PDFs (new regulations, court rulings)
News sites without reliable RSS
Tender portals
Property management messages (e.g. service notices)
Bank terms and policy updates

Current options I've tried:
• Huginn — Powerful but requires significant setup, no unified feed • Changedetection-io — good for HTML, limited for documents

Key needs:
✓ Local processing (no cloud dependencies)
✓ Multi-page PDF support
✓ Customizable alert rules
✓ Trying to reduce manual monitoring overhead — looking for robust, offline-first approaches

What's working well for others? Especially interested in:

Solutions combining OCR + text analysis
Experience with local LLMs for this (NLP, not just diff)
Creative workarounds you've built

(P.S. Testing a deep scraping + LLM pipeline — if results look promising, will share.)

you are viewing a single comment's thread
view the rest of the comments

[–] alfablend@lemmy.world 1 points 1 month ago

@xyro Ah, I see! I’m not using Ollama at the moment — my setup is based on GPT4All with a locally hosted DeepSeek model, which handles the semantic parsing directly.

As mentioned earlier, the pipeline doesn’t just diff pages — it detects new document URLs from the source feed (via selectors), downloads them, and generates structured summaries. Here's a snippet from the YAML config to illustrate how that works:

(extract:
  events:
    selector: "results[*]"
    fields:
      url: pdf_url
      title: title
      order_number: executive_order_number

download:
  extensions: [".pdf"]

gpt:
  prompt: |
    Analyze this Executive Order document:
    - Purpose: 1–2 sentences
    - Key provisions: 3–5 bullet points
    - Agencies involved: list
    - Revokes/amends: if any
    - Policy impact: neutral analysis
)

To keep things efficient, I also support regex-based extraction before passing content to the LLM. That way, I can isolate relevant blocks (e.g. addresses, client names, conclusions) and reduce the noise in the prompt. Example from another config:

processing:
  extract_regex:
    - "object of cultural heritage"
    - "address[:\\s]\\s*(.{10,100}?)(?=\\n|$)"
    - "project(?:s)?"
    - "circumstances"
    - "client\\s*:?\\s*(.{10,100}?)(?=\\n|$)"
    - "(?:conclusions?)\\s*(.{50,300}?)(?=\\n|$)"

Let me know if you're experimenting with similar flows — I’d be happy to share templates or compare how DeepSeek performs on your sources!