Turn hot topics intostructured social evidence.
MindSpider is an open-source sentiment crawling system that first discovers emerging topics, then expands them into platform-level crawls for deeper discussion, reaction, and feedback analysis.
Daily feeds across social, technical, and community surfaces.
Platform-specific passes for posts, comments, and engagement evidence.
Broad topic extraction first, platform-level sentiment crawling second.
System Shape
Discovery to crawl, without the manual gap
Discovery Sources
Daily signal intake
Weibo, Zhihu, Bilibili, Toutiao, GitHub, CoolApk, and adjacent feeds seed the topic graph before deeper crawling begins.
Agent Layer
AI topic extraction
Model-assisted summarization produces topic names, summaries, and keyword lists from noisy daily sources.
Crawl Queue
Keyword fan-out
The extracted topics become crawl tasks for each platform adapter, keeping downstream work tied to explicit evidence.
Platform Pass
Deep sentiment crawling
Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, and Zhihu are crawled with browser automation to capture comments, reactions, and discussion context.
Output
Tables + reports
Data lands in explicit tables like `daily_topics`, `topic_news_relation`, and `crawling_tasks`.
Discovery sources
The broad pass is designed to recognize momentum before you choose a crawl target.
Deep-crawl targets
The second pass goes deeper on the platforms where sentiment, discussion, and feedback actually live.
Data outputs
What comes out is designed for operators, not just demos.
How It Works
A crawler pipeline shaped like an analyst workflow.
Discover rising topics
MindSpider pulls daily hot signals from news and community sources, then uses AI extraction to turn raw headlines into reusable topic clusters.
Fan out into platform crawls
Those topic clusters become structured keyword queues for deep crawls across Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, and Zhihu.
Persist evidence for analysis
Tasks, content, and relationships are written into MySQL-ready tables so you can review trajectories, compare platforms, and build downstream reports.
Architecture
Three lanes, one intent: make topic movement inspectable.
Daily signal intake
Broad Topic Extraction
The first lane watches public trend surfaces, normalizes source data, and asks the model layer to produce topics worth pursuing.
- Daily news and hot-list collection
- AI-generated topic summaries
- Keyword lists written to durable storage
Platform-specific evidence collection
Deep Sentiment Crawling
The second lane takes the extracted keywords and turns them into structured crawl tasks for each target platform.
- Per-platform crawler adapters
- Login-aware browser sessions
- Comment, post, and interaction capture
Tables, tasks, and replayability
Structured Output Layer
Instead of dumping text into blobs, MindSpider stores topic relations, crawl progress, and platform outputs in explicit database structures.
- MySQL-oriented persistence
- Task progress and status tracking
- Reusable datasets for reports and follow-on agents
Open-Source Status
MindSpider is live as a project identity, but its latest code lives inside BettaFish.
The original MindSpider repository documents the pipeline clearly and is still useful for understanding the architecture. The maintainers now position it as a module inside BettaFish for newer work.
- Use this site as the product-facing front door for the project story.
- Use the GitHub repository and README for current installation details.
- Treat the self-serve onboarding flow here as intentionally in-progress.
Upstream repositories
Keep both links visible so operators can read the original README and follow the newer monorepo path without guessing.
Feature Surface
Product language for a system that still respects the code.
AI topic extraction
Convert noisy daily news and hot lists into themes, summaries, and keyword sets that agents can keep working with.
Playwright-first crawling
Browser automation is built into the deep crawl layer, making dynamic pages and login-heavy flows more realistic to operate.
Platform-aware storage
Outputs are mapped into structured tables for notes, videos, threads, tasks, and topic relationships instead of loose export files.
Keyword queue control
The system manages topic-to-keyword fan-out so follow-up crawls stay tied to the signals that triggered them.
Open-source inspectability
Everything is visible in code: pipeline stages, database schema, platform adapters, and the operational assumptions around them.
Built for analyst handoff
The output is meant to be reviewed, queried, and reused by humans or later agents instead of dying as a one-shot scrape.
Frequently Asked Questions
Is MindSpider a hosted SaaS product today?+
No. This site presents MindSpider as an open-source project and product identity. The current Start Free path is a placeholder while the guided onboarding flow is rebuilt.
What does the two-stage pipeline actually mean?+
Stage one identifies promising topics from daily feeds. Stage two takes the resulting keywords and runs deeper platform crawls to gather sentiment-bearing evidence.
Which technologies shape the implementation?+
The README centers Python, Playwright, MySQL, asyncio, and a DeepSeek-compatible analysis layer for topic extraction and downstream interpretation.
Can I self-host it?+
Yes. The project is presented as an inspectable open-source system, so the primary path today is repository-driven setup rather than a closed hosted dashboard.
Why mention BettaFish here?+
Because the upstream README now states that the latest MindSpider code is maintained as a submodule inside BettaFish. Linking both avoids sending users to stale expectations.
Start Path
Start free now, then decide how deep you want the project story to go.
Today that means a guided placeholder path, the public README, and the upstream repositories. The site flow is being rebuilt so those three pieces eventually feel like one coherent onboarding track.